Package Exports

simple-webscraper

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (simple-webscraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Web Scraper

CSS selectors
exporting function
pre-configured to insert results into SQLite database and generate CSV
stop conditions:
- time
- number of results
- number of websites
filter function to check for results
post- and pre-processing functions
init with options or set them later with spider.setVal1(v).setVal2(v2)
builder (call chaining) design pattern
extensible

API

Docs in gh-pages.

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL);
crawler.setRespSecW8(20)
       .appendSelector('p.info')
       .appendSelector('p.more-info')
       .appendFollowSelector('.btn.next')
       .appendFollowSelector('.btn.next-page')
       .setPostProcessTextFunct(text => text.replace('mother', 'yes'))
       .setFilterFunct(txt => !!txt.match('sunflower'))
       .setTimeLimit(120) // sec
       .setThreadCount(8) // #workers
       .setSiteCount(100) // distinct URLs
       // run returns void, you might want to provide an export function for each result (see below)
       // by default goes to sqlite ./db and prints to console
       .run();

OR use init object in the constructor

// DEFAULT init options
const spiderOpts = {
  // Function<String, String, String, Promise>
  exportFunct: exports.combine(exports.console(), exports.sqlite()),
  // predicate i.e. Function<String, Boolean>
  filterFunct: (txt) => true, 
  // Array<String>
  followSelectors: [], 
  // String
  logInfoFile: undefined, // logging goes to console
  // String
  logInfoFile: undefined, // logging goes to console
  // Integer
  redirFollowCount: 3,
  // Integer
  respSecW8: 10,
  // Array<String>
  selectors: [], 
  // Integer
  resultCount: 100,
  // Integer
  siteCount: 10, // #sites
  // Integer
  threadCount: 4,
  // Integer
  timeLimit: 60, // sec
};

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL, spiderOpts);
crawler.run();

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL);
crawler.setRespSecW8(20)
       .appendSelector('p.info')
       .appendSelector('p.more-info')
       .appendFollowSelector('.btn.next')
       .appendFollowSelector('.btn.next-page')
       .setPostProcessTextFunct(text => text.replace('mother', 'yes'))
       .setFilterFunct(txt => !!txt.match('sunflower'))
       .setTimeLimit(120) // sec
       .setThreadCount(8) // #workers
       .setSiteCount(100) // distinct URLs
       // run returns void, you might want to provide an export function for each result (see below)
       // by default goes to sqlite ./db and prints to console
       .run();

See export functions below to save results.

Export Function

Must be of type (uri: string, selector: string, text: string) => Promise<*>. There is a few configurable export functions that you can use:

Import the exporting module:

const { exporting, Spider }  = require('simple-webscraper');

Declare a spider:

const spider = new Spider(uri, { /* opts */ });

sqlite

Generates a Result table with columns: id INT, text TEXT, selector TEXT, uri TEXT columns.

spider.setExportFunct(exporting.sqlite()) // generate output db name
      .run();

spider.setExportFunct(exporting.sqlite('my-database.sqlite'))
      .run();

console

spider.setExportFunct(exporting.console()) // default formatter
      .run();

spider.setExportFunct(exporting.console('%s :: %s => %s')) // string formatter for (uri, selector, text)
      .run();

spider.setExportFunct(exporting.console((uri, selector, text) => `${uri} :: ${text.slice(0, 100)}`))
      .run();

file

spider.setExportFunct(exporting.file()) // default file name, default formatter
      .run();

spider.setExportFunct(exporting.file('results.csv')) // custom file name, default csv formatter
      .run();

spider.setExportFunct(exporting.file('results.log', 'INFO %s, %s, %s')) // custom file name, string formatter
      .run();

spider.setExportFunct(exporting.file('results.log', (uri, selector, text) => `${uri} :: ${text.slice(0, 100)}`))
      .run();

combine (used to broadcast results to many exports)

spider.setExportFunct(exporting.combine(
    exporting.sqlite(), 
    exporting.console(), 
    exporting.file(),
  )).run();

db

spider.setExportFunct(exporting.db(dbURI)) // look at sequelize docs
      .run();

default (enabled by default, sends to console, CSV file and sqlite database)

It's very easy to define your own export function. E.g. imagine wanting to POST each result to some 3rd party API.

const myExportFunction = async (uri, selector, text) => {
  const res = await http.post(myURI, { uri, selector, text });
  return;
};

Example

More examples in ./examples.

const { Spider, exporting } = require('simple-webscraper');

(async function() {
  const s = new Spider('https://www.jobsite.co.uk/jobs/javascript');

  const sqliteExport = await exporting.sqlite('./db', true /* force wipe if exists */);

  s.setExportFunct(sqliteExport)
   .appendSelector(".job > .row > .col-sm-12")
    // don't look for jobs in London, make sure they are graduate!
   .setFilterFunct(txt => !!txt.match('raduate') && !txt.match('London'))
    // next page 
   .appendFollowSelector(".results-footer-links-container ul.pagination li a[href*='page=']") 
    // stop after 3 websites (urls)
   .setSiteCount(3)
    // run for 30 sec
   .setTimeLimit(30)
   .run();
})();