JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 5
  • Score
    100M100P100Q45656F
  • License MIT

Web scraping toolkit

Package Exports

  • simple-webscraper

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (simple-webscraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Web Scraper

  • CSS selectors
  • inserts results into SQLite database
  • stop conditions:
    • time
    • number of results
    • number of websites
    • filter function to check for results
  • init with options or set them later with spider.setVal1(v).setVal2(v2)
  • builder (call chaining) design pattern

API

// DEFAULT init options
const spiderOpts = {
  // Function<String, String, String, Promise>
  exportFunct: async (url, sel, txt) => null,
  // predicate i.e. Function<String, Boolean>
  filterFunct: (txt) => true, 
  // Array<String>
  followSelectors: [], 
  // String
  logErrFile: rootPath('errors.log'),
  // String
  logInfoFile: rootPath('log'),
  // Integer
  redirFollowCount: 3,
  // Integer
  respSecW8: 10,
  // Array<String>
  selectors: [], 
  // Integer
  resultCount: 100,
  // Integer
  siteCount: 10, // #sites
  // Integer
  threadCount: 4,
  // Integer
  timeLimit: 60, // sec
};

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL, spiderOpts);
crawler.run();

OR use methods to modify options (OPTIONAL, you can set them on init)

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL);
crawler.setLogErrFile('msgs-err.log')
       .setLogInfoFile('msgs-info.log')
       .setRespSecW8(20)
       .setRespSecW8(10)
       .appendSelector('p.info')
       .appendSelector('p.more-info')
       .appendFollowSelector('.btn.next')
       .appendFollowSelector('.btn.next-page')
       .setFilterFunct(txt => !!txt.match('sunflower'))
       .setTimeLimit(120) // sec
       .setThreadCount(8)
       .setSiteCount(100) // distinct URLs
       // run returns void, you need to prodive an export function for each result (see below)
       .run(); 

Export Function

Must be of type (url: String, sel: String, txt: String) => Promise<*>. See ./db.js for an example which inserts every result into an SQLite database.

NOTE Results will be in ./db.

Example

const Spider = require('./spider');
const db = require('./db');

(async function() {
  await db.sequelize.sync({force: true})
  const s = new Spider('https://www.jobsite.co.uk/jobs/javascript');
  s.setExportFunct(async (url, sel, txt) => {
    try {
      return db.Result.create({txt, selector: sel, url});
    } catch (e) {
      console.error(e);
    }
  }).appendSelector(".job > .row > .col-sm-12")
     // don't look for jobs in London, make sure they are graduate!
    .setFilterFunct(txt => !!txt.match('raduate') && !txt.match('London'))
     // next page 
    .appendFollowSelector(".results-footer-links-container ul.pagination li a[href*='page=']") 
     // stop after 3 websites (urls)
    .setSiteCount(3)
     // run for 30 sec
    .setTimeLimit(30)
    .run();
})();