Package Exports
- website-scraper
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (website-scraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
##Introduction Node.js module for website's scraping with images, css, js, etc. Uses cheerio, request, bluebird, fs-extra, underscore.
##Installation
npm install website-scraper
##Usage
var scraper = require('website-scraper');
scraper.scrape({
url: 'http://nodejs.org/',
path: '/path/to/save/',
}, function (error, result){
/* some code here */
});##API
scrape(options, callback)
Makes request to url and saves all files found with srcToLoad to path.
options - object containing next options:
url:url to load (required)path:path to save loaded files (required)log:boolean indicates whether to write the log to console (optional, default: false)indexFile:filename for index page (optional, default: 'index.html')srcToLoad:array of objects to load, specifies selectors and attribute values to select files for loading (optional, see example below)directories:array of objects, specifies relative directories for extensions. Ifnullall files will be saved topath(optional, see example below)
callback - callback function (optional), includes following parameters:
error:if error -Error object, if success -nullresult:if error -null, if success - object containing:html:html code of index page
##Examples
Let's scrape http://nodejs.org/ with images, css, js files and save them to /path/to/save/. Index page will be named 'myIndex.html', files will be separated into directories:
imgfor .jpg, .png (full path/path/to/save/img)jsfor .js (full path/path/to/save/js)cssfor .css (full path/path/to/save/css)fontfor .ttf, .woff, .eot, .svg (full path/path/to/save/font)
scraper.scrape({
url: 'http://nodejs.org/',
path: '/path/to/save',
indexFile: 'myIndex.html',
srcToLoad: [
{selector: 'img', attr: 'src'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
],
directories: [
{directory: 'img', extensions: ['.jpg', '.png']},
{directory: 'js', extensions: ['.js']},
{directory: 'css', extensions: ['.css']},
{directory: 'fonts', extensions: ['.ttf', '.woff', '.eot', '.svg']}
]
}, function (error, result){
console.log(result);
});##Dependencies
- cheerio
- request
- bluebird
- fs-extra
- underscore