JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 3843
  • Score
    100M100P100Q118262F
  • License MIT

A library to test if a url(request) is crawled, usually used in a web crawler. Compatible with `request` and `node-webcrawler`

Package Exports

  • seenreq

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (seenreq) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

seenreq

A library to test if a url(request) is crawled, usually used in a web crawler. Compatible with request and node-crawler

Notice: Please do not use 0.1.6 or lower version on Node 4.0+

Install

$ npm install seenreq

Basic Usage

var seenreq = require('seenreq')
var seen = new seenreq();

//url to be normalized
var url = "http://www.GOOGLE.com";
console.log(seen.normalize(url));//GET http://www.google.com/\r\n

//request options to be normalized
var option = {
    uri:'http://www.GOOGLE.com'
};

console.log(seen.normalize(option));//GET http://www.google.com/\r\n

//return false if ask for a `request` never see
console.log(seen.exists(url));//false

//return true if got same `request`
console.log(seen.exists(opt));//true

When you call exists, the module will do normalization itself first and then check if exists.

Use Redis to store keys

seenreq default stores keys in memory, so process will use unlimited memory if there are unlimited keys. Redis will solve this problem.

var seenreq = require('seenreq')
var seen = new seenreq({
    repo:'redis',// use redis instead of memory
    host:'127.0.0.1',
    port:6379,
    clearOnQuit:false // default true.
});

var url = "http://www.GOOGLE.com";

//because of non-blocking I/O, you have to use a callback function to get result
seen.exists(url,{
    callback:function(err,result){
        if(err){
            console.error(err);
        }else{
            console.log(result);
        }
    }
});

Class:Seenreq

Instance of Seenreq

seen.normalize(uri|option)

  • uri String, option is Option of request or node-webcrawler. return normalized String.

seen.exists(uri|option|[uri][,options])

  • options, Warning: When using default repo if you call exists with an array of uri that have duplicate uris, the function won't remove.

seen.dispose()

  • dispose resources of repo. If you are using Redis and do not call dispose the connection will keep forever, that is your process will never exit.

Options

  • removeKeys: Array, ignore specified keys when doing normalization. For instance, there is a ts property in the url like http://www.xxx.com/index?ts=1442382602504 which is timestamp and it should be same whenever you visit.
  • callback: Function, return result if using Redis repo.

RoadMap

  • add mysql repo to persist keys to disk.
  • add keys life time management.