Package Exports

html-stemmer

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-stemmer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

html-stemmer

Main repo: https://github.com/marcelpuyat/html-stemmer

Overview

Extracts all words from a file, filtering out HTML tags, stemming using Porter2 and filtering out stop words.

Install

npm install html-stemmer

Usage

var htmlStemmer = require('html-stemmer');

htmlStemmer.initialize();

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
    console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});

Documentation

initialize(options)

Initializes the stemmer, using default options when not specified.

Example:

htmlStemmer.initialize({
  includeTags: true,
  caseSensitive: true,
  delimiter: /[^A-Za-Z0-9]+/gi
});

Options:

Note that all of these are optional

includeTags - true or false. Filters out html tags (i.e. '<body>' is deleted) when false. false by default

filters - An object that maps regular expressions to what they should be replaced by.

// Example that filters '&apos;' into an apostrophe and '&quot;' into a quotation mark
filters = {};

filters[/&apos;/gi] = '\'';
filters[/&quot;/gi] = '"';

htmlStemmer.initialize({
  filters: filters
});

stopWords - true or false. Excludes stop words (i.e. 'for', 'to', etc.) from final array returned by getStemmedWords if true. List of stop words used is available here. true by default.
caseSensitive - true or false. Converts all characters to lowercase when false. false by default.
stemmed - true or false. Stems each word using Porter2 when true. true by default.
delimiter - A RegExp delimiter that is used to split the data into tokens. By default, /[^A-Za-z]+/gi is used.

getStemmedWords(filePath, callbackFn)

Returns an array containing all stemmed words according to the options specified in initialize. Because file reading is done asynchronously, a callback function is required to get the array of stemmed words.

Example:

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
  console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});