Package Exports
- html-stemmer
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-stemmer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
html-stemmer
Main repo: https://github.com/marcelpuyat/html-stemmer
Overview
Extracts all words from a file, filtering out HTML tags, stemming using Porter2 and filtering out stop words.
Install
npm install html-stemmerUsage
var htmlStemmer = require('html-stemmer');
htmlStemmer.initialize();
htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});Documentation
initialize(options)
Initializes the stemmer, using default options when not specified.
Example:
htmlStemmer.initialize({
includeTags: true,
caseSensitive: true,
delimiter: /[^A-Za-Z0-9]+/gi
});
Options:
Note that all of these are optional
includeTags- true or false. Filters out html tags (i.e. '<body>' is deleted) when false. false by defaultfilters- An object that maps regular expressions to what they should be replaced by.// Example that filters ''' into an apostrophe and '"' into a quotation mark filters = {}; filters[/'/gi] = '\''; filters[/"/gi] = '"'; htmlStemmer.initialize({ filters: filters });stopWords- true or false. Excludes stop words (i.e. 'for', 'to', etc.) from final array returned by getStemmedWords if true. List of stop words used is available here. true by default.caseSensitive- true or false. Converts all characters to lowercase when false. false by default.stemmed- true or false. Stems each word using Porter2 when true. true by default.delimiter- A RegExp delimiter that is used to split the data into tokens. By default, /[^A-Za-z]+/gi is used.
getStemmedWords(filePath, callbackFn)
Returns an array containing all stemmed words according to the options specified in initialize. Because file reading is done asynchronously, a callback function is required to get the array of stemmed words.
Example:
htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});