JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 4
  • Score
    100M100P100Q28614F
  • License ISC

Extracts all [porter2] stemmed words from an HTML file, with the goal of aiding web-based NLP

Package Exports

  • html-stemmer

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-stemmer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

html-stemmer

Main repo: https://github.com/marcelpuyat/html-stemmer

Overview

Extracts all words from a file, filtering out HTML tags, stemming using Porter2 and filtering out stop words.

Install

npm install html-stemmer

Usage

var htmlStemmer = require('html-stemmer');

htmlStemmer.initialize();

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
    console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});

Documentation

initialize(options)

Initializes the stemmer, using default options when not specified.

Example:

htmlStemmer.initialize({
  includeTags: true,
  caseSensitive: true,
  delimiter: /[^A-Za-Z0-9]+/gi
});

Options:

Note that all of these are optional

  • includeTags - true or false. Filters out html tags (i.e. '<body>' is deleted) when false. false by default
  • filters - An object that maps regular expressions to what they should be replaced by.
    // Example that filters '&apos;' into an apostrophe and '&quot;' into a quotation mark
    filters = {};
    
    filters[/&apos;/gi] = '\'';
    filters[/&quot;/gi] = '"';
    
    htmlStemmer.initialize({
      filters: filters
    });
  • stopWords - true or false. Excludes stop words (i.e. 'for', 'to', etc.) from final array returned by getStemmedWords if true. List of stop words used is available here. true by default.
  • caseSensitive - true or false. Converts all characters to lowercase when false. false by default.
  • stemmed - true or false. Stems each word using Porter2 when true. true by default.
  • delimiter - A RegExp delimiter that is used to split the data into tokens. By default, /[^A-Za-z]+/gi is used.

getStemmedWords(filePath, callbackFn)

Returns an array containing all stemmed words according to the options specified in initialize. Because file reading is done asynchronously, a callback function is required to get the array of stemmed words.

Example:

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
  console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});