JSPM

  • Created
  • Published
  • Downloads 148944
  • Score
    100M100P100Q154514F
  • License MIT

modest natural language processing

Package Exports

  • compromise
  • compromise/one
  • compromise/three
  • compromise/tokenize
  • compromise/two

Readme

compromise
modest natural language processing
npm install compromise
do you find it strange, how we struggle to parse text?
    ᔐᖜ↬-
    how error-prone and tricky the simplest things become?
    how easy text is to make, then how difficult it is to use?
how it becomes
basically a dead-end
for our information?
compromise tries its best to turn text into data.
it makes limited and sensible decisions.
it is not as smart as you'd think.
import nlp from 'compromise'

let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'
the idea is to be not fancy at all:
if (doc.has('simon says #Verb')) {
  return true
}
pull-out parts of a text:
let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"

compute metadata, and grab it:

import plg from 'compromise-speech'
nlp.extend(plg)

let doc = nlp('Milwaukee has certainly had its share of visitors..')
doc.compute('syllables')
doc.places().json()
/*
[{
  "text": "Milwaukee",
  "terms": [{ 
    "normal": "milwaukee",
    "syllables": ["mil", "waukee"]
  }]
}]
*/

quickly flip between parsed and unparsed forms:

let doc = nlp('soft and yielding like a nerf ball')
doc.out({ '#Adjective': (m)=>`<i>${m.text()}</i>` })
// '<i>soft</i> and <i>yielding</i> like a nerf ball'

avoid idiomatic problems, and brittle parsers:

let doc = nlp("we're not gonna take it..")

doc.has('gonna') // true
doc.has('going to') // true (implicit)

// transform
doc.contractions().expand()
dox.text()
// 'we are not going to take it..'

whip stuff around like it's data:

let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(20)
doc.text()
// 'ninety five thousand and seventy two'

because it actually is:

let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'

Use it on the client-side:

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp('two bottles of beer')
  doc.numbers().minus(1)
  document.body.innerHTML = doc.text()
  // 'one bottle of beer'
</script>

or likewise:

import nlp from 'compromise'

var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'

compromise is ~200kb (minified):

it's pretty fast. It can run on keypress:

it works mainly by conjugating all forms of a basic word list.

The final lexicon is ~14,000 words:

you can read more about how it works, here. it's weird.

okay,

compromise/one

A tokenizer of words, sentences, and punctuation.

import nlp from 'compromise/one'

let doc = nlp("Wayne's World, party time")
let data = doc.json()
/* [{ 
    normal:"wayne's world party time",
    terms:[{ text: "Wayne's", normal: "wayne" }, 
      ...
      ] 
  }]
*/

one splits your text up, wraps it in a handy API,

    and does nothing else -
Output
  • .text() - return the document as text
  • .json() - return the document as data
  • .debug() - pretty-print the interpreted document
Utils
  • .all() - return the whole original document ('zoom out')
  • .found [getter] - is this document empty?
  • .tagger() - (re-)run the part-of-speech tagger on this document
  • .wordCount() - count the # of terms in the document
  • .length [getter] - count the # of characters in the document (string length)
  • .clone() - deep-copy the document, so that no references remain
  • .cache({}) - freeze the current state of the document, for speed-purposes
  • .uncache() - un-freezes the current state of the document, so it may be transformed
Accessors
Match

(match methods use the match-syntax.)

  • .match('') - return a new Doc, with this one as a parent
  • .not('') - return all results except for this
  • .matchOne('') - return only the first match
  • .if('') - return each current phrase, only if it contains this match ('only')
  • .ifNo('') - Filter-out any current phrases that have this match ('notIf')
  • .has('') - Return a boolean if this match exists
  • .lookBehind('') - search through earlier terms, in the sentence
  • .lookAhead('') - search through following terms, in the sentence
  • .before('') - return all terms before a match, in each phrase
  • .after('') - return all terms after a match, in each phrase
  • .lookup([]) - quick find for an array of string matches
Tag
  • .tag('') - Give all terms the given tag
  • .tagSafe('') - Only apply tag to terms if it is consistent with current tags
  • .unTag('') - Remove this term from the given terms
  • .canBe('') - return only the terms that can be this tag
Case
Whitespace
  • .pre('') - add this punctuation or whitespace before each match
  • .post('') - add this punctuation or whitespace after each match
  • .trim() - remove start and end whitespace
  • .hyphenate() - connect words with hyphen, and remove whitespace
  • .dehyphenate() - remove hyphens between words, and set whitespace
  • .toQuotations() - add quotation marks around these matches
  • .toParentheses() - add brackets around these matches
Loops
  • .map(fn) - run each phrase through a function, and create a new document
  • .forEach(fn) - run a function on each phrase, as an individual document
  • .filter(fn) - return only the phrases that return true
  • .find(fn) - return a document with only the first phrase that matches
  • .some(fn) - return true or false if there is one matching phrase
  • .random(fn) - sample a subset of the results
Insert
Transform

one is fast - most sentences take a 10th of a millisecond.

It can do ~1mb of text a second - or 10 wikipedia pages.

Infinite jest is takes 3s.

You can also paralellize, or stream text to it with compromise-speed.

compromise/two

A part-of-speech tagger, and grammar-interpreter.

import nlp from 'compromise/two'

let doc = nlp("Wayne's World, party time")
let str = doc.match('#Possessive #Noun').text()
// "Wayne's World"

two automatically calculates the very basic grammar of each word.

this is more useful than people sometimes realize.

Really light grammar helps you write cleaner templates, and get closer to the information.

Part-of-speech tagging is profoundly-difficult task to get 100% on. It is also a profoundly easy task to get 85% on.

Contractions

you can see the grammar of each word by running doc.debug(), and the reasoning for each tag with nlp.verbose('tagger').

compromise has 83 tags, arranged in a handsome graph.

#FirstName#Person#ProperNoun#Noun

if you prefer Penn tags, you can derive them with:

let doc = nlp('welcome thrillho')
doc.compute('penn')
doc.json()

compromise/three

Phrase and sentence tooling.

import nlp from 'compromise/three'

let doc = nlp("Wayne's World, party time")
let str = doc.people().normalize().text()
// "wayne"

three is a set of tooling to zoom into and operate on parts of a text.

.numbers() grabs all the numbers in a document, for example - and extends it with new methods, like .subtract().

Nouns
Verbs
Numbers
Sentences
Misc selections

.extend():

compromise comes with a considerate, common-sense baseline for english grammar. You're free to change, or lay-waste to any settings - which is the fun part actually.

the easiest part is just to suggest tags for any given words:

let myWords = {
  kermit: 'FirstName',
  fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)

or make heavier changes with a compromise-plugin.

import nlp from 'compromise'
nlp.extend({
  // add new tags
  tags: {
    Character: {
      isA: 'Person',
      notA: 'Adjective',
    },
  },
  // add or change words in the lexicon
  words: {
    kermit: 'Character',
    gonzo: 'Character',
  },
  // add new methods to compromise
  api: (View) => {
    View.prototype.kermitVoice = function () {
      this.sentences().prepend('well,')
      this.match('i [(am|was)]').prepend('um,')
      return this
    }
  }
})

Docs:

gentle introduction:
Documentation:
Concepts API Plugins
Accuracy Accessors Adjectives
Caching Constructor-methods Dates
Case Contractions Export
Filesize Insert Hash
Internals Json Html
Justification Lists Keypress
Lexicon Loops Ngrams
Match-syntax Match Numbers
Performance Nouns Paragraphs
Plugins Output Scan
Projects Selections Sentences
Tagger Sorting Syllables
Tags Split Pronounce
Tokenization Text Strict
Named-Entities Utils Penn-tags
Whitespace Verbs Typeahead
World data Normalization
Fuzzy-matching Typescript
Talks:
Articles:
Some fun Applications:

API:

Constructor

(these methods are on the nlp object)

  • .tokenize() - parse text without running POS-tagging
  • .plugin() - mix in a compromise-plugin
  • .verbose() - log our decision-making for debugging
  • .version - current semver version of the library
  • .parseMatch() - pre-parse any match statements for faster lookups
  • .world() - grab all current linguistic data

Plugins:

These are some helpful extensions:

Adjectives

npm install compromise-adjectives

Dates

npm install compromise-dates

Export

npm install compromise-export

  • .export() - store a parsed document for later use
  • nlp.load() - re-generate a Doc object from .export() results
Html

npm install compromise-html

  • .html({}) - generate sanitized html from the document
Hash

npm install compromise-hash

  • .hash() - generate an md5 hash from the document+tags
  • .isEqual(doc) - compare the hash of two documents for semantic-equality
Keypress

npm install compromise-keypress

Ngrams

npm install compromise-plugin-stats

Paragraphs

npm install compromise-paragraphs this plugin creates a wrapper around the default sentence objects.

Syllables

npm install compromise-syllables

  • .syllables() - split each term by its typical pronounciation
Penn-tags

npm install compromise-penn-tags


Typescript

we're committed to typescript/deno support, both in main and in the official-plugins:

import nlp from 'compromise'
import stats from 'compromise-stats'

const nlpEx = nlp.extend(stats)

nlpEx('This is type safe!').ngrams({ min: 1 })

Limitations:

  • slash-support: We currently split slashes up as different words, like we do for hyphens. so things like this don't work: nlp('the koala eats/shoots/leaves').has('koala leaves') //false

  • inter-sentence match: By default, sentences are the top-level abstraction. Inter-sentence, or multi-sentence matches aren't supported without a plugin: nlp("that's it. Back to Winnipeg!").has('it back')//false

  • nested match syntax: the danger beauty of regex is that you can recurse indefinitely. Our match syntax is much weaker. Things like this are not (yet) possible: doc.match('(modern (major|minor))? general') complex matches must be achieved with successive .match() statements.

  • dependency parsing: Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do. We should! Help wanted with this.

FAQ

    ☂️ Isn't javascript too...

      yeah it is!
      it wasn't built to compete with NLTK, and may not fit every project.
      string processing is synchronous too, and parallelizing node processes is weird.
      See here for information about speed & performance, and here for project motivations

    💃 Can it run on my arduino-watch?

      Only if it's water-proof!
      Read quick start for running compromise in workers, mobile apps, and all sorts of funny environments.

    🌎 Compromise in other Languages?

      we've got work-in-progress forks for German and French, in the same philosophy.
      and need some help.

    ✨ Partial builds?

      we do offer a compromise-tokenize build, which has the POS-tagger pulled-out.
      but otherwise, compromise isn't easily tree-shaken.
      the tagging methods are competitive, and greedy, so it's not recommended to pull things out.
      Note that without a full POS-tagging, the contraction-parser won't work perfectly. ((spencer's cool) vs. (spencer's house))
      It's recommended to run the library fully.

See Also:

MIT