JSPM

  • Created
  • Published
  • Downloads 845412
  • Score
    100M100P100Q189640F
  • License MIT

Latin-script (natural language) parser

Package Exports

  • parse-latin
  • parse-latin/lib/plugin/break-implicit-sentences
  • parse-latin/lib/plugin/make-final-white-space-siblings
  • parse-latin/lib/plugin/make-initial-white-space-siblings
  • parse-latin/lib/plugin/merge-affix-exceptions
  • parse-latin/lib/plugin/merge-affix-symbol
  • parse-latin/lib/plugin/merge-initial-digit-sentences
  • parse-latin/lib/plugin/merge-initial-lower-case-letter-sentences
  • parse-latin/lib/plugin/merge-non-word-sentences
  • parse-latin/lib/plugin/merge-remaining-full-stops
  • parse-latin/lib/plugin/patch-position
  • parse-latin/lib/plugin/remove-empty-nodes

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (parse-latin) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

parse-latin

Build Coverage Downloads Size Chat

A Latin-script language parser for retext producing nlcst nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenizing it.

Note also that parse-latin does a decent job at tokenizing Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Install

This package is ESM only: Node 12+ is needed to use it and it must be imported instead of required.

npm:

npm install parse-latin

Use

import inspect from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'

var tree = new ParseLatin().parse('A simple sentence.')

console.log(inspect(tree))

Which, when inspecting, yields:

RootNode[1] (1:1-1:19, 0-18)
└─ ParagraphNode[1] (1:1-1:19, 0-18)
   └─ SentenceNode[6] (1:1-1:19, 0-18)
      ├─ WordNode[1] (1:1-1:2, 0-1)
      │  └─ TextNode: "A" (1:1-1:2, 0-1)
      ├─ WhiteSpaceNode: " " (1:2-1:3, 1-2)
      ├─ WordNode[1] (1:3-1:9, 2-8)
      │  └─ TextNode: "simple" (1:3-1:9, 2-8)
      ├─ WhiteSpaceNode: " " (1:9-1:10, 8-9)
      ├─ WordNode[1] (1:10-1:18, 9-17)
      │  └─ TextNode: "sentence" (1:10-1:18, 9-17)
      └─ PunctuationNode: "." (1:18-1:19, 17-18)

API

This package exports the following identifiers: ParseLatin. There is no default export.

ParseLatin(value)

Exposes the functionality needed to tokenize natural Latin-script languages into a syntax tree. If value is passed here, it’s not needed to give it to #parse().

ParseLatin#tokenize(value)

Tokenize value (string) into letters and numbers (words), white space, and everything else (punctuation). The returned nodes are a flat list without paragraphs or sentences.

Returns

Array.<Node> — Nodes.

ParseLatin#parse(value)

Tokenize value (string) into an NLCST tree. The returned node is a RootNode with in it paragraphs and sentences.

Returns

Node — Root node.

Algorithm

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenizers use:

  • A “word” is one or more letter or number characters
  • A “white space” is one or more white space characters
  • A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into a (nlcst) syntax tree, adding sentences and paragraphs where needed.

  • Some punctuation marks are part of the word they occur in, such as non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and…
  • Some full-stops do not mark a sentence end, such as 1., e.g., id.
  • Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, such as .), ."
  • And many more exceptions

License

MIT © Titus Wormer