Package Exports

string-strip-html

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (string-strip-html) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

string-strip-html

Strips HTML tags from strings. Detects legit unencoded brackets.

Install

$ npm i string-strip-html

// consume as a CommonJS require:
const stripHtml = require('string-strip-html')
// or as an ES Module:
import stripHtml from 'string-strip-html'

// it does not assume the output must be always HTML and detects legit brackets:
console.log(stripHtml('a < b and c > d')) // => 'a < b and c > d'
// leaves content between tags:
console.log(stripHtml('Some text <b>and</b> text.')) // => 'Some text and text.'
// adds spaces to prevent accidental string concatenation
console.log(stripHtml('aaa<div>bbb</div>ccc')) // => 'aaa bbb ccc'

Here's what you'll get:

Type	Key in `package.json`	Path	Size
Main export - CommonJS version, transpiled, contains `require` and `module.exports`	`main`	`dist/string-strip-html.cjs.js`	20 KB
ES module build that Webpack/Rollup understands. Untranspiled ES6 code with `import`/`export`.	`module`	`dist/string-strip-html.esm.js`	19 KB
UMD build for browsers, transpiled, minified, containing `iife`'s and has all dependencies baked-in	`browser`	`dist/string-strip-html.umd.js`	38 KB

⬆ back to top

Purpose
Bigger picture
API
Devil is in the details...
- Whitespace management
Contributing
Licence

Purpose

Imagine you have a string with some HTML tags in it. This library makes those tags go poof.

I strongly believe JS libraries should do one thing and do it well.

In this case, it should strip HTML tags and only HTML tags. If we detect something else, only resembling a tag, we should not delete it, right?

I think stripping anything else than an HTML tag would be not doing it well.

Speaking about competitor libs, they excuse their algorithm imperfections saying unencoded brackets are not allowed within HTML.

But hey, what if somebody wanted to strip HTML tags within simple text? Both inputs and outputs then could contain brackets and they would not be encoded, right?

Other HTML stripping libraries (like strip and striptags) assume the output must be HTML too. As a consequence, they:

Limit their functionality and algorithm creativity, not concerning with false positive cases, legit brackets in non-HTML scenarios,
thus preventing other libraries that accept HTML too as input (besides other things) to use their HTML stripping.

I had emotional debates on GitHub with other people that were explaining to me HTML must have to unencoded brackets. That was their response to me saying HTML stripping libraries should strip only HTML tags (not a d for example). If it's deemed to be not an HTML tag, it should not be stripped and I don't care if unencoded brackets are not allowed in HTML. It's outside the scope. My algorithm didn't detect it as a tag and thus left it out. End of scope.

For example, text cleaning libraries (like Detergent) might implement HTML stripping, and their outputs will most of the cases be not-HTML (strictly speaking, since you can paste any text into HTML). A string like a d should be able to pass the HTML stripping intact. Then, encode the brackets, but the HTML stripping should not strip .

The scope of this library is to take HTML and strip HTML tags and only HTML tags. If there's something else there besides tags, what doesn't belong in HTML, I don't care. Use different tool to process your string further.

⬆ back to top

Bigger picture

I scratched my itch, producing detergent - I needed a tool to clean the text before pasting into HTML because clients would supply briefing documents in all possible forms and shapes and often text would contain invisible Unicode characters. I've been given: Excel files, PSD's, Illustrator files, PDF's and of course, good old "nothing" where I had to reference existing code.

Detergent would remove excessive whitespace, invisible characters and improve the text's English style. Detergent would also take HTML as input - stripping the tags, cleaning the text and giving back ready-to-paste sentences. But most of the cases, Detergent's input is just a text. And not always it ends up in HTML.

In September 2017, string.js which originally performed the HTML-stripping was discovered as having vulnerabilities.

I was able to quickly replace all functions that Detergent was consuming from string.js except HTML-stripping.

This library is the last missing piece of a puzzle to drop string.js from Detergent dependencies.

⬆ back to top

API

Basically, string-in string-out, with optional second input argument - an Optional Options Object.

API - Input

Input argument	Type	Obligatory?	Description
`input`	String	yes	Text you want to strip HTML tags from
`opts`	Plain object	no	Optional options object, see below

If input arguments are supplied have any other types, an error will be thrown.

⬆ back to top

Optional Options Object

options object's key	Type of its value	Default	Description
{
`ignoreTags`	Array of zero or more strings	`[]`	Any tags provided here will not be stripped from the input
`stripTogetherWithTheirContents`	Array of zero or more strings, `something falsey`	`['script', 'style']`	My idea is you should be able to paste HTML and see only the text that would be visible in a browser window. Not CSS, not stuff from `script` tags. To turn this off, just set it to an empty array. Or something falsey.
}

The Optional Options Object is validated by check-types-mini so please behave: the settings' values have to match the API and settings object should not have any extra keys, not defined in the API. Naughtiness will cause error throws. I know, it's strict, but it prevents any API misconfigurations and helps to identify some errors early-on.

Here is the O.O.O. in one place (in case you ever want to copy it):

stripHtml(
  str,
  {
    ignoreTags: [],
    stripTogetherWithTheirContents: ['script', 'style'],
  }
);

⬆ back to top

API - Output

A string of zero or more characters-length.

Devil is in the details...

Whitespace management

Two rules:

Output will be trimmed. Any leading (in front) whitespaces characters as well as trailing (in the end of the result) will be deleted.
Any whitespace between the tags will be deleted too. For example, z<a> <a>y => zy. Also, anything string.trim()m-able to zero-length string will be removed, like aforementioned \n and \r and also tabs: z \t\t\t y => zy.

⬆ back to top

Contributing

Hi! 99% of people in the society are passive - consumers. They wait for others to take action, they prefer to blend in. The remaining 1% are proactive citizens who will do something rather than wait. If you are one of that 1%, you're in luck because I am the same and together we can make something happen.

If you want a new feature in this package or you would like to change some of its functionality, raise an issue on this repo. Also, you can email me. Just let it out.
If you tried to use this library but it misbehaves, or you need an advice setting it up, and its readme doesn't make sense, just document it and raise an issue on this repo. Alternatively, you can email me.
If you don't like the code in here and would like to give an advice about how something could be done better, please do. Same drill - GitHub issues or email, your choice.
If you would like to add or change some features, just fork it, hack away, and file a pull request. I'll do my best to merge it quickly. Code style is airbnb, only without semicolons. If you use a good code editor, it will pick up the established ESLint setup.

⬆ back to top

Licence

MIT License (MIT)