Package Exports
- string-strip-html
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (string-strip-html) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
string-strip-html
Strips HTML tags from strings. Detects legit unencoded brackets.
Install
npm i string-strip-html
// consume as a CommonJS require:
const stripHtml = require('string-strip-html')
// or as an ES Module:
import stripHtml from 'string-strip-html'
// it does not assume the output must be always HTML and detects legit brackets:
console.log(stripHtml('a < b and c > d')) // => 'a < b and c > d'
// leaves content between tags:
console.log(stripHtml('Some text <b>and</b> text.')) // => 'Some text and text.'
// adds spaces to prevent accidental string concatenation
console.log(stripHtml('aaa<div>bbb</div>ccc')) // => 'aaa bbb ccc'
Here's what you'll get:
Type | Key in package.json |
Path | Size |
---|---|---|---|
Main export - CommonJS version, transpiled to ES5, contains require and module.exports |
main |
dist/string-strip-html.cjs.js |
18 KB |
ES module build that Webpack/Rollup understands. Untranspiled ES6 code with import /export . |
module |
dist/string-strip-html.esm.js |
17 KB |
UMD build for browsers, transpiled, minified, containing iife 's and has all dependencies baked-in |
browser |
dist/string-strip-html.umd.js |
33 KB |
Table of Contents
Purpose
This library deletes HTML tags from strings and doesn't assume anything about the output.
You might take HTML and strip all tags and paste it back into HTML. But equally, you can take a photo of a christmas card from your grandmother and OCR it, remove all cheeky HTML tags she put around her greetings, then print out this cleaned text and stick it on the wall. OK, I'm exaggerating, but the idea is, we will not assume anything about the input source or destination of the output of this library. We will dilligently identify and delete all and only all HTML tags.
Other HTML stripping libraries (like strip and striptags) assume too much. For example, they will remove legit brackets, such as a < b and c > d
arguing that they don't belong in HTML at the first place and that's some sneaky attack vector. But again, if you stripped HTML tags, then by definition it's not HTML any more and HTML requirements don't apply, do they?
The scope of this library is to take the HTML and strip HTML tags and only HTML tags. If there's something else there besides tags such as greater than signs that doesn't belong in HTML, I don't care. Use different tool to process your string further.
API
Basically, string-in string-out, with optional second input argument - an Optional Options Object.
API - Input
Input argument | Type | Obligatory? | Description |
---|---|---|---|
input |
String | yes | Text you want to strip HTML tags from |
opts |
Plain object | no | The Optional Options Object, see below for its API |
If input arguments are supplied have any other types, an error will be throw
n.
Optional Options Object
An Optional Options Object's key | Type of its value | Default | Description |
---|---|---|---|
{ | |||
ignoreTags |
Array of zero or more strings | [] |
Any tags provided here will not be stripped from the input |
stripTogetherWithTheirContents |
Array of zero or more strings, something falsey |
['script', 'style', 'xml'] |
My idea is you should be able to paste HTML and see only the text that would be visible in a browser window. Not CSS, not stuff from script tags. To turn this off, just set it to an empty array. Or something falsey. |
} |
The Optional Options Object is validated by check-types-mini so please behave: the settings' values have to match the API and settings object should not have any extra keys, not defined in the API. Naughtiness will cause error throw
s. I know, it's strict, but it prevents any API misconfigurations and helps to identify some errors early-on.
Here is the Optional Options Object in one place (in case you ever want to copy it):
{
ignoreTags: [],
stripTogetherWithTheirContents: ['script', 'style', 'xml'],
}
API - Output
A string of zero or more characters-length.
Devil is in the details...
Whitespace management
Two rules:
- Output will be trimmed. Any leading (in front) whitespaces characters as well as trailing (in the end of the result) will be deleted.
- Any whitespace between the tags will be deleted too. For example,
z<a> <a>y
=>zy
. Also, anythingstring.trim()
m-able to zero-length string will be removed, like aforementioned\n
and\r
and also tabs:z<b> \t\t\t <b>y
=>zy
.
Bigger picture
I scratched my itch, producing detergent - I needed a tool to clean the text before pasting into HTML because clients would supply briefing documents in all possible forms and shapes and often text would contain invisible Unicode characters. I've been given: Excel files, PSD's, Illustrator files, PDF's and of course, good old "nothing" where I had to reference existing code.
Detergent would remove the excessive whitespace, invisible characters and improve the text's English style. Detergent would also take HTML as input - stripping the tags, cleaning the text and giving back ready-to-paste sentences. But most of the cases, Detergent's input is just a text. And not always it ends up in HTML.
In September 2017, string.js which originally performed the HTML-stripping was discovered as having vulnerabilities.
I was able to quickly replace all functions that Detergent was consuming from string.js
except HTML-stripping.
This library is the last missing piece of a puzzle to get rid of string.js
.
Contributing
If you want a new feature in this package or you would like us to change some of its functionality, raise an issue on this repo.
If you tried to use this library but it misbehaves, or you need an advice setting it up, and its readme doesn't make sense, just document it and raise an issue on this repo.
If you would like to add or change some features, just fork it, hack away, and file a pull request. We'll do our best to merge it quickly. Code style is
airbnb-base
, only without semicolons. If you use a good code editor, it will pick up the established ESLint setup.
Licence
MIT License (MIT)
Copyright © 2018 Codsen Ltd, Roy Revelt