JSPM

metascraper

2.0.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 35874
  • Score
    100M100P100Q138388F
  • License MIT

A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

Package Exports

  • metascraper

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (metascraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

metascraper

Last version Build Status Coverage Status Dependency status Dev Dependencies Status NPM Status

A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

Table of Contents

Getting Started

metascraper is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

  • Have a high accuracy for online articles by default.
  • Be usable on the server and in the browser.
  • Make it simple to add new rules or override existing ones.
  • Don't restrict rules to CSS selectors or text accessors.

Installation

$ npm install metascraper --save

Usage

Let's extract accurate information from the followgin article:

const metascraper = require('metascraper')
const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const {body: html, url} = await got(targetUrl)
  const metadata = await metascraper({html, url})
  console.log(metadata)
})()

Where the output will be something like:

{
  "author": "Ellen Huet",
  "date": "2016-05-24T18:00:03.894Z",
  "description": "The HR startups go to war.",
  "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
  "publisher": "Bloomberg.com",
  "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
  "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

Metadata

Here is a list of the metadata that metascraper collects by default:

  • author — eg. Noah Kulwin
    A human-readable representation of the author's name.

  • date — eg. 2016-05-27T00:00:00.000Z
    An ISO 8601 representation of the date the article was published.

  • description — eg. Venture capitalists are raising money at the fastest rate...
    The publisher's chosen description of the article.

  • image — eg. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg
    An image URL that best represents the article.

  • logo — eg. https://entrepreneur.com/favicon180x180.png
    An image URL that best represents the publisher brand.

  • publisher — eg. Fast Company
    A human-readable representation of the publisher's name.

  • title — eg. Meet Wall Street's New A.I. Sheriffs
    The publisher's chosen title of the article.

  • url — eg. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion
    The URL of the article.

API

metascraper(options)

options

html

Required
Type: String

The HTML markup for extracting the content.

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

Comparison

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library metascraper html-metadata node-metainspector open-graph-scraper unfluff
Correct 95.54% 74.56% 61.16% 66.52% 70.90%
Incorrect 1.79% 1.79% 0.89% 6.70% 10.27%
Missed 2.68% 23.67% 37.95% 26.34% 8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper © Ian Storm Taylor, Released under the MIT License.
Maintained by Kiko Beats with help from contributors.