Package Exports

scrapingapi
scrapingapi/bin/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (scrapingapi) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Scraping API (official library)

Scraping API is an All In One solution to scrape webpage in Node.js without headaches.

Current status: Internal tests. Not available for the public for now.

Features

Fully automated proxy rotation with HQ residential IPs. No captcha, and you will never be detected as a bot or proxy user
Integrated data extraction with CSS / jQuery selectors, filters and iterators
Bulk requests: Up to 3 per call
Allowed to send json / form-encoded body and cookies
Returns response body, headers, final URL & status code
Supports redirects
Coming Soon: Presets for popular websites

Get started in 5 minutes chrono

Install the package from NPM
```
npm install --save scrapingapi
```
Get your API key Simply by creating an account on RapidAPI.
Enjoy scraping without headaches !

💡 TIP: You can test your requests with Insomnia (Open Source + Cross Platform)

Simple Usage Example

Here is an example of scraping current Bitcoin price + search results from Google Search.

const scraper = require("scrapingapi")(API_KEY);

scraper.get("https://www.google.com/search?q=bitcoin", { device: "desktop" }, {
    // Extract the current bitcoin price                  
    price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"],
    // Search results
    results: {
        // For each Google search result
        $foreach: "h2:contains('Web results') + div",
        // We retrieve the link URL
        url: ["a[href]", "href", true, "url"],
        // And the title text
        title: ["h3", "text", true, "title"]
    }
}).then((response) => {

    console.log("Here are the results:", response );

});

The Scraper.get method send a GET request to the provided URL, and returns a Promise with a TScrapeResult object.

Jump: Request methods / Request options / Response object

Google Search Example

You will get the following result

{
    "url": "https://www.google.com/search?q=bitcoin",
    "status": 200,
    "data": {
        "price": {
            "amount": 50655.51,
            "currency": "EUR"
        },
        "results": [{
            "url": "https://bitcoin.org/",
            "title": "Bitcoin - Open source P2P money"
        }, {
            "url": "https://coinmarketcap.com/currencies/bitcoin/",
            "title": "Bitcoin price today, BTC to USD live, marketcap and chart"
        }, {
            "url": "https://www.bitcoin.com/",
            "title": "Bitcoin.com | Buy BTC, ETH & BCH | Wallet, news, markets ..."
        }, {
            "url": "https://en.wikipedia.org/wiki/Bitcoin",
            "title": "Bitcoin - Wikipedia"
        }]
    }
}

Jump: Response object

Are you using Typescript / ESM ?

ESM imports are also supported. If you're using Typescript, it's advised to use import instead of require in order to benefit from type checkings.

import Scraper from 'scrapingapi';
const scraper = new Scraper(API_KEY);

Extracted data typing

In addition of basic type checkings, you can define the type of the scraped data.

...

type BitcoinGoogleResults = {
    // Metadata generated by the price filter
    price: {
        amount: number, 
        currency: string 
    },
    // An array containing an informations object for each Google search result
    results: {
        url: string,
        title: string
    }[]
}

scraper.get<BitcoinGoogleResults>("https://www.google.com/search?q=bitcoin").then( ... );

Request: Methods

This library provides one method per supported HTTP method:

public get( 
    url: string, 
    options?: TOptions, 
    extract?: TExtractor 
): Promise<TScrapeResult>;

public post( 
    url: string, 
    body?: any, 
    bodyType?: "json" | "form", 
    options?: TOptions, 
    extract?: TExtractor 
): Promise<TScrapeResult>;

With the scrape method, You can also send up to 3 requests per call if each of them points to different domain names.

public scrape( requests: TRequestWithExtractors[] ): Promise<TScrapeResult[]>;

Jump: Request options / Extractors / Response object

Request: Options

Each request options is represented by the TRequestWithExtractors type (the following definition is a simplified version):

type TRequestWithExtractors = {
    
    // The URL address you want to sent the request to
    url: string,
    // The HTTP method. Default value: "GET"
    method?: HttpMethod,
    // The cookie string you want to pass to the request.
    // Example: "sessionId=34; userId=87;"
    cookies?: string,

    // The data to send with the request. Must be combined with bodyType.
    // Example: { "name": "bob", "age": 25 }
    body: { [key: string]: any },
    bodyType: typeof bodyTypes[number],

    // Extractor object that define what data you want to extract from the webpage
    extract?: TExtractor,
    // true if you want to retrieve the response body string
    withBody?: boolean,
    // true if you want to retrieve the response headers
    withHeaders?: boolean,
}

Learn More: Allowed HTTP Methods / Allowed Body Types

Extractors

As you've seen before, besides of providing an undetectable scraping proxy, the scrapingapi library also allows you to extract and filter data from webpages with the optional extract option.

There are two types of extractors that you can combine with each other.

type TExtractor = TValueExtractor | TItemsExtractor;

Value extractor

As indicated by his name, the value extractor gives you the tools so you can easily extract data from a webpage.

type TValueExtractor = [
    selector: "this" | string, 
    attribute: "text" | "html" | string,
    required: boolean,
    ...filters: string[]
]

Its a an array composed by at least three values:

Selector: A CSS / jQuery-like selector to match the DOM element you are interested in. By example:
- h3: Simply matches all h3 elements
  - Matches:
```
<h3>This is a title</h3>
```
  - Do not matches because it's not a h3 element:
```
<p>Hello</p>
```
- a.myLink[href]: Matches a elements having the class myLink, and where the href attribute is defined
  - Matches:
```
<a class="myLink anotherclass" href="https://scrapingapi.io">Link Text</a>
```
  - Do not matches, because it doesn't contains the myLink class
```
<a class="thisClassIsAlone" href="https://scrapingapi.io">Link Text</a>
```
- h2:contains('Scraping API') + div: Matches div elements that are next to h2 elements where the content is equal to Scraping API
  - Matches:
```
<h2>Scraping API</h2>
<div>is cool</div>
```
  - Do not matches, because the div element is not next to the h2 element
```
<h2>Scraping API</h2>
<p>is maybe not</p>
<div>well configured</div>
```
Attribute: The DOM element attribute that contains the value you want to extract. It includes:
- Native HTML attributes: href, class, src, etc ...
- "text": Get the element content text
- "html": Get the element content html
Required: A boolean that specify if this value is essential or not. If no value has been found and if required is true, then the whole item will not be included in the response.
Filters: All the following values are filters that will be applied to the extracted value. Here are built-in filters:
- URL
- Title
- Price

By Example

["h3", "text", true, "title"],

Select all the h3 elements
Get the content text of each of theses elements
This data is required, it should be present in the response
Format the data by passing it to the title filter

Item extractor

type TItemsExtractor = (
    { $foreach?: string }
    &
    { [name: string]: TExtractor }
)

The item extractor has 3 use cases. To illustrate them, I will take back the Bitcoin Google Search example.

Give a name to every value you've extracted

{
    price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"],
}

Define a structure for your data (you can nest multiple item extractors)

{
    informations: {
        price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"],
    }               
}

Iterate a DOM elements list to return an array correspond to each element (see the $foreach instruction)

{              
    price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"],
    results: {
        $foreach: "h2:contains('Web results') + div", // This is our iterator
        url: ["a[href]", "href", true, "url"],
        title: ["h3", "text", true, "title"]
    }
}

The $foreach instruction

The $foreach instruction allows you to iterate all items that matches a selector.

Important: Please note that all the selectors that follows - directly or indirectly - a $foreach instruction will be relative to the matched items. Consider the following extractor:

{
    $foreach: "article.product",
    name: ["> h3", "text", true, "title"],
}

It goal is to extract the title of every article element having the product class.

Since we've iterating across items via a $foreach, the > h3 will be executed inside every article.product element.

In other words, the name data will match every h3 that is a direct child to every article.product element.

Response

For each request you send, a TScrapeResult will be returned, containing the informations you've requested in the options.

type TScrapeResult<TData extends any = any> = {

    // The final response URL. Useful if the requested webpage send redirections.
    url: string,

    // The response HTTP status code. 200 if it is ok.
    status: number,
    
    // When you set the `withHeaders` option to true, an object containing the webpage response headers will be returned.
    headers?: { [key: string]: string },

    // When you set the `withBody` option to true, you will get the HTML of the requested webpage.
    body?: string,

    // When you specify extractors with the `extract` option, data will contain the extracted data.
    data?: TData
}

Learn more: List of HTTP status codes.

Optimize the response time

Disable theses options as soon as you can:

extract
withBody
withHeaders

Theses three features can be useful, but it uses additionnal CPU resources, slows down communication between our proxies and our server and increase response size.

Another example

Consider that http://example.com/products contains the following HTML code:

<h2>Space Cat Holograms to motive you programming</h2>
<p>Free shipping to all the Milky Way.</p>
<section id="products">

    <article class="product">
        <img src="https://wallpapercave.com/wp/wp4014371.jpg" />
        <h3>Sandwich cat lost on a burger rocket</h3>
        <strong class="red price">123.45 $</strong>
        <ul class="tags">
            <li>sandwich</li>
            <li>burger</li>
            <li>rocket</li>
        </ul>
    </article>

    <article class="product">
        <img src="https://wallpapercave.com/wp/wp4575175.jpg" />
        <h3>Aliens can't sleep because of this cute DJ</h3>
        <ul class="tags">
            <li>aliens</li>
            <li>sleep</li>
            <li>cute</li>
            <li>dj</li>
            <li>music</li>
        </ul>
    </article>

    <article class="product">
        <img src="https://wallpapercave.com/wp/wp4575192.jpg" />
        <h3>Travelling at the speed of light with a radioactive spaceship</h3>
        <p class="details">
            Warning: Contains Plutonium.
        </p>
        <strong class="red price">456.78 $</strong>
        <ul class="tags">
            <li>pizza</li>
            <li>slice</li>
            <li>spaceship</li>
        </ul>
    </article>

    <article class="product">
        <img src="https://wallpapercave.com/wp/wp4575163.jpg" />
        <h3>Gentleman dropped his litter into a black hole</h3>
        <p class="details">
            Since he found this calm planet.
        </p>
        <strong class="red price">undefined</strong>
        <ul class="tags">
            <li>luxury</li>
            <li>litter</li>
        </ul>
    </article>
</section>

Let's extract the product list.

type Product = {
    name: string,
    image: string,
    price: { amount: number, currency: string },
    tags: { text: string }[],
    description?: string
}

scraper.get<Product[]>("http://example.com/products", {}, {
    
    $foreach: "#products > article.product",

    name: ["> h3", "text", true, "title"],
    image: ["> img", "src", true, "url"],
    price: ["> .price", "text", true, "price"],
    tags: {
        $foreach: "> ul.tags > li",
        text: ["this", "text", true, "title"]
    },
    description: ["> .details", "text", false]

});

Here is the response:

{
    "url": "http://example.com/products",
    "status": 200,
    "data": [{
        "name": "Sandwich cat lost on a burger rocket",
        "image": "https://wallpapercave.com/wp/wp4014371.jpg",
        "price": { "amount": 123.45, "currency": "USD" },
        "tags": [
            { "text": "sandwich" },
            { "text": "burger" },
            { "text": "rocket" }
        ]
    },{
        "name": "Gentlemen can't find his litter anymore",
        "image": "https://wallpapercave.com/wp/wp4575192.jpg",
        "price": { "amount": 456.78, "currency": "USD" },
        "tags": [
            { "text": "pizza" },
            { "text": "slice" },
            { "text": "spaceship" }
        ]
    }]
}

Did you notice ? Two items were excluded, because the price data has been marked as required, but:

"Aliens can't sleep because of this cute DJ" doesn't contains any element that matches with > .price
"Gentleman dropped his litter into a black hole" contains a .price element, but the content text doesn't represents a price

Ready to scrape the web ?

What if you play with the examples ?

Need any additionnal information or help ?

Search if an issue has not been created before
If not, feel free to create a new issue
For more personal questions, or for profesionnal inquiries:

Send me an email
contact@gaetan-legac.fr

Credits

Space cat images are from WallpaperCave.