Package Exports
- scrapingapi
- scrapingapi/bin/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (scrapingapi) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Scraping API (official library)
Scraping API is an All In One solution to scrape webpage in Node.js without headaches.
Current status: Internal tests. Not available for the public for now.
Features
- Fully automated proxy rotation with HQ residential IPs. No captcha, and you will never be detected as a bot or proxy user
- Integrated data extraction with CSS / jQuery selectors, filters and iterators
- Bulk requests: Up to 3 per call
- Allowed to send json / form-encoded body and cookies
- Returns response body, headers, final URL & status code
- Supports redirects
- Coming Soon: Presets for popular websites
Get started in 5 minutes chrono
Install the package from NPM
npm install --save scrapingapiGet your API key Simply by creating an account on RapidAPI.
Enjoy scraping without headaches !
💡 TIP: You can test your requests with Insomnia (Open Source + Cross Platform)
Simple Usage Example
Here is an example of scraping current Bitcoin price + search results from Google Search.
const scraper = require("scrapingapi")(API_KEY);
scraper.get("https://www.google.com/search?q=bitcoin", { device: "desktop" }, {
// Extract the current bitcoin price
price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"],
// Search results
results: {
// For each Google search result
$foreach: "h2:contains('Web results') + div",
// We retrieve the link URL
url: ["a[href]", "href", true, "url"],
// And the title text
title: ["h3", "text", true, "title"]
}
}).then((response) => {
console.log("Here are the results:", response );
});The Scraper.get method send a GET request to the provided URL, and returns a Promise with a TScrapeResult object.
Jump: Request methods / Request options / Response object

You will get the following result
{
"url": "https://www.google.com/search?q=bitcoin",
"status": 200,
"data": {
"price": {
"amount": 50655.51,
"currency": "EUR"
},
"results": [{
"url": "https://bitcoin.org/",
"title": "Bitcoin - Open source P2P money"
}, {
"url": "https://coinmarketcap.com/currencies/bitcoin/",
"title": "Bitcoin price today, BTC to USD live, marketcap and chart"
}, {
"url": "https://www.bitcoin.com/",
"title": "Bitcoin.com | Buy BTC, ETH & BCH | Wallet, news, markets ..."
}, {
"url": "https://en.wikipedia.org/wiki/Bitcoin",
"title": "Bitcoin - Wikipedia"
}]
}
}Jump: Response object
Are you using Typescript / ESM ?
ESM imports are also supported.
If you're using Typescript, it's advised to use import instead of require in order to benefit from type checkings.
import Scraper from 'scrapingapi';
const scraper = new Scraper(API_KEY);Extracted data typing
In addition of basic type checkings, you can define the type of the scraped data.
...
type BitcoinGoogleResults = {
// Metadata generated by the price filter
price: {
amount: number,
currency: string
},
// An array containing an informations object for each Google search result
results: {
url: string,
title: string
}[]
}
scraper.get<BitcoinGoogleResults>("https://www.google.com/search?q=bitcoin").then( ... );Request: Methods
This library provides one method per supported HTTP method:
public get(
url: string,
options?: TOptions,
extract?: TExtractor
): Promise<TScrapeResult>;public post(
url: string,
body?: any,
bodyType?: "json" | "form",
options?: TOptions,
extract?: TExtractor
): Promise<TScrapeResult>;With the scrape method, You can also send up to 3 requests per call if each of them points to different domain names.
public scrape( requests: TRequestWithExtractors[] ): Promise<TScrapeResult[]>;Jump: Request options / Extractors / Response object
Request: Options
Each request options is represented by the TRequestWithExtractors type (the following definition is a simplified version):
type TRequestWithExtractors = {
// The URL address you want to sent the request to
url: string,
// The HTTP method. Default value: "GET"
method?: HttpMethod,
// The cookie string you want to pass to the request.
// Example: "sessionId=34; userId=87;"
cookies?: string,
// The data to send with the request. Must be combined with bodyType.
// Example: { "name": "bob", "age": 25 }
body: { [key: string]: any },
bodyType: typeof bodyTypes[number],
// Extractor object that define what data you want to extract from the webpage
extract?: TExtractor,
// true if you want to retrieve the response body string
withBody?: boolean,
// true if you want to retrieve the response headers
withHeaders?: boolean,
}Learn More: Allowed HTTP Methods / Allowed Body Types
Extractors
As you've seen before, besides of providing an undetectable scraping proxy, the scrapingapi library also allows you to extract and filter data from webpages with the optional extract option.
There are two types of extractors that you can combine with each other.
type TExtractor = TValueExtractor | TItemsExtractor;Value extractor
As indicated by his name, the value extractor gives you the tools so you can easily extract data from a webpage.
type TValueExtractor = [
selector: "this" | string,
attribute: "text" | "html" | string,
required: boolean,
...filters: string[]
]Its a an array composed by at least three values:
Selector: A CSS / jQuery-like selector to match the DOM element you are interested in. By example:
h3: Simply matches allh3elements- Matches:
<h3>This is a title</h3>
- Do not matches because it's not a
h3element:<p>Hello</p>
- Matches:
a.myLink[href]: Matchesaelements having the classmyLink, and where thehrefattribute is defined- Matches:
<a class="myLink anotherclass" href="https://scrapingapi.io">Link Text</a>
- Do not matches, because it doesn't contains the
myLinkclass<a class="thisClassIsAlone" href="https://scrapingapi.io">Link Text</a>
- Matches:
h2:contains('Scraping API') + div: Matchesdivelements that are next toh2elements where the content is equal toScraping API- Matches:
<h2>Scraping API</h2> <div>is cool</div>
- Do not matches, because the
divelement is not next to theh2element<h2>Scraping API</h2> <p>is maybe not</p> <div>well configured</div>
- Matches:
Attribute: The DOM element attribute that contains the value you want to extract. It includes:
- Native HTML attributes:
href,class,src, etc ... "text": Get the element content text"html": Get the element content html
- Native HTML attributes:
Required: A boolean that specify if this value is essential or not. If no value has been found and if required is true, then the whole item will not be included in the response.
Filters: All the following values are filters that will be applied to the extracted value. Here are built-in filters:
- URL
- Title
- Price
By Example
["h3", "text", true, "title"],- Select all the
h3elements - Get the content text of each of theses elements
- This data is required, it should be present in the response
- Format the data by passing it to the title filter
Item extractor
type TItemsExtractor = (
{ $foreach?: string }
&
{ [name: string]: TExtractor }
)The item extractor has 3 use cases. To illustrate them, I will take back the Bitcoin Google Search example.
- Give a name to every value you've extracted
{ price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"], }
- Define a structure for your data (you can nest multiple item extractors)
{ informations: { price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"], } }
- Iterate a DOM elements list to return an array correspond to each element (see the
$foreachinstruction){ price: ["#search .obcontainer .card-section > div:eq(1)", "text", true, "price"], results: { $foreach: "h2:contains('Web results') + div", // This is our iterator url: ["a[href]", "href", true, "url"], title: ["h3", "text", true, "title"] } }
The $foreach instruction
The $foreach instruction allows you to iterate all items that matches a selector.
Important: Please note that all the selectors that follows - directly or indirectly - a $foreach instruction will be relative to the matched items.
Consider the following extractor:
{
$foreach: "article.product",
name: ["> h3", "text", true, "title"],
}It goal is to extract the title of every article element having the product class.
Since we've iterating across items via a $foreach, the > h3 will be executed inside every article.product element.
In other words, the name data will match every h3 that is a direct child to every article.product element.
Response
For each request you send, a TScrapeResult will be returned, containing the informations you've requested in the options.
type TScrapeResult<TData extends any = any> = {
// The final response URL. Useful if the requested webpage send redirections.
url: string,
// The response HTTP status code. 200 if it is ok.
status: number,
// When you set the `withHeaders` option to true, an object containing the webpage response headers will be returned.
headers?: { [key: string]: string },
// When you set the `withBody` option to true, you will get the HTML of the requested webpage.
body?: string,
// When you specify extractors with the `extract` option, data will contain the extracted data.
data?: TData
}Learn more: List of HTTP status codes.
Optimize the response time
Disable theses options as soon as you can:
- extract
- withBody
- withHeaders
Theses three features can be useful, but it uses additionnal CPU resources, slows down communication between our proxies and our server and increase response size.
Another example
Consider that http://example.com/products contains the following HTML code:
<h2>Space Cat Holograms to motive you programming</h2>
<p>Free shipping to all the Milky Way.</p>
<section id="products">
<article class="product">
<img src="https://wallpapercave.com/wp/wp4014371.jpg" />
<h3>Sandwich cat lost on a burger rocket</h3>
<strong class="red price">123.45 $</strong>
<ul class="tags">
<li>sandwich</li>
<li>burger</li>
<li>rocket</li>
</ul>
</article>
<article class="product">
<img src="https://wallpapercave.com/wp/wp4575175.jpg" />
<h3>Aliens can't sleep because of this cute DJ</h3>
<ul class="tags">
<li>aliens</li>
<li>sleep</li>
<li>cute</li>
<li>dj</li>
<li>music</li>
</ul>
</article>
<article class="product">
<img src="https://wallpapercave.com/wp/wp4575192.jpg" />
<h3>Travelling at the speed of light with a radioactive spaceship</h3>
<p class="details">
Warning: Contains Plutonium.
</p>
<strong class="red price">456.78 $</strong>
<ul class="tags">
<li>pizza</li>
<li>slice</li>
<li>spaceship</li>
</ul>
</article>
<article class="product">
<img src="https://wallpapercave.com/wp/wp4575163.jpg" />
<h3>Gentleman dropped his litter into a black hole</h3>
<p class="details">
Since he found this calm planet.
</p>
<strong class="red price">undefined</strong>
<ul class="tags">
<li>luxury</li>
<li>litter</li>
</ul>
</article>
</section>Let's extract the product list.
type Product = {
name: string,
image: string,
price: { amount: number, currency: string },
tags: { text: string }[],
description?: string
}
scraper.get<Product[]>("http://example.com/products", {}, {
$foreach: "#products > article.product",
name: ["> h3", "text", true, "title"],
image: ["> img", "src", true, "url"],
price: ["> .price", "text", true, "price"],
tags: {
$foreach: "> ul.tags > li",
text: ["this", "text", true, "title"]
},
description: ["> .details", "text", false]
});Here is the response:
{
"url": "http://example.com/products",
"status": 200,
"data": [{
"name": "Sandwich cat lost on a burger rocket",
"image": "https://wallpapercave.com/wp/wp4014371.jpg",
"price": { "amount": 123.45, "currency": "USD" },
"tags": [
{ "text": "sandwich" },
{ "text": "burger" },
{ "text": "rocket" }
]
},{
"name": "Gentlemen can't find his litter anymore",
"image": "https://wallpapercave.com/wp/wp4575192.jpg",
"price": { "amount": 456.78, "currency": "USD" },
"tags": [
{ "text": "pizza" },
{ "text": "slice" },
{ "text": "spaceship" }
]
}]
}Did you notice ? Two items were excluded, because the price data has been marked as required, but:
- "Aliens can't sleep because of this cute DJ" doesn't contains any element that matches with
> .price - "Gentleman dropped his litter into a black hole" contains a
.priceelement, but the content text doesn't represents a price
Ready to scrape the web ?
What if you play with the examples ?
Need any additionnal information or help ?
- Search if an issue has not been created before
- If not, feel free to create a new issue
- For more personal questions, or for profesionnal inquiries:
Send me an email
contact@gaetan-legac.fr
Credits
Space cat images are from WallpaperCave.
