Scrapinghub’s New AI Powered Developer Data Extraction API for E-Commerce & Article Extraction

Today, we’re delighted to announce the launch of the beta program for Scrapinghub’s new AI powered developer data extraction API for automated product and article extraction.

After much development and refinement with alpha users, our team have refined this machine learning technology to the point that data extraction engine is capable of automatically identifying common items on product and article web pages and extracting them without the need to develop and maintain individual web crawlers for each site.

Enabling developers to easily turn unstructured product and article pages into structured datasets at a scale, speed and flexibility that is nearly impossible to achieve when manually developing spiders.

With the AI enabled data extraction engine contained within the developer API, you now have the potential to extract product data from 100,000 e-commerce sites without having to write 100,000 custom spiders for each. 

As result, today we’re delighted to announce the launch of the developer API's public beta.

Join The Beta Program Today

If you are interested in e-commerce or media monitoring and would like to get early access to the data extraction developer API then be sure to sign up to the public beta program.

When you sign up to the beta program you will be issued an API key and documentation on how to use the API. From there you are free to use the developer API for your own projects and retain ownership of the data you extracted when the beta program closes.

What's even better, the beta program is completely free. You will be assigned a daily/monthly request quota which you are free to consume as you wish.

The beta program will run until July 9th, so if you’d like to be involved then be sure to sign up today as places are limited.

How to Use The API?

Once you’ve been approved to join the beta program and have received your API key, using the API is very straightforward.

Currently, the API has a single endpoint: https://developerapi.scrapinghub.com/v1/extract. A request is composed of one or more queries where each query contains a URL to extract from, and a page type that indicates what the extraction result should be (product or article).

Requests and responses are transmitted in JSON format over HTTPS. Authentication is performed using HTTP Basic Authentication where your API key is the username and the password is empty.

To make a request simply send a POST request to the API along with your API key, target URL and pageType (either article or product):

curl --verbose \
 --user '[api key]':'' \
 --header 'Content-Type: application/json' \
 --data '[{"url": "https://blog.scrapinghub.com/gopro-study", "pageType": "article"}]' \
 https://developerapi.scrapinghub.com/v1/extract

Or, in Python:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
                        auth=('[api key]', ''),
                        json=[{'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}])
print(response.json())

To facilitate query batching (see below) API responses are wrapped in a JSON array. Here is an article from our blog that we want to extract structured data from:

article screenshot

And the response from the article extraction API:

[
    {
        "article": {
            "articleBody": "Unbeknownst to many, there is a data revolution happening in finance.\n\nIn their never ending search for alpha hedge funds and investment banks are increasingly turning to new alternative sources of data to give them an informational edge over the market.\n\nOn the 31st May, Scrapinghub got ...",
            "articleBodyRaw": "<span id=\"hs_cos_wrapper_post_body\" class=\"hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_rich_text\" data-hs-cos-general-type=\"meta_field\" data-hs-cos-type=\"rich_text\"><p><span>Unbeknownst to many, there is a data revolution ... ",
            "audioUrls": null,
            "author": "Ian Kerins",
            "authorsList": [
                "Ian Kerins"
            ],
            "breadcrumbs": null,
            "datePublished": "2018-06-19T00:00:00",
            "datePublishedRaw": "June 19, 2018",
            "description": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
            "headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
            "images": [
                "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"
            ],
            "inLanguage": "en",
            "mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg#keepProtocol",
            "probability": 0.8376080989837646,
            "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
            "videoUrls": null
        },
        "error": null,
        "html": "<!DOCTYPE html><!-- start coded_template: id:5871566911 path:generated_layouts/5871566907.html --><!-...",
        "product": null,
        "query": {
            "userMeta": "Ku chatlanin!",
            "userQuery": {
                "pageTypeHint": "article",
                "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data"
            }
        }
    }
]

Product & Article Extraction

As mentioned previously the developer API is capable of extracting data from two types of web pages: product and article pages.

Product Extraction

The product extraction API enables developers to easily turn product pages into structured datasets for e-commerce monitoring applications.

To make a request to the product extraction API, simply set the “pageType” attribute to “product”, and provide the URL of a product page to the API. Example:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
                        auth=('[api key]', ''),
                        json=[{'url': 'http://www.waterbedbargains.com/innomax-perfections-deep-fill-softside-waterbed/', 'pageType': 'product'}])
print(response.json()[0]['product'])

The product extraction API is able to extract the following data types:

Name Type Description
brand String Brand or manufacturer of the product
breadcrumbs List of dictionaries with name and link optional string fields List of breadcrumbs (a specific navigation element) with optional name and URL
description String Description of the product
images List of strings List of URL or data URL values of all images of the product (may include the main image)
mainImage A URL or data URL value of the main image of the product. URL or data URL value of the main image of the product. 
name String The name of the product
offers List of dictionaries with price and currency string fields Prices of the product, price field is always present and is a valid number with a dot as a decimal separator. Currency is optional, it is currency as given on the web site, without extra normalization (for example both "$" and "USD" are possible currencies).
probability Float Probability that this is a single product page
properties List of dictionaries with key and value string fields List of product properties or characteristics, key field contains the property name, and value field contains the property value.
sku String SKU or other identifier (ISBN, GTIN, MPN, etc.) of the product
url String URL of page where this product was extracted

All fields are optional (can be null), except for url and probability.

Article Extraction

The article extraction API enables developers to easily turn articles into structured datasets for media monitoring applications.

To make a request to the article extraction API, simply set the “pageType” attribute to “article”, and provide the URL of an article to the API. Example:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
                        auth=('[api key]', ''),
                        json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'}])
print(response.json()[0]['article'])

The article extraction API is able to extract the following data types:

Name  Type Description
headline String Article headline or title
datePublished String Date, ISO-formatted with 'T' separator, may contain a timezone
datePublishedRaw String Same date as datePublished but before parsing, as it appeared on the site
author String Author (or authors) of the article
authorsList List of strings All authors of the article split into separate strings, for example the author value might be "Alice and Bob"and authorList value ["Alice", "Bob"], while for a single author author value might be "Alice Jones"and authorList value ["Alice Jones"]
inLanguage Language of the article, as an ISO 639-1 language code Language of the article, as an ISO 639-1 language code
breadcrumbs List of dictionaries with name and link optional string fields List of breadcrumbs (a specific navigation element) with optional name and URL
mainImage String URL or data URL value of the main image of the article
images List of strings List of URL or data URL values of all images of the article (may include the main image)
description String Short summary of the article, human-provided if available, or auto-generated
articleBody String Text of the article, including sub-headings and image captions, with newline separators
articleBodyRaw String HTML of the article body
videoUrls List of strings List of URLs of all videos inside the article body
audioUrls List of strings List of URLs of all audio inside the article body
probability Float Probability that this is a single article page
url String URL of page where this article was extracted

Similarly to the product extraction API, all article extraction fields are optional (can be null), except for url and probability.

Batching Queries

Both the product and article extraction API offer the ability to submit multiple queries (up to 100) in a single API request:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
                        auth=('[api key]', ''),
                        json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'},
                              {'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring', 'pageType': 'article'},
                              {'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}])
for query_result in response.json():
   print(query_result['article']['headline'])

The API will return the results of the extraction as the data extraction receives them, so query results are not necessarily returned in the same order as the original query.

If you need an easy way to associate the results with the queries that generated them, you can pass an additional "meta" field in the query. The value that you pass will appear as a "userMeta" field in the corresponding query result. For example, you can create a dictionary keyed on the "meta" field to match queries with their corresponding results:

import requests
queries = [{'meta': 'query1', 'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'},
          {'meta': 'query2', 'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring', 'pageType': 'article'},
          {'meta': 'query3', 'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}]
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
                        auth=('[api key]', ''),
                        json=queries)
query_results = {result['query']['userMeta']:result for result in response.json()}
for query in queries:
   query_result = query_results[query['meta']]
   print(query_result['article']['headline'])

If you would like to learn more about the developer API’s functionality and how you can use it for your specific projects then check out the API documentation (will be sent to you when you sign up).

Don’t Forget ! Join The Beta Program Today

The Developer API Beta Program is only open for a limited time (July 9th), so if you would like to get early and free access to the future of product and article extraction then be sure to sign up to the public beta program today.

April 05, 2019 In "Web Scraping" , "solution architecture"
March 28, 2019 In "Web Scraping" , "crawling" , "Solutions" , "guide" , "crawler" , "Architecture"
March 15, 2019 In "Web Scraping" , "scrapy cloud" , "scrapyproject" , "St.Patricks Day"
Web Scraping, Media Monitoring, Developer API, E-commerce monitoring