Scrapy & AutoExtract API integration

We’ve just released a new open-source Scrapy middleware which makes it easy to integrate AutoExtract into your existing Scrapy spider. If you haven’t heard about AutoExtract yet, it’s an AI-based web scraping tool which automatically extracts data from web pages without the need to write any code. Learn more about AutoExtract here.

Installation

This project uses Python 3.6+ and pip. A virtual environment is strongly encouraged.

$ pip install git+https://github.com/scrapinghub/scrapy-autoextract

 

Configuration

Enable middleware

DOWNLOADER_MIDDLEWARES = {
'scrapy_autoextract.AutoExtractMiddleware': 543, }

This middleware should be the last one to be executed so make sure to give it the highest value.

AutoExtract settings

Mandatory

These settings must be defined in order for AutoExtract to work.

  • AUTOEXTRACT_USER: your AutoExtract API key
  • AUTOEXTRACT_PAGE_TYPE: the kind of data to be extracted (current options: "product" or "article")

Optional

  • AUTOEXTRACT_URL: AutoExtract service url (default: autoextract.scrapinghub.com)
  • AUTOEXTRACT_TIMEOUT: response timeout from AutoExtract (default: 660 seconds)

Spider

AutoExtract requests are opt-in and they must be enabled for each request, by adding:

meta['autoextract'] = {'enabled': True}

If the request was sent to AutoExtract, inside your Scrapy spider you can access the AutoExtract result through the meta attribute:

def parse(self, response):
yield response.meta['autoextract']

 

Example

In the Scrapy settings file:

DOWNLOADER_MIDDLEWARES = {
'scrapy_autoextract.AutoExtractMiddleware': 543,
}

# Disable AutoThrottle middleware
AUTHTHROTTLE_ENABLED = False

AUTOEXTRACT_USER = 'my_autoextract_apikey'
AUTOEXTRACT_PAGE_TYPE = 'article'

In the spider:

class ExampleSpider(Spider):
name = 'example'
start_urls = ['example.com']

def start_requests(self):
yield scrapy.Request(url, meta={'autoextract': {'enabled': True}}, self.parse)
def parse(self, response):
yield response.meta['autoextract']

Example output:

[{
"query":{
"domain":"example.com",
"userQuery":{
"url":"https://www.example.com/news/2019/oct/15/lorem-dolor-sit",
"pageType":"article"
},
"id":"1570771884892-800e44fc7cf49259"
},
"article":{
"articleBody":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat...",
"description":"Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatu",
"probability":0.9717171744215637,
"inLanguage":"en",
"headline":"'Lorem Ipsum Dolor Sit Amet",
"author":"Attila Toth",
"articleBodyHtml":"<article>\n\n<p>Lorem ipsum...",
"images":["https://i.example.com/img/media/12a71d2200e99f9fff125972b88ff395f5e...",],
"mainImage":"https://i.example.com/img/media/12a71d2200e99f9fff125972b88ff395f5e..."}
}]

 

Limitations

  • The incoming spider request is rendered by AutoExtract, not just downloaded by Scrapy, which can change the result - the IP is different, headers are different, etc.
  • Only GET requests are supported
  • Custom headers and cookies are not supported (i.e. Scrapy features to set them don't work)
  • Proxies are not supported (they would work incorrectly, sitting between Scrapy and AutoExtract, instead of AutoExtract and website)
  • AutoThrottle extension can work incorrectly for AutoExtract requests, because AutoExtract timing can be much larger than the time required to download a page, so it's best to use AUTOTHROTTLE_ENABLED=False in the settings.
  • Redirects are handled by AutoExtract, not by Scrapy, so these kinds of middlewares might have no effect
  • Retries should be disabled, because AutoExtract handles them internally (use RETRY_ENABLED=False in the settings) There is an exception, if there are too many requests sent in a short amount of time and AutoExtract returns HTTP code 429. For that case it's best to use RETRY_HTTP_CODES=[429].

Check out the middleware on Github or learn more about AutoExtract!

November 07, 2019 In "Scrapy" , "SQL" , "Real Estate" , "Matlab"
October 08, 2019 In "Scrapy" , "E-commerce monitoring" , "Product Data" , "Price Intelligence" , "SQL" , "Pandas"
September 17, 2019 In "data extraction" , "AutoExtract"
Scrapy, AutoExtract
Sign up now

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Follow Us

Learn More

Recent Posts