Scrapy Cloud Secrets: Hub Crawl Frontier and How To Use It

Imagine a long crawling process, like extracting data from a website for a whole month. We can start it and leave it running until we get the results. Though, we can agree that a whole month is plenty of time for something to go wrong. The target website can go down for a few minutes/hours, there can be some sort of power outage in your crawling server or even some other internet connection issues.

Any of those are real case scenarios and can happen at any given moment, bringing risk to your data extraction pipeline. In this case, if something like that happens, you may need to restart your crawling process and wait even longer to get access to that precious data. But, you don’t need to panic, this is where Hub Crawl Frontier (HCF) comes to our rescue.

What is Hub Crawl Frontier (HCF)?

HCF is an API to store request data and is available through Scrapy Cloud projects. It is a bit similar to Collections, but its intended use is to store request data, not a generic key value storage like Collections. At this moment, if you are familiar with Scrapy, you may be wondering why one would use HCF, when Scrapy can store and recover the crawling state by itself. 

The advantage is that Scrapy requires you to manage this state, by saving the content to disk (so needs disk quota) and if you are running inside a container, like in Scrapy Cloud, local files are lost once the process is finished. So, having some kind of external storage for requests is an alternative that takes this burden from your shoulders, leaving you to think about the extraction logic and not about the details on how to proceed in case it crashes and you need to restart.

Structure of Hub Crawl Frontier

Before digging into an example of how to use HCF, I’ll go over a bit on how it is structured. We can create many Frontiers per project, for each one we need a name. These Frontiers are then broken into slots, something similar to sharding, that can be useful in a producer-consumer scenario (topic of one of our upcoming blog posts). Usually, the name will be the name of the spider, to avoid any confusion. The catchy part is that we shouldn't change the number of slots after it was created, so keep it in mind when creating it.

Using HCF

Now that we know what HCF is and how we could make use of it, it is time to see it working. For this purpose, we’ll build a simple Scrapy spider to extract book information from http://books.toscrape.com. To get started, we’ll create a new scrapy project and install the proper dependencies as shown below (type them in your terminal).

# setup
mkdir hcf_example
cd hcf_example
python3 -m venv .venv  # or your favorite virtual env
source .venv/bin/activate
 
# project
pip install scrapy scrapy-frontera hcf-backend
scrapy startproject hcf_example .
scrapy genspider books.toscrape.com books.toscrape.com

The commands above will create a new directory for our project and create a new virtual environment, to avoid messing up our Operational System. Then it will install Scrapy and some libraries to use HCF. Finally, it creates a new Scrapy project and a spider. A side note on the extra libraries for HCF. There are a couple of libraries we could use, like scrapy-hcf, but it seems to be unmaintained for awhile. So, we’ll be using scrapy-frontera and HCF as a backed through hcf-backend.

Given that our project was successfully created and the dependencies were installed, we can write a minimal spider to extract the book data as shown in the following code snippet.

import scrapy

class BooksToscrapeComSpider(scrapy.Spider):
   name = 'books.toscrape.com'
   allowed_domains = ['books.toscrape.com']
   start_urls = ['http://books.toscrape.com/']
 
   def parse(self, response):
       for href in response.css('.product_pod h3 a::attr(href)').getall():
           # books
           yield response.follow(href, self.parse_book)
 
       next_page_href = response.css('.pager .next a::attr(href)').get()
       if next_page_href:
           yield response.follow(next_page_href, self.parse)
  
   def parse_book(self, response):
       return {
           'title': response.css('.product_main h1::text').get().strip(),
           'price': response.css('.product_main .price_color::text').get().strip()
       }

If you are familiar with Scrapy, there’s nothing so fancy in the code above. Just a simple spider that navigates the book pages and follows book links to their pages to extract the title and price.

We can run this spider from the terminal by typing Scrapy crawl books.toscrape.com and we should see the result there (no errors and 1,000 items were extracted). So far, we’re not interacting with HCF and we’ll be doing so by configuring it in the following changes. First, we’ll need to update our project settings.py file with the following.

HCF_AUTH = 'YOUR API KEY HERE'
HCF_PROJECT_ID = 'YOUR SCRAPY CLOUD PROJECT ID'
SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler'
BACKEND = 'hcf_backend.HCFBackend'
 DOWNLOADER_MIDDLEWARES = {
   'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0,
}
 SPIDER_MIDDLEWARES = {
   'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0,
}

The SCHEDULER, SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES are set so scrapy-frontera works. Then, we set HCF as the BACKEND and add the proper Scrapy Cloud API Key (HCF_AUTH) and the project in which we’re creating the Frontier (HCF_PROJECT_ID). With these settings in place, we can update our spider, so it starts interacting with HCF. If you run the spider now, you’ll see some new logs, but it won’t be storing the requests in HCF yet. The following changes should be applied in books_toscrape_com.py file.

frontera_settings = {
    'HCF_PRODUCER_FRONTIER': 'books_toscrape_com',
    'HCF_PRODUCER_NUMBER_OF_SLOTS': 1,
    'HCF_CONSUMER_FRONTIER': 'books_toscrape_com',
    'HCF_CONSUMER_SLOT': '0'
}
 
custom_settings = {
    'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse', 'parse_book'],
}

Recall that we are using scrapy-frontera to interact with HCF, that’s the main reason we need to set frontera_settings. Basically, we are setting the Frontier name where we are going to store the requests (HCF_PRODUCER_FRONTIER) and where we are consuming them (HCF_CONSUMER_FRONTIER).

The HCF_PRODUCER_NUMBER_OF_SLOTS setting means the number of slots we should be creating for this producer, in this case only one and HCF_CONSUMER_SLOT means the slot we’re using for consumption which is the slot 0 (given that there is only 1 and starts from 0). Finally, we need to tell scrapy-frontera which requests it should send to the backend, and it happens by identifying the request callback. If the request callback is any of the names set in FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER it will be sent to the backend, otherwise it’ll be processed as a local scrapy request.

This is it, we’ve got the moment that we can run our spider and it will be storing the requests in HCF. Just run the spider as we did before and it should work! But how can I tell that the requests were sent to HCF? For that, hcf-backend comes with a handy tool to help us, the hcfpal. From your terminal, just run the command below and you should see the Frontier name.

PROJECT_ID="<YOUR PROJECT ID>" SH_APIKEY="<YOUR API KEY>" python -m hcf_backend.utils.hcfpal list

There are some other commands available in hcfpal, like counting nthe requests in a given frontier.

PROJECT_ID="<YOUR PROJECT ID>" SH_APIKEY="<YOUR API KEY>" python -m hcf_backend.utils.hcfpal count books_toscrape_com

It will show you the request count per slot and total count (in case you have more than one slot).

Incremental crawl

As we are storing the requests in HCF for further restart, it can be used as an example of incremental crawling. So, no need for special logic or so, just run the spider and it should start getting only new content. The requests are identified as in scrapy, by their fingerprint. There is one catch when working with multiple slots that is:, a given request is unique in a given slot (but we won’t bother with it for now and leave it for a future article). To get started, let’s clean our Frontier by typing the following in our terminal.

PROJECT_ID="<YOUR PROJECT ID>" SH_APIKEY="<YOUR API KEY>" python -m hcf_backend.utils.hcfpal delete books_toscrape_com 0 

Once it’s done, run the spider but stop it before it finishes (simulating an early stop). To do it, press CTRL + C (Command + C) on the terminal once. It should send the signal to scrapy to finish the process. Then, wait a bit so the crawling process finishes. As the process finishes, it logs the stats in the terminal and we should use them to understand a bit of what’s happening.

For example, by looking into item_scrape_count I get that 80 items were extracted. Also, pay attention to stats starting with hcf/consumer and hcf/producer. These are related to the URLs we found in our run, how many were processed/extracted (consumed) and how many were discovered/stored (produced). In my case, it consumed 84 requests and found 105 links (all new, as we had cleaned the Frontier before running).

After inspecting the stats, run the spider once again, without deleting the Frontier, and wait for it to finish. You should see that item_scrape_count is the difference between the previous crawl and the current one (in my case, 920 items). This happened because the duplicate requests were filtered by HCF and then they weren’t processed again.

You should also identify a similar behavior in hcf/consumer and hcf/producer stats, showing that some links were extracted but not all of them are new.

Finally,  you can run the spider once more and it will just stop, logging no items scraped, because all the links it extracts were already processed in the previous runs. So, there is no new data to be processed and it finishes.

Wrapping up

HCF is a kind of external storage for requests that is available in Scrapy Cloud projects and it can be used by Scrapy spiders. There are many use cases for it, and we’ve been through the recovery of a crawling process and incremental crawling scenarios. For a future article, we’ll explore a bit more how we can configure HCF in our projects and how to use it in a producer-consumer architecture. If you got interested in it, I invite you to check the Shub Workflow basic tutorial (which has some information similar to this tutorial) and Frontera docs.

Learn more

If you want to learn more about web data extraction and how it can serve your business you can check out our solutions to see how others are making use of web data. Also, if you’re considering outsourcing web scraping, you can watch our on-demand webinar to help you decide between in-house vs outsourced web data extraction.

Read the whitepaper

September 03, 2020 In "data extraction" , "QA" , "Data Quality"
August 27, 2020 In "data extraction" , "Real Estate" , "real estate data" , "property data"
July 07, 2020 In "data" , "data extraction" , "web scraping basics"
data extraction, Scrapy, web scraping basics