Frontera: The Brain Behind the Crawls

Frontera: The Brain Behind the Crawls

At Scrapinghub we’re always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up to date lists of item pages and are able to optimise our crawl as much as possible.

From these complex projects emerge technologies that can be used across all of our spiders, and we’re very pleased to release Frontera, a flexible frontier for web crawlers.

Frontera, formerly Crawl Frontier, is an open source framework we developed to facilitate building a crawl frontier, helping manage our crawling logic and sharing it between spiders in our Scrapy projects.

What is a crawl frontier?

A crawl frontier is the system in charge of the logic and policies to follow when crawling websites, and plays a key role in more sophisticated crawling systems. It allows us to set rules about what pages should be crawled next, visiting priorities and ordering, how often pages are revisited, and any behaviour we may want to build into the crawl.

While Frontera was originally designed for use with Scrapy, it’s completely agnostic and can be used with any other crawling framework or standalone project.

In this post we’re going to demonstrate how Frontera can improve the way you crawl using Scrapy. We’ll show you how you can use Scrapy to scrape articles from Hacker News while using Frontera to ensure the same articles aren’t visited again in subsequent crawls.

The frontier needs to be initialised with a set of starting URLs (seeds), and then the crawler will ask the frontier which pages should visit. As the crawler visits pages it will inform back to the frontier of each page’s response and extracted URLs.

The frontier will decide how to use this information according to the defined logic. This process continues until an end condition is reached. Some crawlers may never stop, we refer to these as continuous crawls.

Creating a Spider for HackerNews

Hopefully you’re now familiar with what Frontera does. If not, have take a look at this textbook’s section for more theory on how a crawl frontier works.

You can checkout the project we’ll be developing in this example from GitHub.

Let’s start by creating a new project and spider:

scrapy startproject hn_scraper
cd hn_scraper
scrapy genspider HackerNews news.ycombinator.com

You should have a directory structure similar to the following:

hn_scraper
hn_scraper/hn_scraper
hn_scraper/hn_scraper/__init__.py
hn_scraper/hn_scraper/__init__.pyc
hn_scraper/hn_scraper/items.py
hn_scraper/hn_scraper/pipelines.py
hn_scraper/hn_scraper/settings.py
hn_scraper/hn_scraper/settings.pyc
hn_scraper/hn_scraper/spiders
hn_scraper/hn_scraper/spiders/__init__.py
hn_scraper/hn_scraper/spiders/__init__.pyc
hn_scraper/hn_scraper/spiders/HackerNews.py
hn_scraper/scrapy.cfg

Due to the way the spider template is set up, your start_urls in spiders/HackerNews.py will look like this:

start_urls = (
    'http://www.news.ycombinator.com/',
)

So you will want to correct it like so:

start_urls = (
    'https://news.ycombinator.com/',
)

We also need to create an item definition for the article we’re scraping:

items.py
import scrapy

class HnArticleItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    item_id = scrapy.Field()
    pass

Here the url field will refer to the outbound URL, the title to the article’s title, and the item_id to HN’s item ID.

We then need to define a link extractor so Scrapy will know which links to follow and extract data from.

Hacker News doesn’t make use of CSS classes for each item row, and another problem is that the article’s item URL, author and comments count are on a separate row from the article title and outbound URL. We’ll need to use XPath in this case.

First let’s gather all of the rows containing a title and outbound URL. If you inspect the DOM, you will notice these rows contain 3 cells, whereas the subtext rows contain 2 cells. So we can use something like the following:

selector = Selector(response)

rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]' \
                          '//table[count(tr) > 1]//tr[count(td) = 3]')

We then iterate over each row, retrieving the article URL and title, and we also need to retrieve the item URL and author from the subtext row, which we can find using the following-sibling axis. You should create a method similar to the following:

def parse_item(self, response):
    selector = Selector(response)

    rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]' \
                              '//table[count(tr) > 1]//tr[count(td) = 3]')
    for row in rows:
        item = HnArticleItem()

        article = row.xpath('td[@class="title" and count(a) = 1]//a')
        article_url = self.extract_one(article, './@href', '')
        article_title = self.extract_one(article, './text()', '')
        item['url'] = article_url
        item['title'] = article_title

        subtext = row.xpath(
            './following-sibling::tr[1]//td[@class="subtext" and count(a) = 3]')
        if subtext:
            item_author = self.extract_one(subtext, './/a[1]/@href', '')
            item_id = self.extract_one(subtext, './/a[2]/@href', '')
            item['author'] = item_author[8:]
            item['id'] = int(item_id[8:])

        yield item

The extract_one method is a helper function to extract the first result:

def extract_one(self, selector, xpath, default=None):
    extracted = selector.xpath(xpath).extract()
    if extracted:
        return extracted[0]
    return default

There’s currently a bug with Frontera’s SQLalchemy middleware where callbacks aren’t called, so right now we need to inherit from Spider and override the parse method and make it call our parse_item function. Here’s an example of what the spider should look like:

spiders/HackerNews.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from hn_scraper.items import HnArticleItem


class HackernewsSpider(Spider):
    name = "HackerNews"
    allowed_domains = ["news.ycombinator.com"]
    start_urls = ('https://news.ycombinator.com/', )

    link_extractor = SgmlLinkExtractor(
        allow=('news', ),
        restrict_xpaths=('//a[text()="More"]', ))

    def extract_one(self, selector, xpath, default=None):
        extracted = selector.xpath(xpath).extract()
        if extracted:
            return extracted[0]
        return default

    def parse(self, response):
        for link in self.link_extractor.extract_links(response):
            request = Request(url=link.url)
            request.meta.update(link_text=link.text)
            yield request

        for item in self.parse_item(response):
            yield item

    def parse_item(self, response):
        selector = Selector(response)

        rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]' \
                              '//table[count(tr) > 1]//tr[count(td) = 3]')
        for row in rows:
            item = HnArticleItem()

            article = row.xpath('td[@class="title" and count(a) = 1]//a')
            article_url = self.extract_one(article, './@href', '')
            article_title = self.extract_one(article, './text()', '')
            item['url'] = article_url
            item['title'] = article_title

            subtext = row.xpath(
                './following-sibling::tr[1]//td[@class="subtext" and count(a) = 3]')
            if subtext:
                item_author = self.extract_one(subtext, './/a[1]/@href', '')
                item_id = self.extract_one(subtext, './/a[2]/@href', '')
                item['author'] = item_author[8:]
                item['id'] = int(item_id[8:])

            yield item

Enabling Frontera in Our Project

Now all we need to do is configure the Scrapy project to use Frontera with the SQLalchemy middleware. First install Frontera:

pip install frontera

First enable Frontera’s middlewares and scheduler by adding the following to settings.py:

SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}
SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 999
}, )
DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware':
    999
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
FRONTERA_SETTINGS = 'hn_scraper.frontera_settings'

Next create a file named frontera_settings.py, as specified above in FRONTERA_SETTINGS, to store any settings related to the frontier:

BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///hn_frontier.db'
MAX_REQUESTS = 2000
MAX_NEXT_REQUESTS = 10
DELAY_ON_EMPTY = 0.0

Here we specify hn_frontier.db as the SQLite database file, which is where Frontera will store pages it has crawled.

Running the Spider

Let’s run the spider:

scrapy crawl HackerNews -o results.csv -t csv

You can review the items being scraped in results.csv while the spider is running.

You will notice the hn_scraper.db file we specified earlier will be created. You can browse it using the sqlite3 command line tool:

sqlite> attach "hn_frontier.db" as hns;
sqlite> .tables
hns.pages
sqlite> select * from hns.pages;
https://news.ycombinator.com/|f1f3bd09de659fc955d2db1e439e3200802c4645|0|20150413231805460038|200|CRAWLED|
https://news.ycombinator.com/news?p=2|e273a7bbcf16fdcdb74191eb0e6bddf984be6487|1|20150413231809316300|200|CRAWLED|
https://news.ycombinator.com/news?p=3|f804e8cd8ff236bb0777220fb241fcbad6bf0145|2|20150413231810321708|200|CRAWLED|
https://news.ycombinator.com/news?p=4|5dfeb8168e126c5b497dfa48032760ad30189454|3|20150413231811333822|200|CRAWLED|
https://news.ycombinator.com/news?p=5|2ea8685c1863fca3075c4f5d451aa286f4af4261|4|20150413231812425024|200|CRAWLED|
https://news.ycombinator.com/news?p=6|b7ca907cc8b5d1f783325d99bc3a8d5ae7dcec58|5|20150413231813312731|200|CRAWLED|
https://news.ycombinator.com/news?p=7|81f45c4153cc8f2a291157b10bdce682563362f1|6|20150413231814324002|200|CRAWLED|
https://news.ycombinator.com/news?p=8|5fbe397d005c2f79829169f2ec7858b2a7d0097d|7|20150413231815443002|200|CRAWLED|
https://news.ycombinator.com/news?p=9|14ee3557a2920b62be3fd521893241c43864c728|8|20150413231816426616|200|CRAWLED|

As shown above, the database has one table, pages, which stores the URL, its fingerprint, timestamp and response code. This schema is specific to the SQLalchemy backend, and different backends will may use different schemas, and some don’t persist crawled pages at all.

Frontera backends aren’t limited to storing crawled pages; they’re the core component of Frontera, and hold all crawl frontier related logic you wish to make use of, so which backend you use is heavily tied to what you want to achieve with Frontera.

In many cases you will want to create your own backend. This is a lot easier than it sounds, and you can find all the information you need in the documentation.

Hopefully this tutorial has given you a good insight into Frontera and how you can use it to improve the way you manage your crawling logic. Feel free to checkout the code and docs. If you run into a problem please report it at the issue tracker.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

7 thoughts on “Frontera: The Brain Behind the Crawls

  1. items.py is incorrect; it should be:

    import scrapy

    class HnArticleItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    item_id = scrapy.Field()
    author = scrapy.Field()
    id = scrapy.Field()
    pass

Leave a Reply

Your email address will not be published. Required fields are marked *