Skip to content

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.

In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.

Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.

Robots

What Makes a Crawler Polite?

A polite crawler respects robots.txt
A polite crawler never degrades a website’s performance
A polite crawler identifies its creator with contact information
A polite crawler is not a pain in the buttocks of system administrators

robots.txt

Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard. Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers:

User-agent: Some_Annoying_Bot
Disallow: /

User-Agent: *
Disallow: /*.json
Disallow: /api
Disallow: /post
Disallow: /submit
Allow: /

Crawl-Delay

Mission critical to having a polite crawler is making sure your crawler doesn’t hit a website too hard. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.

When a website gets overloaded with more requests that the web server can handle, they might become unresponsive. Don’t be that guy or girl that causes a headache for the website administrators.

User-Agent

However, if you have ignored the cardinal rules above (or your crawler has achieved aggressive sentience), there needs to be a way for the website owners to contact you. You can do this by including your company name and an email address or website in the request’s User-Agent header. For example, Google’s crawler user agent is “Googlebot”.

Scrapinghub Abuse Report Form

Hey folks using our Scrapy Cloud platform! We trust you will crawl responsibly, but to support website administrators, we provide an abuse report form where they can report any misbehaviour from crawlers running on our platform. We’ll kindly pass the message along so that you can modify your crawls and avoid ruining a sysadmin’s day. If your crawler’s are turning into Skynet and running roughshod over human law, we reserve the right to halt their crawling activities and thus avert the robot apocalypse.

How to be Polite using Scrapy

Scrapy is a bit like Optimus Prime: friendly, fast, and capable of getting the job done no matter what. However, much like Optimus Prime and his fellow Autobots, Scrapy occasionally needs to be kept in check. So here’s the nitty gritty for ensuring that Scrapy is as polite as can be.

scrapy

Robots.txt

Crawlers created using Scrapy 1.1+ already respect robots.txt by default. If your crawlers have been generated using a previous version of Scrapy, you can enable this feature by adding this in the project’s settings.py:

ROBOTSTXT_OBEY = True

Then, every time your crawler tries to download a page from a disallowed URL, you’ll see a message like this:

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

Identifying your Crawler

It’s important to provide a way for sysadmins to easily contact you if they have any trouble with your crawler. If you don’t, they’ll have to dig into their logs and look for the offending IPs.

Be nice to the friendly sysadmins in your life and identify your crawler via the Scrapy USER_AGENT setting. Share your crawler name, company name and a contact email:

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

Introducing Delays

Scrapy spiders are blazingly fast. They can handle many concurrent requests and they make the most of your bandwidth and computing power. However, with great power comes great responsibility.

To avoid hitting the web servers too frequently, you need to use the DOWNLOAD_DELAY setting in your project (or in your spiders). Scrapy will then introduce a random delay ranging from 0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY seconds between consecutive requests to the same domain. If you want to stick to the exact DOWNLOAD_DELAY that you defined, you have to disable RANDOMIZE_DOWNLOAD_DELAY.

By default, DOWNLOAD_DELAY is set to 0. To introduce a 5 second delay between requests from your crawler, add this to your settings.py:

DOWNLOAD_DELAY = 5.0

If you have a multi-spider project crawling multiple sites, you can define a different delay for each spider with the download_delay (yes, it’s lowercase) spider attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'
    download_delay = 5.0
    ...

Concurrent Requests Per Domain

Another setting you might want to tweak to make your spider more polite is the number of concurrent requests it will do for each domain. By default, Scrapy will dispatch at most 8 requests simultaneously to any given domain, but you can change this value by updating the CONCURRENT_REQUESTS_PER_DOMAIN setting.

Heads up, the CONCURRENT_REQUESTS setting defines the maximum amount of simultaneous requests that Scrapy’s downloader will do for all your spiders. Tweaking this setting is more about your own server performance / bandwidth than your target’s when you’re crawling multiple domains at the same time.

AutoThrottle to Save the Day

Websites vary drastically in the number of requests they can handle. Adjusting this manually for every website that you are crawling is about as much fun as watching paint dry. To save your sanity, Scrapy provides an extension called AutoThrottle.

AutoThrottle automatically adjusts the delays between requests according to the current web server load. It first calculates the latency from one request. Then it will adjust the delay between requests for the same domain in a way that no more than AUTOTHROTTLE_TARGET_CONCURRENCY requests will be simultaneously active. It also ensures that requests are evenly distributed in a given timespan.

To enable AutoThrottle, just include this in your project’s settings.py:

AUTOTHROTTLE_ENABLED = True

Scrapy Cloud users don’t have to worry about enabling it, because it’s already enabled by default.

There’s a wide range of settings to help you tweak the throttle mechanism, so have fun playing around!

Use an HTTP Cache for Development

Developing a web crawler is an iterative process. However, running a crawler to check if it’s working means hitting the server multiple times for each test. To help you to avoid this impolite activity, Scrapy provides a built-in middleware called HttpCacheMiddleware. You can enable it by including this in your project’s settings.py:

HTTPCACHE_ENABLED = True

Once enabled, it caches every request made by your spider along with the related response. So the next time you run your spider, it will not hit the server for requests already done. It’s a win-win: your tests will run much faster and the website will save resources.

Don’t Crawl, use the API

Many websites provide HTTP APIs so that third parties can consume their data without having to crawl their web pages. Before building a web scraper, check if the target website already provides an HTTP API that you can use. If it does, go with the API. Again, it’s a win-win: you avoid digging into the page’s HTML and your crawler gets more robust because it doesn’t need to depend on the website’s layout.

Wrap Up

Let’s all do our part to keep the peace between sysadmins, website owners, and developers by making sure that our web crawling projects are as noninvasive as possible. Remember, we need to band together to delay the rise of our robot overlords, so let’s keep our crawlers, spiders, and bots polite.

image03

To all website owners, help a crawler out and ensure your site has an HTTP API. And remember, if someone using our platform is overstepping their bounds, please fill out an Abuse Report form and we’ll take care of the issue.

For those new to our platform, Scrapy Cloud is forever free and is the peanut butter to Scrapy’s jelly. For our existing Scrapy and Scrapy Cloud users, hopefully you learned a few tips for how to both speed up your crawls and prevent abuse complaints. Let us know if you have any further suggestions in the comment section below!

Sign up for free

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3.

scrapy_cloud_2x

If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping projects. It pairs with Scrapy, the open source web scraping framework, and Portia, our open source visual web scraper.

Scrapy + Scrapy Cloud with Python 3

I’m sure you Scrapy users are breathing a huge sigh of relief! While Scrapy with official Python 3 support has been around since May, you can now deploy your Scrapy spiders using the fancy new features introduced with Python 3 to Scrapy Cloud. You’ll have the beloved extended tuple unpacking, function annotations, keyword-only arguments and much more at your fingertips.

Fear not if you are a Python 2 developer and can’t port your spiders’ codebase to Python 3, because Scrapy Cloud will continue supporting Python 2. In fact, Python 2 remains the default unless you explicitly set your environment to Python 3.

Deploying your Python 3 Spiders

Docker support was one of the new features that came along with the Scrapy Cloud 2.0 release in May. It brings more flexibility to your spiders, allowing you to define in which kind of runtime environment (AKA stack) they will be executed.

This configuration is done in your local project’s scrapinghub.yml. There you have to include a section called stacks having scrapy:1.1-py3 as the stack for your Scrapy Cloud project:

projects:
    default: 99999
stacks:
    default: scrapy:1.1-py3

After doing that, you just have to deploy your project using shub:

$ shub deploy

Note: make sure you are using shub 2.3+ by upgrading it:

$ pip install shub --upgrade

And you’re all done! The next time you run your spiders on Scrapy Cloud, they will run on Scrapy 1.1 + Python 3.

Multi-target Deployment File

If you have a multi-target deployment file, you can define a separate stack for each project ID:

projects:
    default:
        id: 55555
        stack: scrapy:1.1
    py3:
        id: 99999
        stack: scrapy:1.1-py3

This allows you to deploy your local project to whichever Scrapy Cloud project you want, using a different stack for each one:

$ shub deploy py3

This deploys your crawler to project 99999 and uses Scrapy 1.1 + Python 3 as the execution environment.

You can find different versions of the Scrapy stack here.

Wrap Up

We hope that you’re as excited as we are for this newest upgrade to Python 3. If you have further questions or are interested in learning more about the souped up Scrapy Cloud, take a look at our Knowledge Base article.

For those new to our platform, Scrapy Cloud has a forever free subscription, so sign up and give us a try.

Sign up for free

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner.

morpheus_meme

Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair according to character traits and real world web data uses, so hopefully this isn’t too much of a stretch.

This should be spoiler free, with nothing revealed that you can’t get from the trailers! Fair warning, you’re going to have Ballroom Blitz stuck in your head all day. And if you haven’t seen Suicide Squad yet, hopefully we get you pumped up for this popcorn movie.

Market Research and Predictions: Deadshot

Deadshot’s claim to fame is accurate aim. He can predict bullet trajectories and he never misses a shot. So I paired him with using web data for market research and trend prediction. You can scrape multiple websites for price fluctuation, new products, reviews, and consumer trends. This is an automated process that allows you to quickly and accurately analyze data without needing to manually monitor websites.

Social Media Monitoring: Harley Quinn

Harley Quinn has a sunny personality that remains chipper even when faced with death, destruction, torture, and mayhem. She also always has a witty comeback no matter the situation. These traits go hand-in-hand with how brands should approach social media channels. Extracting web data from social media interactions help you understand consumer opinions. You can monitor ongoing chatter about your company or your competition and respond in the most positive way possible.

Lead Generation and HR Recruitment: Amanda Waller

This is probably the most obvious pairing since Amanda Waller (played by the wonderful Viola Davis) is the one responsible for assembling the Suicide Squad. She carefully researched and compiled intimate details on all the criminals-turned-reluctant-heroes. This is an aspect of web data that benefits all sales, marketing, recruitment, and HR. With a pre-vetted pool, you’ll have access to qualified leads and decision-makers without needing to wade through the worst of the worst.

Tracking Criminal Activity in the Dark Web: Killer Croc

This sewer-dwelling villain thrives in dark and hidden spaces. He’s used to working underground and in places most people don’t even know exist. This makes Killer Croc the perfect backdrop for the type of web data located in the deep/dark web. The dark web is the part of the internet that is not indexed by search engines (Google, Bing, etc.) and is often a haven for criminal activity. Data scraped from this part of the web is commonly used by law enforcement agencies.

Competitive Pricing: Captain Boomerang

This jewelry thief goes around the world stealing from banks and committing acts of burglary – with a boomerang… Captain Boomerang knows all about pricing and the comparative value of products so he can get the largest bang for his buck. Similarly, web data is a great resource for new companies looking to research their industry and how their prices match up to the competition. And if you are an established company, this is a great way for you to keep track of newcomers and potential market disruptors.

Machine Learning Models: Enchantress

In her 6313 years of existence, the Enchantress has had to cope with changing times, customs, and civilizations. The ability to learn quickly and adapt to new situations is definitely an important part of her continued survival. Likewise, machine learning is a form of artificial intelligence that can learn when given new information. Train your machine learning models using datasets for conducting sentiment analysis, making predictions, and even automating web scraping. Whether you are an SaaS company specializing in developing machine learning technology or someone who needs machine learning analysis, you need to ensure you have up-to-date datasets.

Monitoring Resellers: Colonel Rick Flag

Colonel Rick Flag is a “good guy” whose job is to keep track of the Suicide Squad and kill them if they get out of line. Now obviously your relationship with resellers is not a life-and-death situation, but it can be good to know how your brand is being represented across the internet. Web scraping can help you keep track of reseller customer reviews and any contract violations that might be occurring.

Monitoring Legal Matters and Government Corruption: Katana

Katana the samurai is the enforcer of the Suicide Squad. She is there as an additional check to keep the criminal members in line. Similarly, web data allows reporters, lawyers, and concerned citizens to keep track of government officials, potential corruption charges, and changing legal matters. You can scrape obscure or poorly presented public records and then use that information to create accessible interfaces for easy reference and research.

Web Scraping for Fun: the Joker

I believe the Joker needs no introduction, whether you know this character from Jack Nicholson, Heath Ledger, or the new Jared Leto incarnation. He is unpredictable, has eclectic tastes, and is capable of doing anything. And honestly, this is what web scraping is all about. Whether you want to build a bike sharing app or monitor government corruption, web data provides the backbone for all of your creative endeavors.

Wrap Up

I hope you enjoyed this unorthodox tour of the world of web data! If you’re looking for some mindless fun, Suicide Squad ain’t half bad (it ain’t half good either). If you’re looking to explore how web data fits within your business or personal projects, feel free to reach out to us. And if you’re looking to hate on or defend Suicide Squad, comment below.

P.S. There is no way this movie is worse than Batman v Superman: Dawn of Justice

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub.

OS-Scrapinghub

Scrapy

This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!

What’s coming in Scrapy 1.2 (in a couple of weeks)?

  • The ability to specify the encoding of items in your JSON, CSV or XML output files
  • Creating Scrapy projects in any folder you want (not only the current one)

Scrapy Plugins

We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).

Here are some of the newly released ones:

  • scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
  • scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.

Libraries

Dateparser

In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.

It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!

Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day😉

w3lib

w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!

Wrap Up

And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework.

HarryParseltongue

Not to be confused with Parseltongue/Parselmouth

We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3. As a library, it’s lighter than Scrapy (it relies on lxml and cssselect) and also more flexible, allowing you to use it within any Python program.

v-3

Using Parsel

Install Parsel using pip:

pip install parsel

And here’s how you use it. Say you have this HTML snippet in a variable:

>>> html = u'''
<ul>
	<li><a href="http://blog.scrapinghub.com">Blog</a></li>
...
	<li><a href="http://www.scrapinghub.com">Scrapinghub</a></li>
...
	<li class="external"><a href="http://www.scrapy.org">Scrapy</a></li>
</ul>
'''

You then import the Parsel library, load it into a Parsel Selector and extract links with an XPath expression:

>>> import parsel
>>> sel = parsel.Selector(html)
>>> sel.xpath("//a/@href").extract()
[u'http://blog.scrapinghub.com', u'http://www.scrapinghub.com', u'http://www.scrapy.org']

Note: Parsel works both in Python 3 and Python 2. If you’re using Python 2, remember to pass the HTML in a unicode object.

Sweet Parsel Features

One of the nicest features of Parsel is the ability to chain selectors. This allows you to chain CSS and XPath selectors however you wish, such as in this example:

>>> sel.css('li.external').xpath('./a/@href').extract()
[u'http://www.scrapy.org']

You can also iterate through the results of the .css() and .xpath() methods since each element will be another selector:

>>> for li in sel.css('ul li'):
...     print(li.xpath('./a/@href').extract_first())
...
http://blog.scrapinghub.com
http://www.scrapinghub.com
http://www.scrapy.org

You can find more examples of this in the documentation.

When to use Parsel

The beauty of Parsel is in its wide applicability. It is useful for a range of situations including:

  • Processing XML/HTML data in an IPython notebook
  • Writing end-to-end tests for your website or app
  • Simple web scraping projects with the Python Requests library
  • Simple automation tasks at the command-line

And now, you can also run Parsel with the command-line tool for simple extraction tasks in your terminal. This new development is thanks to our very own Rolando who created parsel-cli.

Install parsel-cli with pip install parsel-cli and play around using the examples below (you need to have curl installed).

The following command will download and extract the list of Academy Award-winning films from Wikipedia:

curl -s https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films |\
    parsel-cli 'table.wikitable tr td i a::text'

You can also get the current top 5 news items from Hacker News using:

curl -s https://news.ycombinator.com |\
    parsel-cli 'a.storylink::attr(href)' | head -n 5

And how about obtaining a list of the latest YouTube videos from a specific channel?

curl -s https://www.youtube.com/user/crashcourse/videos |\
    parsel-cli 'h3 a::attr(href), h3 a::text' |\
    paste -s -d' \n' - | sed 's|^|http://youtube.com|'

Wrap Up

I hope that you enjoyed this little tour of Parsel and I am looking forward to seeing how these examples have sparked your imagination when finding solutions for your HTML parsing needs.

The next time you find yourself wanting to extract data from HTML/XML and don’t need Scrapy and its crawling capabilities, you know what to do: just Parsel it!

Feel free to reach out to us on Twitter and let us know how you use Parsel in your projects.

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love!🙂

Follow

Get every new post delivered to your Inbox.

Join 121 other followers

%d bloggers like this: