Scrapy Tips from the Pros: March 2016 Edition

Scrapy Tips from the Pros: March 2016 Edition

Scrapy-Tips-March-2016

Welcome to the March Edition of Scrapy Tips from the Pros! Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll cover how to use a cookiejar with the CookiesMiddleware to get around websites that won’t allow you to crawl multiple pages at the same time using the same cookie. We’ll also share a handy tip on how to use multiple fallback XPath/CSS expressions with item loaders to get data from websites more reliably.

**Students reading this, we are participating in Google Summer of Code 2016 and some of our project ideas involve Scrapy! If you’re interested, take a look at our ideas and remember to apply before Friday, March 25!

If you are not a student, please share with your student friends. They could get a summer stipend and we might even hire them at the end.**

Work Around Sites With Weird Session Behavior Using a CookieJar

Websites that store your UI state on their server’s sessions are a pain to navigate, let alone scrape. Have you ever run into websites where one tab affects the other tabs open on the same site? Then you’ve probably run into this issue.

While this is frustrating for humans, it’s even worse for web crawlers. It can severely hinder a web crawling session. Unfortunately, this is a common pattern for ASP.Net and J2EE-based websites. And that’s where cookiejars come in. While the cookiejar is not a frequent need, you’ll be so glad that you have it for those unexpected cases.

When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site. But, as you may know, Scrapy requests are asynchronous. This means that you probably have multiple requests being handled concurrently to the same website while sharing the same cookie. To avoid having requests affect each other when crawling these types of websites, you must set different cookies for different requests.

You can do this by using a cookiejar to store separate cookies for different pages in the same website. The cookiejar is just a key-value collection of cookies that Scrapy keeps during the crawling session. You just have to define a unique identifier for each of the cookies that you want to store and then use that identifier when you want to use that specific cookie.

For example, say you want to crawl multiple categories on a website, but this website stores the data related to the category that you are crawling/browsing in the server session. To crawl the categories concurrently, you would need to create a cookie for each category by passing the category name as the identifier to the cookiejar meta parameter:

class ExampleSpider(scrapy.Spider):
    urls = [
        'http://www.example.com/category/photo',
        'http://www.example.com/category/videogames',
        'http://www.example.com/category/tablets'
    ]

    def start_requests(self):
        for url in urls:
            category = url.split('/')[-1]
            yield scrapy.Request(url, meta={'cookiejar': category})

Three different cookies will be managed in this case (‘photo’, ‘videogames’ and ‘tablets’). You can create a new cookie whenever you pass a nonexistent key as the cookiejar meta value (like when a category name hasn’t been visited yet). When the key we pass already exists, Scrapy uses the respective cookie for that request.

So, if you want to reuse the cookie that has been used to crawl the ‘videogames’ page, for example, you just need to pass ‘videogames’ as the unique key to the cookiejar. Instead of creating a new cookie, it will use the existing one:

yield scrapy.Request('http://www.example.com/atari2600', meta={'cookiejar': 'videogames'})

Adding Fallback CSS/XPath Rules

Item Loaders are useful when you need to accomplish more than simply populating a dictionary or an Item object with the data collected by your spider. For example, you might need to add some post-processing logic to the data that you just collected. You might be interested in something as simple as capitalizing every word in a title to more complex operations. With an ItemLoader, you can decouple this post-processing logic from the spider in order to have a more maintainable design.

This tip shows you how to add extra functionality to an Item Loader. Let’s say that you are crawling Amazon.com and extracting the price for each product. You can use an Item Loader to populate a ProductItem object with the product data:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()
    price = scrapy.Field()


class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
        ...

    def parse_product(self, response):
        loader = ItemLoader(item=ProductItem(), response=response)
        loader.add_css('price', '#priceblock_ourprice ::text')
        loader.add_css('name', '#productTitle ::text')
        loader.add_value('url', response.url)
        yield loader.load_item()

This method works pretty well, unless the scraped product is a deal. This is because Amazon represents deal prices in a slightly different format than regular prices. While the price of a regular product is represented like this:

<span id="priceblock_ourprice" class="a-size-medium a-color-price">
    $699.99
</span>

The price of a deal is shown slightly differently:

<span id="priceblock_dealprice" class="a-size-medium a-color-price">
    $649.99
</span>

A good way to handle situations like this is to add a fallback rule for the price field in the Item loader. This is a rule that is applied only if the previous rules for that field have failed. To accomplish this with the Item Loader, you can add a add_fallback_css method:

class AmazonItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    def get_collected_values(self, field_name):
        return (self._values[field_name]
                if field_name in self._values
                else self._values.default_factory())

    def add_fallback_css(self, field_name, css, *processors, **kw):
        if not any(self.get_collected_values(field_name)):
            self.add_css(field_name, css, *processors, **kw)

As you can see, the add_fallback_css method will use the CSS rule if there are no previously collected values for that field. Now, we can change our spider to use AmazonItemLoader and then add the fallback CSS rule to our loader:

def parse_product(self, response):
    loader = AmazonItemLoader(item=ProductItem(), response=response)
    loader.add_css('price', '#priceblock_ourprice ::text')
    loader.add_fallback_css('price', '#priceblock_dealprice ::text')
    loader.add_css('name', '#productTitle ::text')
    loader.add_value('url', response.url)
    yield loader.load_item()

This tip can save you time and make your spiders much more robust. If one CSS rule fails to get the data, there will be other rules that can be applied which will extract the data you need.

If Item Loaders are new to you, check out the documentation.

Wrap Up

And there you have it! Please share any and all problems that you’ve run into while web scraping and extracting data. We’re always on the lookout for new tips and hacks to share in our Scrapy Tips from the Pros monthly column. Hit us up on Twitter or Facebook and let us know if we’ve helped your workflow.

And if you haven’t yet, give Portia, our open source visual web scraping tool, a try. We know you’re attached to Scrapy, but it never hurts to experiment with your stack 😉

Please apply to join us for Google Summer of Code 2016 by Friday, March 25!

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

2 thoughts on “Scrapy Tips from the Pros: March 2016 Edition

  1. For sites with this weird session behaviour I’ve also found it’s useful to consider the duplicate filter too – if they’re using cookies to navigate they will most often share the same url for each subsequent page, so one can easily run into trouble here.

Leave a Reply

Your email address will not be published. Required fields are marked *