Browsed by
Author: Valdir Stumm Jr

Deploy your Scrapy Spiders from GitHub

Deploy your Scrapy Spiders from GitHub

Up until now, your deployment process using Scrapy Cloud has probably been something like this: code and test your spiders locally, commit and push your changes to a GitHub repository, and finally deploy them to Scrapy Cloud using shub deploy. However, having the development and the deployment processes in isolated steps might bring you some issues, such as unversioned and outdated code running in production.

The good news is that, from now on, you can have your code automatically deployed to Scrapy Cloud whenever you push changes to a GitHub repository. All you have to do is connect your Scrapy Cloud project with a repository branch and voilà!

Scrapy Cloud’s new GitHub integration will help you ensure that your code repository and your deployment environments are always in sync, getting rid of the error-prone manual deployment process and also speeding up the development cycle.

Check out how to setup automatic deploys in your projects:

If you are not that into videos, have a look at this guide.

Improving your workflow with the GitHub integration

You could use this feature to set up a multi-stage deploy workflow integrated with your repository. Let’s say you have a repo called foobar-crawler, with three main branches — development, staging and master — and you need one deployment environment for each one.

You create one Scrapy Cloud project for each branch:

  • foobar-dev
  • foobar-staging
  • foobar

And connect each of these projects with a specific branch from your foobar-crawler repository, as shown below for the development one:

Then, every time you push changes to one of these branches, the code is automatically deployed to the proper environment.

Wrapping up

If you have any feedback regarding this feature or the whole platform, leave us a comment.

Start deploying your Scrapy spiders from Github now.

Sign up for free

How to Build your own Price Monitoring Tool

How to Build your own Price Monitoring Tool

Computers are great at repetitive tasks. They don’t get distracted, bored, or tired. Automation is how you should be approaching tedious tasks that are absolutely essential to becoming a successful business or when carrying out mundane responsibilities. Price monitoring, for example, is a practice that every company should be doing, and is a task that readily lends itself to automation.

In this tutorial, I’ll walk you through how to create your very own price monitoring tool from scratch. While I’m approaching this as a careful shopper who wants to make sure I’m getting the best price for a specific product, you could develop a similar tool to monitor your competitors using similar methods.

Why you should be monitoring competitor prices

Price monitoring is basically knowing how your competitors price their products, how your prices fit within your industry, and whether there are any fluctuations that you can take advantage of.

When it comes to mission critical tasks like price monitoring, it’s important to ensure accuracy, obtain up-to-date information, and have the capacity for massive scale. By pricing your products perfectly, you can make sure that your competitors aren’t undercutting you, which makes you more likely to nab customers.

In our article on how web data is used by startups, Max Robinson, owner of Ace Work Gear, shared his thoughts on the importance of price monitoring:

“But it occurred to me that if you aren’t offering competitive prices, then you’re essentially throwing money down the drain. Even if you have good visibility, users will look elsewhere to buy once they’ve seen your prices.”

And that’s part of why automation is so important. You don’t want to miss sudden sales or deals from competitors that might make your offerings less desirable.

Overview

In terms of using price monitoring as a consumer, the key is to be able to take advantage of rapid price drops so you can buy during lightning sales. For this tutorial, I used Scrapy, our open source web scraping framework, and Scrapy Cloud, our fully-featured production environment (there’s a forever free account option). Here is the basic outline of my approach:

  1. Develop web scrapers to periodically collect prices from a list of products and online retailers.
  2. Build a Python script to check whether there are price drops in the most recently scraped data and then send an email alert when there are.
  3. Deploy the project to Scrapy Cloud and schedule periodic jobs to run the spiders and the script every X minutes.

Collecting the Prices

I monitored prices from a couple online retailers. To scrape the prices, I built one Scrapy spider for each of these. The spiders work by:

  1. Reading a list of product URLs from a JSON file
  2. Scraping the prices for the listed products
  3. Storing the prices in a Scrapy Cloud Collection (an efficient key-value storage)

Here is a sample JSON file with product URLs:

{
    "headsetlogitech": [
        "https://www.retailer1.com/pagefor-logitech-headset",
        "http://www.retailer2.com/pagefor-headset-logitech",
        "http://www.retailer3.com/pagefor-headset-log"
    ],
    "webcamlogitech": [
        "https://www.retailer1.com/pagefor-logitech-webcam",
        "http://www.retailer2.com/pagefor-webcam-logitech",
        "http://www.retailer3.com/pagefor-webcam-log"
    ]
}

If you want to monitor more retailers than the three I implemented, all you need to do is add their URLs to the JSON file and then create the requisite Scrapy spider for each website.

The Spiders

If you are new to the world of Scrapy and web scraping, then I suggest that you check out this tutorial first. When building a spider, you need to pay attention to the layout of each retailer’s product page. For most of these stores, the spider code will be really straightforward, containing only the extraction logic using CSS selectors. In this case, the URLs are read during the spider’s startup.

Here’s an example spider for Best Buy:

class BestbuySpider(BaseSpider):
  name = "bestbuy.com"
  
  def parse(self, response):
    item = response.meta.get('item', {})
    item['url'] = response.url
    item['title'] = response.css(
      'div#sku-title > h1::text'
    ).extract_first().strip()
    item['price'] = float(response.css(
      'div.price-block ::attr(data-customer-price)'
    ).extract_first(default=0))
    yield item

BaseSpider contains the logic to read the URLs from the JSON file and generate requests. In addition to the spiders, I created an item pipeline to store product data in a Scrapy Cloud collection. You can check out the other spiders that I built in the project repository.

Building the Price Monitoring Script

Now that the spiders have been built, you should start getting product prices that are then stored in a collection. To monitor price fluctuations, the next step is to build a Python script that will pull data from that collection, check if the most recent prices are the lowest in a given time span, and then send an email alert when it finds a good deal.

Here is my model email notification that is sent out when there’s a price drop:

image00

You can find the source code for the price monitor in the project repository. As you might have noticed, there are customizable options via command line arguments. You can:

  • modify the time frame in which the prices are compared to find out whether the latest price is the best of the day, the week, the month, and so forth.
  • set a price margin to ignore insignificant price drops since some retailers have minuscule price fluctuations throughout the day. You probably don’t want to receive an email when the product that you’re interested in drops one cent…

Deployment and Execution

Now that you have the spider(s) and the script, you need to deploy both to Scrapy Cloud, our PaaS for web crawlers.

I scheduled my spiders to collect prices every 30 minutes and the script to check this data at 30 minute intervals as well. You can configure this through your Scrapy Cloud dashboard, easily changing the periodicity depending on your needs.

image03

Check out this video to learn how to deploy Scrapy spiders and this tutorial on how to run a regular Python script on Scrapy Cloud.

How to run this project in your own Scrapy Cloud account:

  • Clone the project:
    • git clone git@github.com:scrapinghub/sample-projects.git
  • Add the products you want to monitor to resources/urls.json
  • Sign up for Scrapy Cloud (it’s free!)
  • Create a project on Scrapy Cloud
  • Deploy your local project to Scrapy Cloud
  • Create a periodic execution job to run each spider
  • Create a periodic execution job to run the monitor script
  • Sit back, relax, and let automation work its magic

Scaling up

This price monitor is a good fit for individuals interested in getting the best deals for their wishlist. However, if you’re looking to scale up and create a reliable tool for monitoring competitors, here are some typical challenges that you will face:

  • Getting prices from online retailers who feature millions of products can be overwhelming. Scraping these sites requires advanced crawling strategies to make sure that you always have hot data that is relevant.
  • Online retailers typically have layout variations throughout their website and the smallest shifts can bring your crawler to a screeching halt. To get around this, you might need to use advanced techniques such as machine learning to help with data discovery.
  • Running into anti-bot software can shut your price gathering activities down. You will need to develop some sophisticated techniques for bypassing these obstacles.

If you’re curious about how to implement or develop an automated price monitoring tool, feel free to reach out with any questions.

Tell us about your needs

Wrap up

To sum up, there’s no reason why you should be manually searching for prices and monitoring competitors. Using Scrapy, Scrapy Cloud, a Python script, and just a little bit of programming know-how, you can easily get your holiday shopping done under budget with deals delivered straight to your inbox.

If you’re looking for a professional-grade competitor and price monitoring service, get in touch!

An Introduction to XPath: How to Get Started

An Introduction to XPath: How to Get Started

XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy. The other is CSS and while CSS selectors are a popular choice, XPath can actually allow you to do more.

With XPath, you can extract data based on text elements’ contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day (and a bunch of your time!).

This is an introductory tutorial that will walk you through the basic concepts of XPath, crucial to a good understanding of it, before diving into more complex use cases.

Note: You can use the XPath playground to experiment with XPath. Just paste the HTML samples provided in this post and play with the expressions.

The basics

Consider this HTML document:

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>

XPath handles any XML/HTML document as a tree. This tree’s root node is not part of the document itself. It is in fact the parent of the document element node (<html> in case of the HTML above). This is how the XPath tree for the HTML document looks like:

HTML tree

As you can see, there are many node types in an XPath tree:

  • Element node: represents an HTML element, a.k.a an HTML tag.
  • Attribute node: represents an attribute from an element node, e.g. “href” attribute in <a href=”http://www.example.com”>example</a>.
  • Comment node: represents comments in the document (<!-- … -->).
  • Text node: represents the text enclosed in an element node (example in <p>example</p>).

Distinguishing between these different types is useful to understand how XPath expressions work. Now let’s start digging into XPath.

Here is how we can select the title element from the page above using an XPath expression:

/html/head/title

This is what we call a location path. It allows us to specify the path from the context node (in this case the root of the tree) to the element we want to select, as we do when addressing files in a file system. The location path above has three location steps, separated by slashes. It roughly means: start from the ‘html’ element, look for a ‘head’ element underneath, and a ‘title’ element underneath that ‘head’. The context node changes in each step. For example, the head node is the context node when the last step is being evaluated.

However, we usually don’t know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name. We can select them using:

//title

Which means: look in the whole tree, starting from the root of the tree (//) and select only those nodes whose name matches title. In this example, // is the axis and title is the node test.

In fact, the expressions we’ve just seen are using XPath’s abbreviated syntax. Translating //title to the full syntax we get:

/descendant-or-self::node()/child::title

So, // in the abbreviated syntax is short for descendant-or-self, which means the current node or any node below it in the tree. This part of the expression is called the axis and it specifies a set of nodes to select from, based on their direction on the tree from the current context (downwards, upwards, on the same tree level). Other examples of axes are: parent, child, ancestor, etc — we’ll dig more into this later on.

The next part of the expression, node(), is called a node test, and it contains an expression that is evaluated to decide whether a given node should be selected or not. In this case, it selects nodes from all types. Then we have another axis,child, which means go to the child nodes from the current context, followed by another node test, which selects the nodes named as title.

So, the axis defines where in the tree the node test should be applied and the nodes that match the node test will be returned as a result.

You can test nodes against their name or against their type.

Here are some examples of name tests:

Expression Meaning
/html Selects the node named html, which is under the root.
/html/head Selects the node named head, which is under the html node.
//title Selects all the title nodes from the HTML tree.
//h2/a Selects all a nodes which are directly under an h2 node.

And here are some examples of node type tests:

Expression Meaning
//comment() Selects only comment nodes.
//node() Selects any kind of node in the tree.
//text() Selects only text nodes, such as “This is the first paragraph”.
//* Selects all nodes, except comment and text nodes.

We can also combine name and node tests in the same expression. For example:

//p/text()

This expression selects the text nodes from inside p elements. In the HTML snippet shown above, it would select “This is the first paragraph.”.

Now, let’s see how we can further filter and specify things. Consider this HTML document:

<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2> ...</li>
    </ul>
  </body>
</html>

Say we want to select only the first li node from the snippet above. We can do this with:

//li[position() = 1]

The expression surrounded by square brackets is called a predicate and it filters the node set returned by //li (that is, all li nodes from the document) using the given condition. In this case it checks each node’s position using the position() function, which returns the position of the current node in the resulting node set (notice that positions in XPath start at 1, not 0). We can abbreviate the expression above to:

//li[1]

Both XPath expressions above would select the following element:

<li class="quote">Quote 1</li>

Check out a few more predicate examples:

Expression Meaning
//li[position()%2=0] Selects the li elements at even positions.
//li[a] Selects the li elements which enclose an a element.
//li[a or h2] Selects the li elements which enclose either an a or an h2 element.
//li[ a [ text() = "link" ] ] Selects the li elements which enclose an a element whose text is “link”. Can also be written as //li[ a/text()="link" ].
//li[last()] Selects the last li element in the document.

So, a location path is basically composed by steps, which are separated by / and each step can have an axis, a node test and a predicate. Here we have an expression composed by two steps, each one with axis, node test and predicate:

//li[ 4 ]/h2[ text() = "Quote 4 title" ]

And here is the same expression, written using the non-abbreviated syntax:

/descendant-or-self::node()
    /child::li[ position() = 4 ]
        /child::h2[ text() = "Quote 4 title" ]

We can also combine multiple XPath expressions in a single one using the union operator |. For example, we can select all a and h2 elements in the document above using this expression:

//a | //h2

Now, consider this HTML document:

<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
    </ul>
  </body>
</html>

Say we want to select only the a elements whose link points to an HTTPS URL. We can do it by checking their href attribute:

//a[starts-with(@href, "https")]

This expression first selects all the a elements from the document and for each of those elements, it checks whether their href attribute starts with “https”. We can access any node attribute using the @attributename syntax.

Here we have a few additional examples using attributes:

Expression Meaning
//a[@href=”https://scrapy.org”] Selects the a elements pointing to https://scrapy.org.
//a/@href Selects the value of the href attribute from all the a elements in the document.
//li[@id] Selects only the li elements which have an id attribute.

More on Axes

We’ve seen only two types of axes so far:

  • descendant-or-self
  • child

But there’s plenty more where they came from and we’ll see a few examples. Consider this HTML document:

<html>
  <body>
    <p>Intro paragraph</p>
    <h1>Title #1</h1>
    <p>A random paragraph #1</p>
    <h1>Title #2</h1>
    <p>A random paragraph #2</p>
    <p>Another one #2</p>
    A single paragraph, with no markup
    <div id="footer"><p>Footer text</p></div>
  </body>
</html>

Now we want to extract only the first paragraph after each of the titles. To do that, we can use the following-sibling axis, which selects all the siblings after the context node. Siblings are nodes who are children of the same parent, for example all children nodes of the body tag are siblings. This is the expression:

//h1/following-sibling::p[1]

In this example, the context node where the following-sibling axis is applied to is each of the h1 nodes from the page.

What if we want to select only the text that is right before the footer? We can use the preceding-sibling axis:

//div[@id='footer']/preceding-sibling::text()[1]

In this case, we are selecting the first text node before the div footer (“A single paragraph, with no markup”).

XPath also allows us to select elements based on their text content. We can use such a feature, along with the parent axis, to select the parent of the p element whose text is “Footer text”:

//p[ text()="Footer text" ]/..

The expression above selects <div id="footer"><p>Footer text</p></div>. As you may have noticed, we used .. here as a shortcut to the parent axis.

As an alternative to the expression above, we could use:

//*[p/text()="Footer text"]

It selects, from all elements, the ones that have a p child which text is “Footer text”, getting the same result as the previous expression.

You can find additional axes in the XPath specification: https://www.w3.org/TR/xpath/#axes

Wrap up

XPath is very powerful and this post is just an introduction to the basic concepts. If you want to learn more about it, check out these resources:

And stay tuned, because we will post a series with more XPath tips from the trenches in the following months.

How to Deploy Custom Docker Images for Your Web Crawlers

How to Deploy Custom Docker Images for Your Web Crawlers

What if you could have complete control over your environment? Your crawling environment, that is… One of the many benefits of our upgraded production environment, Scrapy Cloud 2.0, is that you can customize your crawler runtime environment via Docker images. It’s like a superpower that allows you to use specific versions of Python, Scrapy and the rest of your stack, deciding if and when to upgrade.

docker

With this new feature, you can tailor a Docker image to include any dependency your crawler might have. For instance, if you wanted to crawl JavaScript-based pages using Selenium and PhantomJS, you would have to include the PhantomJS executable somewhere in the PATH of your crawler’s runtime environment.

And guess what, we’ll be walking you through how to do just that in this post.

Heads up, while we have a forever free account, this feature is only available for paid Scrapy Cloud users. The good news is that it’s easy to upgrade your account. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

Using a custom image to run a headless browser

Download the sample project or clone the GitHub repo to follow along.

Imagine you created a crawler to handle website content that is rendered client-side via Javascript. You decide to use selenium and PhantomJS. However, since PhantomJS is not installed by default on Scrapy Cloud, trying to deploy your crawler the usual way would result in this message showing up in the job logs:

selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable needs to be in PATH.

PhantomJS, which is a C++ application, needs to be installed in the runtime environment. You can do this by creating a custom Docker image that downloads and installs the PhantomJS executable.

Building a custom Docker image

First you have to install a command line tool that will help you with building and deploying the image:

$ pip install shub-image

Before using shub-image, you have to include scrapinghub-entrypoint-scrapy in your project’s requirements file, which is a runtime dependency of Scrapy Cloud.

$ echo scrapinghub-entrypoint-scrapy >> ./requirements.txt

Once you have done that, run the following command to generate an initial Dockerfile for your custom image:

$ shub-image init --requirements ./requirements.txt

It will ask you whether you want to save the Dockerfile, so confirm by answering Y.

Now it’s time to include the installation steps for PhantomJS binary in the generated Dockerfile. All you need to do is copy the highlighted code below and put it in the proper place inside your Dockerfile:

FROM python:2.7
RUN apt-get update -qq && \
    apt-get install -qy htop iputils-ping lsof ltrace strace telnet vim && \
    rm -rf /var/lib/apt/lists/*
RUN wget -q https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/bin && \
    rm -rf phantomjs-2.1.1-linux-x86_64.tar.bz2 phantomjs-2.1.1-linux-x86_64
ENV TERM xterm
ENV PYTHONPATH $PYTHONPATH:/app
ENV SCRAPY_SETTINGS_MODULE demo.settings
RUN mkdir -p /app
WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app

The Docker image you’re going to build with shub-image has to be uploaded to a Docker registry. I used Docker Hub, the default Docker registry, to create a repository under my user account:

docker-hub

Once this is done, you have to define the images setting in your project’s scrapinghub.yml (replace stummjr/demo with your own):

projects:
    default: PUT_YOUR_PROJECT_ID_HERE
requirements_file: requirements.txt
images:
    default: stummjr/demo

This will tell shub-image where to push the image once it’s built and also where Scrapy Cloud should pull the image from when deploying.

Now that you have everything configured as expected, you can build, push and deploy the Docker image to Scrapy Cloud. This step may take a couple minutes, so now might be a good time to go grab a cup of coffee. 🙂

$ shub-image upload --username stummjr --password NotSoEasy
The image stummjr/demo:1.0 build is completed.
Pushing stummjr/demo:1.0 to the registry.
The image stummjr/demo:1.0 pushed successfully.
Deploy task results:
You can check deploy results later with 'shub-image check --id 1'.
Deploy results:
{u'status': u'progress', u'last_step': u'pulling'}
{u'status': u'ok', u'project': 98162, u'version': u'1.0', u'spiders': 1}

If everything went well, you should now be able to run your PhantomJS spider on Scrapy Cloud. If you followed along with the sample project from the GitHub repo, your crawler should have collected 300 quotes scraped from the page that was rendered with PhantomJS.

Wrap Up

You now officially know how to use custom Docker images with Scrapy Cloud to supercharge your crawling projects. For example, you might want to do OCR using Tesseract in your crawler. Now you can, it’s just a matter of creating a Docker image with the Tesseract command line tool and pytesseract installed. You can also install tools from apt repositories and even compile the libraries/tools that you want.

Warning: this feature is still in beta, so be aware that some Scrapy Cloud features, such as addons, dependencies and settings, still don’t work with custom images.

For further information, check out the shub-image documentation.

Feel free to comment below with any other ideas or tips you’d like to hear more about!

This feature is a perk of paid accounts, so painlessly upgrade to unlock custom docker images for your projects. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

How to Crawl the Web Politely with Scrapy

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.

In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.

Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.

Robots

What Makes a Crawler Polite?

A polite crawler respects robots.txt
A polite crawler never degrades a website’s performance
A polite crawler identifies its creator with contact information
A polite crawler is not a pain in the buttocks of system administrators

robots.txt

Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard. Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers:

User-agent: Some_Annoying_Bot
Disallow: /

User-Agent: *
Disallow: /*.json
Disallow: /api
Disallow: /post
Disallow: /submit
Allow: /

Crawl-Delay

Mission critical to having a polite crawler is making sure your crawler doesn’t hit a website too hard. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.

When a website gets overloaded with more requests that the web server can handle, they might become unresponsive. Don’t be that guy or girl that causes a headache for the website administrators.

User-Agent

However, if you have ignored the cardinal rules above (or your crawler has achieved aggressive sentience), there needs to be a way for the website owners to contact you. You can do this by including your company name and an email address or website in the request’s User-Agent header. For example, Google’s crawler user agent is “Googlebot”.

Scrapinghub Abuse Report Form

Hey folks using our Scrapy Cloud platform! We trust you will crawl responsibly, but to support website administrators, we provide an abuse report form where they can report any misbehaviour from crawlers running on our platform. We’ll kindly pass the message along so that you can modify your crawls and avoid ruining a sysadmin’s day. If your crawler’s are turning into Skynet and running roughshod over human law, we reserve the right to halt their crawling activities and thus avert the robot apocalypse.

How to be Polite using Scrapy

Scrapy is a bit like Optimus Prime: friendly, fast, and capable of getting the job done no matter what. However, much like Optimus Prime and his fellow Autobots, Scrapy occasionally needs to be kept in check. So here’s the nitty gritty for ensuring that Scrapy is as polite as can be.

scrapy

Robots.txt

Crawlers created using Scrapy 1.1+ already respect robots.txt by default. If your crawlers have been generated using a previous version of Scrapy, you can enable this feature by adding this in the project’s settings.py:

ROBOTSTXT_OBEY = True

Then, every time your crawler tries to download a page from a disallowed URL, you’ll see a message like this:

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

Identifying your Crawler

It’s important to provide a way for sysadmins to easily contact you if they have any trouble with your crawler. If you don’t, they’ll have to dig into their logs and look for the offending IPs.

Be nice to the friendly sysadmins in your life and identify your crawler via the Scrapy USER_AGENT setting. Share your crawler name, company name and a contact email:

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

Introducing Delays

Scrapy spiders are blazingly fast. They can handle many concurrent requests and they make the most of your bandwidth and computing power. However, with great power comes great responsibility.

To avoid hitting the web servers too frequently, you need to use the DOWNLOAD_DELAY setting in your project (or in your spiders). Scrapy will then introduce a random delay ranging from 0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY seconds between consecutive requests to the same domain. If you want to stick to the exact DOWNLOAD_DELAY that you defined, you have to disable RANDOMIZE_DOWNLOAD_DELAY.

By default, DOWNLOAD_DELAY is set to 0. To introduce a 5 second delay between requests from your crawler, add this to your settings.py:

DOWNLOAD_DELAY = 5.0

If you have a multi-spider project crawling multiple sites, you can define a different delay for each spider with the download_delay (yes, it’s lowercase) spider attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'
    download_delay = 5.0
    ...

Concurrent Requests Per Domain

Another setting you might want to tweak to make your spider more polite is the number of concurrent requests it will do for each domain. By default, Scrapy will dispatch at most 8 requests simultaneously to any given domain, but you can change this value by updating the CONCURRENT_REQUESTS_PER_DOMAIN setting.

Heads up, the CONCURRENT_REQUESTS setting defines the maximum amount of simultaneous requests that Scrapy’s downloader will do for all your spiders. Tweaking this setting is more about your own server performance / bandwidth than your target’s when you’re crawling multiple domains at the same time.

AutoThrottle to Save the Day

Websites vary drastically in the number of requests they can handle. Adjusting this manually for every website that you are crawling is about as much fun as watching paint dry. To save your sanity, Scrapy provides an extension called AutoThrottle.

AutoThrottle automatically adjusts the delays between requests according to the current web server load. It first calculates the latency from one request. Then it will adjust the delay between requests for the same domain in a way that no more than AUTOTHROTTLE_TARGET_CONCURRENCY requests will be simultaneously active. It also ensures that requests are evenly distributed in a given timespan.

To enable AutoThrottle, just include this in your project’s settings.py:

AUTOTHROTTLE_ENABLED = True

Scrapy Cloud users don’t have to worry about enabling it, because it’s already enabled by default.

There’s a wide range of settings to help you tweak the throttle mechanism, so have fun playing around!

Use an HTTP Cache for Development

Developing a web crawler is an iterative process. However, running a crawler to check if it’s working means hitting the server multiple times for each test. To help you to avoid this impolite activity, Scrapy provides a built-in middleware called HttpCacheMiddleware. You can enable it by including this in your project’s settings.py:

HTTPCACHE_ENABLED = True

Once enabled, it caches every request made by your spider along with the related response. So the next time you run your spider, it will not hit the server for requests already done. It’s a win-win: your tests will run much faster and the website will save resources.

Don’t Crawl, use the API

Many websites provide HTTP APIs so that third parties can consume their data without having to crawl their web pages. Before building a web scraper, check if the target website already provides an HTTP API that you can use. If it does, go with the API. Again, it’s a win-win: you avoid digging into the page’s HTML and your crawler gets more robust because it doesn’t need to depend on the website’s layout.

Wrap Up

Let’s all do our part to keep the peace between sysadmins, website owners, and developers by making sure that our web crawling projects are as noninvasive as possible. Remember, we need to band together to delay the rise of our robot overlords, so let’s keep our crawlers, spiders, and bots polite.

image03

To all website owners, help a crawler out and ensure your site has an HTTP API. And remember, if someone using our platform is overstepping their bounds, please fill out an Abuse Report form and we’ll take care of the issue.

For those new to our platform, Scrapy Cloud is forever free and is the peanut butter to Scrapy’s jelly. For our existing Scrapy and Scrapy Cloud users, hopefully you learned a few tips for how to both speed up your crawls and prevent abuse complaints. Let us know if you have any further suggestions in the comment section below!

Sign up for free

Introducing Scrapy Cloud with Python 3 Support

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3.

scrapy_cloud_2x

If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping projects. It pairs with Scrapy, the open source web scraping framework, and Portia, our open source visual web scraper.

Scrapy + Scrapy Cloud with Python 3

I’m sure you Scrapy users are breathing a huge sigh of relief! While Scrapy with official Python 3 support has been around since May, you can now deploy your Scrapy spiders using the fancy new features introduced with Python 3 to Scrapy Cloud. You’ll have the beloved extended tuple unpacking, function annotations, keyword-only arguments and much more at your fingertips.

Fear not if you are a Python 2 developer and can’t port your spiders’ codebase to Python 3, because Scrapy Cloud will continue supporting Python 2. In fact, Python 2 remains the default unless you explicitly set your environment to Python 3.

Deploying your Python 3 Spiders

Docker support was one of the new features that came along with the Scrapy Cloud 2.0 release in May. It brings more flexibility to your spiders, allowing you to define in which kind of runtime environment (AKA stack) they will be executed.

This configuration is done in your local project’s scrapinghub.yml. There you have to include a section called stacks having scrapy:1.1-py3 as the stack for your Scrapy Cloud project:

projects:
    default: 99999
stacks:
    default: scrapy:1.1-py3

After doing that, you just have to deploy your project using shub:

$ shub deploy

Note: make sure you are using shub 2.3+ by upgrading it:

$ pip install shub --upgrade

And you’re all done! The next time you run your spiders on Scrapy Cloud, they will run on Scrapy 1.1 + Python 3.

Multi-target Deployment File

If you have a multi-target deployment file, you can define a separate stack for each project ID:

projects:
    default:
        id: 55555
        stack: scrapy:1.1
    py3:
        id: 99999
        stack: scrapy:1.1-py3

This allows you to deploy your local project to whichever Scrapy Cloud project you want, using a different stack for each one:

$ shub deploy py3

This deploys your crawler to project 99999 and uses Scrapy 1.1 + Python 3 as the execution environment.

You can find different versions of the Scrapy stack here.

Wrap Up

We hope that you’re as excited as we are for this newest upgrade to Python 3. If you have further questions or are interested in learning more about the souped up Scrapy Cloud, take a look at our Knowledge Base article.

For those new to our platform, Scrapy Cloud has a forever free subscription, so sign up and give us a try.

Sign up for free

Incremental Crawls with Scrapy and DeltaFetch

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

  • www.example.com/product?id=123
  • www.example.com/deals?id=123
  • www.example.com/category/keyboards?id=123
  • www.example.com/category/gaming?id=123

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Scraping Infinite Scrolling Pages

Scraping Infinite Scrolling Pages

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

In the era of single page apps and tons of AJAX requests per page, a lot of websites have replaced “previous/next” pagination buttons with a fancy infinite scrolling mechanism. Websites using this technique load new items whenever the user scrolls to the bottom of the page (think Twitter, Facebook, Google Images). Even though UX experts maintain that infinite scrolling provides an overwhelming amount of data for users, we’re seeing an increasing number of web pages resorting to presenting this unending list of results.

When developing our web scrapers, one of the first things we do is look for UI components with links that might lead us to the next page of results. Unfortunately, these links aren’t present on infinite scrolling web pages.

While this scenario might seem like a classic case for a JavaScript engine such as Splash or Selenium, it’s actually a simple fix. Instead of simulating user interaction with such engines, all you have to do is inspect your browser’s AJAX requests when you scroll the target page and then re-create those requests in your Scrapy spider.

Let’s use Spidy Quotes as an example and build a spider to get all the items listed on it.

Inspecting the Page

First things first, we need to understand how the infinite scrolling works on this page and we can do so by using the Network panel in the Browser’s developer tools. Open the panel and then scroll down the page to see the requests that the browser is firing:

scrapy tips from the pros june

Click on a request for a closer look. The browser sends a request to /api/quotes?page=x and then receives a JSON object like this in response:

{
   "has_next":true,
   "page":8,
   "quotes":[
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["individuality", "majority", "minority", "wisdom"],
         "text":"Whenever you find yourself on the side of the ..."
      },
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["books", "contentment", "friends"],
         "text":"Good friends, good books, and a sleepy ..."
      }
   ],
   "tag":null,
   "top_ten_tags":[["love", 49], ["inspirational", 43], ...]
}

This is the information we need for our spider. All it has to do is generate requests to “/api/quotes?page=x” for an increasing x until the has_next field becomes false. The best part of this is that we don’t even have to scrape the HTML contents to get the data we need. It’s all in a beautiful machine-readable JSON.

Building the Spider

Here is our spider. It extracts the target data from the JSON content returned by the server. This approach is easier and more robust than digging into the page’s HTML tree, trusting that layout changes will not break our spiders.

import json
import scrapy
 
 
class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5
 
    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

To further practice this tip, you can experiment with building a spider for our blog since it also uses infinite scrolling to load older posts.

Wrap Up

If you were feeling daunted by the prospect of scraping infinite scrolling websites, hopefully you’re feeling a bit more confident now. The next time that you have to deal with a page based on AJAX calls triggered by user actions, take a look at the requests that your browser is making and then replay them in your spider. The response is usually in a JSON format, making your spider even simpler.

And that’s it for June! Please let us know what you would like to see in future columns by reaching out on Twitter. We also recently released a Datasets Catalog, so if you’re stumped on what to scrape, take a look for some inspiration.

How to Debug your Scrapy Spiders

How to Debug your Scrapy Spiders

Welcome to Scrapy Tips from the Pros! Every month we release a few tricks and hacks to help speed up your web scraping and data extraction activities. As the lead Scrapy maintainers, we have run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with suggestions for future topics.

Scrapy Tips

Your spider isn’t working and you have no idea why. One way to quickly spot potential issues is to add a few print statements to find out what’s happening. This is often my first step and sometimes all I need to do to uncover the bugs that are preventing my spider from running properly. If this method works for you, great, but if it’s not enough, then read on to learn about how to deal with the nastier bugs that require a more thorough investigation. In this post, I’ll introduce you to the tools that should be in the toolbelt of every Scrapy user when it comes to debugging spiders.

Scrapy Shell is your Best Friend

Scrapy shell is a full-featured Python shell loaded with the same context that you would get in your spider callback methods. You just have to provide an URL and Scrapy Shell will let you interact with the same objects that your spider handles in its callbacks, including the response object.

$ scrapy shell http://blog.scrapinghub.com
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f0638a2cbd0>
[s]   item       {}
[s]   request    <GET http://blog.scrapinghub.com>
[s]   response   <200 https://blog.scrapinghub.com/>
[s]   settings   <scrapy.settings.Settings object at 0x7f0638a2cb50>
[s]   spider     <DefaultSpider 'default' at 0x7f06371f3290>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

After loading it, you can start playing around with the response in order to build the selectors to extract the data that you need:

>>> response.css("div.post-header > h2 ::text").extract()
...

If you’re not familiar with Scrapy Shell, give it a try. It’s a perfect fit for your development workflow, sitting right after the page inspection in the browser. You can create and test your spider’s extraction rules and use them in your spider’s code once you’ve built the ones you need.

Learn more about Scrapy Shell through the official documentation.

Start Scrapy Shell from your Spider Code

If your spider has been behaving unexpectedly for certain responses, you can quickly see what’s happening using the scrapy.shell.inspect_response method in your spider code. This will open a Scrapy shell session that will let you interact with the current response object.

For example, imagine that your spider is not extracting the expected amount of items from certain pages and you want to see what’s wrong with the response returned by the website:

from scrapy.shell import inspect_response
 
def BlogSpider(scrapy.Spider)
    ...
    def parse(self, response):
        if len(response.css('div.post-header > h2 ::text')) > EXPECTED:
            # generate the items
        else:
            inspect_response(response, self)
        ...

Once the execution hits the inspect_response call, Scrapy Shell is opened and you can interact with the response to see what’s happening.

Quickly Attaching a Debugger to your Spider

Another approach to debugging spiders is to use a regular Python debugger such as pdb or PuDB. I use PuDB because it’s quite a powerful yet easy-to-use debugger and all I need to do to activate it is to put this code in the line where I want a breakpoint:

import pudb; pudb.set_trace()

And when the breakpoint is reached, PuDB opens up a cool text-mode UI in your terminal that will bring back fond memories from the old days of using the Turbo Pascal debugger.

Take a look:image00

You can install PuDB using pip:

$ pip install pudb

Check out this video where our very own @eliasdorneles demonstrates a few tips on how to use PuDB: https://vimeo.com/166584837

Scrapy parse CLI command

There are certain scraping projects where you need your spiders to run for a long time. However, after a few hours of running, you might sadly see in the logs that one of your spiders had issues scraping specific URLs. You want to debug the spider, but you certainly don’t want to run the whole crawling process again and have to wait until that specific callback is called for that specific URL so that you can start your debugger.

Don’t worry, the parse command from Scrapy CLI is here to save the day! You just need to provide the spider name, the callback from the spider that should be used and the URL that you want to parse:

$ scrapy parse http://blog.scrapinghub.com/comments/bla --spider blog -c parse_comments

In this case, Scrapy is going to call the parse_comments method from the blog spider to parse the blog.scrapinghub.com/comments/bla URL. If you don’t specify the spider, Scrapy will search for a spider capable of handling this URL in your project based on the spiders’ allowed_domains settings.

It will then show you a summary of your callback’s execution:

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'comments': [
    {'content': u"I've seen this language ...",
     'username': u'forthemostpart'},
    {'content': u"It's a ...",
     'username': u'YellowAfterlife'},
    ...
    {'content': u"There is a macro for ...",
    'username': u'mrcdk'}]}]
# Requests  -----------------------------------------------------------------
[]

You can also attach a debugger inside the method to help you figure out what’s happening (see the previous tip).

Scrapy fetch and view commands

Inspecting page contents in browsers might be deceiving since their JavaScript engine could render some content that the Scrapy downloader will not do. If you want to quickly check exactly how a page will look when downloaded by Scrapy, you can use these commands:

  • fetch: downloads the HTML using Scrapy Downloader and prints to stdout.
  • view: downloads the HTML using Scrapy Downloader and opens it with your default browser.

Examples:

$ scrapy fetch http://blog.scrapinghub.com > blog.html
$ scrapy view http://scrapy.org

Post-Mortem Debugging Over Spiders with —pdb Option

Writing fail-proof software is nearly impossible. This situation is worse for web scrapers since they deal with web content that is frequently changing (and breaking). It’s better to accept that our spiders will eventually fail and to make sure that we have the tools to quickly understand why it’s broken and to be able to fix it as soon as possible.

Python tracebacks are great, but in some cases they don’t provide us with enough information about what happened in our code. This is where post-mortem debugging comes into play. Scrapy provides the --pdb command line option that fires a pdb session right where your crawler has broken, so you can inspect its context and understand what happened:

$ scrapy crawl blog -o blog_items.jl --pdb

If your spider dies due to a fatal exception, the pdb debugger will open and you can thoroughly inspect its cause of death.

Wrap-up

And that’s it for the Scrapy Tips from the Pros May edition. Some of these debugging tips are also available in Scrapy official documentation.

Please let us know what you’d like to see in the future since we’re here to help you scrape the web more effectively. We’ll see you next month!

 

Scrapy + MonkeyLearn: Textual Analysis of Web Data

Scrapy + MonkeyLearn: Textual Analysis of Web Data

We recently announced our integration with MonkeyLearn, bringing machine learning to Scrapy Cloud. MonkeyLearn offers numerous text analysis services via its API. Since there are so many uses to this platform addon, we’re launching a series of tutorials to help get you started.

Scrapinghub-MonkeyLearn-Addon-02

To kick off the MonkeyLearn Addon Tutorial series, let’s start with something we can all identify with: shopping. Whether you need to buy something for yourself, friends or family, or even the office, you need to evaluate cost, quality, and reviews. And when you’re working on a budget of both money and time, it can be helpful to automate the process with web scraping.

When scraping shopping and e-commerce sites, you’re most likely going to want product categories. Typically, you’d do this using the breadcrumbs. However, the challenge comes when you want to scrape several websites at once while keeping categories consistent throughout.

This is where MonkeyLearn comes in. You can use their Retail Classifier to classify products based on their descriptions, taking away the ambiguity of varied product categories.

This post will walk you through how to use MonkeyLearn’s Retail Classifier through the MonkeyLearn addon on Scrapy Cloud to scrape and categorise products from an online retailer.

Say Hello to Scrapy Cloud 2.0

For those new readers, Scrapy Cloud is our cloud-based platform that lets you easily deploy and run Scrapy and Portia web spiders without needing to deal with servers, libraries and dependencies, scheduling, storage, or monitoring. Scrapy Cloud recently underwent an upgrade and now features Docker support and a whole host of other updates.

In this tutorial, we’re using Scrapy to crawl and extract data. Scrapy’s decoupled architecture lets you use ready-made integrations for your spiders. The MonkeyLearn addon implements a Scrapy middleware. The addon takes every item scraped and sends the fields of your choice to MonkeyLearn for analysis. The classifier then stores the resulting category in another field of your choice. This lets you classify items without any extra code.

If you are a new user, sign up for Scrapy Cloud for free to continue on with this addon tutorial.

Meet MonkeyLearn’s Retail Classifier

We’ll begin by trying out the MonkeyLearn Retail Classifier with a sample description:

Enjoy speedy Wi-Fi around your home with this NETGEAR Nighthawk X4 AC2350 R7500-100NAS router, which features 4 high-performance antennas and Beamforming+ technology for optimal wireless range. Dynamic QoS prioritization automatically adjusts bandwidth.

Paste this sample in the test form under the Sandbox > Classify tab. And hit Submit:

1

You should get the following results:

2

MonkeyLearn’s engine analyzed the description and identified that the product belongs in the Electronics / Computers / Networking / Routers categories. As a bonus, it specifies how sure it is of its predictions.

The same example using curl would be:

curl --data '{"text_list": ["Enjoy speedy Wi-Fi around your home with this NETGEAR Nighthawk X4 AC2350 R7500-100NAS router, which features 4 high-performance antennas and Beamforming+ technology for optimal wireless range. Dynamic QoS prioritization automatically adjusts bandwidth."]}' \
-H "Authorization:Token <YOUR TOKEN GOES HERE>" \
-H "Content-Type: application/json" \
-D - \
"https://api.monkeylearn.com/v2/classifiers/cl_oFKL5wft/classify/?"

You can sign up for free on MonkeyLearn and replace <YOUR TOKEN GOES HERE>  with your particular API token to play with the retail classifier further.

Using MonkeyLearn with a Scrapy Cloud Project

Now we are going to deploy a Scrapy project to Scrapy Cloud and use the MonkeyLearn addon to categorize the scraped data. You can clone this project, or build your own spiders, and follow the steps described below.

1. Build your Scrapy spiders

For this tutorial, we built a spider for a fictional e-commerce website. The spider is pretty straightforward so that you can easily clone the whole project and try it by yourself. However, you should be aware of some details when building a spider from scratch to use with the MonkeyLearn addon.

First, the addon requires your spiders to generate Item objects from a pre-defined Item class. In our case, it’s the ProductItem class:

class ProductItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    category = scrapy.Field()

Second, you have to declare where the MonkeyLearn addon will store the analysis’ results as an additional field in your Item class. For our spider, these results will be stored in the category field of each of the items scraped.

2. Setup shub

Shub is the command line tool to manage your Scrapy Cloud services and you will use it to deploy your Scrapy projects there. You can install it by:

$ pip install shub

Now authenticate yourself on Scrapy Cloud:

$ shub login
Enter your API key from https://app.scrapinghub.com/account/apikey
API key: <YOUR SCRAPINGHUB API CLOUD GOES HERE>
Validating API key...
API key is OK, you are logged in now.

You can get your API key in your account profile page.

3. Deploy your project

First go to Scrapy Cloud’s web dashboard and create a project there.

Then return to your command line terminal, go to your local project’s folder and run shub deploy. It will ask you what the target project id is (i.e. the Scrapy Cloud project that you want to deploy your spider to). You can get this information through the Code & Deploys link on your project page.
$ cd product-crawler
$ shub deploy
Target project ID: <YOUR PROJECT ID>

Now your project is ready to run in Scrapy Cloud.

4. Enable the MonkeyLearn addon on Scrapy Cloud

Note that before you enable the addon, you have to create an account on MonkeyLearn.

To enable the addon, head to the Addons Setup section in your Scrapy Cloud project’s settings:

3

You can configure the addon with the following settings:

  • MonkeyLearn token: your MonkeyLearn API token. You can access it from your account settings on the MonkeyLearn website.
  • MonkeyLearn field to process: a list of item text fields (separated by commas) that will be used as input for the classifier. In this tutorial it is: title,description.
  • MonkeyLearn field output: the name of the new field that will be added to your items in order to store the categories returned by the classifier.
  • MonkeyLearn module: the id of the classifier that you are going to use. In this tutorial, the id is ‘cl_oFKL5wft’.
  • MonkeyLearn batch size: the amount of items the addon will retain before sending to MonkeyLearn for analysis.

You can find the id of any classifier in the URL:

4

When you’re done filling out all the fields, the addon configuration should look something like this:

Selection_047

5. Run your Spiders

Now that you have the Retail Classifier enabled, run the spider by going to your project’s Jobs page. Click ‘Run Spider’, select the spider and then confirm.

Give the spider a couple of minutes to gather results. You can then view the job’s items and you should see that the category field has been filled by MonkeyLearn:

6

You can then download the results as a JSON or XML file and then categorize the products by the categories and probabilities returned by the addon.

Wrap Up

Using MonkeyLearn’s Retail Classifier with Scrapy on Scrapy Cloud allows you to immediately analyze your data for easier categorization and analysis. So the next time you’ve got a massive list of people to shop for, try using immediate textual analysis with web scraping to simplify the process.

We’ll continue the series with walkthroughs on using the MonkeyLearn addon for language detection, sentiment analysis, keyword extraction, or any custom classification or extraction that you may need personally or professionally. We’ll explore different uses and hopefully help you make the most of this new platform integration.

If you haven’t already, sign up for MonkeyLearn (for free) and sign up for the newly upgraded Scrapy Cloud (for free) and get to experimenting.