Browsed by
Author: Elias Dorneles

How to Run Python Scripts in Scrapy Cloud

How to Run Python Scripts in Scrapy Cloud

You can deploy, run, and maintain control over your Scrapy spiders in Scrapy Cloud, our production environment. Keeping control means you need to be able to know what’s going on with your spiders and to find out early if they are in trouble.

image03
No one wants to lose control of a swarm of spiders. No one. (Reuters/Daniel Munoz)

This is one of the reasons why being able to run any Python script in Scrapy Cloud is a nice feature. You can customize to your heart’s content and automate any crawling-related tasks that you may need (like monitoring your spiders’ execution). Plus, there’s no need to scatter your workflow since you can run any Python code in the same place that you run your spiders and store your data.

While just the tip of the iceberg, I’ll demonstrate how to use custom Python scripts to notify you about jobs with errors. If this tutorial sparks some creative applications, let me know in the comments below.

Setting up the Project

We’ll start off with a regular Scrapy project that includes a Python script for building a summary of jobs with errors that have finished in the last 24 hours. The results will be emailed to you using AWS Simple Email Service.

You can check out the sample project code here.

Note: To use this script you will have to modify the settings at the beginning with your own AWS keys so that the email function works.

In addition to the traditional Scrapy project structure, it also contains a check_jobs.py script in the bin folder. This is the script responsible for building and sending the report.

.
├── bin
│   └── check_jobs.py
├── sc_scripts_demo
│   ├── __init__.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── bad_spider.py
│       └── good_spider.py
├── scrapy.cfg
└── setup.py

The deploy will be done via shub, just like your regular projects.

But first, you have to make sure that your project’s setup.py file lists the script that you’ll want to run (see the line highlighted in the snippet below):

from setuptools import setup, find_packages

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    scripts      = ['bin/check_jobs.py'],
    entry_points = {'scrapy': ['settings = sc_scripts_demo.settings']},
)

Note: If there’s no setup.py file in your project root yet, you can run shub deploy and the deploy process will generate it for you.

Once you have included the scripts parameter in the setup.py file, you can deploy your spider to Scrapy Cloud with this command:

$ shub deploy

Running the Script on Scrapy Cloud

Running a Python script is very much like running a Scrapy spider in Scrapy Cloud. All you need to do is set the job type as “Scripts” and then select the script you want to execute.

The check_jobs.py script expects three arguments: your Scrapinghub API key, an email address to send the report to, and the project ID (the numeric value in the project URL):

image05

Scheduling Periodic Execution on Scrapy Cloud

Since this script is meant to be executed once a day, you need to schedule it under Periodic Jobs, as shown below:

image01

Select the script to run, configure when you want it to run and specify any arguments that may be necessary.

image04

After scheduling the periodic job (I’ve set it up to run once a day at 7 AM UTC), you will see a screen like this:

image02

Note: You can run the script immediately by clicking the play button as well.

And you’re done! The script will run every day at 7 AM UTC and send a report of jobs with errors (if any) right into your email inbox. This is how the report email looks:

image00

 

Helpful Tips

Heads up, here’s what else you should know about Python scripts in Scrapy Cloud:

  • The output of print statements show up in the log with level INFO and prefixed with [stdout]. It’s generally better to use a Python standard logging API to log messages with proper levels (e.g. to report errors or warnings).
  • After about an hour of inactivity, jobs are killed. If you plan to leave a script running for hours, make sure that it logs something in the output every few minutes to avoid this grisly fate.

Wrap Up

While this specific example demonstrated how to automate the reporting of jobs with errors, keep in mind that you can use any Python script with Scrapy Cloud. This is helpful for customizing your crawls, monitoring jobs, and also handling post-processing tasks.

Read more about this and other features at Scrapy Cloud online documentation.

Scrapy Cloud is forever free, so no need to worry about a bait-and-switch. Try it out and let me know what Python scripts you’re using in the comments below.

Sign up for free

Meet Parsel: the Selector Library behind Scrapy

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework.

HarryParseltongue

Not to be confused with Parseltongue/Parselmouth

We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3. As a library, it’s lighter than Scrapy (it relies on lxml and cssselect) and also more flexible, allowing you to use it within any Python program.

v-3

Using Parsel

Install Parsel using pip:

pip install parsel

And here’s how you use it. Say you have this HTML snippet in a variable:

>>> html = u'''
<ul>
    <li><a href="http://blog.scrapinghub.com">Blog</a></li>
...
    <li><a href="https://www.scrapinghub.com">Scrapinghub</a></li>
...
    <li class="external"><a href="http://www.scrapy.org">Scrapy</a></li>
</ul>
'''

You then import the Parsel library, load it into a Parsel Selector and extract links with an XPath expression:

>>> import parsel
>>> sel = parsel.Selector(html)
>>> sel.xpath("//a/@href").extract()
[u'http://blog.scrapinghub.com', u'https://www.scrapinghub.com', u'http://www.scrapy.org']

Note: Parsel works both in Python 3 and Python 2. If you’re using Python 2, remember to pass the HTML in a unicode object.

Sweet Parsel Features

One of the nicest features of Parsel is the ability to chain selectors. This allows you to chain CSS and XPath selectors however you wish, such as in this example:

>>> sel.css('li.external').xpath('./a/@href').extract()
[u'http://www.scrapy.org']

You can also iterate through the results of the .css() and .xpath() methods since each element will be another selector:

>>> for li in sel.css('ul li'):
...     print(li.xpath('./a/@href').extract_first())
...
http://blog.scrapinghub.com
https://www.scrapinghub.com
http://www.scrapy.org

You can find more examples of this in the documentation.

When to use Parsel

The beauty of Parsel is in its wide applicability. It is useful for a range of situations including:

  • Processing XML/HTML data in an IPython notebook
  • Writing end-to-end tests for your website or app
  • Simple web scraping projects with the Python Requests library
  • Simple automation tasks at the command-line

And now, you can also run Parsel with the command-line tool for simple extraction tasks in your terminal. This new development is thanks to our very own Rolando who created parsel-cli.

Install parsel-cli with pip install parsel-cli and play around using the examples below (you need to have curl installed).

The following command will download and extract the list of Academy Award-winning films from Wikipedia:

curl -s https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films |\
    parsel-cli 'table.wikitable tr td i a::text'

You can also get the current top 5 news items from Hacker News using:

curl -s https://news.ycombinator.com |\
    parsel-cli 'a.storylink::attr(href)' | head -n 5

And how about obtaining a list of the latest YouTube videos from a specific channel?

curl -s https://www.youtube.com/user/crashcourse/videos |\
    parsel-cli 'h3 a::attr(href), h3 a::text' |\
    paste -s -d' \n' - | sed 's|^|http://youtube.com|'

Wrap Up

I hope that you enjoyed this little tour of Parsel and I am looking forward to seeing how these examples have sparked your imagination when finding solutions for your HTML parsing needs.

The next time you find yourself wanting to extract data from HTML/XML and don’t need Scrapy and its crawling capabilities, you know what to do: just Parsel it!

Feel free to reach out to us on Twitter and let us know how you use Parsel in your projects.

Skinfer: A Tool for Inferring JSON Schemas

Skinfer: A Tool for Inferring JSON Schemas

Imagine that you have a lot of samples for a certain kind of data in JSON format. Maybe you want to have a better feel of it, know which fields appear in all records, which appear only in some and what are their types. In other words, you want to know the schema for the data that you have.

We’d like to present you skinfer, a tool that we built for inferring the schema from samples in JSON format. Skinfer will take a list of JSON samples and give you one JSON schema that describes all of the samples. (For more information about JSON Schema, we recommend the online book Understanding JSON Schema.)

Install skinfer with pip install skinfer, then generate a schema running the command schema_inferer passing a list of JSON samples (it can be a JSON lines file with all samples or a list of JSON files passed via the command line).

Here is an example of usage with a simple input:

$ cat samples.json
$ cat samples.json
{"name": "Claudio", "age": 29}
{"name": "Roberto", "surname": "Gomez", "age": 72}
$ schema_inferer --jsonlines samples.json
{
    "$schema": "http://json-schema.org/draft-04/schema",
    "required": [
        "age",
        "name"
    ],
    "type": "object",
    "properties": {
        "age": {
            "type": "number"
        },
        "surname": {
            "type": "string"
        },
        "name": {
            "type": "string"
        }
    }
}

Once you’ve generated a schema for your data, you can:

  1. Run it against other samples to see if they share the same schema
  2. Share it with anyone who wants to know the structure of your data
  3. Complement it manually, adding descriptions for the fields
  4. Use a tool like docson to generate a nice page documenting the schema of your data (see example here)

Another interesting feature of Skinfer is that it can also merge a list of schemas, giving you a new schema that describes samples from all previously given schemas. For this, use the json_schema_merger command passing it a list of schemas.

This is cool because you can continuously keep updating a schema even after you’ve already generated it: you can just merge it with the one you already have.

Feel free to dive into the code, explore the docs and please file any issues that you have on GitHub. 🙂

XPath Tips from the Web Scraping Trenches

XPath Tips from the Web Scraping Trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you’re looking for a tutorial, here is a XPath tutorial with nice examples.

In this post, we’ll show you some tips we found valuable when using XPath in the trenches, using Scrapy Selector API for our examples.

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> xp = lambda x: sel.xpath(x).extract() # let's type this only once
>>> xp('//a//text()') # take a peek at the node-set
   [u'Click here to go to the ', u'Next Page']
>>> xp('string(//a//text())')  # convert it to a string
   [u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

So, in general:

GOOD:

>>> xp("//a[contains(., 'Next Page')]")
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:

>>> xp("//a[contains(.//text(), 'Next Page')]")
[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")
[u'Page']

BAD:

>>> xp("substring-after(//a//text(), 'Next ')")
[u'']

You can read more detailed explanations about string values of nodes and node-sets in the XPath spec.

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel=Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]") # get all first LI elements under whatever it is its parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]") # get the first LI element in the whole document
[u'<li>1</li>']
>>> xp("//ul/li[1]")  # get all first LI elements under an UL parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]") # get the first LI element under an UL parent in the document
[u'<li>1</li>']

Also,

//a[starts-with(@href, '#')][1] gets a collection of the local anchors that occur first under their respective parents.

(//a[starts-with(@href, '#')])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

Let’s cook up some examples:

>>> sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
>>> xp = lambda x: sel.xpath(x).extract()

BAD: doesn’t work because there are multiple classes in the attribute

>>> xp("//*[@class='content']")
[]

BAD: gets more than we want

>>> xp("//*[contains(@class,'content')]")
[u'<p class="content-author">Someone</p>']

GOOD:

>>> xp("//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]")
[u'<p class="content text-wrap">Some content</p>']

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Read more about what you can do with Scrapy’s Selectors here.

Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the tutorial to quickly review this.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes. Source: http://stackoverflow.com/a/19350897/2572383

Do you have another XPath tip?

Please, leave us a comment with your tips or questions. 🙂

And for everybody who contributed tips and reviewed this article, a big thank you!