Skip to content

Handling JavaScript in Scrapy with Splash

A common problem when developing spiders is dealing with sites that use a heavy amount of JavaScript. It’s not uncommon for a page to require JavaScript to be executed in order to render the page properly. In this post we’re going to show you how you can use Splash to handle JavaScript in your Scrapy projects.

What is Splash?

Splash is our in-house solution for JavaScript rendering, implemented in Python using Twisted and QT. Splash is a lightweight web browser which is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more. Best of all, it’s open source!

Setting Up Splash

The easiest way to set up Splash is through Docker:

$ docker run -p 8050:8050 scrapinghub/splash

Splash will now be running on localhost:8050. If you’re on OS X, it will be running on the IP address of boot2docker’s virtual machine.

If you would like to install Splash without using Docker, please refer to the documentation.

Using Splash with Scrapy

Now that Splash is running, you can test it in your browser:

http://localhost:8050/

On the right enter a URL (e.g. http://amazon.com) and click ‘Render me!’. Splash will display a screenshot of the page as well as charts and a list of requests with their timings. At the bottom you should see a textbox containing the rendered HTML.

Manually

You can use Request to send links to Splash:

req_url = "http://localhost:8050/render.json"
body = json.dumps({
    "url": url,
    "har": 1,
    "html": 0,
})
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(req_url, self.parse_link, method='POST',
                                 body=body, headers=headers)

If you’re using CrawlSpider, the easiest way is to override the process_links function in your spider to replace links with their Splash equivalents:

def process_links(self, links):
    for link in links:
        link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
    return links

ScrapyJS (recommended)

The preferred way to integrate Splash with Scrapy is using ScrapyJS. You can install ScrapyJS using pip:

pip install scrapyjs

To use ScrapyJS in your project, you first need to enable the middleware:

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

The middleware needs to take precedence over the HttpProxyMiddleware, which by default is at position 750, so we set the middleware to 725.

You will then need to set the SPLASH_URL setting in your project’s settings.py:

SPLASH_URL = 'http://localhost:8050/'

Don’t forget, if you are using boot2docker on OS X, you will need to set this to the IP address of the boot2docker virtual machine, e.g.:

SPLASH_URL = 'http://192.168.59.103:8050/'

Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom DUPEFILTER_CLASS and a custom cache storage backend:

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

If you already use another cache storage backend, you will need to subclass it and replace all calls to scrapy.util.request.request_fingerprint with scrapyjs.splash_request_fingerprint.

Now that the Splash middleware is enabled, you can begin rendering your requests with Splash using the ‘splash’ meta key.

For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # …

The ‘args’ dict contains arguments to send to Splash, you can find a full list of available arguments in the HTTP API documentation. We don’t need to provide the url parameter in the args, as it will be pre-filled from the URL in the Request object. By default the endpoint is set to ‘render.json’, but here we have overridden it and set it to ‘render.html’ to provide a HTML response.

Running Custom JavaScript

Sometimes when visiting a website, you may need to press a button or close a modal to view the page properly. Splash allows you to run your own JavaScript code within the context of the web page you’re requesting. There are several ways you can accomplish this:

Using the js_source Parameter

The js_source parameter can be used to send the JavaScript you want to execute. It’s recommended that you send this as a POST parameter rather than GET due to web servers and proxies enforcing limits on the size of GET parameters. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/json' \
    -d '{"js_source": "document.title=\"My Title\";", "url": "http://example.com"}' \
    'http://localhost:8050/render.html'

Sending JavaScript in the Body

You can also set the Content-Type header to ‘application/javascript’ and send the JavaScript code you would like to execute in the body of the request. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/javascript' \
    -d 'document.title="My Title"; \
    'http://localhost:8050/render.html?url=http://domain.com'

Splash Scripts (recommended)

Splash supports Lua scripts through its execute endpoint. This is the preferred way to execute JavaScript as you can preload libraries, choose when to execute the JavaScript, and retrieve the output.

Here’s an example script:

function main(splash)
    splash:go("http://example.com")
    splash:wait(0.5)
    local title = splash:evaljs("document.title")
    return {title=title}
end

This will return a JSON object containing the title:

{
    "title": "Some title"
}

Every script requires a main function to act as the entry point. You can return a Lua table which will be rendered as JSON, which is what we have done here. The splash:go function is used to tell Splash to visit the provided URL. The splash:evaljs function allows you to execute JavaScript within the page context, however, if you don’t need the result you should use splash:runjs instead.

You can test your Splash scripts in your browser by visiting your Splash instance’s index page (e.g. http://localhost:8050/). It’s also possible to use Splash with IPython notebook as an interactive web-based development environment, see here for more details. Note: You will need to use the master branch in the meantime until Splash 1.5 is released, so instead of docker pull scrapinghub/splash-jupyter (as stated in the docs) you should do docker pull scrapinghub/splash-jupyter:master.

A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:

function main(splash)
    splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
    splash:go("http://example.com")
    splash:runjs("$('#some-button').click()")
    return splash:html()
end

Here we use splash:autoload to load the jQuery library from Google’s CDN. We then tell Splash to visit the website and run our custom JavaScript to click the button with jQuery’s click function. The rendered HTML is then returned to the browser.

You can find more info on running JavaScript with Splash in the docs, and for a more in-depth tutorial, check out the Splash Scripts Tutorial.

We hope this tutorial gave you a nice introduction to Splash, and please let us know if you have any questions or comments!

Scrapinghub crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?”

— Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

new search engine exposes the dark web

Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

New Changes to Our Scrapy Cloud Platform

We are proud to announce some exciting changes we’ve introduced this week. These changes bring a much more pleasant user experience, and several new features including the addition of Portia to our platform!

Here are the highlights:

An Improved Look and Feel

We have introduced a number of improvements in the way our dashboard looks and feels. This includes a new layout based on Bootstrap 3, a more user-friendly color scheme, the ability to schedule jobs once per month, and a greatly improved spiders page with pagination and search.

Filtering and pagination of spiders:

spiders

A new user interface for adding periodic jobs:

periodic-jobs

A new user interface for scheduling spiders:

schedule-spider

And much more!

Your Organization on Scrapy Cloud

You are now able to create and manage organizations in Scrapy Cloud, add members if necessary and from there create new projects under your organization. This will make it much easier for you to manage your projects and keep them all in one place. Also, to make things simpler, our billing system will soon be managed at the organization level rather than per individual user.

You are now able to create projects within the context of an organization, however other organization members will need to be invited in order to access it. A user can be invited to a project even if that user is not a member of the project’s organization.

Export Items as XML

Due to popular demand, we have added the ability to download your items as XML:

export-items-xml

Improvements to Periodic Jobs

We have made several improvements to the way periodic jobs are handled:

  • There is no delay when creating or editing a job. For example, if you create a new a job at 11:59 to run at 12:00, it will do so without any trouble.
  • If there is any downtime, jobs that were intended to be scheduled during the downtime will be scheduled automatically once the service is restored.
  • You can now schedule jobs at specific dates in the month.

Portia Now Available in Dash

Last year, we open-sourced our annotation based scraping tool, Portia. We have since been working to integrate it into Dash, and it’s finally here!

We have added an ‘Open in Portia’ button to your projects’ Autoscraping page, so you can now open your Scrapy Cloud projects in Portia. We intend Portia to be a successor to our existing Autoscraping interface, and hope you find it to be a much more pleasant experience. No longer do you have to do a preliminary crawl to begin annotating, you can just jump straight in!

Check out this demo of how you can create a spider using Portia and Dash!

Enjoy the new features, and of course if you have any feedback please don’t hesitate to post on our support forum!

Introducing ScrapyRT: An API for Scrapy spiders

We’re proud to announce our new open source project, ScrapyRT! ScrapyRT, short for Scrapy Real Time, allows you to extract data from a single web page via an API using your existing Scrapy spiders.

Why did we start this project?

We needed to be able to retrieve the latest data for a previously scraped page, on demand. ScrapyRT made this easy by allowing us to reuse our spider logic to extract data from a single page, rather than running the whole crawl again.

How does ScrapyRT work?

ScrapyRT runs as a web service and retrieving data is as simple as making a request with the URL you want to extract data from and the name of the spider you would like to use.

Let’s say you were running ScrapyRT on localhost, you could make a request like this:

http://localhost:9080/crawl.json?spider_name=foo&url=http://example.com/product/1

ScrapyRT will schedule a request in Scrapy for the URL specified and use the ‘foo’ spider’s parse method as a callback. The data extracted from the page will be serialized into JSON and returned in the response body. If the spider specified doesn’t exist, a 404 will be returned. The majority of Scrapy spiders will be compatible without any additional programming necessary.

How do I use ScrapyRT in my Scrapy project?

 > git clone https://github.com/scrapinghub/scrapyrt.git
 > cd scrapyrt
 > pip install -r requirements.txt
 > python setup.py install
 > cd ~/your-scrapy-project
 > scrapyrt

ScrapyRT will be running on port 9080, and you can schedule your spiders per the example shown earlier.

We hope you find ScrapyRT useful and look forward to hearing your feedback!

Comment here or discuss on HackerNews.

Looking back at 2014

One year ago we were looking back at the great 2013 we had and realized we would have quite a big challenge in front of us in order to have as much growth as we had during last year. So here are some highlights of the things we’ve been up to during this year, let’s see how well we did!

2014 was quite the travelling year for Scrapinghub! We sponsored both the US PyCon in Montreal and the spanish PyCon in Zaragoza. We’ve also been to Codemotion in Madrid and PythonBrasil. We hope to hit the road during 2015 too, bringing some spider magic to even more cities!

PyCon US Scrapinghub booth

PyCon US Scrapinghub booth

During this year we’ve also continued to work on both new and ongoing Professional Services projects for clients all around the world and we’re glad to see that our efforts are paying off, we have increased our customer base while maintaining the same quality standards we’ve had since we were just a few guys, back in 2010.

Our platform has grown too! There’s been steady effort in getting it to the point where we’ve been able to accommodate the ever increasing volume of scraping we and our customers have been doing. In 2014 alone Scrapy Cloud has scraped and stored data from over 10 billion pages (more than 5 times the amount we did in 2013!) and an extra 5 billion have passed through Crawlera.

We are excited to see our revenue tripling from last year, and it makes us very proud to have grown organically so far. We can only imagine what we could do with some funding, but we won’t do anything that could jeopardize the way we run the company, which has proven very successful.

Open Source

In the open source front, we’ve been spending a lot of time improving our annotation based scraping tool Portia. Our main focus has been on integrating it into our Scrapy cloud platform, and soon Scrapinghub users will be able to open their Autoscraping projects in Portia. You can see an example of the current Dash integration here. This will eventually be our successor to our Autoscraping tool. If you just cannot wait to try Portia you’re in luck, we open sourced it sometime ago (it was trending Python project on Github for a month!) so you can try it locally if you wish!

 

We also have a number of new and interesting open source projects: Dateparser, Crawl Frontier and Splash.

 

  • Dateparser is a parser for human readable dates/ which is able to detect and support multiple languages. It can even read text such as “2 weeks ago” and determine the date relative to the current time! The project already has over 300 stars and 17 forks on github.

 

  • Crawl frontier is a framework for building the frontier part of your web crawler, that’s the bit of a crawling system that decides the logic and policies to follow when a crawler is visiting websites such as what pages should be crawled next, priorities and ordering, how often pages are revisited, etc. Although originally designed for use with Scrapy, Crawl frontier can now be used with any other crawling framework or project you wish to use!

 

  • Splash is an API for JavaScript rendering. It’s currently our in-house solution for crawling Javascript powered websites, and we’re hopeful that usage will begin to grow outside of Scrapinghub.

 

Of course, all our other existing open source projects such as Scrapy, webstruct and others have seen major improvements, something that will keep on going during 2015 and beyond.

We’re glad to be able to share all these experiences, numbers and new projects with you, but we know very well that behind every single one of those stands the hard work done by all members of our team, saying that this wouldn’t have been possible without them is an understatement. And the team has grown, 2014 marks the second year in a row that our team has doubled, we’re now 85 Scrapinghubbers! (and we’ll be over 90 by the end of January)

Green markers represent 2014 new hires. Click for an interactive map of our full team.

Green markers represent 2014 new hires. Click for an interactive map of our full team.

So here’s a sincere thank you from the Scrapinghub team to all of our customers and supporters. Thanks for an amazing 2014 and watch out 2015, here we come! Happy New Year!

XPath tips from the web scraping trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you’re looking for a tutorial, here is a XPath tutorial with nice examples.

In this post, we’ll show you some tips we found valuable when using XPath in the trenches, using Scrapy Selector API for our examples.

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> xp = lambda x: sel.xpath(x).extract() # let's type this only once
>>> xp('//a//text()') # take a peek at the node-set
   [u'Click here to go to the ', u'Next Page']
>>> xp('string(//a//text())')  # convert it to a string
   [u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

So, in general:

GOOD:

>>> xp("//a[contains(., 'Next Page')]")
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:

>>> xp("//a[contains(.//text(), 'Next Page')]")
[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")
[u'Page']

BAD:

>>> xp("substring-after(//a//text(), 'Next ')")
[u'']

You can read more detailed explanations about string values of nodes and node-sets in the XPath spec.

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel=Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]") # get all first LI elements under whatever it is its parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]") # get the first LI element in the whole document
[u'<li>1</li>']
>>> xp("//ul/li[1]")  # get all first LI elements under an UL parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]") # get the first LI element under an UL parent in the document
[u'<li>1</li>']

Also,

//a[starts-with(@href, '#')][1] gets a collection of the local anchors that occur first under their respective parents.

(//a[starts-with(@href, '#')])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

Let’s cook up some examples:

>>> sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
>>> xp = lambda x: sel.xpath(x).extract()

BAD: doesn’t work because there are multiple classes in the attribute

>>> xp("//*[@class='content']")
[]

BAD: gets more than we want

>>> xp("//*[contains(@class,'content')]")
[u'<p class="content-author">Someone</p>']

GOOD:

>>> xp("//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]")
[u'<p class="content text-wrap">Some content</p>']

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Read more about what you can do with Scrapy’s Selectors here.

Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the tutorial to quickly review this.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes. Source: http://stackoverflow.com/a/19350897/2572383

Do you have another XPath tip?

Please, leave us a comment with your tips or questions. :)

And for everybody who contributed tips and reviewed this article, a big thank you!

Introducing Data Reviews

One of the things that takes more time when building a spider is reviewing the scraped data and making sure it conforms to the requirements and expectations of your client or team. This process is so time consuming that, in many cases, it ends up taking more time than writing the spider code itself, depending on how well the requirements are written. To make this process more efficient we have introduced the ability to comment data directly on Dash (Scrapinghub UI), right next to the data, instead of relying on other channels (like issue trackers, emails or chat).

 

data_reviews

 

With this new feature you can discuss problems with data right where they appear without having to copy/paste data around, and have a conversation with your client or team until the issue is resolved. This reduces the time spent on data QA, making the whole process more productive and rewarding.

So go ahead, start adding comments to your data (you can comment whole items or individual fields) and let the conversation flow around data! You can mark resolved issues by archiving comments, and you will see jobs with unresolved (unarchived) comments directly on the Jobs Dashboard.

Last, but not least, you have the Data Reviews API to insert comments programmatically. This is useful, for example, to report problems in post-processing scripts that analyze the scraped data.

Happy scraping!

Follow

Get every new post delivered to your Inbox.

Join 283 other followers