Skip to content

Scrapy on the Road to Python 3 Support

Scrapy-to-Python3Scrapy is one of the few popular Python packages (almost 10k github stars) that’s not yet compatible with Python 3. The team and community around it are working to make it compatible as soon as possible. Here’s an overview of what has been happening so far.

You’re invited to read along and participate in the porting process from Scrapy’s Github repository.

First off you may be wondering: “why is Scrapy not in Python 3 yet?”

If you asked around you likely heard an answer like “It’s based on Twisted, and Twisted is not fully ported yet, you know?”. Many blame Twisted, but other things are actually holding back the Scrapy Python 3 port.

When it comes to Twisted, its most important parts are already ported. But if we want Scrapy spiders to download from HTTP urls, we are really going to need a fix or workaround on Twisted’s http agent, since it doesn’t work on Python 3.

The Scrapy team started to make moves towards Python 3 support two years ago by porting some of the Scrapy dependencies. For about a year now a subset of Scrapy tests is executed under Python 3 on each commit. Apart from Twisted, one bottleneck that was blocking the progress for a while was that most of Scrapy requires Request or Response objects to work – which was recently resolved as described below.

Scrapy core devs meet and prepare a sprint

Devs-Preparing-Scrapy-Sprint-Python3During EuroPython 2015, from 20 to 28 of July, several Scrapy core developers gathered together in Bilbao and were able to make progress on the porting process – meeting for the first time after several years of working together as they did.

There was a Scrapy sprint scheduled to the weekend. Mikhail Korobov, Daniel Graña, and Elias Dorneles teamed up to prepare Scrapy for it by porting Request and Response in advance. This way, it would be easier for other people to join and contribute during the weekend sprint.

In the end, time was short for them to fully port Request and Response before the weekend sprint. Some of the issues they faced were:

  • Should HTTP headers be bytes or unicode? Is it different for keys and values? Some headers values are usually UTF-8 (e.g. cookies); HTTP Basic Auth headers are usually latin1; for other headers there is not a single universal encoding as well. Generally, bytes for HTTP headers make sense, but there is a gotcha: if you’re porting an existing project from Python 2.x to 3.x, code which was working before may start to silently producing incorrect results. For example, let’s say there is a response with ‘application/json’ content type. If header values are bytes, in Python 2.x `content_type == ‘application/json’` will return True, but in Python 3.x it will return False because you’re then comparing a unicode literal with bytes.
  • How to percent-escape and unescape URLs properly? Proper escaping depends on web page encoding and on a part of URL being escaped. This matters if a webpage author uses non-ascii URLs. After some experiments we found that browsers are doing crazy things here: URL path is encoded to UTF-8 before escaping, but the query string is encoded to web page encoding before escaping. You can’t trust browser UIs to check that. What they are sending to servers is consistent, namely a UTF-8 path and page-encoding for the query string – in each of Firefox and Chrome on OS X and Linux. But what they display to users depends on their browser and operating system.
  • URL-related functions are very different in Python 2.x and 3.x. In Python 2.x they only accept bytes, while in Python 3.x they only accept unicode. Combined with the encoding craziness, this makes porting the code harder still.

The EuroPython sprint weekend arrives

EuroPython-Scrapy-Sprint-Python3To unblock further porting, the team decided to use bytes for HTTP headers and momentarily disable some of the tests for non-ascii URL handling, thus eliminating the two bottlenecks that were holding things back.

The sprint itself was quiet but productive. Some highlights on what developers did there:

  • They unblocked the main problem with the port, the handling of urls and headers, and then migrated the Request and Response classes so it’s finally possible to divide the work and port each component independently;
  • Elias split the Scrapy Selectors into a separate library (called Parsel), which reached a stable point and which Scrapy now depends on (there is work being done in the documentation to make an official release);
  • Mikhail and Daniel ported several Scrapy modules to make further contributions easier;
  • A mystery contributor came, silently ported a Scrapy module, and left without a trace (please reach out!);
  • Two more guys joined, they were completely new to Scrapy and had fun setting up their first project.

The road is long, but the path is clear!

In the end the plan worked as expected. After porting Request and Response objects and making some hard decisions, the road to contributions is open.

In the weeks that followed the sprint, developers continued to work on the port. They also got important contributions from the community. As you can see here, Scrapy already got several pull requests merged (for example from @GregoryVigoTorres and @nyov).

Before the sprint there were ~250 tests passing in Python 3. The number is now over 600. These recent advances helped increase our test coverage under Python 3 from 19% to 54%.

Our next major goal is to port the Twisted HTTP client so spiders can actually download something from remote sites.

It’s still a long way to the Python 3 support, but when it comes to Python 3 porting Scrapy is in a much better shape now. Join us and contribute to porting Scrapy following these guidelines. We have added a badge to Github to show the progress of Python 3 support. The percentage is calculated using the tests that pass in Python 3 vs the total number of tests available. Currently, 633 tests passing on Python vs 1153 in total.

Python 3 Porting Status

Thanks for reading!

Introducing Javascript support for Portia

Today we released the latest version of Portia bringing with it the ability to crawl pages that require JavaScript. To celebrate this release we are making Splash available as a free trial to all Portia users so you can try it out with your projects.

How to use it

Open a project within Portia. If you don’t already have a an instance of Portia you can get started by downloading Portia from github or by signing up for our hosted instance.

If you would like to crawl using JavaScript in your project you can do so by:

  1. Navigating to your spider in Portia.
  2. Opening the Crawling tab.
  3. Clicking the Enable JS checkbox.

By clicking this checkbox you will now be able to annotate pages which require JavaScript to be enabled to be crawled correctly.

After you enable JavaScript you will be presented with the ability to limit which pages that JavaScript is enabled for. You can choose for JavaScript to only be run on certain pages in the same way that you can limit which pages are followed by Portia.

The reason you may want to limit which pages load JavaScript is that loading JavaScript can increase the amount of time required to run your spider.

Once you have made these changes you can publish your spider and it will be able to crawl pages that require JavaScript.

How do I know if I need it?

When you are creating a spider you can use the show followed links checkbox to decide if you need to JavaScript enabled on a page or not. By showing followed links you can see which links are followed only if JavaScript is enabled and which links are always followed.

Links followed by Food.com

Links followed by Food.com. Green links will always be followed. Red links will never be followed. Blue links will only be followed if JavaScript is enabled.

To decide if you need JavaScript enabled for extracting data you can try to create a sample for the page that you wish to extract data from. If JavaScript is not enabled for this page and you can see the data you wish to extract then you don’t need to change anything; your spider works! If you don’t see the data you want then you can:

  • Enable JavaScript for the spider as described above, or
  • Add a matching pattern for this URL to the enable JavaScript patterns.

Using your own Splash

If you already have your own dedicated splash instance you can enable it for your project by adding its URL and your API key to the Portia addon in your project settings. If you would like to request your own Splash instance please visit your organization’s dashboard page. If you would like to learn more about splash you can do so here.

This should be all you need to get started with JavaScript in Portia. If you need anything else, help@scrapinghub.com is here to assist you. Happy Scraping!

Distributed Frontera: Web Crawling at Scale

Over the last half year we have been working on a distributed version of our frontier framework, Frontera. This work was partially funded by DARPA and is going to be included in the DARPA Open Catalog on Friday 7th.

The project came about when a client of ours expressed interest in building a crawler that’s able to identify frequently changing hub pages.

In case you aren’t familiar, hub pages are pages which contain a large number of outgoing link to authority sites. Authority sites refer to sites which obtain a high authority score which is determined by the relevance of the page. [2][3]

A client wanted us to build a crawler that would crawl around 1 billion pages per week, batch process them and then output the pages that are hubs, changing with some predefined periodicity and keep this system running to get latest changes.

Single-thread mode

We began by building single-thread crawl frontier framework, allowing us to store and generate batches of documents to crawl and working seamlessly with Scrapy ecosystem.

This is basically the original Frontera, intended to solve:

  • Cases when one need to isolate URL ordering/queueing from the spider e.g. distributed infrastructure, need of remote management of ordering/queueing.
  • Cases when URL metadata storage is needed e.g. to demonstrate its contents somewhere.
  • Cases when one needs advanced URL ordering logic. If a website is big and it’s expensive to crawl the whole website, Frontera can be used for crawling the most important documents.

Single-thread Frontera has two storage backends: memory and SQLAlchemy. You can use any RDBMS of your choice such as SQLite, MySQL, Postgres. If you wish to use your own crawling strategy, it should be programmed in the backend and spider code.

You can find the repository here.

Distributed mode

Later, we began investigating how to scale the existing solution and make it work with an arbitrary number of documents. Considering our access pattern, we were interested in key-value storage that could afford random read/writes as long as efficient batch processing capabilities, at the same time being scalable.

HBase [4] turned out to be a good choice for this. A good example is Mail.ru’s content system built on HBase. [5]

The next choice is communication with HBase and Python directly isn’t reliable, so we decided to use Kafka as a communication layer. As a bonus we got partitioning by domain name which makes it easier to ensure each domain is downloaded at most by one spider, and also the ability to replay the log out of the box which can be useful when changing crawling strategy on the fly.

This resulted in the architecture outlined below:

frontera-architecture

Let’s start with spiders. The seed URLs defined by the user inside spiders are propagated to strategy workers and DB workers by means of a Kafka topic named ‘Spider Log’. Strategy workers decide which pages to crawl using HBase’s state cache, assigns a score to each page and sends the results to the ‘Scoring Log’ topic.

DB Worker stores all kinds of metadata, including content and scores. DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to “New Batches” topic. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the ‘Spider Log’ topic where they are stored and scored. That’s it, we have a closed circle.

The main advantage of this design is real-time operation. Crawling strategy can be changed without having to stop the crawl. It’s worth mentioning that crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.

Distributed Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by Kafka topic partitioning. All distributed Frontera components are written in Python, which is much easier to customize than C++ or Java–the most common languages for large-scale web crawlers. [6]

Here are the use-cases for distributed version:

  • You have set of URLs and need to revisit them e.g. to track changes.
  • Building a search engine with content retrieval from the Web.
  • All kinds of research work on web graph: gathering links statistics, structure of graph, tracking domain count, etc.
  • You have a topic and you want to crawl the documents about that topic.
  • More general focused crawling tasks: e.g. searching for pages that are large hubs and change frequently.

Hardware requirements

One spider thread can crawl around 1200 pages/minute from 100 web hosts in parallel. Spiders to workers ratio is about 4:1 without content; storing content could require more workers.

1 GB of RAM is required for each strategy worker instance in order to run state cache, which can be tuned. For example, if you need to crawl at 15K pages/minute you need 12 spiders and 3 strategy worker and DB worker pairs, all these processes consume 18 cores in total.

Using distributed Frontera

You can find the repository here.

The tutorial should help you get started, and there are also some useful notes in the slides Alexander presented at EuroPython last month.

You may also want to check out the Frontera documentation as well as our blog post that provides an example of using Frontera in a project.

Thanks for reading! If you have any questions or comments please share your thoughts below.

References

1. Image taken from http://soltisconsulting.co/2013/08/28/the-pagerank-periodicals-an-educational-presentation-and-interpretive-study-part-1b-introduction-of-the-hypertext-induced-topic-search-hits-algorithm/
2. https://en.wikipedia.org/wiki/HITS_algorithm
3. http://blog.scrapinghub.com/2015/06/19/link-analysis-algorithms-explained/
4. HBase was modeled after Google’s BigTable system, and you may find this paper useful to better understand it: http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
5. http://habrahabr.ru/company/mailru/blog/167297/
6. Strategy worker can be implemented using different language, the only requirement is Kafka client.

The Road to Loading JavaScript in Portia

Support for JavaScript has been a much requested feature ever since Portia’s first release 2 years ago. The wait is nearly over and we are happy to inform you that we will be launching these changes in the very near future. If you’re feeling adventurous you can try it out on the develop branch at Github. This post aims to highlight the path we took to achieving JavaScript support in Portia.

The Plan

As with everything in software, we started out by investigating what our requirements were and what others had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the web pages.

Reliability: A solution that could render the pages in the same way during spider creation and crawling.

Interaction: A system that would allow us to record the user’s actions so that they could be replayed while crawling.

The Investigation

The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:

  1. Placing the Portia UI inside a browser add on and taking advantage of the additional privileges to read from and interact with the page.
  2. Placing the Portia UI inside a bookmarklet, and after doing some post processing of the page on our server, allow interaction with the page.
  3. Rendering a static screenshot of the page with the coordinates of all its elements and sending them to the UI. Interaction involves re-rendering the whole screenshot.
  4. Rendering a tiled screenshot of the page along with coordinates. Then an interaction event is detected, update the representation on the server, send the updated tiles to the UI to be rendered along with the updated DOM.
  5. Rendering the page in an iframe with a proxy to avoid cross-origin issues and disable unwanted activity.
  6. Rendering the page on the server and send the DOM to the user. Whenever the user interacts with the page, the server would forward any changes that happen as a result of user interaction.
  7. Building a desktop application using Webkit to have full control over the UI, page rendering and everything else we might need.
  8. Building an internal application using Webkit to run on a server accessible through a web based VNC.

We rejected 7 and 8 because they would increase the barrier of entry for using Portia and make it more difficult to use. This method is used by Import.io for their spider creation tool.

1 and 2 were rejected because it would be hard to fit the whole Portia UI into an add on in the way we’d prefer, although we may revisit these options in the future. ParseHub and Kimono use these method to great effect.

3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end though it was clunky and we could achieve better performance by sending DOM updates rather than image tiles.

The Solution

The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted with in a controlled manner.

We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.

The server side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially we looked at React’s virtual DOM, and while it worked it wasn’t perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.

We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server side browser or from the original host and provide SSL protection to the resources if the host doesn’t already provide it.

The Future

Before JS support (left), After JS support (right)

Before JS support (left), After JS support (right)

For now we’re very happy with how it works and hope it will make it easier for users to extract the data they need.

This initial release will provide the means to crawl and extract pages that require JavaScript, but we want to make it better! We are now building a system to allow actions to be recorded and replayed on pages during crawling. We are hoping that this feature will make filling out forms, pressing buttons and triggering infinite scrolling simple and easy to use.

If you have any ideas for what features you would like to see in Portia leave a comment below!

EuroPython 2015

EuroPython 2015 is happening this week and we’re having the largest company meetup so far as a part of it, with more than 30 members from our fully remote-working team attending. The event which is held in Bilbao started on Monday and is providing great quality talks, sessions and plenty of tasty Spanish dishes.

scrapinghub-europython-booth

As sponsors we’ve been privileged with a nice booth where we are connecting with Pythonistas, discussing scraping practises and giving away lots of swag such as hats, t-shirts, stickers, bottle-openers and clocks.

europython-scrapinghub-swags

If you’re attending to EuroPython, make sure to drop by and get one of these. Our booth is in the main area (exhibition hall and lounge area) of the Euskalduna Conference Center. :)

In the conference’s first day Scrapinghubbers participated with two talks: one about Frontera, hosted by Alexander Sibiryakov, and one about testing, hosted by Eugene Amirov – both available below.

By Tuesday we hosted two more talks: one about best practises on web scraping, by Shane Evans, and one about Scrapy, by Juan Riaza – also available below. In addition to that, we’ve had a poster session about Frontera and a recruiting session sharing job opportunities and the perks from working for Scrapinghub.

In the third day Lluis Esquerda hosted a talk about City Bikes, a personal project developed by him on bike sharing networks (check the video below). As the night came we went to Hotel Ercilla, where EuroPython organizers were hosting the conference’s social event (a.k.a. “Pyntxos Night”). There, we were able to enjoy the nice food, other attendees’ companionship and also to have some tricky moves on the dance for. :)

After resting from the party, Thursday came and with it Juan Riaza’s Scrapy Helpdesk. We invited attendees to join us and learn about Scrapy, also getting a couple of swags in our booth.

This Friday Juan Riaza hosted a 3 hours training on Scrapy, sharing tips and building spiders in real time with the attendees. We also went to the Guggenheim Museum where we met the “Mom Spider” or, as some people say, one of Scrapinghub’s ancestor.

With the weekend came the EuroPython sprints, where we teamed up to port Scrapy to Python 3. During the sprint we managed to unblock our main problem with the port, the handling of urls and headers, and then migrated Request and Response classes so finally it is possible to divide the work and port each component independently. We also managed to split the Scrapy Selectors into a separate library (called Parsel) which reached an stable point and Scrapy now depends on it (we’re working on its documentation to make an official release). We continued the Python 3 porting in the following weeks back home and got important contributions from the community that helped increase our test coverage under python3 from 19% to 54% (see current progress in this badge).

Here are the full recorded version of the talks given by Scrapinghubbers:

Alexander’s talk “Frontera: open source large-scale web crawling framework”

Eugene’s talk “Sustainable way of testing your code”

Shane’s talk “Advanced Web Scraping”

Juan’s talk on “Dive into Scrapy”

Lluis’ talk “CityBikes: bike sharing networks around the world”

It was an amazing experience, our special thanks to the EuroPython organization and for all team members that made it happen!

StartupChats Remote Working Q&A

Earlier this week, Scrapinghub was invited along with several other fully-distributed companies to participate in a remote working Q&A hosted by Startups Canada.

The various companies invited, as well as their guests attending, shared their insights into a number of questions related to building a remote company and being a remote worker.

We’re sharing our answers and our favourite answers from other companies.

Q1 What are some of the major benefits to using remote workers?



Q2 What are some of the major benefits in working remotely?




Q3 How can you stay connected with your team while working remotely?



Q4 What is the best software for ensuring effective communication/collaboration?


Q5 Are there any home-office necessities for working remotely?


Q6 Do certain roles lend themselves better to remote workers? Are there some that don’t?


Q7 How do you maintain a team atmosphere and culture when your employees are working remotely?




Q8 Is information security a greater issue for businesses that use remote workers/allow employees to work remotely?


Q9 How can a business protect its data while using remote workers?


Q10 Is a certain amount of face-to-face interaction recommended ie pre-hire/once a year? When and how much?



Q11 Any final tips for our audience today? Do you know of any helpful blogs/resources on remote working?






Thanks to everyone at QuickBooks and Startup Canada, as well as Joe Johnson, Wade Foster, Alex Brown and Tom Redman!

PyCon Philippines 2015

Earlier this month we attended PyCon Philippines as a gold sponsor, presenting on the 2nd day. This was particularly exciting as it was the first time the whole Philippines team was together in one place and it was nice meeting each other in person!

Checkout the slides below:

The talk started with how people would scrape manually in the past, the pain of dealing with handling timeouts, retries, HTTP errors and so forth. We presented Scrapy as a solution to these issues and explained how to address them, as well as giving a brief history of how Scrapy came to be.

11698839_10204654662429255_7239153432639870114_o

We proceeded with a live demo showing how to scrape the Republic Acts of the Philippines from the Philippine government website as well as scraping clickthecity.com to retrieve cinema screening schedules in Metro Manila. Some of the audience joined in during the demo and we helped answer their questions.

We also talked about some of the projects we have done for customers at Scrapinghub:

  • Collecting product information from various retailers worldwide for a UK analytics company. This data is used to discover who has the cheapest products, which retailers are running promotions etc. This is useful for customers who want to find the best deals, retailers to see how they compare, and brands to ensure retailers are confirming to their guidelines.
  • Scraping home appliances along with their price, specifications and ratings for a U.S. Department of Energy laboratory. This data is used to better understand the relation of product price, energy efficiency and other factors and their evolution over time.
  • DARPA’s Memex project.

We then showed a number of side projects including:

  • Using Scrapy to crawl the Metro Rail Transit website with the aid of computer vision to gauge the number of people on a scale of 1 to 10 through their CCTV images. Visualisation of the data was collected and presented with an explanation of the possibility of using historical data to predict future results.
  • Minibalita.com: Jolo showed scraping Philippine news websites and then running them through his site TextTeaser to produce article summaries. Balita is the Tagalog word for news.
  • Mikko presented his crawling of the Philippines’ 2013 general election, emphasising the power of structured data in finding trends. Some unusual trends discussed were:

    • How one clustered precinct of around 400 people only voted for one party list despite having the option to vote for two.
    • How one clustered precinct only voted for one senator, out of the maximum of 12 votes they can cast for the position.
    • How 70 clustered precincts recorded a 100% voter turnout which is highly unlikely to happen.

Finally we discussed some of the legalities of scraping, of which there were a lot of questions. We also had many questions on how we deal with complaints and sites blocking us, and how to deal with sites that make heavy use of JavaScript and AJAX.

11056608_10204654663429280_6966676948524246209_o

Afterwards we had dinner in a nearby mall with the other speakers and it was great meeting like minded people from the Python community.

Follow

Get every new post delivered to your Inbox.

Join 60 other followers