Skip to content

Aduana: Link Analysis to Crawl the Web at Scale

Crawling vast numbers of websites for specific types of information is impractical. Unless, that is, you prioritize what you crawl. Aduana is an experimental tool that we developed to help you do that. It’s a special backend for Frontera, our tool to expedite massive crawls in parallel (primer here).

Aduana is designed for situations where the information you’re after is dispersed all over the web. It provides you with two link analysis algorithms, PageRank and HITS (how they work). These algorithms analyze link structure and page content to identify relevant pages. Aduana then prioritize the pages you’ll crawl next based on this information.

You can use Aduana for a number of things. For instance:

  • Analyzing news.
  • Searching locations and people.
  • Performing sentiment analysis.
  • Finding companies to classify them.
  • Extracting job listings.
  • Finding all sellers of certain products.

Concrete example: imagine that you’re working for a travel agency and want to monitor trends on location popularity. Marketing needs this to decide what locations they’ll put forward. One way you could do this is by finding mentions of locations on news websites, forums, and so forth. You’d then be able to observe which locations are trending and which ones are not. You could refine this even further by digging into demographics or the time of year – but let’s keep things simple for now.

This post will walk you through that scenario. We’ll explore how you can extract locations from web pages using geotagging algorithms, and feed this data into Aduana to prioritize crawling the most relevant web pages. We’ll then discusses how performance bottlenecks led us to build Aduana in the first place. We’ll lastly give you its backstory and a glimpse at what’s coming next.

In case you’d like to try this yourself as you read, you’ll find the source code for everything discussed in this post in the Aduana repository’s examples directory.

Using GeoNames and Neural Networks to Guide Aduana

Geotagging is a hard problem

Extracting location names from arbitrary text is a lot easier said than done. You can’t just take a geographical dictionary (a gazetteer) and match its contents against your text. Two main reasons:

  • Location names can look like common words.
  • Multiple locations can share the same name.

For instance, GeoNames references 8 locations called Hello, one called Hi, and a whopping 95 places called Hola:

Using Named-Entity Recognition to Extract Location Names

Our first idea was to use named-entity recognition (NER) to reduce the number of false positives and false negatives. A NER tokenizer allows to classify elements in a text into predefined categories such as person names or location names. We figured we’d run our text through the NER and match the relevant output against the gazetteer.

CLAVIN is one of the few open source libraries available for this task. It uses heuristics that disambiguate locations based on the text’s context.

We built a similar solution in Python. It’s based on NLTK, a library for natural language processing that includes a part of speech (POS) tagger. A POS tagger is an algorithm that assigns parts of speech – e.g. noun or verb – to each word. This lets us extract geopolitical entities (GPEs) from a text directly. And then match the GPEs against GeoNames to get candidate locations.

The result is better than naive location name matching but not bulletproof. Consider the following paragraph from this blog entry to illustrate:

Bilbao can easily be seen in a weekend, and if you’re really ambitious, squeezed into 24 hours. Lots of tourists drop in before heading out to San Sebastián, but a quick trip to the Bizkaia province leaves out so many beautiful places. Puerto Viejo in Algorta – up the river from Bilbao on the coast – is one of those places that gets overlooked, leaving it for only for those who’ve done their research.

Green words are locations our NER tokenizer recognized. Red words are those it missed. As you can see, it ignored Bizkaia. It also missed Viejo after incorrectly splitting Puerto Viejo in two separate words.

Undeterred, we tried a different approach. Rather than extracting GPEs using a NER and feeding them directly into a gazetteer, we’d use the gazetteer to build a neural network and feed the GPEs into that instead.

Using a Hopfield Network to Extract Location Names

We initially experimented with centrality measures using the NetworkX library. In short, a centrality measure assigns a score to each node in a graph to represent its importance. A typical one would be PageRank.

But we were getting poor results: nodes values were too similar. We were leaning towards introducing non-linearities to the centrality measures to work around that. When it hit us: a special form of recurrent neural network called a Hopfield network had them built-in.

We built our Hopfield network (source code) with each unit representing a location. Unit activation values range between -1 and +1, and represent how sure we are that a location appears in a text. The network’s links connect related locations with positive weights – e.g. a city with its state, a country with its continent, or a country with neighboring countries. Links also connect mutually exclusive locations with negative weights, to address cases where different locations have identical names. This network then lets us pair each location in a text with the unit that has the highest activation.

You’d normally train such a network’s weights until you minimize an error function. But manually building a training corpus would take too much time and effort for this post’s purpose. As a shortcut, we simply set the weights of the network with fixed values (an arbitrary +1 connection strength for related locations) or using values based on GeoNames (an activation bias based on the location’s population relative to its country).

The example we used earlier yields the following neural network:

Colors represent the final unit activations. Warmer colors (red) show activated units; cooler colors (blue) deactivated ones.

The activation values for our example paragraph are:

The results are still not perfect but very promising. Puerto gets recognized as Puerto Viejo but is incorrectly assigned to a location in Mexico. This may look strange at first, but Mexico actually has locations named San Sebastián and Bilbao. Algorta is correctly identified but received a low score. It’s too weakly connected to the related locations in the graph. Both issues occur because we didn’t train our network.

That’s sensible and good enough for our purpose. So let’s move forward and make Aduana use this data.

Running the Spider

You can safely skip this section if you’re not running the code as you read.

You’ll find the spider’s source code in the Aduana repository and Aduana installation instructions in the documentation.

To run the spider:

git clone
cd aduana
python develop
cd examples/locations
pip install -r requirements.txt
scrapy crawl locations

If you get problems with the dependencies – namely with numpy and scipy – install them manually. One at a time, in order:

pip install numpy
pip install scipy
pip install sklearn

Prioritizing Crawls Using Aduana

Our spider starts with a seed of around 1000 news articles. It extracts links to new pages and location data as it crawls. It then relies on Aduana to tell it what to crawl next.

Each page’s text goes through our geotagging algorithm. Our spider only keeps locations with an activation score of 0.90 or above. It outputs the rows in a locations.csv file. Each row contains the date and time of the crawl, the GeoName ID, the location’s name, and the number of times the location appears in the page.

Our spider uses this data to compute a score for each page. The score is a function of the proportion of locations in the text:

Aduana then uses these scores as inputs for its HITS algorithm. This allows it to identify which pages are most likely to mention locations and rank pages on the fly.

Promising Results

Our spider is still running as we write this. Aduana ranked about 2.87M web pages. Our spider crawled 129k. And our geotagging algorithm found over 300k locations on these pages. The results so far look good.

In a future blog post we’ll dig into our findings and see if anything interesting stands out.

To conclude this section, Aduana helps a lot to guide your crawler when running broad crawls. And depending on what you’re using it for, consider implementing some kind of machine learning algorithm to extract the information you need to compute page scores.

Aduana Under the Hood

Performance problems with graph databases

If you’re familiar with graph databases, you may have been wondering all along why we created Aduana instead of using off-the-shelf solutions like GraphX and Neo4j. The main reason is performance.

We actually wanted to use a graph database at first. We figured we’d pick one, write a thin layer of code so Frontera can use it as a storage backend, and call it a day. This was the logical choice because the web is a graph and popular graph databases include link analysis algorithms out of the box.

We chose GraphX because it’s open source and works with HDFS. We ran the provided PageRank application on the LiveJournal dataseed. The results were surprisingly poor. We cancelled the job 90 minutes in. It was already consuming 10GB of RAM.

Wondering why, we ended up reading GraphChi: Large-Scale Graph Computation on Just a PC. It turns out that running PageRank on a 40 million node graph is only twice as fast as running GraphChi on a single computer. That eye opener made us reconsider whether we should distribute the work.

We tested a few non-distributed libraries with same LiveJournal data. The results:

To be clear, take these timings with a grain of salt. We only tested each library for a brief period of time and we didn’t fine-tune anything. What counted for us was the order of magnitude. The non-distributed algorithms were much faster than the distributed ones.

We then came across two more articles that confirmed what we had found: Scalability! But at what COST? and Bigger data; same laptop. The second one is particularly interesting. It discusses computing PageRank on a 128 billion edge graph using a single computer.

In the end, SNAP wasn’t fast enough for our needs, and FlashGraph and X-Stream were rather complex. We realized it would be simpler to develop our own solution.

Aduana Is Built on Top of LMDB

With this in mind, we began implementing a Frontera storage engine on top of a regular key-value store: LMDB. Aduana was born.

We considered using LevelDB. It might have been faster. LMDB is optimized for reads after all; we do a lot of writing. We picked LMDB regardless because:

  • It’s easy to deploy: only 3 source files written in C.
  • It’s plenty fast.
  • Multiple processes can access the database simultaneously.
  • It released under the OpenLDAP license: open source and non-GPL.

When Aduana needs to recompute PageRank/HITS scores, it stores the vertex data as a flat array in memory, and streams the links between them from the database.

Using LMDB and writing the algorithms in C was well worth the effort in the end. Aduana now fits a 128 billion edge graph onto a 1TB disk, and its vertex data in 16GB of memory. Which is pretty good.

Let’s now segue into why we needed any of this to begin with.

Aduana’s Backstory and Future

Aduana came about when we looked into identifying and monitoring specialized news and alerts on a massive scale for one of our Professional Services clients.

To cut a long story short, we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens.

We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month. And Crawlera, our smart proxy rotator, lets them work around crawler countermeasures when needed.

But no. The sorry reality is that relevant pages and updates were still coming in too slowly. We needed to prioritize more. That got us into link analysis and ultimately to Aduana itself.

We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after.

It’s still experimental. And not production-ready yet. Our next steps will be to improve how Aduana decides to revisit web pages. And make it play well with Distributed Frontera.

If you’d like to dig deeper or contribute, don’t hesitate to fork Aduana’s Github repository. The documentation includes a section on Aduana’s internals if you’d like to extend it.

As an aside, we’re hiring; and we’re for hire if you need help turning web content into useful data.

Scrapy on the Road to Python 3 Support

Scrapy-to-Python3Scrapy is one of the few popular Python packages (almost 10k github stars) that’s not yet compatible with Python 3. The team and community around it are working to make it compatible as soon as possible. Here’s an overview of what has been happening so far.

You’re invited to read along and participate in the porting process from Scrapy’s Github repository.

First off you may be wondering: “why is Scrapy not in Python 3 yet?”

If you asked around you likely heard an answer like “It’s based on Twisted, and Twisted is not fully ported yet, you know?”. Many blame Twisted, but other things are actually holding back the Scrapy Python 3 port.

When it comes to Twisted, its most important parts are already ported. But if we want Scrapy spiders to download from HTTP urls, we are really going to need a fix or workaround on Twisted’s http agent, since it doesn’t work on Python 3.

The Scrapy team started to make moves towards Python 3 support two years ago by porting some of the Scrapy dependencies. For about a year now a subset of Scrapy tests is executed under Python 3 on each commit. Apart from Twisted, one bottleneck that was blocking the progress for a while was that most of Scrapy requires Request or Response objects to work – which was recently resolved as described below.

Scrapy core devs meet and prepare a sprint

Devs-Preparing-Scrapy-Sprint-Python3During EuroPython 2015, from 20 to 28 of July, several Scrapy core developers gathered together in Bilbao and were able to make progress on the porting process – meeting for the first time after several years of working together as they did.

There was a Scrapy sprint scheduled to the weekend. Mikhail Korobov, Daniel Graña, and Elias Dorneles teamed up to prepare Scrapy for it by porting Request and Response in advance. This way, it would be easier for other people to join and contribute during the weekend sprint.

In the end, time was short for them to fully port Request and Response before the weekend sprint. Some of the issues they faced were:

  • Should HTTP headers be bytes or unicode? Is it different for keys and values? Some headers values are usually UTF-8 (e.g. cookies); HTTP Basic Auth headers are usually latin1; for other headers there is not a single universal encoding as well. Generally, bytes for HTTP headers make sense, but there is a gotcha: if you’re porting an existing project from Python 2.x to 3.x, code which was working before may start to silently producing incorrect results. For example, let’s say there is a response with ‘application/json’ content type. If header values are bytes, in Python 2.x `content_type == ‘application/json’` will return True, but in Python 3.x it will return False because you’re then comparing a unicode literal with bytes.
  • How to percent-escape and unescape URLs properly? Proper escaping depends on web page encoding and on a part of URL being escaped. This matters if a webpage author uses non-ascii URLs. After some experiments we found that browsers are doing crazy things here: URL path is encoded to UTF-8 before escaping, but the query string is encoded to web page encoding before escaping. You can’t trust browser UIs to check that. What they are sending to servers is consistent, namely a UTF-8 path and page-encoding for the query string – in each of Firefox and Chrome on OS X and Linux. But what they display to users depends on their browser and operating system.
  • URL-related functions are very different in Python 2.x and 3.x. In Python 2.x they only accept bytes, while in Python 3.x they only accept unicode. Combined with the encoding craziness, this makes porting the code harder still.

The EuroPython sprint weekend arrives

EuroPython-Scrapy-Sprint-Python3To unblock further porting, the team decided to use bytes for HTTP headers and momentarily disable some of the tests for non-ascii URL handling, thus eliminating the two bottlenecks that were holding things back.

The sprint itself was quiet but productive. Some highlights on what developers did there:

  • They unblocked the main problem with the port, the handling of urls and headers, and then migrated the Request and Response classes so it’s finally possible to divide the work and port each component independently;
  • Elias split the Scrapy Selectors into a separate library (called Parsel), which reached a stable point and which Scrapy now depends on (there is work being done in the documentation to make an official release);
  • Mikhail and Daniel ported several Scrapy modules to make further contributions easier;
  • A mystery contributor came, silently ported a Scrapy module, and left without a trace (please reach out!);
  • Two more guys joined, they were completely new to Scrapy and had fun setting up their first project.

The road is long, but the path is clear!

In the end the plan worked as expected. After porting Request and Response objects and making some hard decisions, the road to contributions is open.

In the weeks that followed the sprint, developers continued to work on the port. They also got important contributions from the community. As you can see here, Scrapy already got several pull requests merged (for example from @GregoryVigoTorres and @nyov).

Before the sprint there were ~250 tests passing in Python 3. The number is now over 600. These recent advances helped increase our test coverage under Python 3 from 19% to 54%.

Our next major goal is to port the Twisted HTTP client so spiders can actually download something from remote sites.

It’s still a long way to the Python 3 support, but when it comes to Python 3 porting Scrapy is in a much better shape now. Join us and contribute to porting Scrapy following these guidelines. We have added a badge to Github to show the progress of Python 3 support. The percentage is calculated using the tests that pass in Python 3 vs the total number of tests available. Currently, 633 tests passing on Python vs 1153 in total.

Python 3 Porting Status

Thanks for reading!

Introducing Javascript support for Portia

Today we released the latest version of Portia bringing with it the ability to crawl pages that require JavaScript. To celebrate this release we are making Splash available as a free trial to all Portia users so you can try it out with your projects.

How to use it

Open a project within Portia. If you don’t already have a an instance of Portia you can get started by downloading Portia from github or by signing up for our hosted instance.

If you would like to crawl using JavaScript in your project you can do so by:

  1. Navigating to your spider in Portia.
  2. Opening the Crawling tab.
  3. Clicking the Enable JS checkbox.

By clicking this checkbox you will now be able to annotate pages which require JavaScript to be enabled to be crawled correctly.

After you enable JavaScript you will be presented with the ability to limit which pages that JavaScript is enabled for. You can choose for JavaScript to only be run on certain pages in the same way that you can limit which pages are followed by Portia.

The reason you may want to limit which pages load JavaScript is that loading JavaScript can increase the amount of time required to run your spider.

Once you have made these changes you can publish your spider and it will be able to crawl pages that require JavaScript.

How do I know if I need it?

When you are creating a spider you can use the show followed links checkbox to decide if you need to JavaScript enabled on a page or not. By showing followed links you can see which links are followed only if JavaScript is enabled and which links are always followed.

Links followed by

Links followed by Green links will always be followed. Red links will never be followed. Blue links will only be followed if JavaScript is enabled.

To decide if you need JavaScript enabled for extracting data you can try to create a sample for the page that you wish to extract data from. If JavaScript is not enabled for this page and you can see the data you wish to extract then you don’t need to change anything; your spider works! If you don’t see the data you want then you can:

  • Enable JavaScript for the spider as described above, or
  • Add a matching pattern for this URL to the enable JavaScript patterns.

Using your own Splash

If you already have your own dedicated splash instance you can enable it for your project by adding its URL and your API key to the Portia addon in your project settings. If you would like to request your own Splash instance please visit your organization’s dashboard page. If you would like to learn more about splash you can do so here.

This should be all you need to get started with JavaScript in Portia. If you need anything else, is here to assist you. Happy Scraping!

Distributed Frontera: Web Crawling at Scale

Over the last half year we have been working on a distributed version of our frontier framework, Frontera. This work was partially funded by DARPA and is going to be included in the DARPA Open Catalog on Friday 7th.

The project came about when a client of ours expressed interest in building a crawler that’s able to identify frequently changing hub pages.

In case you aren’t familiar, hub pages are pages which contain a large number of outgoing link to authority sites. Authority sites refer to sites which obtain a high authority score which is determined by the relevance of the page. [2][3]

A client wanted us to build a crawler that would crawl around 1 billion pages per week, batch process them and then output the pages that are hubs, changing with some predefined periodicity and keep this system running to get latest changes.

Single-thread mode

We began by building single-thread crawl frontier framework, allowing us to store and generate batches of documents to crawl and working seamlessly with Scrapy ecosystem.

This is basically the original Frontera, intended to solve:

  • Cases when one need to isolate URL ordering/queueing from the spider e.g. distributed infrastructure, need of remote management of ordering/queueing.
  • Cases when URL metadata storage is needed e.g. to demonstrate its contents somewhere.
  • Cases when one needs advanced URL ordering logic. If a website is big and it’s expensive to crawl the whole website, Frontera can be used for crawling the most important documents.

Single-thread Frontera has two storage backends: memory and SQLAlchemy. You can use any RDBMS of your choice such as SQLite, MySQL, Postgres. If you wish to use your own crawling strategy, it should be programmed in the backend and spider code.

You can find the repository here.

Distributed mode

Later, we began investigating how to scale the existing solution and make it work with an arbitrary number of documents. Considering our access pattern, we were interested in key-value storage that could afford random read/writes as long as efficient batch processing capabilities, at the same time being scalable.

HBase [4] turned out to be a good choice for this. A good example is’s content system built on HBase. [5]

The next choice is communication with HBase and Python directly isn’t reliable, so we decided to use Kafka as a communication layer. As a bonus we got partitioning by domain name which makes it easier to ensure each domain is downloaded at most by one spider, and also the ability to replay the log out of the box which can be useful when changing crawling strategy on the fly.

This resulted in the architecture outlined below:


Let’s start with spiders. The seed URLs defined by the user inside spiders are propagated to strategy workers and DB workers by means of a Kafka topic named ‘Spider Log’. Strategy workers decide which pages to crawl using HBase’s state cache, assigns a score to each page and sends the results to the ‘Scoring Log’ topic.

DB Worker stores all kinds of metadata, including content and scores. DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to “New Batches” topic. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the ‘Spider Log’ topic where they are stored and scored. That’s it, we have a closed circle.

The main advantage of this design is real-time operation. Crawling strategy can be changed without having to stop the crawl. It’s worth mentioning that crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.

Distributed Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by Kafka topic partitioning. All distributed Frontera components are written in Python, which is much easier to customize than C++ or Java–the most common languages for large-scale web crawlers. [6]

Here are the use-cases for distributed version:

  • You have set of URLs and need to revisit them e.g. to track changes.
  • Building a search engine with content retrieval from the Web.
  • All kinds of research work on web graph: gathering links statistics, structure of graph, tracking domain count, etc.
  • You have a topic and you want to crawl the documents about that topic.
  • More general focused crawling tasks: e.g. searching for pages that are large hubs and change frequently.

Hardware requirements

One spider thread can crawl around 1200 pages/minute from 100 web hosts in parallel. Spiders to workers ratio is about 4:1 without content; storing content could require more workers.

1 GB of RAM is required for each strategy worker instance in order to run state cache, which can be tuned. For example, if you need to crawl at 15K pages/minute you need 12 spiders and 3 strategy worker and DB worker pairs, all these processes consume 18 cores in total.

Using distributed Frontera

You can find the repository here.

The tutorial should help you get started, and there are also some useful notes in the slides Alexander presented at EuroPython last month.

You may also want to check out the Frontera documentation as well as our blog post that provides an example of using Frontera in a project.

Thanks for reading! If you have any questions or comments please share your thoughts below.


1. Image taken from
4. HBase was modeled after Google’s BigTable system, and you may find this paper useful to better understand it:
6. Strategy worker can be implemented using different language, the only requirement is Kafka client.

The Road to Loading JavaScript in Portia

Support for JavaScript has been a much requested feature ever since Portia’s first release 2 years ago. The wait is nearly over and we are happy to inform you that we will be launching these changes in the very near future. If you’re feeling adventurous you can try it out on the develop branch at Github. This post aims to highlight the path we took to achieving JavaScript support in Portia.

The Plan

As with everything in software, we started out by investigating what our requirements were and what others had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the web pages.

Reliability: A solution that could render the pages in the same way during spider creation and crawling.

Interaction: A system that would allow us to record the user’s actions so that they could be replayed while crawling.

The Investigation

The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:

  1. Placing the Portia UI inside a browser add on and taking advantage of the additional privileges to read from and interact with the page.
  2. Placing the Portia UI inside a bookmarklet, and after doing some post processing of the page on our server, allow interaction with the page.
  3. Rendering a static screenshot of the page with the coordinates of all its elements and sending them to the UI. Interaction involves re-rendering the whole screenshot.
  4. Rendering a tiled screenshot of the page along with coordinates. Then an interaction event is detected, update the representation on the server, send the updated tiles to the UI to be rendered along with the updated DOM.
  5. Rendering the page in an iframe with a proxy to avoid cross-origin issues and disable unwanted activity.
  6. Rendering the page on the server and send the DOM to the user. Whenever the user interacts with the page, the server would forward any changes that happen as a result of user interaction.
  7. Building a desktop application using Webkit to have full control over the UI, page rendering and everything else we might need.
  8. Building an internal application using Webkit to run on a server accessible through a web based VNC.

We rejected 7 and 8 because they would increase the barrier of entry for using Portia and make it more difficult to use. This method is used by for their spider creation tool.

1 and 2 were rejected because it would be hard to fit the whole Portia UI into an add on in the way we’d prefer, although we may revisit these options in the future. ParseHub and Kimono use these method to great effect.

3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end though it was clunky and we could achieve better performance by sending DOM updates rather than image tiles.

The Solution

The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted with in a controlled manner.

We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.

The server side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially we looked at React’s virtual DOM, and while it worked it wasn’t perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.

We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server side browser or from the original host and provide SSL protection to the resources if the host doesn’t already provide it.

The Future

Before JS support (left), After JS support (right)

Before JS support (left), After JS support (right)

For now we’re very happy with how it works and hope it will make it easier for users to extract the data they need.

This initial release will provide the means to crawl and extract pages that require JavaScript, but we want to make it better! We are now building a system to allow actions to be recorded and replayed on pages during crawling. We are hoping that this feature will make filling out forms, pressing buttons and triggering infinite scrolling simple and easy to use.

If you have any ideas for what features you would like to see in Portia leave a comment below!

EuroPython 2015

EuroPython 2015 is happening this week and we’re having the largest company meetup so far as a part of it, with more than 30 members from our fully remote-working team attending. The event which is held in Bilbao started on Monday and is providing great quality talks, sessions and plenty of tasty Spanish dishes.


As sponsors we’ve been privileged with a nice booth where we are connecting with Pythonistas, discussing scraping practises and giving away lots of swag such as hats, t-shirts, stickers, bottle-openers and clocks.


If you’re attending to EuroPython, make sure to drop by and get one of these. Our booth is in the main area (exhibition hall and lounge area) of the Euskalduna Conference Center. :)

In the conference’s first day Scrapinghubbers participated with two talks: one about Frontera, hosted by Alexander Sibiryakov, and one about testing, hosted by Eugene Amirov – both available below.

By Tuesday we hosted two more talks: one about best practises on web scraping, by Shane Evans, and one about Scrapy, by Juan Riaza – also available below. In addition to that, we’ve had a poster session about Frontera and a recruiting session sharing job opportunities and the perks from working for Scrapinghub.

In the third day Lluis Esquerda hosted a talk about City Bikes, a personal project developed by him on bike sharing networks (check the video below). As the night came we went to Hotel Ercilla, where EuroPython organizers were hosting the conference’s social event (a.k.a. “Pyntxos Night”). There, we were able to enjoy the nice food, other attendees’ companionship and also to have some tricky moves on the dance for. :)

After resting from the party, Thursday came and with it Juan Riaza’s Scrapy Helpdesk. We invited attendees to join us and learn about Scrapy, also getting a couple of swags in our booth.

This Friday Juan Riaza hosted a 3 hours training on Scrapy, sharing tips and building spiders in real time with the attendees. We also went to the Guggenheim Museum where we met the “Mom Spider” or, as some people say, one of Scrapinghub’s ancestor.

With the weekend came the EuroPython sprints, where we teamed up to port Scrapy to Python 3. During the sprint we managed to unblock our main problem with the port, the handling of urls and headers, and then migrated Request and Response classes so finally it is possible to divide the work and port each component independently. We also managed to split the Scrapy Selectors into a separate library (called Parsel) which reached an stable point and Scrapy now depends on it (we’re working on its documentation to make an official release). We continued the Python 3 porting in the following weeks back home and got important contributions from the community that helped increase our test coverage under python3 from 19% to 54% (see current progress in this badge).

Here are the full recorded version of the talks given by Scrapinghubbers:

Alexander’s talk “Frontera: open source large-scale web crawling framework”

Eugene’s talk “Sustainable way of testing your code”

Shane’s talk “Advanced Web Scraping”

Juan’s talk on “Dive into Scrapy”

Lluis’ talk “CityBikes: bike sharing networks around the world”

It was an amazing experience, our special thanks to the EuroPython organization and for all team members that made it happen!

StartupChats Remote Working Q&A

Earlier this week, Scrapinghub was invited along with several other fully-distributed companies to participate in a remote working Q&A hosted by Startups Canada.

The various companies invited, as well as their guests attending, shared their insights into a number of questions related to building a remote company and being a remote worker.

We’re sharing our answers and our favourite answers from other companies.

Q1 What are some of the major benefits to using remote workers?

Q2 What are some of the major benefits in working remotely?

Q3 How can you stay connected with your team while working remotely?

Q4 What is the best software for ensuring effective communication/collaboration?

Q5 Are there any home-office necessities for working remotely?

Q6 Do certain roles lend themselves better to remote workers? Are there some that don’t?

Q7 How do you maintain a team atmosphere and culture when your employees are working remotely?

Q8 Is information security a greater issue for businesses that use remote workers/allow employees to work remotely?

Q9 How can a business protect its data while using remote workers?

Q10 Is a certain amount of face-to-face interaction recommended ie pre-hire/once a year? When and how much?

Q11 Any final tips for our audience today? Do you know of any helpful blogs/resources on remote working?

Thanks to everyone at QuickBooks and Startup Canada, as well as Joe Johnson, Wade Foster, Alex Brown and Tom Redman!


Get every new post delivered to your Inbox.

Join 60 other followers