We are growing. Rapidly. This past year, we have gone from 80 to 114 Scrapinghubbers. Companies expanding so rapidly can risk losing what made them them in the first place. We don’t want that to happen, especially since we have the added challenge of being an entirely distributed company with remote coworkers scattered around the world in 36 countries.
However, we also want to be sure to adapt to the changing dynamic. Sticking to a method that worked well with 40 people means stagnation. It falls apart when you double your numbers. We have had to learn how to communicate through these growth spurts while maintaining our core culture.
One of the further challenges has been to stay true to our quirky selves while building up. With engineers, programmers, data scientists, writers, marketers, and a whole slew of other positions, we’ve got a lot of different personalities. Extroverts, introverts, niche interests in dystopian novels; you name it, we’ve got it. With applications coming in each day, narrowing down qualified candidates can be challenging because above all else, we stress a cultural fit. We want to ensure that each new hire fits in with their team and with the wider community as a whole.
Here’s a breakdown of the key steps we take to get there.
Hiring a Cultural Fit
CVs and resumes are a very flat way to learn about a person. We know that sometimes people have trouble showing their full potential through such a limited document. So, as part of our screening process, we assign a trial task to give people a chance to shine. This gives both us and the candidate a little taste of what it would be like to work together in a real world scenario. A successful project then leads into an interview.
Our interviews reflect the position that is being filled. Whether that includes a sales pitch or a detailed overview of relevant background experiences, rest assured that we’re not going to ask you any odd hypotheticals about being trapped in a blender. Instead, we will look for situations where you have demonstrated:
We cover all of our bases before making an offer. Picking the best candidate right off the bat smoothes the rest of the hiring process and ensures that we strengthen the baseline of our company culture.
Once we find the right cultural fit, onboarding our new teammate is the next priority. This includes maintaining open communication and transparency throughout this process. Immediately working closely with a new hire in a remote setting is critical to easing people into our company environment. It can be an overwhelming enough experience to be onboarded onsite while in the physical presence of your coworkers. That overwhelming sensation can multiply exponentially when you are dealing with remote situations.
This is why we have a buddy system. We pair up new hires with veterans from their own teams who can be their mentor (much like Yoda).
This mentor is the point of contact for the new hire and the gateway to the pulse of Scrapinghub. Mentors are crucial to the continued success of our remote lifestyle. Mentors help with all questions big or small and are always available for reassurance and feedback.
Confidence is one of the foundations of independence and we strongly believe in working with self-assured coworkers who are excited about taking control of their own projects.
Open Lines of Communication
Since Scrapinghub was founded around Scrapy, the Open Source project created by our co-founders, most of our culture was inherited from the Open Source community. This includes an atmosphere of collaboration and problem-solving alongside innovation. While this is a great environment for independent individuals who enjoy working on passion projects, it can be challenging to keep everyone on the same page.
Factor in time zones, language barriers, and varying levels of technical expertise, and it’s like we’re attempting to herd cats. This is why it is critical for us to keep many lines of communication open.
Here are some of the methods that we use to ensure that our teams are all on the same page:
1. Weekly Newsletter
This is an internal newsletter that covers new projects, products, features, new hires, and where we feature remarkable Scrapinghubbers every week. It provides a weekly overview of achievements within the company.
We love Slack. Slack is easily our go-to communication route between Scrapinghubbers. It’s immediate, clearly shows who’s online (important with our various timezones), and has a ton of different features.
We have many channels that not only cover clients and projects, but also different topics of interest to Scrapinghubbers. These interests include #radiohub to share music, #beer for beer lovers, and channels for our individual countries, regions, and everything else.
3. Office Tour Week
We don’t have cubicles to show off. Instead, we work from places like:
This is how the Office Tour Week started. It is a way for us to share our lives with our distributed colleagues. We remain connected in spite of the distance through these exchanges and glimpses into each other’s lives.
The “tour” takes place on our Google+ Community where people share photos and/or videos from their workplaces and their lives.
We plan to continue to grow with our tried-and-true roadmap. In case you’re struggling with developing a company culture that is realistic, productive, and cohesive, try out the methods that have worked for us.
From finding the right candidates to hiring a cultural fit and maintaining open lines of communication with a mentor, it is a process. Please share yours and let us know what tips you’ve picked up along the way!
*Note: This post was coauthored by Rocio Aramberri and Cecilia Haynes
We recently released Dateparser 0.3.1 with support for Belarusian and Indonesian, as well as the Jalali calendar used in Iran and Afghanistan. With this in mind, we’re taking the opportunity to introduce and demonstrate the features of Dateparser.
Dateparser is an open source library we created to parse dates written using natural language into Python. It translates specific dates like ‘5:47pm 29th of December, 2015’ or ‘9:22am on 15/05/2015’, and even relative times like ‘10 minutes ago’, into Python datetime objects. From there, it’s simple to convert the datetime object into any format you like.
Who benefits from Dateparser
When scraping dates from the web, you need them in a structured format so you can easily search, sort, and compare them.
Dates written in natural language aren’t suitable for this. For example, the 24th of December shows up first if you sort the 25th of November and the 24th of December alphanumerically. Dateparser solves this by taking the natural language date and parsing it into a datetime object.
A bonus perk of Dateparser is that you don’t need to worry about translation. It supports a range of languages including English, French, Spanish, Russian, and Chinese. Better yet, Dateparser autodetects languages, so you don’t need to write any additional code.
This makes Dateparser especially useful when you want to scrape data from websites in multiple languages, such as international job boards or real estate listings, without necessarily knowing what language the data you’re scraping is in. Think Indeed.
Why we didn’t use similar libraries
Dateparser developed while we were working on a broad crawl project that involved scraping many forums and blogs. The websites were written in numerous languages and we needed to parse dates in a consistent format.
None of the existing solutions met our needs. So we created Dateparser as a simple set of functions that sanitised the input and passed it to the dateutil library, using parserinfo objects to work with other languages.
This process worked well at first. But as the crawling project matured we ran into problems with short words, strings containing several languages, and a host of other issues. We decided to move away from parserinfo objects and handle language translation on our own. With the help of contributors from the Scrapy community, we significantly improved Dateparser’s language detection feature and made it easier to add languages by using YAML to store the translations.
Dateparser is simple to use and highly extendable. We have successfully used it to extract dates on over 100 million web pages. It’s well tested and robust.
Peeking under the hood
You can install Dateparser via pip. Import the library and use the dateparser.parse method to try it out:
$ pip install dateparser … $ python >>> import dateparser >>> dateparser.parse('1 week and one day ago') datetime.datetime(2015, 9, 27, 0, 17, 59, 738361)
Contributing new languages
Supporting new languages is simple. If yours is missing and you’d like to contribute, send us a pull request after updating the languages.yaml file. Here is what the definitions for French look like:
name: French skip: ["le", "environ", "et", "à", "er"] monday: - Lundi ... november: - Novembre - Nov december: - Décembre - Déc ... year: - an - année - années ... simplifications: - avant-hier: 2 jour - hier: 1 jour - aujourd'hui: 0 jours - d'une: 1 - un: 1 - une: 1 - (\d+)\sh\s(\d+)\smin: \1h\2m - (\d+)h(\d+)m?: \1:\2 - moins\s(?:de\s)?(\d+)(\s?(?:[smh]|minute|seconde|heure)): \1\2
When parsing dates, you don’t need to set the language explicitly. Dateparser will detect it for you:
$ python >>> import dateparser >>> dateparser.parse('aujourd\'hui') # French for 'today' datetime.datetime(2015, 10, 13, 12, 3, 19, 262752) >>> dateparser.parse('il ya 2 jours') # French for '2 days ago' datetime.datetime(2015, 10, 11, 12, 3, 19, 262752)
See the documentation for more examples.
How we measure up
Dateutil is the most popular Python library to parse dates. Dateparser actually uses dateutil.parser as a base, and builds its features on top of it. However, Dateutil was designed for formatted dates, e.g. ‘22-10-15’, rather than natural language dates such as ‘10pm yesterday’.
Parsedatetime is closer to Dateparser in that it also parses natural language dates. One advantage of Parsedatetime is that it supports future relative dates like ‘tomorrow’. However, while Parsedatetime also supports non-English languages, you must specify the location manually, whereas Dateparser detects the language for you.
Parsedatetime also has more boilerplate code, compare:
from datetime import datetime from time import mktime time_struct, parse_status = cal.parse('today') datetime.fromtimestamp(mktime(time_struct))
import dateparser dateparser.parse('today')
If you are dealing with multiple languages and want a simple API with no unnecessary boilerplate, then Dateparser is likely a good fit for your needs.
Give Dateparser a go
Dateparser is an extensible, easy-to-use, and effective method for parsing international dates from websites. Its unique features arose from the specific problems we needed to address. Namely, parsing dates from websites whose language we did not know in advance.
The library has been well tested against a large number of sites in over 20 languages and we continue to refine and improve it. Contributors are most welcome, so if you’re interested, please don’t hesitate to get involved!
Crawling vast numbers of websites for specific types of information is impractical. Unless, that is, you prioritize what you crawl. Aduana is an experimental tool that we developed to help you do that. It’s a special backend for Frontera, our tool to expedite massive crawls in parallel (primer here).
Aduana is designed for situations where the information you’re after is dispersed all over the web. It provides you with two link analysis algorithms, PageRank and HITS (how they work). These algorithms analyze link structure and page content to identify relevant pages. Aduana then prioritize the pages you’ll crawl next based on this information.
You can use Aduana for a number of things. For instance:
- Analyzing news.
- Searching locations and people.
- Performing sentiment analysis.
- Finding companies to classify them.
- Extracting job listings.
- Finding all sellers of certain products.
Concrete example: imagine that you’re working for a travel agency and want to monitor trends on location popularity. Marketing needs this to decide what locations they’ll put forward. One way you could do this is by finding mentions of locations on news websites, forums, and so forth. You’d then be able to observe which locations are trending and which ones are not. You could refine this even further by digging into demographics or the time of year – but let’s keep things simple for now.
This post will walk you through that scenario. We’ll explore how you can extract locations from web pages using geotagging algorithms, and feed this data into Aduana to prioritize crawling the most relevant web pages. We’ll then discusses how performance bottlenecks led us to build Aduana in the first place. We’ll lastly give you its backstory and a glimpse at what’s coming next.
In case you’d like to try this yourself as you read, you’ll find the source code for everything discussed in this post in the Aduana repository’s examples directory.
Using GeoNames and Neural Networks to Guide Aduana
Geotagging is a hard problem
Extracting location names from arbitrary text is a lot easier said than done. You can’t just take a geographical dictionary (a gazetteer) and match its contents against your text. Two main reasons:
- Location names can look like common words.
- Multiple locations can share the same name.
For instance, GeoNames references 8 locations called Hello, one called Hi, and a whopping 95 places called Hola:
Using Named-Entity Recognition to Extract Location Names
Our first idea was to use named-entity recognition (NER) to reduce the number of false positives and false negatives. A NER tokenizer allows to classify elements in a text into predefined categories such as person names or location names. We figured we’d run our text through the NER and match the relevant output against the gazetteer.
CLAVIN is one of the few open source libraries available for this task. It uses heuristics that disambiguate locations based on the text’s context.
We built a similar solution in Python. It’s based on NLTK, a library for natural language processing that includes a part of speech (POS) tagger. A POS tagger is an algorithm that assigns parts of speech – e.g. noun or verb – to each word. This lets us extract geopolitical entities (GPEs) from a text directly. And then match the GPEs against GeoNames to get candidate locations.
The result is better than naive location name matching but not bulletproof. Consider the following paragraph from this blog entry to illustrate:
Bilbao can easily be seen in a weekend, and if you’re really ambitious, squeezed into 24 hours. Lots of tourists drop in before heading out to San Sebastián, but a quick trip to the Bizkaia province leaves out so many beautiful places. Puerto Viejo in Algorta – up the river from Bilbao on the coast – is one of those places that gets overlooked, leaving it for only for those who’ve done their research.
Green words are locations our NER tokenizer recognized. Red words are those it missed. As you can see, it ignored Bizkaia. It also missed Viejo after incorrectly splitting Puerto Viejo in two separate words.
Undeterred, we tried a different approach. Rather than extracting GPEs using a NER and feeding them directly into a gazetteer, we’d use the gazetteer to build a neural network and feed the GPEs into that instead.
Using a Hopfield Network to Extract Location Names
We initially experimented with centrality measures using the NetworkX library. In short, a centrality measure assigns a score to each node in a graph to represent its importance. A typical one would be PageRank.
But we were getting poor results: nodes values were too similar. We were leaning towards introducing non-linearities to the centrality measures to work around that. When it hit us: a special form of recurrent neural network called a Hopfield network had them built-in.
We built our Hopfield network (source code) with each unit representing a location. Unit activation values range between -1 and +1, and represent how sure we are that a location appears in a text. The network’s links connect related locations with positive weights – e.g. a city with its state, a country with its continent, or a country with neighboring countries. Links also connect mutually exclusive locations with negative weights, to address cases where different locations have identical names. This network then lets us pair each location in a text with the unit that has the highest activation.
You’d normally train such a network’s weights until you minimize an error function. But manually building a training corpus would take too much time and effort for this post’s purpose. As a shortcut, we simply set the weights of the network with fixed values (an arbitrary +1 connection strength for related locations) or using values based on GeoNames (an activation bias based on the location’s population relative to its country).
The example we used earlier yields the following neural network:
Colors represent the final unit activations. Warmer colors (red) show activated units; cooler colors (blue) deactivated ones.
The activation values for our example paragraph are:
The results are still not perfect but very promising. Puerto gets recognized as Puerto Viejo but is incorrectly assigned to a location in Mexico. This may look strange at first, but Mexico actually has locations named San Sebastián and Bilbao. Algorta is correctly identified but received a low score. It’s too weakly connected to the related locations in the graph. Both issues occur because we didn’t train our network.
That’s sensible and good enough for our purpose. So let’s move forward and make Aduana use this data.
Running the Spider
You can safely skip this section if you’re not running the code as you read.
To run the spider:
git clone https://github.com/scrapinghub/aduana.git
python setup.py develop
pip install -r requirements.txt
scrapy crawl locations
If you get problems with the dependencies – namely with numpy and scipy – install them manually. One at a time, in order:
pip install numpy
pip install scipy
pip install sklearn
Prioritizing Crawls Using Aduana
Our spider starts with a seed of around 1000 news articles. It extracts links to new pages and location data as it crawls. It then relies on Aduana to tell it what to crawl next.
Each page’s text goes through our geotagging algorithm. Our spider only keeps locations with an activation score of 0.90 or above. It outputs the rows in a locations.csv file. Each row contains the date and time of the crawl, the GeoName ID, the location’s name, and the number of times the location appears in the page.
Our spider uses this data to compute a score for each page. The score is a function of the proportion of locations in the text:
Aduana then uses these scores as inputs for its HITS algorithm. This allows it to identify which pages are most likely to mention locations and rank pages on the fly.
Our spider is still running as we write this. Aduana ranked about 2.87M web pages. Our spider crawled 129k. And our geotagging algorithm found over 300k locations on these pages. The results so far look good.
In a future blog post we’ll dig into our findings and see if anything interesting stands out.
To conclude this section, Aduana helps a lot to guide your crawler when running broad crawls. And depending on what you’re using it for, consider implementing some kind of machine learning algorithm to extract the information you need to compute page scores.
Aduana Under the Hood
Performance problems with graph databases
We actually wanted to use a graph database at first. We figured we’d pick one, write a thin layer of code so Frontera can use it as a storage backend, and call it a day. This was the logical choice because the web is a graph and popular graph databases include link analysis algorithms out of the box.
We chose GraphX because it’s open source and works with HDFS. We ran the provided PageRank application on the LiveJournal dataseed. The results were surprisingly poor. We cancelled the job 90 minutes in. It was already consuming 10GB of RAM.
Wondering why, we ended up reading GraphChi: Large-Scale Graph Computation on Just a PC. It turns out that running PageRank on a 40 million node graph is only twice as fast as running GraphChi on a single computer. That eye opener made us reconsider whether we should distribute the work.
We tested a few non-distributed libraries with same LiveJournal data. The results:
To be clear, take these timings with a grain of salt. We only tested each library for a brief period of time and we didn’t fine-tune anything. What counted for us was the order of magnitude. The non-distributed algorithms were much faster than the distributed ones.
We then came across two more articles that confirmed what we had found: Scalability! But at what COST? and Bigger data; same laptop. The second one is particularly interesting. It discusses computing PageRank on a 128 billion edge graph using a single computer.
In the end, SNAP wasn’t fast enough for our needs, and FlashGraph and X-Stream were rather complex. We realized it would be simpler to develop our own solution.
Aduana Is Built on Top of LMDB
With this in mind, we began implementing a Frontera storage engine on top of a regular key-value store: LMDB. Aduana was born.
We considered using LevelDB. It might have been faster. LMDB is optimized for reads after all; we do a lot of writing. We picked LMDB regardless because:
- It’s easy to deploy: only 3 source files written in C.
- It’s plenty fast.
- Multiple processes can access the database simultaneously.
- It released under the OpenLDAP license: open source and non-GPL.
When Aduana needs to recompute PageRank/HITS scores, it stores the vertex data as a flat array in memory, and streams the links between them from the database.
Using LMDB and writing the algorithms in C was well worth the effort in the end. Aduana now fits a 128 billion edge graph onto a 1TB disk, and its vertex data in 16GB of memory. Which is pretty good.
Let’s now segue into why we needed any of this to begin with.
Aduana’s Backstory and Future
Aduana came about when we looked into identifying and monitoring specialized news and alerts on a massive scale for one of our Professional Services clients.
To cut a long story short, we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens.
We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month. And Crawlera, our smart proxy rotator, lets them work around crawler countermeasures when needed.
But no. The sorry reality is that relevant pages and updates were still coming in too slowly. We needed to prioritize more. That got us into link analysis and ultimately to Aduana itself.
We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after.
It’s still experimental. And not production-ready yet. Our next steps will be to improve how Aduana decides to revisit web pages. And make it play well with Distributed Frontera.
Scrapy is one of the few popular Python packages (almost 10k github stars) that’s not yet compatible with Python 3. The team and community around it are working to make it compatible as soon as possible. Here’s an overview of what has been happening so far.
You’re invited to read along and participate in the porting process from Scrapy’s Github repository.
First off you may be wondering: “why is Scrapy not in Python 3 yet?”
If you asked around you likely heard an answer like “It’s based on Twisted, and Twisted is not fully ported yet, you know?”. Many blame Twisted, but other things are actually holding back the Scrapy Python 3 port.
When it comes to Twisted, its most important parts are already ported. But if we want Scrapy spiders to download from HTTP urls, we are really going to need a fix or workaround on Twisted’s http agent, since it doesn’t work on Python 3.
The Scrapy team started to make moves towards Python 3 support two years ago by porting some of the Scrapy dependencies. For about a year now a subset of Scrapy tests is executed under Python 3 on each commit. Apart from Twisted, one bottleneck that was blocking the progress for a while was that most of Scrapy requires Request or Response objects to work – which was recently resolved as described below.
Scrapy core devs meet and prepare a sprint
During EuroPython 2015, from 20 to 28 of July, several Scrapy core developers gathered together in Bilbao and were able to make progress on the porting process – meeting for the first time after several years of working together as they did.
There was a Scrapy sprint scheduled to the weekend. Mikhail Korobov, Daniel Graña, and Elias Dorneles teamed up to prepare Scrapy for it by porting Request and Response in advance. This way, it would be easier for other people to join and contribute during the weekend sprint.
In the end, time was short for them to fully port Request and Response before the weekend sprint. Some of the issues they faced were:
- Should HTTP headers be bytes or unicode? Is it different for keys and values? Some headers values are usually UTF-8 (e.g. cookies); HTTP Basic Auth headers are usually latin1; for other headers there is not a single universal encoding as well. Generally, bytes for HTTP headers make sense, but there is a gotcha: if you’re porting an existing project from Python 2.x to 3.x, code which was working before may start to silently producing incorrect results. For example, let’s say there is a response with ‘application/json’ content type. If header values are bytes, in Python 2.x `content_type == ‘application/json’` will return True, but in Python 3.x it will return False because you’re then comparing a unicode literal with bytes.
- How to percent-escape and unescape URLs properly? Proper escaping depends on web page encoding and on a part of URL being escaped. This matters if a webpage author uses non-ascii URLs. After some experiments we found that browsers are doing crazy things here: URL path is encoded to UTF-8 before escaping, but the query string is encoded to web page encoding before escaping. You can’t trust browser UIs to check that. What they are sending to servers is consistent, namely a UTF-8 path and page-encoding for the query string – in each of Firefox and Chrome on OS X and Linux. But what they display to users depends on their browser and operating system.
- URL-related functions are very different in Python 2.x and 3.x. In Python 2.x they only accept bytes, while in Python 3.x they only accept unicode. Combined with the encoding craziness, this makes porting the code harder still.
The EuroPython sprint weekend arrives
To unblock further porting, the team decided to use bytes for HTTP headers and momentarily disable some of the tests for non-ascii URL handling, thus eliminating the two bottlenecks that were holding things back.
The sprint itself was quiet but productive. Some highlights on what developers did there:
- They unblocked the main problem with the port, the handling of urls and headers, and then migrated the Request and Response classes so it’s finally possible to divide the work and port each component independently;
- Elias split the Scrapy Selectors into a separate library (called Parsel), which reached a stable point and which Scrapy now depends on (there is work being done in the documentation to make an official release);
- Mikhail and Daniel ported several Scrapy modules to make further contributions easier;
- A mystery contributor came, silently ported a Scrapy module, and left without a trace (please reach out!);
- Two more guys joined, they were completely new to Scrapy and had fun setting up their first project.
The road is long, but the path is clear!
In the end the plan worked as expected. After porting Request and Response objects and making some hard decisions, the road to contributions is open.
In the weeks that followed the sprint, developers continued to work on the port. They also got important contributions from the community. As you can see here, Scrapy already got several pull requests merged (for example from @GregoryVigoTorres and @nyov).
Before the sprint there were ~250 tests passing in Python 3. The number is now over 600. These recent advances helped increase our test coverage under Python 3 from 19% to 54%.
Our next major goal is to port the Twisted HTTP client so spiders can actually download something from remote sites.
It’s still a long way to the Python 3 support, but when it comes to Python 3 porting Scrapy is in a much better shape now. Join us and contribute to porting Scrapy following these guidelines. We have added a badge to Github to show the progress of Python 3 support. The percentage is calculated using the tests that pass in Python 3 vs the total number of tests available. Currently, 633 tests passing on Python vs 1153 in total.
Thanks for reading!
- Navigating to your spider in Portia.
- Opening the Crawling tab.
- Clicking the Enable JS checkbox.
Using your own Splash
If you already have your own dedicated splash instance you can enable it for your project by adding its URL and your API key to the Portia addon in your project settings. If you would like to request your own Splash instance please visit your organization’s dashboard page. If you would like to learn more about splash you can do so here.
Over the last half year we have been working on a distributed version of our frontier framework, Frontera. This work was partially funded by DARPA and is going to be included in the DARPA Open Catalog on Friday 7th.
The project came about when a client of ours expressed interest in building a crawler that’s able to identify frequently changing hub pages.
In case you aren’t familiar, hub pages are pages which contain a large number of outgoing link to authority sites. Authority sites refer to sites which obtain a high authority score which is determined by the relevance of the page. 
A client wanted us to build a crawler that would crawl around 1 billion pages per week, batch process them and then output the pages that are hubs, changing with some predefined periodicity and keep this system running to get latest changes.
We began by building single-thread crawl frontier framework, allowing us to store and generate batches of documents to crawl and working seamlessly with Scrapy ecosystem.
This is basically the original Frontera, intended to solve:
- Cases when one need to isolate URL ordering/queueing from the spider e.g. distributed infrastructure, need of remote management of ordering/queueing.
- Cases when URL metadata storage is needed e.g. to demonstrate its contents somewhere.
- Cases when one needs advanced URL ordering logic. If a website is big and it’s expensive to crawl the whole website, Frontera can be used for crawling the most important documents.
Single-thread Frontera has two storage backends: memory and SQLAlchemy. You can use any RDBMS of your choice such as SQLite, MySQL, Postgres. If you wish to use your own crawling strategy, it should be programmed in the backend and spider code.
You can find the repository here.
Later, we began investigating how to scale the existing solution and make it work with an arbitrary number of documents. Considering our access pattern, we were interested in key-value storage that could afford random read/writes as long as efficient batch processing capabilities, at the same time being scalable.
The next choice is communication with HBase and Python directly isn’t reliable, so we decided to use Kafka as a communication layer. As a bonus we got partitioning by domain name which makes it easier to ensure each domain is downloaded at most by one spider, and also the ability to replay the log out of the box which can be useful when changing crawling strategy on the fly.
This resulted in the architecture outlined below:
Let’s start with spiders. The seed URLs defined by the user inside spiders are propagated to strategy workers and DB workers by means of a Kafka topic named ‘Spider Log’. Strategy workers decide which pages to crawl using HBase’s state cache, assigns a score to each page and sends the results to the ‘Scoring Log’ topic.
DB Worker stores all kinds of metadata, including content and scores. DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to “New Batches” topic. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the ‘Spider Log’ topic where they are stored and scored. That’s it, we have a closed circle.
The main advantage of this design is real-time operation. Crawling strategy can be changed without having to stop the crawl. It’s worth mentioning that crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.
Distributed Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by Kafka topic partitioning. All distributed Frontera components are written in Python, which is much easier to customize than C++ or Java–the most common languages for large-scale web crawlers. 
Here are the use-cases for distributed version:
- You have set of URLs and need to revisit them e.g. to track changes.
- Building a search engine with content retrieval from the Web.
- All kinds of research work on web graph: gathering links statistics, structure of graph, tracking domain count, etc.
- You have a topic and you want to crawl the documents about that topic.
- More general focused crawling tasks: e.g. searching for pages that are large hubs and change frequently.
One spider thread can crawl around 1200 pages/minute from 100 web hosts in parallel. Spiders to workers ratio is about 4:1 without content; storing content could require more workers.
1 GB of RAM is required for each strategy worker instance in order to run state cache, which can be tuned. For example, if you need to crawl at 15K pages/minute you need 12 spiders and 3 strategy worker and DB worker pairs, all these processes consume 18 cores in total.
Using distributed Frontera
You can find the repository here.
Thanks for reading! If you have any questions or comments please share your thoughts below.
1. Image taken from http://soltisconsulting.co/2013/08/28/the-pagerank-periodicals-an-educational-presentation-and-interpretive-study-part-1b-introduction-of-the-hypertext-induced-topic-search-hits-algorithm/
4. HBase was modeled after Google’s BigTable system, and you may find this paper useful to better understand it: http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
6. Strategy worker can be implemented using different language, the only requirement is Kafka client.
As with everything in software, we started out by investigating what our requirements were and what others had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the web pages.
Reliability: A solution that could render the pages in the same way during spider creation and crawling.
Interaction: A system that would allow us to record the user’s actions so that they could be replayed while crawling.
The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:
- Placing the Portia UI inside a browser add on and taking advantage of the additional privileges to read from and interact with the page.
- Placing the Portia UI inside a bookmarklet, and after doing some post processing of the page on our server, allow interaction with the page.
- Rendering a static screenshot of the page with the coordinates of all its elements and sending them to the UI. Interaction involves re-rendering the whole screenshot.
- Rendering a tiled screenshot of the page along with coordinates. Then an interaction event is detected, update the representation on the server, send the updated tiles to the UI to be rendered along with the updated DOM.
- Rendering the page in an iframe with a proxy to avoid cross-origin issues and disable unwanted activity.
- Rendering the page on the server and send the DOM to the user. Whenever the user interacts with the page, the server would forward any changes that happen as a result of user interaction.
- Building a desktop application using Webkit to have full control over the UI, page rendering and everything else we might need.
- Building an internal application using Webkit to run on a server accessible through a web based VNC.
We rejected 7 and 8 because they would increase the barrier of entry for using Portia and make it more difficult to use. This method is used by Import.io for their spider creation tool.
1 and 2 were rejected because it would be hard to fit the whole Portia UI into an add on in the way we’d prefer, although we may revisit these options in the future. ParseHub and Kimono use these method to great effect.
3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end though it was clunky and we could achieve better performance by sending DOM updates rather than image tiles.
The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted with in a controlled manner.
We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.
The server side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially we looked at React’s virtual DOM, and while it worked it wasn’t perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.
We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server side browser or from the original host and provide SSL protection to the resources if the host doesn’t already provide it.
For now we’re very happy with how it works and hope it will make it easier for users to extract the data they need.
If you have any ideas for what features you would like to see in Portia leave a comment below!