Skip to content

Google Summer of Code 2015

We are very excited to be participating again this year on Google Summer of Code. After a successful experience last year where Julia Medina (now a proud Scrapinghubber!) worked on Scrapy API cleanup and per-spider settings, we are back again this year with 3 ideas approved:

We would like to thank the Python Software Foundation for taking us again this year and wish the best luck to our students and all Summer of Code participants.

Link Analysis Algorithms Explained

When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these scenarios to guide the crawler to relevant pages.

This post aims to provide a lightweight introduction to page ranking algorithms so you have a better understanding of how to implement and use them in your spiders. There will be a follow up post soon detailing how to use these algorithms in your crawl.

PageRank

PageRank is perhaps the most well known page ranking algorithm. The PageRank algorithm uses the link structure of the web to calculate a score. This score provides insight into how relevant the page is and in our case can be used to guide the crawler. Many search engines use this score to influence search results.

One possible way of coming up with PageRank is by modelling the surfing behavior of a user on the web graph. Imagine that at a given instant of time t the surfer, S, is on page i. We denote this event as S_t = i. Page i has n_i outgoing and m_i incoming links as depicted in the figure.

web_graph

While on page i the algorithm can do one of the following to continue navigating:

  1. With probability c follow randomly an outgoing link.
  2. With probability 1 - c select randomly an arbitrary page on the web. When this happens we say the the surfer has teleported to another page.

So what happens if a page has no outgoing links? It’s reasonable to assume the user won’t stick around, so it’s assumed that the user will ‘teleport’, meaning they will visit another page through different means such as entering the address manually.

Now that we have a model of the user behavior let’s calculate P(S_t = i): the probability of being at page i at time instant t. The total number of pages is N.

P\left(S_t = i\right) = \sum_{j=1}^N P\left(S_{t} = i \mid S_{t-1} = j\right)P\left(S_{t-1} = j\right)

Now, the probability of going from page j to page i is different depending on wether or not j links to i (j \rightarrow i) or not (j \nrightarrow i).

If j \rightarrow i then:

P\left(S_{t} = i \mid S_{t-1} = j\right) = c \frac{1}{n_j} + (1 - c)\frac{1}{N}

If j \nrightarrow i but n_j \neq 0 then the only possibility is that the user chooses to teleport and it lands on page i.

P\left(S_{t} = i \mid S_{t-1}\right) = (1 - c)\frac{1}{N}

If j \nrightarrow i and n_j = 0 then the only possibility is that the user teleports to page i.

P\left(S_{t} = i \mid S_{t-1}\right) = \frac{1}{N}

We have assumed uniform probabilities in two cases:

  • When following outgoing links we assume all links have equal probability of being visited.
  • When teleporting to another page we assume all pages have equal probability of being visited.

In the next section we’ll remove this second assumption about teleporting in order to calculate personalized PageRank.

Using the formulas above, with some manipulation we obtain:

P\left(S_t = i\right) = c \sum_{j \rightarrow i} \frac{P(S_{t-1} = j)}{n_j} + \frac{1}{N}\left(1 - c\sum_{n_j \neq 0}P(S_{t-1} = j)\right)

Finally, for convenience let’s call \pi_i^t = P\left(S_t = i\right):

\pi_i^t = c \sum_{j \rightarrow i} \frac{\pi_j^{t-1}}{n_j} + \frac{1}{N}\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right)

PageRank is defined as the limit of the above sequence:

\text{PageRank}(i) = \pi_i = \lim_{t \to \infty}\pi_i^t

In practice, however, PageRank is computed by iterating the above formula a finite number of times: either a fixed number or until changes in the PageRank score are low enough.

If we look at the formula for \pi_i^t we see that the PageRank of a page has two parts. One part depends on how many pages are linking to the page but the other part is distributed equally to all pages. This means that all pages are going to get at least:

\frac{1}{N}\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right) \geq \frac{1 - c}{N}

This gives an oportunity to link spammers to artificially increase the PageRank of any page they want by maintaing link farms, which are huge amounts of pages controlled by the spammer.

As PageRank will give all pages a minimum score, all of these pages will have some PageRank that can be redirected to the page the spammer wants to rise in search results.

Spammers will try to build backlinks to their pages by linking to their sites on pages they don’t own. This is most common on blog comments and forums where content is accepted from users.

web_spam

Trying to detect web spam is a never-ending war between search engines and spammers. To help filter out spam we can use Personalized PageRank which works by not assigning a free score to undeserving pages.

Personalized PageRank

Personalized PageRank is obtained very similar to PageRank but instead of a uniform teleporting probability, each page has its own probability r_i of being teleported to irrespective of the originating page:

P\left(j \mathbin{\text{teleports to}} i \mid j\: \text{teleports}\right) = r_i

The update equations are therefore:
\pi_i^t = c \sum_{j \rightarrow i} \frac{\pi_j^{t-1}}{n_j} + r_i\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right)

Of course it must be that:

\sum_{i=1}^N r_i = 1

As you can see plain PageRank is just a special case where r_i = 1/N.

There are several ways the score r_i can be calculated. For example, it could be computed using some text classification algorithm on the page content. Alternatively, it could be set to 1 for some set of seeds pages and 0 for the rest of pages, in which case we get TrustRank.

Of course, there are ways to defeat this algorithm:

  • Good pages can link to spam pages.
  • Spam pages could manage to get good scores, for example, adding certain keywords to its content (content spamming).
  • Link farms can be improved by duplicating good pages but altering their links. An example would be mirrors of Wikipedia which add links to spam pages.

HITS

HITS (hyperlink-induced topic search) is another link analysis algorithm that assigns two scores: hub score and authority score. A page’s hub score is influenced by the authority scores of the pages linking to it, and vice versa. Twitter makes use of HITS to suggest users to follow.

The idea is to compute for each page a pair of numbers called the hub and authority scores. A page is considered a hub when it points to lot of pages with high authority, and page has high authority if it’s pointed to by many hubs.

The following graph shows one several pages with one clear hub H and two clear authorities A_1 and A_2.

hits_graph

Mathematically this is expressed as:

h_i = \sum_{j: i \to j} a_j
a_i = \sum_{j: j \to i} h_j

Where h_i represents the hub score of page i and a_i represents its authority score.

Similar to PageRank, these equations are solved iteratively until they converge to the required precision. HITS was conceived as a ranking algorithm for user queries where the set of pages that were not relevant to the query were filtered out before computing HITS scores.

For the purposes of our crawler we make a compromise: authority scores are modulated with the topic specific score r_i to give the following modified equations:

h_i = \sum_{j: i \to j} r_j a_j
a_i = \sum_{j: j \to i} h_j

As we can see totally irrelevant pages (r_j = 0) don’t contribute back authority.

HITS is slightly more expensive to run than PageRank because it has to maintains two sets of scores and also propagates scores twice. However, it’s particularly useful for crawling as it propagates scores back to the parent pages, providing a more accurate prediction of the strength of a link.

Further reading:

  • HITS: Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.” Journal of the ACM (JACM) 46.5 (1999): 604-632.
  • PageRank and Personalized Pagerank: Page, Lawrence, et al. “The PageRank citation ranking: Bringing order to the web.” (1999).
  • TrustRank: Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. “Combating web spam with trustrank.” Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.

We hope this post has served as a good introduction to page ranking algorithms and their usefulness in web scraping projects. Stay tuned for our next post where we’ll show you how to use these algorithms in your Scrapy projects!

EuroPython, here we go!

We are very excited about EuroPython 2015!

33 Scrapinghubbers from 15 countries will be meeting (most of them, for the first time) in Bilbao, for what is going to be our largest get-together event so far. We are also thrilled to have gotten our 8 sessions accepted (5 talks, 1 poster, 1 tutorial, 1 helpdesk) and couldn’t feel prouder of being a Gold Sponsor.

Here is a summary of the talks, tutorials and poster sessions that our staff will be giving at EuroPython 2015.

juan-talkJuan Riaza

Dive into Scrapy / 45 minute talk incl. Q&A

Tuesday 21 July at 11:45

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

In this talk some advanced techniques will be shown based on how Scrapy is used at Scrapinghub.

Scrapy Helpdesk / 3 hours helpdesk

Date and Time yet to be defined

Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

This helpdesk is run by members of Scrapinghub, where Scrapy was built and designed.

Scrapy Workshop / 3 hours training

Friday 24 July at 14:30

If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.

This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples.

Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task.

shane-talkShane Evans

Advanced Web Scraping / 45 minute talk incl. Q&A

Tuesday 21 July at 11:00

Python is a fantastic language for writing web scrapers. There is a large ecosystem of useful projects and a great developer community. However, it can be confusing once you go beyond the simpler scrapers typically covered in tutorials.

In this talk, we will explore some common real-world scraping tasks. You will learn best practises and get a deeper understanding of what tools and techniques can be used.

Topics covered will include:

  • Crawling – single pages, websites, focussed crawlers, etc.
  • Data extraction – techniques for “scraping” data from from web pages (e.g. regular expressions, xpath, machine learning)
  • Deployment – how to run and maintain different kinds of web scrapers
  • Real world examples

alexander-talkAlexander Sibiryakov

Frontera: open source large-scale web crawling framework / 30 minute talk incl. Q&A

Monday 20 July at 15:15

In this talk I’m going to introduce Scrapinghub’s new open source framework Frontera. Frontera allows to build real-time distributed web crawlers and website focused ones.

Offering:

  • customizable URL metadata storage (RDBMS or Key-Value based),
  • crawling strategies management,
  • transport layer abstraction.
  • fetcher abstraction.

Along with framework description I’ll demonstrate how to build a distributed crawler using Scrapy, Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!

Frontera: open source large-scale web crawling framework / Poster session

Date and Time yet to be defined

In this poster session I’m going to introduce Scrapinghub’s new open source framework Frontera. Frontera allows to build real-time distributed web crawlers and website focused ones.

Offering:

  • customizable URL metadata storage (RDBMS or Key-Value based),
  • crawling strategies management,
  • transport layer abstraction.
  • fetcher abstraction.

Along with framework description I’ll demonstrate how to build a distributed crawler using Scrapy, Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!

eugene-talkEugene Amirov

Sustainable way of testing your code / 30 minute talk incl. Q&A

Monday 20 July at 15:45

How to write a test so you would remember what it does in a year from now? How to write selective tests with different inputs? What is test? How to subclass tests cases and yet maintain control on which tests would run? How to extend or to filter inputs used in parent classes? Are you a tiny bit intrigued now? :)

This is not another talk about how to test, but how to organize your tests so they were maintainable. I will be using nose framework as an example, however main ideas should be applicable to any other framework you choose. Explaining how some parts of code works I would have to briefly touch some advanced python topics, although I will provide need-to-know basics there, so people with any level of python knowledge could enjoy the ride.

lluis-talkLluis Esquerda

CityBikes: bike sharing networks around the world / 45 minute talk incl. Q&A

Wednesday 22 July at 11:00

CityBikes [1] started on 2010 as a FOSS alternative endpoint (and Android client) to gather information for Barcelona’s Bicing bike sharing service. Later evolved as an open API [2] providing bike sharing data of any (mostly) service worldwide.

Fast forward today and after some C&D letters, there’s support for more than 200 cities, more than 170M historical entries have been gathered for analysis (in approx. a year) and the CityBikes API is the main source for open bike share data worldwide. This talk will tour about how we got there with the help of python and the community [3].

PS: We have a realtime map, it is awesome [4].

[1]: http://citybik.es
[2]: http://api.citybik.es
[3]: http://github.com/eskerda/pybikes
[4]: http://upcoming.citybik.es

In case you haven’t registered for EuroPython yet, you can do it here.

If you have any suggestions for anything specific you would like to hear us talk about or questions you would like us to answer related to our talks, please tell us in the comments.

Using git to manage vacations in a large distributed team

Here at Scrapinghub we are a remote team of 100+ engineers distributed among 30+ countries. As part of their standard contract, Scrapinghubbers get 20 vacation days per year and local country holidays off, and yet we spent almost zero time managing this. How do we do it?. The answer is “git” and here we explain how.

We have a github repository where each employee has a (YAML) file describing their personal vacations and each country has another one describing their public holidays. Employees are linked to the countries they live in. All this information in structured format is compiled to create an ICS file and expose it as a (company accessible) calendar with all staff holidays (including country holidays, medical leave, conferences, etc – it’s all in the YAML file, with that level of detail). Anyone can check this calendar to see who is off this, next week or anytime.

On top of this we have built some graphs to monitor your accrued vacation days (1.66 days per worked month) and warn you if you haven’t taken holidays in a long time.

timeoff-graph

Now here comes the best part: when Scrapinghubbers want to take vacations all they need to do is fork the time off repository and send a pull request to their managers, who merges them to approve the vacation. Because we are all technical people, we wanted to make the time off request process as developer-friendly as possible. Carefully wording a polite email to your manager is a thing of the past!. Just send a pull request!
github

For country holidays we apply something similar. In a large distributed team there are many countries involved (in our case, more than 30) and the list of country holidays is better maintained by the people who live in those countries. They know better which holidays apply than a Human Resources person collecting them from Google. So Scrapinghubbers are required to submit (and help maintain) their country holidays pro-actively and responsibly. This is done also by sending pull requests (with changes to the country file) that gets signed off (ie. merged) by the HR manager.

This model is heavily based around trust and having a technically fluent team (working with pull requests may put non-technical people off or require training) but we are happy to have both. If your company is in a position to try this model, we strongly suggest you give it a try. Even if you can’t apply this company wide, you could try it in specific teams. Having all in structured data means you can deliver this information in any format requested by your company’s HR team. We have been using this method for half a year now and never looked back. We couldn’t think of a better way to manage vacations and country holidays from such a large distributed team that wouldn’t involve large Human Resources overheads. And we hate unnecessary overheads!

We plan to continue writing about how we manage our distributed team, so stay tuned and follow us on Twitter. If you would like to work in a strong technical team and can’t wait to request your vacations with git, we are always hiring.

Gender Inequality Across Programming Languages

Gender inequality is a hot topic in the tech industry. Over the last several years we’ve gathered business profiles for our clients, and we realised this data would prove useful in identifying trends in how gender and employment relate to one another.

The following study is based on UK profiles to determine the gender of a profile using the given name, which covered approximately 80% of the users. We had collected data from 2010 through to 2015, so we were able to identify changes between each year.

The following languages were analyzed:

  • Python
  • Ruby
  • Java
  • C#
  • C++
  • JavaScript
  • PHP

Results

Male and female percentages in the IT Industry

gender_it

Male and female percentages outside the IT industry

gender_nonit

Male and female percentages by language

gender_language

Female percentage by year

gender_time

Ruby by a large margin appears to have the highest percentage of women, and C++ the lowest.

Gender imbalance seems to be less prominent outside the IT industry, but the percentage of women across languages seems to be increasing over time.

Methodology

The source for this study was provided by our Data Services massive business profiles collection. The gender of a profile was inferred by its given name.

The programming language associated with the user was determined by inspecting the descriptions of the person’s prior experience. Two methodologies were used for this.

The first methodology, which we’ll refer to as ‘Method 1′, associated a user with a programming language if the language name appeared in the description.

As this can lead to a user being assigned more than one language, we also used an alternate methodology, ‘Method 2′, that assigned a language if that language was the only one which appeared in the description.

The results presented above were the average of these two methodolgoies

We considered using people’s list of skills, but decided against it as it would’ve prevented us from retrieving results by year.

We excluded languages such as BASIC due to ambiguity, and analyzed jobs from 2010 to present. We also excluded languages where there weren’t enough jobs to keep our confidence interval below 1% at the 95th percentile.

If you would like additional information about our methodology or have any suggestions for a study please contact sales@scrapinghub.com.

Traveling Tips for Remote Workers

Being free to work from wherever you feel like, no boundaries holding you to a specific place or country. This is one of the greatest advantages of working remotely, and it’s leading many people to travel around the globe while completing their work. Today Claudio Salazar, a Scrapinghubber from Chile, is here to share his experiences and tips for these who seek working on the road.

Claudio’s traveling adventures started in September 2013 and so far he has visited 8 countries and more than 20 cities. He was never the kind of guy that loved traveling, but motivated by the need to improve his English skills, he decided to buy his first flight ticket and get started.

claudio-travelWhen asked about the benefits from starting a journey like his, he points the escape from the routine as one of the most positive aspects, impacting both your morale and your open-mindedness. “I think that staying in one place for a long time makes you live in a routine, but staying in constant change refill your energies since every day you wake up with the discovery of new things in mind. This improves your motivation to work and keep you in a good mood.”

But like many trips, things are not always so easy and comfortable. Claudio has faced some drawbacks since he left Chile, such as dealing with different time zones and adapting to new places and cultures. As well as planning a nice trip, you’ll also need to be flexible, adapting your working hours and sometimes having to attend to virtual meetings during late or unusual hours. “An important thing when you try this lifestyle is the flexibility you’ll need to have. Scrapinghub gives me the freedom to manage my working hours and actually work at any time without restriction.”

After traveling around the world for 16 months, Claudio is currently living the good life with his girlfriend in Paris, France. If you’d like to start a journey like Claudio’s, here’s some good advice:

Plans & Visas

First, you need to research about the countries that you want to visit – check if they are safe, its legislation, visa requirements, where do you plan to live and so on. Also, always keep in mind the following country you plan to visit, especially if you move as a tourist, because when you enter a country they will probably ask for your outbound ticket. Figure out the details before arriving and avoid unnecessary stress.

Health Insurance

Make sure you have health insurance. You never know when you might get sick, and medical services are expensive in any country, so you better be prepared. You can find online many companies offering health insurances and many sites that offer comparisons between them.

If you want to visit multiple countries, a good thing is to look for a continental insurance and check if it fits your needs. Most of the insurances must be contracted from your country, so figure it out beforehand. Usually you can contract an insurance for 3 months and renew it, but if you miss the deadline while traveling you can’t re-contract it. There are also a few more expensive options that allow you to contract the insurance independently from your departure or current location.

When you fell sick, you’ll have to call the phone number your insurance company provided and they’ll make an appointment in the nearest hospital from your residency. In case you need medicines you might have to buy it yourself in a pharmacy, depending on your health insurance, and then ask for refunds.

Accommodation

Make sure you rent an apartment or room before arriving to the country, because they could ask you where are you going to live while you’re staying. Try to get a flat with a nice desk, a comfortable chair and internet connection so you can properly do your work. Sites like Airbnb are useful for finding shorter term lets; more expensive than a 6-12 month lease but cheaper than a hotel.

Credit Cards & Bank Statements

Print your bank statements before traveling because you’ll probably be asked for them (in a typical “show me the money” case by arriving). Keep in mind to always travel with two credit cards, and have debit cards for emergency cases.

Getting Along

As a foreigner you will learn new things daily as you meet people. A good tip is to check beforehand for popular forums and communities online, or even the well-known Facebook, Couchsurfing. Another good option is finding a meetup site to meet people and have fun. Being a foreigner, you’re likely to receive some kind of special treatment from the locals.

Getting Things Done

Since you’ll be working while traveling, one of your challenges will be to keep fulfilling your responsibilities at work while on the road. Aside from the trip preparations, you’ll have to manage any unexpected travel issue and still get things done remotely.

Make sure to always bring your gadgets (smartphone, tablet) and your notebook with you – you never know when one of them may break or malfunction, so a backup gadget can save the communication with your team and buy you time until you address the issues. A good advice is to buy a prepaid cellphone chip with 3G or 4G internet when you arrive to a country so you have a backup internet connection if needed.

Also, before renting a flat or choosing a new place to visit, it is important to check for close cafes with internet connection, coworking spaces and wifi zones – so you have more options to keep on working in case your internet lets you down. A good thing is to narrow your choices for places with these resources available nearby.

In addition to Claudio, many other Scrapinghubbers have been traveling while working remotely, and you can see a few of their adventures and routes in the following map (feel free to navigate through the left menu for more):

Do you have an interesting story or tip to share about traveling while working remotely? We’d be glad to hear it! Feel free to share in the comments below. Safe travels!

A Career in Remote Working

This year I have reached a major milestone in my life, which is getting my bachelor’s degree in mathematics. When I made the decision to go back to college, it was solely because my experience working in your company, I figured out that having a math background would be a great foundation for getting into ML-related stuff.

Now my life journey have a new beginning and I wouldn’t be here if it weren’t for the opportunity you gave me. I will be always grateful with you.

— Rolando, our first hire

From the beginning, Scrapinghub has been a fully remote company, and now boasts over 100 employees working from all over the world, either from their homes or local coworking spaces. Our decision to maintain a fully remote workforce has proven to be a very good decision, allowing us to access a much wider range of talent compared to hiring locally.

So what about the career of someone who is considering remote work? It can seem risky to leave your cushy on-site job behind in favour of working for a company located across the globe, but the risk is smaller than you would imagine.

Rolando is a great example of someone who has had a lot of success with working remotely for the past 7 years. Born in Argentina, and raised in Bolivia, Rolando Espinoza at 30 years old has been a very important part of Scrapinghub since the very beginning. Rolando began his journey at Scrapinghub at our founder Pablo’s first company, Insophia, where he started as a Python developer. By June 2010 he was already working on the first version of Scrapinghub’s dashboard.

The only time Rolando has worked on-site since, is as a software developer in Uruguay alongside founders Pablo and Shane at the former headquarters of Insophia, where Scrapinghub’s Uruguay office now resides.

Working at Scrapinghub allowed Rolando to pursue a degree in mathematics, as he was able to fit his work schedule around his studies. At Scrapinghub, we allocate teams who share similar time zones and give each member control over their own schedule. You can work mornings, nights, or during whichever time you feel like. Day-to-day rearrangements can always be made and agreed upon within the team, and this flexibility has shown to be a win-win for all.

Rolando was very happy with his decision to complete a degree in mathematics. In his own words: “being a remote employee, rather than an independent freelancer, gives the opportunity to work in really interesting and challenging large-scale projects along with very smart people.”

Rolando worked on a number of machine learning projects here at Scrapinghub, allowing him to make use of the mathematical knowledge he was gaining at university from the outset. He is now looking forward to joining a CS/Math graduate program in the near future.

Rolando comments “regarding to machine learning, now I have a better grasp and can understand most of the notation and mathematical terminology in the algorithms and related papers.”

From web development to large scale web data mining projects, from open source projects to large professional services projects, such as the Memex project, he thinks “this is something very attractive, especially for those who live in cities or countries with a small and narrow software industry.”

Rolando believes that working with smart, highly skilled colleagues from such a variety of countries, has been an excellent opportunity to learn from them and push himself further. He’s proud of being able to keep up with the expectations of a company that strives to provide world-class services from top talent from all over the world.

We’re all incredibly grateful here at Scrapinghub for all of Rolando’s excellent work and contributions to the company, which he has brought from the very beginning.

If you’re inspired by Rolando’s story, excited about the prospect of working remotely and looking to join a team of smart, motivated people, check out our open positions.

Follow

Get every new post delivered to your Inbox.

Join 57 other followers