Skip to content

EuroPython 2015

EuroPython 2015 is happening this week and we’re having the largest company meetup so far as a part of it, with more than 30 members from our fully remote-working team attending. The event which is held in Bilbao started on Monday and is providing great quality talks, sessions and plenty of tasty Spanish dishes.

scrapinghub-europython-booth

As sponsors we’ve been privileged with a nice booth where we are connecting with Pythonistas, discussing scraping practises and giving away lots of swag such as hats, t-shirts, stickers, bottle-openers and clocks.

europython-scrapinghub-swags

If you’re attending to EuroPython, make sure to drop by and get one of these. Our booth is in the main area (exhibition hall and lounge area) of the Euskalduna Conference Center. :)

In the conference’s first day Scrapinghubbers participated with two talks: one about Frontera, hosted by Alexander Sibiryakov, and one about testing, hosted by Eugene Amirov – both available below.

By Tuesday we hosted two more talks: one about best practises on web scraping, by Shane Evans, and one about Scrapy, by Juan Riaza – also available below. In addition to that, we’ve had a poster session about Frontera and a recruiting session sharing job opportunities and the perks from working for Scrapinghub.

In the third day Lluis Esquerda hosted a talk about City Bikes, a personal project developed by him on bike sharing networks (check the video below). As the night came we went to Hotel Ercilla, where EuroPython organizers were hosting the conference’s social event (a.k.a. “Pyntxos Night”). There, we were able to enjoy the nice food, other attendees’ companionship and also to have some tricky moves on the dance for. :)

After resting from the party, Thursday came and with it Juan Riaza’s Scrapy Helpdesk. We invited attendees to join us and learn about Scrapy, also getting a couple of swags in our booth.

This Friday Juan Riaza hosted a 3 hours training on Scrapy, sharing tips and building spiders in real time with the attendees. We also went to the Guggenheim Museum where we met the “Mom Spider” or, as some people say, one of Scrapinghub’s ancestor. 

Here are the full recorded version of the talks given by Scrapinghubbers:

Alexander’s talk “Frontera: open source large-scale web crawling framework”

Eugene’s talk “Sustainable way of testing your code”

Shane’s talk “Advanced Web Scraping”

Juan’s talk on “Dive into Scrapy”

Lluis’ talk “CityBikes: bike sharing networks around the world”

Look forward to tomorrow when we’ll be working on the Scrapy sprint!

StartupChats Remote Working Q&A

Earlier this week, Scrapinghub was invited along with several other fully-distributed companies to participate in a remote working Q&A hosted by Startups Canada.

The various companies invited, as well as their guests attending, shared their insights into a number of questions related to building a remote company and being a remote worker.

We’re sharing our answers and our favourite answers from other companies.

Q1 What are some of the major benefits to using remote workers?



Q2 What are some of the major benefits in working remotely?




Q3 How can you stay connected with your team while working remotely?



Q4 What is the best software for ensuring effective communication/collaboration?


Q5 Are there any home-office necessities for working remotely?


Q6 Do certain roles lend themselves better to remote workers? Are there some that don’t?


Q7 How do you maintain a team atmosphere and culture when your employees are working remotely?




Q8 Is information security a greater issue for businesses that use remote workers/allow employees to work remotely?


Q9 How can a business protect its data while using remote workers?


Q10 Is a certain amount of face-to-face interaction recommended ie pre-hire/once a year? When and how much?



Q11 Any final tips for our audience today? Do you know of any helpful blogs/resources on remote working?






Thanks to everyone at QuickBooks and Startup Canada, as well as Joe Johnson, Wade Foster, Alex Brown and Tom Redman!

PyCon Philippines 2015

Earlier this month we attended PyCon Philippines as a gold sponsor, presenting on the 2nd day. This was particularly exciting as it was the first time the whole Philippines team was together in one place and it was nice meeting each other in person!

Checkout the slides below:

The talk started with how people would scrape manually in the past, the pain of dealing with handling timeouts, retries, HTTP errors and so forth. We presented Scrapy as a solution to these issues and explained how to address them, as well as giving a brief history of how Scrapy came to be.

11698839_10204654662429255_7239153432639870114_o

We proceeded with a live demo showing how to scrape the Republic Acts of the Philippines from the Philippine government website as well as scraping clickthecity.com to retrieve cinema screening schedules in Metro Manila. Some of the audience joined in during the demo and we helped answer their questions.

We also talked about some of the projects we have done for customers at Scrapinghub:

  • Collecting product information from various retailers worldwide for a UK analytics company. This data is used to discover who has the cheapest products, which retailers are running promotions etc. This is useful for customers who want to find the best deals, retailers to see how they compare, and brands to ensure retailers are confirming to their guidelines.
  • Scraping home appliances along with their price, specifications and ratings for a U.S. Department of Energy laboratory. This data is used to better understand the relation of product price, energy efficiency and other factors and their evolution over time.
  • DARPA’s Memex project.

We then showed a number of side projects including:

  • Using Scrapy to crawl the Metro Rail Transit website with the aid of computer vision to gauge the number of people on a scale of 1 to 10 through their CCTV images. Visualisation of the data was collected and presented with an explanation of the possibility of using historical data to predict future results.
  • Minibalita.com: Jolo showed scraping Philippine news websites and then running them through his site TextTeaser to produce article summaries. Balita is the Tagalog word for news.
  • Mikko presented his crawling of the Philippines’ 2013 general election, emphasising the power of structured data in finding trends. Some unusual trends discussed were:

    • How one clustered precinct of around 400 people only voted for one party list despite having the option to vote for two.
    • How one clustered precinct only voted for one senator, out of the maximum of 12 votes they can cast for the position.
    • How 70 clustered precincts recorded a 100% voter turnout which is highly unlikely to happen.

Finally we discussed some of the legalities of scraping, of which there were a lot of questions. We also had many questions on how we deal with complaints and sites blocking us, and how to deal with sites that make heavy use of JavaScript and AJAX.

11056608_10204654663429280_6966676948524246209_o

Afterwards we had dinner in a nearby mall with the other speakers and it was great meeting like minded people from the Python community.

Google Summer of Code 2015

We are very excited to be participating again this year on Google Summer of Code. After a successful experience last year where Julia Medina (now a proud Scrapinghubber!) worked on Scrapy API cleanup and per-spider settings, we are back again this year with 3 ideas approved:

We would like to thank the Python Software Foundation for taking us again this year and wish the best luck to our students and all Summer of Code participants.

Link Analysis Algorithms Explained

When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these scenarios to guide the crawler to relevant pages.

This post aims to provide a lightweight introduction to page ranking algorithms so you have a better understanding of how to implement and use them in your spiders. There will be a follow up post soon detailing how to use these algorithms in your crawl.

PageRank

PageRank is perhaps the most well known page ranking algorithm. The PageRank algorithm uses the link structure of the web to calculate a score. This score provides insight into how relevant the page is and in our case can be used to guide the crawler. Many search engines use this score to influence search results.

One possible way of coming up with PageRank is by modelling the surfing behavior of a user on the web graph. Imagine that at a given instant of time t the surfer, S, is on page i. We denote this event as S_t = i. Page i has n_i outgoing and m_i incoming links as depicted in the figure.

web_graph

While on page i the algorithm can do one of the following to continue navigating:

  1. With probability c follow randomly an outgoing link.
  2. With probability 1 - c select randomly an arbitrary page on the web. When this happens we say the the surfer has teleported to another page.

So what happens if a page has no outgoing links? It’s reasonable to assume the user won’t stick around, so it’s assumed that the user will ‘teleport’, meaning they will visit another page through different means such as entering the address manually.

Now that we have a model of the user behavior let’s calculate P(S_t = i): the probability of being at page i at time instant t. The total number of pages is N.

P\left(S_t = i\right) = \sum_{j=1}^N P\left(S_{t} = i \mid S_{t-1} = j\right)P\left(S_{t-1} = j\right)

Now, the probability of going from page j to page i is different depending on wether or not j links to i (j \rightarrow i) or not (j \nrightarrow i).

If j \rightarrow i then:

P\left(S_{t} = i \mid S_{t-1} = j\right) = c \frac{1}{n_j} + (1 - c)\frac{1}{N}

If j \nrightarrow i but n_j \neq 0 then the only possibility is that the user chooses to teleport and it lands on page i.

P\left(S_{t} = i \mid S_{t-1}\right) = (1 - c)\frac{1}{N}

If j \nrightarrow i and n_j = 0 then the only possibility is that the user teleports to page i.

P\left(S_{t} = i \mid S_{t-1}\right) = \frac{1}{N}

We have assumed uniform probabilities in two cases:

  • When following outgoing links we assume all links have equal probability of being visited.
  • When teleporting to another page we assume all pages have equal probability of being visited.

In the next section we’ll remove this second assumption about teleporting in order to calculate personalized PageRank.

Using the formulas above, with some manipulation we obtain:

P\left(S_t = i\right) = c \sum_{j \rightarrow i} \frac{P(S_{t-1} = j)}{n_j} + \frac{1}{N}\left(1 - c\sum_{n_j \neq 0}P(S_{t-1} = j)\right)

Finally, for convenience let’s call \pi_i^t = P\left(S_t = i\right):

\pi_i^t = c \sum_{j \rightarrow i} \frac{\pi_j^{t-1}}{n_j} + \frac{1}{N}\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right)

PageRank is defined as the limit of the above sequence:

\text{PageRank}(i) = \pi_i = \lim_{t \to \infty}\pi_i^t

In practice, however, PageRank is computed by iterating the above formula a finite number of times: either a fixed number or until changes in the PageRank score are low enough.

If we look at the formula for \pi_i^t we see that the PageRank of a page has two parts. One part depends on how many pages are linking to the page but the other part is distributed equally to all pages. This means that all pages are going to get at least:

\frac{1}{N}\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right) \geq \frac{1 - c}{N}

This gives an oportunity to link spammers to artificially increase the PageRank of any page they want by maintaing link farms, which are huge amounts of pages controlled by the spammer.

As PageRank will give all pages a minimum score, all of these pages will have some PageRank that can be redirected to the page the spammer wants to rise in search results.

Spammers will try to build backlinks to their pages by linking to their sites on pages they don’t own. This is most common on blog comments and forums where content is accepted from users.

web_spam

Trying to detect web spam is a never-ending war between search engines and spammers. To help filter out spam we can use Personalized PageRank which works by not assigning a free score to undeserving pages.

Personalized PageRank

Personalized PageRank is obtained very similar to PageRank but instead of a uniform teleporting probability, each page has its own probability r_i of being teleported to irrespective of the originating page:

P\left(j \mathbin{\text{teleports to}} i \mid j\: \text{teleports}\right) = r_i

The update equations are therefore:
\pi_i^t = c \sum_{j \rightarrow i} \frac{\pi_j^{t-1}}{n_j} + r_i\left(1 - c\sum_{n_j \neq 0}\pi_j^{t-1}\right)

Of course it must be that:

\sum_{i=1}^N r_i = 1

As you can see plain PageRank is just a special case where r_i = 1/N.

There are several ways the score r_i can be calculated. For example, it could be computed using some text classification algorithm on the page content. Alternatively, it could be set to 1 for some set of seeds pages and 0 for the rest of pages, in which case we get TrustRank.

Of course, there are ways to defeat this algorithm:

  • Good pages can link to spam pages.
  • Spam pages could manage to get good scores, for example, adding certain keywords to its content (content spamming).
  • Link farms can be improved by duplicating good pages but altering their links. An example would be mirrors of Wikipedia which add links to spam pages.

HITS

HITS (hyperlink-induced topic search) is another link analysis algorithm that assigns two scores: hub score and authority score. A page’s hub score is influenced by the authority scores of the pages linking to it, and vice versa. Twitter makes use of HITS to suggest users to follow.

The idea is to compute for each page a pair of numbers called the hub and authority scores. A page is considered a hub when it points to lot of pages with high authority, and page has high authority if it’s pointed to by many hubs.

The following graph shows one several pages with one clear hub H and two clear authorities A_1 and A_2.

hits_graph

Mathematically this is expressed as:

h_i = \sum_{j: i \to j} a_j
a_i = \sum_{j: j \to i} h_j

Where h_i represents the hub score of page i and a_i represents its authority score.

Similar to PageRank, these equations are solved iteratively until they converge to the required precision. HITS was conceived as a ranking algorithm for user queries where the set of pages that were not relevant to the query were filtered out before computing HITS scores.

For the purposes of our crawler we make a compromise: authority scores are modulated with the topic specific score r_i to give the following modified equations:

h_i = \sum_{j: i \to j} r_j a_j
a_i = \sum_{j: j \to i} h_j

As we can see totally irrelevant pages (r_j = 0) don’t contribute back authority.

HITS is slightly more expensive to run than PageRank because it has to maintains two sets of scores and also propagates scores twice. However, it’s particularly useful for crawling as it propagates scores back to the parent pages, providing a more accurate prediction of the strength of a link.

Further reading:

  • HITS: Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.” Journal of the ACM (JACM) 46.5 (1999): 604-632.
  • PageRank and Personalized Pagerank: Page, Lawrence, et al. “The PageRank citation ranking: Bringing order to the web.” (1999).
  • TrustRank: Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. “Combating web spam with trustrank.” Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.

We hope this post has served as a good introduction to page ranking algorithms and their usefulness in web scraping projects. Stay tuned for our next post where we’ll show you how to use these algorithms in your Scrapy projects!

EuroPython, here we go!

We are very excited about EuroPython 2015!

33 Scrapinghubbers from 15 countries will be meeting (most of them, for the first time) in Bilbao, for what is going to be our largest get-together event so far. We are also thrilled to have gotten our 8 sessions accepted (5 talks, 1 poster, 1 tutorial, 1 helpdesk) and couldn’t feel prouder of being a Gold Sponsor.

Here is a summary of the talks, tutorials and poster sessions that our staff will be giving at EuroPython 2015.

juan-talkJuan Riaza

Dive into Scrapy / 45 minute talk incl. Q&A

Tuesday 21 July at 11:45

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

In this talk some advanced techniques will be shown based on how Scrapy is used at Scrapinghub.

Scrapy Helpdesk / 3 hours helpdesk

Date and Time yet to be defined

Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

This helpdesk is run by members of Scrapinghub, where Scrapy was built and designed.

Scrapy Workshop / 3 hours training

Friday 24 July at 14:30

If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.

This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples.

Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task.

shane-talkShane Evans

Advanced Web Scraping / 45 minute talk incl. Q&A

Tuesday 21 July at 11:00

Python is a fantastic language for writing web scrapers. There is a large ecosystem of useful projects and a great developer community. However, it can be confusing once you go beyond the simpler scrapers typically covered in tutorials.

In this talk, we will explore some common real-world scraping tasks. You will learn best practises and get a deeper understanding of what tools and techniques can be used.

Topics covered will include:

  • Crawling – single pages, websites, focussed crawlers, etc.
  • Data extraction – techniques for “scraping” data from from web pages (e.g. regular expressions, xpath, machine learning)
  • Deployment – how to run and maintain different kinds of web scrapers
  • Real world examples

alexander-talkAlexander Sibiryakov

Frontera: open source large-scale web crawling framework / 30 minute talk incl. Q&A

Monday 20 July at 15:15

In this talk I’m going to introduce Scrapinghub’s new open source framework Frontera. Frontera allows to build real-time distributed web crawlers and website focused ones.

Offering:

  • customizable URL metadata storage (RDBMS or Key-Value based),
  • crawling strategies management,
  • transport layer abstraction.
  • fetcher abstraction.

Along with framework description I’ll demonstrate how to build a distributed crawler using Scrapy, Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!

Frontera: open source large-scale web crawling framework / Poster session

Date and Time yet to be defined

In this poster session I’m going to introduce Scrapinghub’s new open source framework Frontera. Frontera allows to build real-time distributed web crawlers and website focused ones.

Offering:

  • customizable URL metadata storage (RDBMS or Key-Value based),
  • crawling strategies management,
  • transport layer abstraction.
  • fetcher abstraction.

Along with framework description I’ll demonstrate how to build a distributed crawler using Scrapy, Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!

eugene-talkEugene Amirov

Sustainable way of testing your code / 30 minute talk incl. Q&A

Monday 20 July at 15:45

How to write a test so you would remember what it does in a year from now? How to write selective tests with different inputs? What is test? How to subclass tests cases and yet maintain control on which tests would run? How to extend or to filter inputs used in parent classes? Are you a tiny bit intrigued now? :)

This is not another talk about how to test, but how to organize your tests so they were maintainable. I will be using nose framework as an example, however main ideas should be applicable to any other framework you choose. Explaining how some parts of code works I would have to briefly touch some advanced python topics, although I will provide need-to-know basics there, so people with any level of python knowledge could enjoy the ride.

lluis-talkLluis Esquerda

CityBikes: bike sharing networks around the world / 45 minute talk incl. Q&A

Wednesday 22 July at 11:00

CityBikes [1] started on 2010 as a FOSS alternative endpoint (and Android client) to gather information for Barcelona’s Bicing bike sharing service. Later evolved as an open API [2] providing bike sharing data of any (mostly) service worldwide.

Fast forward today and after some C&D letters, there’s support for more than 200 cities, more than 170M historical entries have been gathered for analysis (in approx. a year) and the CityBikes API is the main source for open bike share data worldwide. This talk will tour about how we got there with the help of python and the community [3].

PS: We have a realtime map, it is awesome [4].

[1]: http://citybik.es
[2]: http://api.citybik.es
[3]: http://github.com/eskerda/pybikes
[4]: http://upcoming.citybik.es

In case you haven’t registered for EuroPython yet, you can do it here.

If you have any suggestions for anything specific you would like to hear us talk about or questions you would like us to answer related to our talks, please tell us in the comments.

Using git to manage vacations in a large distributed team

Here at Scrapinghub we are a remote team of 100+ engineers distributed among 30+ countries. As part of their standard contract, Scrapinghubbers get 20 vacation days per year and local country holidays off, and yet we spent almost zero time managing this. How do we do it?. The answer is “git” and here we explain how.

We have a github repository where each employee has a (YAML) file describing their personal vacations and each country has another one describing their public holidays. Employees are linked to the countries they live in. All this information in structured format is compiled to create an ICS file and expose it as a (company accessible) calendar with all staff holidays (including country holidays, medical leave, conferences, etc – it’s all in the YAML file, with that level of detail). Anyone can check this calendar to see who is off this, next week or anytime.

On top of this we have built some graphs to monitor your accrued vacation days (1.66 days per worked month) and warn you if you haven’t taken holidays in a long time.

timeoff-graph

Now here comes the best part: when Scrapinghubbers want to take vacations all they need to do is fork the time off repository and send a pull request to their managers, who merges them to approve the vacation. Because we are all technical people, we wanted to make the time off request process as developer-friendly as possible. Carefully wording a polite email to your manager is a thing of the past!. Just send a pull request!
github

For country holidays we apply something similar. In a large distributed team there are many countries involved (in our case, more than 30) and the list of country holidays is better maintained by the people who live in those countries. They know better which holidays apply than a Human Resources person collecting them from Google. So Scrapinghubbers are required to submit (and help maintain) their country holidays pro-actively and responsibly. This is done also by sending pull requests (with changes to the country file) that gets signed off (ie. merged) by the HR manager.

This model is heavily based around trust and having a technically fluent team (working with pull requests may put non-technical people off or require training) but we are happy to have both. If your company is in a position to try this model, we strongly suggest you give it a try. Even if you can’t apply this company wide, you could try it in specific teams. Having all in structured data means you can deliver this information in any format requested by your company’s HR team. We have been using this method for half a year now and never looked back. We couldn’t think of a better way to manage vacations and country holidays from such a large distributed team that wouldn’t involve large Human Resources overheads. And we hate unnecessary overheads!

We plan to continue writing about how we manage our distributed team, so stay tuned and follow us on Twitter. If you would like to work in a strong technical team and can’t wait to request your vacations with git, we are always hiring.

Follow

Get every new post delivered to your Inbox.

Join 59 other followers