We are very excited to be participating again this year on Google Summer of Code. After a successful experience last year where Julia Medina (now a proud Scrapinghubber!) worked on Scrapy API cleanup and per-spider settings, we are back again this year with 3 ideas approved:
When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these scenarios to guide the crawler to relevant pages.
It's not uncommon to need to crawl a large number of unfamiliar websites when gathering content. Page ranking algorithms are incredibly useful in these scenarios as it can be tricky to determine which pages are relevant to the content you're looking for.
Here at Scrapinghub we are a remote team of 100+ engineers distributed among 30+ countries. As part of their standard contract, Scrapinghubbers get 20 vacation days per year and local country holidays off, and yet we spent almost zero time managing this. How do we do it?. The answer is “git” and here we explain how.