Browsed by
Author: Pablo Hoffman

Google Summer of Code 2015

We are very excited to be participating again this year on Google Summer of Code. After a successful experience last year where Julia Medina (now a proud Scrapinghubber!) worked on Scrapy API cleanup and per-spider settings, we are back again this year with 3 ideas approved:

Using git to manage vacations in a large distributed team

Here at Scrapinghub we are a remote team of 100+ engineers distributed among 30+ countries. As part of their standard contract, Scrapinghubbers get 20 vacation days per year and local country holidays off, and yet we spent almost zero time managing this. How do we do it?. The answer is “git” and here we explain how.

Why we moved to Slack

We are veterans in the chat group arena. We have been using one form of another since we started Scrapinghub in 2010 and I've been personally using corporate group chats since 2004. We started Scrapinghub using our own hosted version of ejabberd, then moved to HipChat in 2013 and we just finished moving to Slack. Thanks to Slack's migration tools, the process went pretty smoothly. In this post...

Open Source at Scrapinghub

Here at Scrapinghub we love open source. We love using and contributing to it. Over these years we have open sourced a few projects, that we keep using over and over, in the hope that it will make others lives easier. Writing reusable code is harder than it sounds, but it enforces good practices such as documenting accurately, testing extensively and worrying about backwards support. In the end...

Marcos Campal Is a ScrapingHubber!

We’re excited to welcome Marcos Campal to the Scrapinghub engineering team.

Introducing Crawlera, a Smart Page Downloader

We are proud to introduce Crawlera, a smart web downloader designed specifically for web crawling.

Git Workflow for Scrapy Projects

Our customers often ask us what's the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to...

How to Fill Login Forms Automatically

We often have to write spiders that need to login to sites, in order to scrape data from them. Our customers provide us with the site, username and password, and we do the rest.

Spiders activity graphs

Today we are introducing a new feature called Spider activity graphs. These allow you to visualize quickly how your spiders are working, and it's a very useful tool for busy projects to find out which spiders are not working as expected.

Scrapy 0.15 dropping support for Python 2.5

After a year considering it, we have decided to go ahead and drop support for Python 2.5 in Scrapy.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Learn More

Recent Posts