Browsed by
Category: scrapy-cloud

Introducing Data Reviews

One of the things that takes more time when building a spider is reviewing the scraped data and making sure it conforms to the requirements and expectations of your client or team. This process is so time consuming that, in many cases, it ends up taking more time than writing the spider code itself, depending on how well the requirements are written. To make this process more efficient we have...

Introducing Dash

We're excited to introduce Dash, a major update to our scraping platform.

Why MongoDB Is a Bad Choice for Storing Our Scraped Data

MongoDB was used early on at Scrapinghub to store scraped data because it's convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the...

Git Workflow for Scrapy Projects

Our customers often ask us what's the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to...

Spiders activity graphs

Today we are introducing a new feature called Spider activity graphs. These allow you to visualize quickly how your spiders are working, and it's a very useful tool for busy projects to find out which spiders are not working as expected.