Spidermon: Scrapinghub’s Secret Sauce To Our Data Quality & Reliability Guarantee
If you know anything about Scrapinghub, you know that we are obsessed with data quality and data reliability.
Outside of building some of the most powerful web scraping tools in the world, we also specialise in helping companies extract the data they need for their mission-critical business requirements. Most notably companies who:
- Rely on web data to make critical business decisions, or;
- Companies who provide web data-driven services to their end customers.
In both scenarios having a reliable data extraction infrastructure that delivers high-quality data is their #1 priority.
When your business's success is dependant on maintaining reliable high-quality data, any disruption to the quality of your data feeds along with their reliability can be devastating and can have huge consequences for your business.
Imagine you are the provider of product and price monitoring services for the world's largest consumer brands (think Nike, Unilever, P&G). Your customers rely on your data-driven services to fuel their competitor research, dynamic pricing and new product research. If their data feeds stop working or suddenly start to experience data quality issues you are in serious hot water.
Your customers will be flying blind, unable to make data-driven product pricing or positioning decisions until the data quality issue is resolved. Potentially costing them hundreds of thousands or even millions of dollars in lost revenue.
Not to mention the technical headaches you will face when trying to diagnose and rectify the underlying problems, the reputational or contractual risk to your business would be enormous. Even a single occurrence of a serious data quality issue could kill your business.
As a result, data quality and reliability is a burning need for many companies.
To ensure our clients feel safe relying on Scrapinghub to deliver their data, we decided to develop Spidermon, our - until recently - proprietary library for monitoring Scrapy spiders.
Spidermon is Scrapinghub’s battle-tested library for monitoring Scrapy spiders. Over the last 3 years, Spidermon has been central to our ability to consistently deliver the most reliable data feeds on the market.
In fact, we are so confident in our capabilities that we guarantee data quality and reliability in all our customer service level agreements (SLAs).
Spidermon is a Scrapy extension for monitoring Scrapy spiders. It provides a suite of data validation, stats monitoring, and notification tools that enable Scrapinghub to quickly develop robust spider monitoring functionality for our client’s spiders.
Until recently Spidermon was a proprietary internal technology, however, as open source is our DNA, Scrapinghub’s co-founders Shane Evans and Pablo Hoffman were adamant that we open source the technology to help developers scrape the web more efficiently at scale. So the decision was made to open source Spidermon.
If you would like to learn more about how you can integrate Spidermon into your own web scraping projects then be sure to check out Spidermon: Scrapinghub’s Open Source Spider Monitoring Library and Spidermon’s GitHub repository.
Spidermon’s Role In Our Data Quality & Reliability Guarantees
Spidermon is central to Scrapinghub’s four-layer data quality assurance process:
- Layer 1 - Pipelines: Pipelines are rule-based Scrapy constructs designed to cleanse and validate the data as it is being scraped.
- Layer 2 - Spidermon: As described above, Spidermon is a spider monitoring framework we’ve developed to monitor and validate the data as it is being scraped.
- Layer 3 - Manually-Executed Automated QA: The third component of Scrapinghub’s QA process is the Python-based automated tests our dedicated QA team develops and executes. During this stage, datasets are analysed to identify any potential sources of data corruption. If any issues are found, these are then manually inspected by a QA engineer.
- Layer 4 - Manual/Visual QA: The final step is to manually investigate any issues flagged by the automated QA process and additionally manually spot check sample sets of data to validate that the automated QA steps haven’t missed any data issues.
The monitoring functionality Spidermon provides powers Scrapinghub’s automatic and manual QA activities.
Once integrated with a spider, Spidermon continuously runs in the background monitoring the data extraction process for potential sources of data quality or reliability issues.
Spidermon verifies the scraped data against a schema that defines the expected structure, data types and value restrictions. Ensuring that if a spider ceases to extract all the target data correctly, the underlying problem can be immediately identified and resolved without the data quality issue ever reaching the client's data infrastructure.
Spidermon also monitors the spider execution for bans, errors and item coverage drops, among other aspects of a typical spiders execution process that may indicate the early signs of a reliability issue. Ensuring that our engineers can investigate the causes of the reliability issue before gaps appear in the data feed (which is extremely important for some of our clients - especially investment asset managers who need perfect data integrity to enable them to backtest their investment theses).
What makes Spidermon so powerful is the fact it can provide these technical capabilities extremely efficiently at enormous scale. Scrapinghub extracts data from over 8 billion pages per month, with Spidermon monitoring the spider execution for every single one.
If an issue is ever detected, a QA engineer is immediately alerted to the issue, where in they can then diagnose and resolve the underlying issue.
If you would like to take a deeper look at Scrapinghub’s four-layer data quality assurance process, the exact data validation tests we conduct and how you can build your own quality system, then be sure to check our whitepaper: Data Quality Assurance: A Sneak Peek Inside Scrapinghub’s Quality Assurance System.
We can safely say that without Spidermon, Scrapinghub would never have been able to scale our operation to the size we’ve achieved whilst simultaneously improving the quality and reliability of our data feeds as our work has continued to scale.
Your Data Extraction Needs
At Scrapinghub we specialize in turning unstructured web data into structured data. If you have a need to start or scale your web scraping projects then our Solution Architecture team is available for a free consultation, where we will evaluate and develop the architecture for a data extraction solution to meet your data and compliance requirements.
At Scrapinghub we always love to hear what our readers think of our content and would be more than interested in any questions you may have. So please, leave a comment below with your thoughts and perhaps consider sharing what you are working on right now!