When it comes to web scraping, one key element is often overlooked until it becomes a big problem.
That is data quality.
Getting consistent high quality data when scraping the web is critical to the success of any web scraping project, particularly when scraping the web at scale or extracting mission critical data where accuracy is paramount.
Data quality can be the difference between a project being discontinued or it giving your business a huge competitive edge in a market.
In this article we’re going to talk about data quality assurance for web scrapers, give you a sneak peak into some of the tools and techniques Scrapinghub has developed and share with you some big news as we are open sourcing one of our most powerful quality assurance tools. These QA processes enable us to verify the quality of our clients’ data at scale, and confidently give all our clients data quality and coverage guarantees.
The Importance of Data Quality
From a business perspective, the most important consideration of any web scraping project is the quality of the data being extracted. Without a consistent high quality data feed your web scraping infrastructure will never be able to help your business achieve its objectives.
Today, with the growing prevalence of big data, artificial intelligence and data driven decision making, a reliable source of rich and clean data is a major competitive advantage. Compounding this is the fact that many companies are now directly integrating web scraped data into their own customer-facing products, making real-time data QA a huge priority for them.
Scraping at scale only magnifies the importance of data quality. Poor data accuracy or coverage in a small web scraping project is a nuisance, but usually manageable. However, when you are scraping hundreds of thousands or millions of web pages per day, even a small drop in accuracy or coverage could have huge consequences for your business.
At the commencement of any web scraping project, you always need to be thinking about how you are going to achieve the high levels of data quality you need when scraping the web.
Challenges of Data Quality Assurance
We know that getting high quality data when scraping the web is often of critical importance to your business, but what makes it so complex?
This is a combination of factors really:
#1 Requirements - The first and most important aspect of data quality verification is clearly defined requirements. Without knowing what data you require, what the final data should look like and what accuracy/coverage level you require, it is very hard to verify the quality of your data. Quite often companies come to Scrapinghub not having clear data requirements laid out, so we need to work with the client to properly define what are these requirements. We find that a good question to ask is:
“What effect would a 5% data quality inaccuracy have on your engineers or downstream systems?”
In order to make your data quality targets realistic and achievable, it is important that you specify your requirements clearly and that they be “testable”, particularly when one or more of the following is true:
- The website in question has considerable variability i.e. there are several different page layouts and permutations of the desired entities being scraped.
- The number of desired fields per item is high (more than 15).
- The expected number of items is known to be very large (in the order of hundreds of thousands).
- The website in question is highly category-based, and these categories result in duplication of identical entities on in different categories.
- The website is highly category-based but such categorisation is not available to the end user (for manual inspection and cross-referencing).
- It is desired to scrape based on some geographic filtering (e.g. postcode, city).
- The data is to be scraped using a mobile app.
#2 Efficiency at Scale - The beauty of web scraping is that it has a unmatched ability to scale very easily compared to other data gathering techniques. However, data QA often isn’t able to match the scalability of your web scraping spiders, particularly when it involves only manual inspection of a sample of the data and visual comparison with the scraped pages.
#3 Website Changes - Perhaps the biggest cause of poor data coverage or accuracy is changes to the underlying structure of all or parts of the target website. With the increasing usage of A/B split testing, seasonal promotions and regional/multilingual variations, large websites are constantly making small tweaks to the structure of their web pages that can break web scraping spiders. As a result, it is very common for the coverage and accuracy of the data from your spiders to degrade over time unless you have continuous monitoring and maintenance processes in place.
#4 Semantics - Verifying the semantics of textual information, or the meaning of the data that is being scraped, is still a challenge for automated QA as of today. While ourselves and others are developing technologies to assist in the verification of the semantics of the data we extract from websites, no system is 100% perfect. As a result, manual QA of the data is often required to ensure the accuracy of the data.
How to Structure an Automated QA System for Web Scraping
At a high level, your QA system is trying to assess the quality/correctness of your data along with the coverage of the data you have scraped.
Data Quality and Correctness
- Verify that the correct data has been scraped (fields scraped are taken from the correct page elements).
- Where applicable, the data scraped has been post-processed and presented in the format requested by you during the requirement collection phase (e.g. formatting, added/stripped characters, etc.).
- The field names match the intended field names stipulated by you.
- Item coverage - verify that the all available items have been scraped (items are the individual products, articles, property listings, etc.).
- Field coverage - verify that all the available fields for each item have been scraped.
Depending on the scale, number of spiders, and the degree of complexity of your web scraping requirements, there are different approaches you can take when developing an automated quality assurance system for your web scraping.
- Project Specific Test Framework - Here you develop a custom automated test framework for every web scraping project you work on. Such an approach is preferred if the scraping requirements are complex and/or your spider functionality is highly rules-based, with field interdependencies and other nuances.
- Generic Test Framework - However, if web scraping is going to be at the core of who you are as a business, and you will be constantly developing new spiders to scrape a wide variety of data types then developing a generic test framework is often the best solution. What’s more, for projects that have a custom automated test framework in place, these generic tests can serve to add an additional layer of assurance and test coverage.
Due to the number of clients we scrape the web for and the wide variety of web scraping projects we have in production at any one time, Scrapinghub have experience with both approaches. We’ve developed bespoke project-specific automated test frameworks for individual projects with unique requirements. We rely principally though on the generic automated test framework we’ve developed that can be used to validate the data scraped by any spider.
When used alongside Spidermon (more on this below), this framework allows us to quickly add a quality assurance layer to any new web scraping project we undertake.
The other key component of any web scraping quality assurance system is a reliable system for monitoring the status and output of your spiders in real-time.
A spider monitoring system allows you to detect sources of potential quality issues immediately after spider execution completes.
At Scrapinghub we’ve developed Spidermon, which allows developers (and indeed other stakeholders such as QA personnel and project managers) to automatically monitor spider execution. It verifies the scraped data against a schema that defines the expected structure, data types and value restrictions. It can also monitor bans, errors, and item coverage drops, among other aspects of a typical spider execution. In addition to the post-execution data validation that Spidermon performs, we often leverage real-time data-validation techniques, particularly for long-running spiders, which enables the developer to stop a spider when it is detected that it is scraping unusable data.
This brings us to the big news to we have to announce. Scrapinghub is delighted to announce that in the coming weeks we are going to open source Spidermon, and making it an easy use add-on for all your Scrapy spiders. It can also be used with spiders developed using other Python libraries and frameworks such as BeautifulSoup.
Spidermon is a extremely powerful and robust spider monitoring add-on that has been developed and tested on millions of spiders over it’s lifetime. To be the first to get access to Spidermon, be sure to get yourself on our email list (below) so we can let you know as soon as Spidermon is released.
Next we’ll take a look at Scrapinghub’s quality assurance process to see how all these elements fit together in an enterprise-scale QA system.
Scrapinghub’s Quality Assurance Process
To ensure the highest data quality from our web scraping, Scrapinghub applies a four-layer QA process to all the projects we undertake with clients.
- Layer 1 - Pipelines: Pipelines are rule-based Scrapy constructs designed to cleanse and validate the data as it is being scraped.
- Layer 2 - Spidermon: As described above, Spidermon is a spider monitoring framework we’ve developed to monitor and validate the data as it is being scraped.
- Layer 3 - Manually-Executed Automated QA: The third component of Scrapinghub’s QA process are the Python-based automated tests our dedicated QA team develops and executes. During this stage, datasets are analysed to identify any potential sources of data corruption. If any issues are found, these are then manually inspected by the QA engineer.
- Layer 4 - Manual/Visual QA: The final step is to manually investigate any issues flagged by the automated QA process and additionally manually spot check sample sets of data to validate that the automated QA steps haven’t missed any data issues.
Only after passing through all four of these layers is the dataset then delivered to the client.
To get a detailed behind the scenes look at how Scrapinghub’s quality system works, the exact data validation tests we conduct and how you can build your own quality system, then click on the image below to download our Web Scraping Quality Assurance Guide.
Wrapping Things Up
As you have seen, there is often quite a bit of work in ensuring your web scraping projects are actually yielding the high quality data you need to grow your business. Hopefully, this article has made you more aware of the challenges you will face and how you could go about solving them.
At Scrapinghub we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped data in your business then feel free to contact our Sales team, who will talk you through the services we offer startups right through to Fortune 100 companies.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on right now.