Solution Architecture Part 5: Designing A Well-Optimised Web Scraping Solution
In the fifth and final post of this solution architecture series, we will share with you how we architect a web scraping solution, all the core components of a well-optimized solution, and the resources required to execute it.
To give you an inside look at this process in action, we will give you a behind the scenes look at examples of projects we’ve scoped for our clients.
But first, let’s take a look at the main components you need for every web scraping project…
Disclaimer: I am not a lawyer, and the recommendations in this guide do not constitute legal advice. Our Head of Legal is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Scrapinghub’s experience helping our clients (startups to Fortune 100s) maintain GDPR compliance while scraping billions of web pages each month. If you want assistance with your specific situation then you should consult a lawyer.
Web Scraping Building Blocks
There are a few core components to every web scraping project that you need to have in place if you want to reliably extract high-quality data from the web at scale:
- Crawler Hosting
- Proxy Management
- Crawler Monitoring & Data QA
- Crawler Maintenance
However, depending on your project requirements you might also need to make use of other technologies to extract the data you need:
- Headless Browser
- Intelligent Crawlers
- Data Post-Processing
The amount of resources required to develop and maintain the project will be determined by the type and frequency of data needed and the complexity of the project.
Designing Web Scraping
Talking about the building blocks of web scraping projects is all well and good, however, the best way to see how to scope a solution is to look at real examples.
In the first example, we’ll look at one of the most common web scraping use cases - Product Monitoring. Every day Scrapinghub receives numerous requests from companies looking to develop internal product intelligence capabilities through the use of web scraped data. Here we will look at a typical example:
Project Requirements: The customer wanted to extract product from specific product pages from Amazon.com. They would provide a batch of search terms and the crawlers would search for those keywords and extract all products associated with them (~500 keywords per day).
- Product URL
- Product name
- Image URLs
- Product Description
- Product Information
- Up to 20 one star reviews
- Up to 20 five star reviews
- Canonical URL
- Article Image (Top Image)
- Media URLs
- Publish Date
The extracted data will be used in a customer facing product intelligence tool for consumer brands looking to monitor their own products along with the products of their competitors.
Legal Assessment: Typically, extracting product data poses very few legal issues provided that the crawler (1) doesn’t have to scrape behind a login (which often isn’t the case), (2) is only scraping factual or non-copyrightable information, and (3) the client doesn’t want to recreate the target website’s whole store, which may bring database rights into question. In this case, the project had little to no legal challenges.
Technical Feasibility: Although this project required a large scale crawling of a complex and ever-changing website, projects like this are Scrapinghub’s bread and butter. We have considerable experience delivering similar (and more complex) projects for clients so this was a very manageable project. We would be able to reuse a considerable amount of code used elsewhere to enable us to get the project up and running very quickly for the client.
Solution: After assessing the project the solution architect then developed a custom solution to meet the client's requirements. The solution consisted of three main parts:
- Data Discovery - the solution proposed that Scrapinghub manually develop data discovery crawlers for the site to automatically enter the product keywords into the search field, navigate to the associated “shelf page” and extract the product URLs for each product. These crawlers would then iterate through each “shelf page” until all product URLs have been extracted.
- Data Extraction - once the product URLs have been extracted they will be then sent to the data extraction crawlers that will navigate to each individual product page and extract the required data.
- Extraction Scale & Reliability - given the volume of requests the crawlers would be making to the same website each day, there was an obvious requirement for a proxy solution, in this case, Crawlera. It was also recommended that this project make use of Scrapinghub’s open source projects Spidermon and Arche for data quality monitoring.
Outcome: Scrapinghub successfully implemented this project for the client. The crawlers developed now extract ~500,000 products per day from the site, which the client inputs directly into their customer-facing product monitoring application.
In the next example, we’re going to take a look at a more complex web scraping project that required us to use artificial intelligence to extract the article data from over 300+ news sources.
Project Requirements: The customer wanted to develop a news aggregator app that will curate news content for their specific industries and interests. They provided an initial list of 300 news sites they wanted to crawl, however, they indicated that this number was likely to rise as their company grew. The client required every article in specific categories to be extracted from all the target sites, crawling the site every 15 minutes to every hour depending on the time of day. The client needed to extract the following data from every article:
- Canonical URL
- Article Image (Top Image)
- Media URLs
- Publish Date
Once extracted this data would be fed directly into their customer-facing app so ensuring high quality and reliable data was a critical requirement.
Legal Assessment: With article extraction, you always want to be cognizant of the fact that the articles are copyrighted material of the target website. You must ensure that you are not simply copying an entire article and republishing it. In this case, since the customer was aggregating the content internally and only republishing headlines and short snippets of the content, it was deemed that this project could fall under the fair use doctrine under copyright law. There are various copyright considerations and use cases to take into account when dealing with article extraction, so it is always best to consult with your legal counsel first.
Technical Feasibility: Although the project was technically feasible, due to the scale of the project (developing high-frequency crawlers for 300+ websites) the natural concern was that it would be financially unviable to pursue such a project.
As a rule of thumb, it takes an experienced crawl engineer 1-2 days to develop a robust and scalable crawler for one website. Doing a rough calculation will quickly show that to manually develop 300+ crawlers would be a very costly project if it required 1 work day per crawler.
With this in mind, our solution architecture team explored the use of AI enabled intelligent crawlers that would remove the need to code custom crawlers for every website.
Solution: After conducting the technical feasibility assessment the solution architect then developed a custom solution to meet the client's requirements. The solution consisted of three main parts:
- Data Discovery - the solution proposed that Scrapinghub manually develop data discovery crawlers for each site that would crawl the target website and locate articles for extraction. As these crawlers were crawling list pages, they were much easier to develop and there was a high level of transferability between sites to significantly cut down the development time.
- Data Extraction - once the article URLs were extracted these would then be sent to Scrapinghub’s automatic data extraction API which would use AI crawlers to extract the article data without having to manually develop extraction crawlers for each site.
- Extraction Scale & Reliability - similar to the previous project, given the number of sites being crawled this project required the use of a sophisticated proxy management system and data quality assurance layer. For this project, the solution architect recommended the use of Crawlera as the proxy solution and make use of Scrapinghub’s open source projects Spidermon and Arche for data quality monitoring.
Outcome: Scrapinghub successfully implemented this project for the client, who now is able to extract 100,000-200,000 articles per day from the target websites for the news aggregation app.
Your Web Scraping Project
So there you have it, this is the four-step process Scrapinghub uses to architect solutions for our client's web scraping projects. At Scrapinghub we have extensive experience architecting and developing data extraction solutions for every possible use case.
Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.
If you have a need to start or scale your web scraping project then our Solution Architecture team are available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.