For E-Commerce Data Scientists: Lessons Learned Scraping 100 Billion Products Pages
Web scraping can look deceptively easy these days. There are numerous open-source libraries/frameworks, visual scraping tools and data extraction tools that make it very easy to scrape data from a website. However, when you want to scrape websites at scale things start to get very tricky, very fast.
In this series of articles, we will share with you the lessons we’ve learned scraping over 100 billion product pages since 2010, give you an in-depth look at the challenges you will face when extracting product data from e-commerce stores at scale and share with you some of the best practices to address those challenges.
In this article, the first of the series, we will give you a overview of the main challenges you will face scraping product data at scale and the lessons Scrapinghub has learned from scraping 100 billion product pages.
What’s Important When Scraping At Scale?
Unlike your standard web scraping application, scraping e-commerce product data at scale has a unique set of challenges that make web scraping vastly more difficult.
At its core, these challenges can be boiled down to two things: speed and data quality.
As time is usually a limiting constraint, scraping at scale requires your crawlers to scrape the web at very high speeds without compromising data quality. This need for speed makes scraping large volumes of product data very challenging.
Challenge #1 - Sloppy and Always Changing Website Formats
It might be obvious and it might not be the sexiest challenge, but sloppy and always changing website formats is by far the biggest challenge you will face when extracting data at scale. Not necessarily because of the complexity of the task, but the time and resources you will spend dealing with it.
- Stores that remove pages when they discontinue products and suddenly begin to return 200 response codes on its 404 error handler after a website upgrade.
- Stores that abuse Ajax calls so much that you can only get the information you’re after by either rendering the page (resulting in much slower crawls) or mimicking the API calls (resulting in more development effort).
Sloppy code like this can make writing your spider a pain, but can also make visual scraping tools or automatic extraction tools unviable.
When scraping at scale, not only do you have to navigate potentially hundreds of websites with sloppy code, you will also have to deal with constantly evolving websites. A good rule of thumb is to expect your target website to make changes that will break your spider (drop in data extraction coverage or quality) every 2-3 months.
That mightn’t sound like too big a deal but when you are scraping at scale, those incidents really add up. For example, one of Scrapinghub’s larger e-commerce projects has ~4,000 spiders targeting about 1,000 e-commerce websites, meaning they can experience 20-30 spiders failing per day.
Variations in website layouts from regional and multilingual websites, A/B split testing and packaging/pricing variants also create a world of problems that routinely break spiders.
No Easy Solution
Unfortunately, there is no magic bullet that will completely solve these problems. A lot of the time it just a matter of committing more resources to your project as you scale. To take the previous project as a example again, that project has a team of full-time 18 crawl engineers and 3 dedicated QA engineers to ensure the client always has reliable data feed.
With experience however, your team will learn to create ever more robust spiders that can detect and deal with quirks in your target websites format.
Instead of having multiple spiders for all the possible layouts a target website might use, it is best practice to have only one product extraction spider that can deal with all the possible rules and schemes used by different page layouts. The more configurable your spiders are the better.
Although these practices will make your spiders more complex (some of our spiders are thousands of lines long), it will ensure that your spiders are easier to maintain.
As most companies need to extract product data on a daily basis, waiting a couple days for your engineering team to fix any broken spiders isn’t an option. When these situations arise, Scrapinghub uses a machine learning based data extraction tool that we’ve developed as a fallback until the spider has been repaired. This ML-based extraction tool automatically identifies the target fields on the target website (product name, price, currency, image, SKU, etc.) and returns the desired result.
In the coming weeks, we’ll be releasing this tool to the public and articles on how you can incorporate machine learning into your data extraction processes. However, in the meantime you can gain early access by contacting our sales team.
Challenge 2: Scalable Architecture
The next challenge you will face is building a crawling infrastructure that will scale as the number of requests per day increases, without degrading in performance.
When extracting product data at scale a simple web crawler that crawls and scrapes data serially just won’t cut it. Typically, a serial web scraper will make requests in a loop, one after the other, with each request taking 2-3 seconds to complete.
This approach is fine if your crawler is only required to make <40,000 requests per day (request every 2 seconds equals 43,200 request per day). However, past this point you will need to transition to a crawling architecture that will allow you to scrape millions of requests per day with no decrease in performance.
As this topic warrants a article onto itself, in the coming weeks we will publish a dedicated article discussing the how to design and build your own high throughput scraping architecture. However, for the remainder of this section we will discuss some of the higher level principles and best practices.
As we’ve discussed, speed is key when it comes to scraping product data at scale. You need to ensure that you can find and scrape all the required product pages in the time allotted (often one day). To do this you need to do the following:
Separate Product Discovery From Product Extraction
To scrape product data at scale you need to separate your product discovery spiders from your product extraction spiders.
The goal of the product discovery spider should be for it to navigate to the target product category (or “shelf”) and store URLs of the products in that category for the product extraction spiders. As the product discovery spider adds product URLs to the queue the product extraction spiders scrape the target data from that product page.
This can be accomplished with the aid of a crawl frontier such as Frontera, the open source crawl frontier developed by Scrapinghub. While Frontera was originally designed for use with Scrapy, it’s completely agnostic and can be used with any other crawling framework or standalone project. In this article, we share how you could use Frontera to scrape HackerNews at scale.
Allocate More Resources To Product Extraction
As each product category “shelf” can contain anywhere from 10 to 100 products and extracting product data is more resource heavy than extracting a product URL, discovery spiders typically run faster than product extraction spiders. When this is the case, you need to have multiple extraction spiders for every discovery spider. A good rule of thumb is to create a separate extraction spider for each ~100,000 page bucket.
Challenge 3: Maintaining Throughput Performance
Scraping at scale can easily be compared to Formula 1 where you goal is to shave every unnecessary gram of weight from your car and squeeze that last fraction of a horsepower from the engine all in the name of speed. The same is true for web scraping at scale.
When extracting large volumes of data you are always on the lookout for ways to minimise the request cycle time and maximise your spiders performance of the available hardware resources. All in the hope that you can shave a couple milliseconds off each request.
To do this your team will need to develop a deep understanding of the web scraping framework, proxy management and hardware you are using so you can tune them for optimal performance. You will also need to focus on:
When scraping at scale you should always be focused on solely extracting the exact data you need in as few requests as possible. Any additional requests or data extraction slow the pace at which you can crawl a website. Keep these tips in mind when designing your spiders:
- If you can get the data you need from the shelf page (e.x. Product names, price, ratings, etc.) without requesting each individual product page, then don’t request the product pages.
- Don’t request or extract images unless you really have to.
Challenge 4: Anti-Bot Countermeasures
If you are scraping e-commerce sites at scale you are guaranteed to run into websites employing anti-bot countermeasures.
For most smaller websites their anti-bot countermeasures will be quite basic (ban IPs making excess requests). However, larger e-commerce websites such as Amazon, etc. make use of sophisticated anti-bot countermeasures such as Distil Networks, Incapsula, or Akamai, that make extracting data significantly more difficult.
With that in mind the first and most essential requirement for any project scraping product data at scale is to use proxy IPs. When scraping at scale you will need a sizeable list of proxies, and will need to implement the necessary IP rotation, request throttling, session management and blacklisting logic to prevent your proxies from getting blocked.
Unless, you have or are willing to commit a sizeable team to manage your proxies you should outsource this part of the scraping process. There are a huge number of proxy services available who provide varying levels of service.
However, our recommendation is to go with a proxy provider who can provide a single endpoint for proxy configuration and hide all the complexities of managing your proxies. Scraping at scale is resource intensive enough without trying to reinvent the wheel by developing and maintaining your own internal proxy management infrastructure.
This is the approach most of the large e-commerce companies use. A number of the worlds largest e-commerce companies use Crawlera, the smart downloader developed by Scrapinghub, that completely outsource their proxy management. When your crawlers are making 20 million requests per day, it makes much more sense to focus on analysing the data not managing proxies.
Unfortunately, just using a proxy service won’t be enough to ensure you can evade bot countermeasures on larger e-commerce websites. More and more websites are using sophisticated anti-bot countermeasures that monitor your crawlers behaviour to detect that it isn’t a real human visitor.
Not only do these anti-bot countermeasures make scraping e-commerce sites more difficult, overcoming them can significantly dent your crawlers performance if done incorrectly.
This means that to ensure you can achieve the necessary throughput from your spiders to deliver daily product data you often need to painstakingly reverse engineer the anti-bot countermeasures used on the site and design your spider to counteract them without using a headless browser.
Challenge 5: Data Quality
From a data scientists perspective the most important consideration of any web scraping project is the quality of the data being extracted. Scraping at scale only makes this focus on data quality even more important.
When extracting millions of data points every single day, it is impossible to manually verify that all your data is clean and intact. It is very easy for dirty or incomplete data to creep into your data feeds and disrupt your data analysis efforts.
This is especially true when scraping products on multiple versions of the same store (different languages, regions, etc.) or separate stores.
Outside of a careful QA process during the design phase of the building the spider, where the code of the spider is peer reviewed and tested to ensure that it is extracting the desired data in the most reliable way possible. The best method of ensuring the highest possible data quality is the development of a automated QA monitoring system.
As part of any data extraction project you need to plan and develop a monitoring system that will alert you of any data for inconsistencies and spider errors. At Scrapinghub we’ve developed machine learning algorithms designed to detect:
- Data Validation Errors - Every data item has a defined data type and values that follow a consistent pattern. Our data validation algorithms will flag to the projects QA team any data items that are inconsistent with what is to expected for that data type, from which point the data is manually checked and alert verified or been flagged as a error.
- Product Variation Errors - When scraping the same product data from multiple versions of the same website (different languages, regions, etc.) it is possible that variable and supposedly fixed values such as product weight or dimensions can vary. This can be the result of a websites anti-bot countermeasures giving one or more of your crawlers falsified information. Again, you need to have algorithms in place to identify and flag any occurrences such as this.
- Volume Based Inconsistencies - Another key monitoring script is one that detects any abnormal variations in the number of records returned. This could signify that there have been changes made to the website or that your crawler is being fed falsified information.
- Site Changes - Structural changes happening to the target websites is the main reason why crawlers break. This is monitored by our dedicated monitoring system, quite aggressively. The tool performs frequent checks on the target site to make sure nothing has changed since the previous crawl. If changes are found, it sends out notifications for the same.
All of which we will discuss in a later article dedicated to automated quality assurance.
Wrapping Things Up
As you have seen scraping product data at scale creates its own unique set of challenges. Hopefully, this article has made you more aware of the challenges you will face and how you should go about solving them.
However, this is just the first article in this series so if you are interested in reading the next articles as soon as they are published be sure to sign up to our email list.
For those of you who are interested in scraping the web at scale but are wrestling with the decision of whether or not you should build up a dedicated web scraping team in-house or outsource it to a dedicated web scraping firm then be sure to check out our guide, Enterprise Web Scraping: A Guide to Scraping the Web at Scale.
At Scrapinghub we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped product data in your business then feel free to contact our sales team, who will talk you through the services we offer startups right through to Fortune 100 companies.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.