Building Blocks of an Unstoppable Web Scraping Infrastructure
More and more businesses leverage the power of web scraping. Extracting data from the web is becoming popular. But it doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience. Here, at Scrapinghub we scrape 9 billion pages per month. In this article, we are going to summarize what the essential elements of web scraping are. What building blocks you need to take care of, in order to develop a healthy web data pipeline.
The building blocks:
- Web spiders
- Spider management
- Data QA
- Proxy management
It can be difficult to choose which library or tool to choose, especially if you’ve never done web scraping before. For one-off projects, it doesn't really matter what library you use. Choose the one which is the easiest to get started within your preferred programming language. But for long term projects, where you will need to maintain your spiders and maybe build on top of them, you should choose a tool which lets you focus on your project-specific needs and not on the underlying scraping logic.
Now that we got the spiders right, the next element you need to have in your web scraping stack is spider management. Spider management is about creating an abstraction on top of your spiders. A spider management platform, like Scrapy Cloud, makes it quick to get a sense of how your spiders are performing. You can schedule jobs and automate spiders. And you can actually focus on the crawlers and not the servers.
The trade-off is that in the short term it's much quicker to just render JS and get the data, but in the long term it will take too much hardware resources. So make sure that if execution time and used hardware resources are important for you, inspect the website first properly and see if there's any AJAX request or hidden API calls in the background. If yes, then try to replicate that in your scraper, instead of executing JS.
If there's just no other way to get the data and you have to render JS no matter what, then try to use a more lightweight solution like Splash instead of using a full-fledged headless browser. Because it is created for scraping and it can only render JS, nothing else, which is exactly what we need to scrape a site.
The next building block is data QA. All our scraping efforts are only worth it if the output produced is the data we expect in the correct format. To make sure this is the case, we can do several things:
- validate the output data against a predefined JSON schema
- check the coverage of the extracted fields
- check duplicates
- compare two scraping jobs
It's also useful to set up alerts and notifications that are triggered when a certain action happens. Scrapinghub recently open sourced two tools that we use for spider monitoring and to ensure data quality. One is Spidermon which, for example, can be used to send email or slack notifications, it can also create custom reports based on web scraping stats. The other one is Arche, which can be used to analyze and verify scraped data using a set of rules.
The last thing, but a very important one, I want to mention is proxy management.
It’s not an easy task to scale web scraping. You cannot solve everything just by throwing more hardware at the problem. Sometimes it’s impossible to scale. For example, if your target website has 50,000 visitors/month and you want to scrape it 50,000 times in a month… that’s not going to work.
There are projects when we need ongoing maintenance and we scrape hundreds of millions or even billions of URLs. For this kind of volume, you definitely need a proxy solution, like Crawlera. You can figure out how many and what kinds of proxies you need based on target websites, requests volume and geolocation (example.com, 3M requests/day, USA).
Web scraping ethics
Before finishing up this article, it’s important to set this straight. When you scrape a website, you have to make sure you actually respect it. Below are some best practices you can follow to scrape respectfully.
Don't be a burden
The most important rule when you scrape a website is not to harm it. Do not make too many requests. Making requests too frequently could make it hard for the website server to serve other visitors. Limit the number of requests in accordance to the target website.
Before scraping, always inspect the robots.txt file first. This will give you a good idea what parts of the website you are free to visit and what pages you should not.
Define a user-agent that clearly describes you or your company. Also, it’s best to include contact information in your user-agent as well, so they can let you know if they have any issues with what you’re doing.
Pages behind a login wall
There are cases, when you can only access certain pages if you are logged in. If you want to scrape those pages, you need to be very careful. By logging in and/or explicitly agreeing to the website's terms and conditions which state you cannot scrape, then you CANNOT scrape. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.
Getting started with web scraping is easy. But as you scale, it’s getting more and more difficult to keep your spiders working. You need a gameplan on how to tackle the challenges of website layout changes, keeping data quality high and proxy needs. Perfecting the mentioned building blocks, you should be well on your way.