Building Blocks of an Unstoppable Web Scraping Infrastructure

More and more businesses leverage the power of web scraping. Extracting data from the web is becoming popular. But it doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience. Here, at Scrapinghub we scrape 9 billion pages per month. In this article, we are going to summarize what the essential elements of web scraping are. What building blocks you need to take care of, in order to develop a healthy web data pipeline.

The building blocks:

  • Web spiders
  • Spider management
  • Javascript rendering
  • Data QA
  • Proxy management

Web spiders

Let's start with the obvious, spiders. In the web scraping community, a spider is a script or program that extracts data from the website. Spiders are essential to scrape the web. There are many libraries and tools available that we could use. In Python, you have Scrapy, the web scraping framework or beautifulsoup. If you’re programming in java, you have Jsoup or HTMLUnit. In Javascript it’s Puppeteer or Cheerio. But I could mention many other libraries, these are just the most popular ones.

It can be difficult to choose which library or tool to choose, especially if you’ve never done web scraping before. For one-off projects, it doesn't really matter what library you use. Choose the one which is the easiest to get started within your preferred programming language. But for long term projects, where you will need to maintain your spiders and maybe build on top of them, you should choose a tool which lets you focus on your project-specific needs and not on the underlying scraping logic.

Spider management

Now that we got the spiders right, the next element you need to have in your web scraping stack is spider management. Spider management is about creating an abstraction on top of your spiders. A spider management platform, like Scrapy Cloud, makes it quick to get a sense of how your spiders are performing. You can schedule jobs and automate spiders. And you can actually focus on the crawlers and not the servers.

Javascript rendering

Javascript, as a technology, is widely used on websites. Sometimes it can be hard to get data that is rendered with JS. Many people render JS when they don't even have to. Generally, when you know the website is using JS to render its content, your first instinct might be to grab some headless browser like Selenium. That is the easiest way to go.

The trade-off is that in the short term it's much quicker to just render JS and get the data, but in the long term it will take too much hardware resources. So make sure that if execution time and used hardware resources are important for you, inspect the website first properly and see if there's any AJAX request or hidden API calls in the background. If yes, then try to replicate that in your scraper, instead of executing JS.

If there's just no other way to get the data and you have to render JS no matter what, then try to use a more lightweight solution like Splash instead of using a full-fledged headless browser. Because it is created for scraping and it can only render JS, nothing else, which is exactly what we need to scrape a site.

Data quality

The next building block is data QA. All our scraping efforts are only worth it if the output produced is the data we expect in the correct format. To make sure this is the case, we can do several things:

  • validate the output data against a predefined JSON schema
  • check the coverage of the extracted fields
  • check duplicates
  • compare two scraping jobs

It's also useful to set up alerts and notifications that are triggered when a certain action happens. Scrapinghub recently open sourced two tools that we use for spider monitoring and to ensure data quality. One is Spidermon which, for example, can be used to send email or slack notifications, it can also create custom reports based on web scraping stats. The other one is Arche, which can be used to analyze and verify scraped data using a set of rules.

Proxy management

The last thing, but a very important one, I want to mention is proxy management.

It’s not an easy task to scale web scraping. You cannot solve everything just by throwing more hardware at the problem. Sometimes it’s impossible to scale. For example, if your target website has 50,000 visitors/month and you want to scrape it 50,000 times in a month… that’s not going to work.

There are projects when we need ongoing maintenance and we scrape hundreds of millions or even billions of URLs. For this kind of volume, you definitely need a proxy solution, like Crawlera. You can figure out how many and what kinds of proxies you need based on target websites, requests volume and geolocation (example.com, 3M requests/day, USA).

TRY CRAWLERA FOR FREE

Web scraping ethics

Before finishing up this article, it’s important to set this straight. When you scrape a website, you have to make sure you actually respect it. Below are some best practices you can follow to scrape respectfully.

Don't be a burden

The most important rule when you scrape a website is not to harm it. Do not make too many requests. Making requests too frequently could make it hard for the website server to serve other visitors. Limit the number of requests in accordance to the target website.

Robots.txt

Before scraping, always inspect the robots.txt file first. This will give you a good idea what parts of the website you are free to visit and what pages you should not.

User-agent

Define a user-agent that clearly describes you or your company. Also, it’s best to include contact information in your user-agent as well, so they can let you know if they have any issues with what you’re doing.

Pages behind a login wall

There are cases, when you can only access certain pages if you are logged in. If you want to scrape those pages, you need to be very careful. By logging in and/or explicitly agreeing to the website's terms and conditions which state you cannot scrape, then you CANNOT scrape. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.

Closing note

Getting started with web scraping is easy. But as you scale, it’s getting more and more difficult to keep your spiders working. You need a gameplan on how to tackle the challenges of website layout changes, keeping data quality high and proxy needs. Perfecting the mentioned building blocks, you should be well on your way.

Learn more about web scraping use cases or our developer tools that make web scraping a breeze.

September 30, 2020 In "Web Data" , "QA" , "Data Quality"
September 24, 2020 In "Web Scraping" , "Web Data" , "web crawling" , "Extract Summit" , "Web Data Extraction Summit"
September 10, 2020 In "open source" , "Data Quality" , "AutoExtract" , "Article Data Extraction" , "News and Articles API"
Web Scraping, Crawlera, data extraction, web crawling, GDPR, Data Quality