Web scraping is when you extract data from the web and put it in a structured format. Getting structured data from publicly available websites and pages should not be an issue as everybody with an internet connection can access these websites. You should be able to structure it as well. In reality though, it’s not that easy.
One of the main use cases of web scraping is in the e-commerce world: price monitoring and price intelligence. However, you can also use web scraping for lead generation, market research, business automation, among others.
In this article, you will learn what are the subtle ways a website can recognize you as a bot and not a human. We also share our knowledge on how to overcome these challenges and get access to publicly available data.
Web Scraping Best Practises
At Scrapinghub, we care about ensuring that our services respect the rights of websites and companies whose data we scrape.
There’s a couple of things to keep in mind when you’re dealing with a web scraping project, in order to respect the website.
Always inspect the robots.txt file and make sure you respect the rules of the site. Make sure you only crawl pages that are allowed to be crawled.
Don't be a burden
When you start scraping a website you should be really careful with the manner of your requests because you don’t want to harm the website. If you harm the website that’s not good for anybody.
- Limit your requests coming from the same IP address
- Respect the delay between requests that is outlined in robots.txt
- Schedule your crawls to run off-peak hours
Still, even when you are careful with your scraper, you might get banned. This is when you need to improve how you do web scraping and apply some techniques to get the data. But remember, be nice how you scrape! Read more about best practises.
What are anti-bots?
Anti-bot systems are created to block website access from bots. These systems have a set of approaches to differentiate bots from humans. Anti-bot mechanisms can mitigate DDOS attacks, credential stuffing, and credit card fraud. In the case of ethical web scraping though, you’re not doing any of these. You just want to get access to publicly available data, in the nicest way possible. Often the website doesn’t have an API so you have no other option but scraping it.
The core of every anti-bot system is that they try to recognize if an activity is done by a bot and not human. In this section, we’re going through all the ways a bot can be caught, while trying to access a website.
When your browser sends a request to the server, it also sends a header. In the header, you have several values and they are different for each browser. Chrome, Firefox, Safari all have their own header patterns. For example, this is what a chrome request header looks like:
A bot can be easily recognized if the header pattern is not equivalent to a regular browser. Or if you’re using a pattern that is inconsistent with known browsers’ patterns you might get throttled or even blocked.
When you started out with web scraping you probably had user-agents like these:
Then, you realized it’s not enough to access the page so you need to set a custom user-agent that looks similar to a real browser's. Like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0
In the past, changing user-agent (user-agent spoofing) might have been enough to access a website but nowadays you need to do more than this.
A more sophisticated way to detect bots is by using TCP/IP fingerprinting. TCP is the backbone of the internet. When you or your scraper uses the internet you are using TCP. TCP leaves a lot of parameters (like TTL or initial window state) that need to be set by the used device/operating system. If these parameter values are not consistent you can get caught.
For example, if you’re sending a request posing as a Chrome browser on Windows but your TTL (time to live) is 64 (maybe because you use a Linux-based proxy), your TTL value is not what it’s supposed to be (128) so your request can be filtered out.
A lot of crawling happens from datacenter IP addresses. If the website owner recognizes that there are a lot of non-human requests coming from this set of IPs, they can just block all the requests coming from that specific datacenter so the scrapers will not be able to access the site. To overcome this, you need to use other datacenter proxies or residential proxies. Or just use a service that handles proxy management.
Some websites intentionally block access if your request comes from a specific (or suspicious) region. Another case where geographical location can be a challenge for you is when the website gives you different content based on where you are. This can be easily solved by utilizing proxies in the proper regions.
Detecting bots on the front-end
Moving away from the back-end side of things and how your scraper can be recognized as a bot on the back-end, there are some ways on the front-end as well that can get your scraper in trouble.
If there are some inconsistencies in this set of information, anti-bot systems can be triggered and the website starts showing you captchas or makes it difficult to scrape the site in some ways.
Back in the day, captchas used HIP (Human Interactive Proof) with the premise that humans are better at solving visual puzzles than machines. Machine learning algorithms weren’t developed enough to solve captchas like this:
However, as machine learning technologies evolved, nowadays a machine can solve this type of captcha easily. Then, more sophisticated image-based tests were introduced, which gave a bigger challenge for machines.
The most recent versions of captchas are much more transparent and user-friendly because they are based on behavioral patterns. The idea behind these captchas is that it’s transparent to the use. They track mouse movements, clicks and keystrokes. Human behaviour on a website is much more complex than bot behaviour.
The key to handling modern captchas is to be smart about the manner of your scraping. If you can figure out what triggers the captcha for that specific site you’re dealing with, solve that problem first, instead of trying to handle the captcha itself.
Use more or different proxies (if you’ve been using datacenter IPs, try to switch to residential ones). Or make requests less frequently based on how the website reacts.
A bot is designed to be efficient and find the quickest way to extract data. Looking behind the curtain and using a path that is not seen nor used by a regular user. Anti-bot systems can pick up on this behavior. Another important aspect is the amount and frequency of requests you make. The more frequent your requests (from the same IP) are the more chance your scraper will be recognized.
It’s not an easy task to scale up your web scraping project. I hope this overview gave you some insights on how to maintain successful requests and minimize blockings. As mentioned above, one of the building blocks of a healthy web scraping project is proxy management. If you need a tool to make web scraping easier, try Crawlera for free. Crawlera's rotating proxy network is built with a proprietary ban detection and request throttling algorithm. Crawlera will ensure your web scraped data is delivered successfully!