The Challenges E-Commerce Retailers Face Managing Their Web Scraping Proxies
These days web scraping amongst the big e-commerce companies is ubiquitous due to the advantages data-based decision making can bring to remaining competitive in such a tight margin business.
E-commerce companies are increasingly using web data fuel their competitor research, dynamic pricing and new product research.
For these e-commerce sites, their most important consideration is: the reliability of their data feed and it’s ability to return the data they need at the required frequency.
As a result, these e-commerce sites face big challenges managing their proxies so that they can reliably scrape the web without disruption.
In this article, we’re going to talk about those challenges and how the best web scrapers get around them.
Challenge #1 - The Sheer Number of Requests Being Made
The sheer number of the requests being made (upwards of 20 million successful requests per day) is a huge challenge for companies. With millions of requests per day, companies also need thousands of IPs in their proxy pools to cope with the request volume.
Not only do they need a large pool size, but a pool that contains a wide range of proxy types (location, datacenter/residential, etc.) to enable them to reliably scrape the precise data they need.
However, managing proxy pools of this scale can be very time-consuming. Developers and data scientists often report spending more time managing proxies and troubleshooting data quality issues than analyzing the extracted data.
To cope with this level of complexity, to scrape the web at this scale you will need to implement a robust intelligence layer to your proxy management logic.
The more sophisticated and automated your proxy management layer, the more efficient and hassle-free managing your proxy pool will be.
On that note, let’s dive deeper into proxy management layers and how the best e-commerce companies solve the challenges associated with them.
Challenge #2 - Building a Robust Intelligence Layer
When scraping the web at a relatively small scale (couple thousand pages per day), you can get away with a simple proxy management infrastructure if your spiders are well designed and you have a large enough pool.
However, when you are scraping the web at scale, this simply just won’t cut it. Very quickly you’ll run into the following challenges when building a large scale web scraper.
- Ban Identification - Your proxy solution needs to be able to detect numerous types of bans so that you can troubleshoot and fix the underlying problem - i.e. captchas, redirects, blocks, ghosting, etc. Making things more difficult, your solution also needs to create and manage a ban database for every single website you scrape, which is not a trivial task.
- Retry Errors - If your proxies experience any errors, bans, timeouts, etc. they need to be able to retry the request with different proxies.
- Request Headers - Managing and rotating user agents, cookies, etc. is crucial to having a healthy crawl.
- Control Proxies - Some scraping projects require you to keep a session with the same proxy, so you’ll need to configure your proxy pool to allow for this.
- Add Delays - Automatically randomize delays and change request throttling to help cloak the fact that you are scraping and access difficult sites.
- Geographical Targeting - Sometimes you’ll need to able to configure your pool so that only some proxies will be used on certain websites.
As a result, companies need to implement a robust proxy management logic to rotate IPs, select geographical specific IPs, throttle requests, identify bans and captchas, automate retries, manage sessions, user agents and blacklisting logic to prevent your proxies from getting blocked and disrupting their data feed.
The problem is most solutions on the market are only selling proxies or proxies with simple rotation logic at best. So often times companies need to build and refine this intelligent proxy management layer themselves. Which requires significant development.
The other option is to use a proxy solution that takes care of all the proxy management for you. More on this later.
Challenge #3 - Precision/Accessing the Data You Want
As is often the case with e-commerce product data, the prices and specifications of products vary depending on the location of the user.
As a result, to get the most accurate picture of a products pricing or feature data, companies often want to request each product from different locations/zip codes. This adds another layer of complexity to an e-commerce web scraping proxy pool, as you now need a proxy pool that contains proxies from different locations and has implemented the necessary logic to select the correct proxies for the target locations.
At lower volumes, it is often ok to just manually configure a proxy pool to only use certain proxies for specific web scraping projects. However, this can become very complex as the number and complexity of the web scraping projects increases. That is why an automated approach to proxy selection is key when scraping at scale.
Challenge #4 - Reliability and Data Quality
As stated at the start of this article, the most important consideration in the development of any proxy management solution for large-scale e-commerce web scraping is that it is robust/reliable and returns high-quality data for analysis.
Oftentimes, the data these e-commerce companies are extracting is mission critical to the success of the businesses and their ability to remain competitive in the marketplace. As a result, any disruptions or reliability issues with their data feed is a huge area of concern for most companies conducting large-scale web scraping.
Even a disruption of a couple hours will likely prevent them from having up to date product data for the setting product pricing for the next day.
The other issue is cloaking, the practice of e-commerce websites feeding incorrect product data to requests if they believe them to be from web scrapers. This can cause huge headaches for the data scientists working in these companies as there will always be a question mark over the validity of their data.
Growing a seed of doubt in their minds as to whether they can make decisions based on what the data is telling them.
This is where having a robust and reliable proxy management infrastructure along with an automated QA process in place really helps. Not only does it remove a lot of the headaches of having to manually configure and troubleshoot proxy issues, it also gives companies a high degree of confidence in the reliability of their data feed.
Best Proxy Solution for Enterprise Web Scraping
Ok, we’ve discussed the challenges of managing proxies for enterprise web scraping projects, however, how do you overcome these challenges and build your own proxy management system for your large scale web scraping projects?
In reality, enterprise web scrapers have two options when it comes to building their proxy infrastructure for their web scraping projects.
- Build the entire infrastructure in-house
- Use a single endpoint proxy solution that deals with all the complexities of managing proxies
One solution is to build a robust proxy management solution in-house that will take care of all the necessary IP rotation, request throttling, session management and blacklisting logic to prevent your spiders being blocked.
There is nothing wrong with this approach, provided that you have the available resources and expertise to build and maintain such an infrastructure. To say that a proxy management infrastructure designed to handle 300 million requests per month (the scale a lot of e-commerce sites scrape at) is complex is a understatement. This kinda of infrastructure is a significant development project.
For most companies their #1 priority is the data, not proxy management. As a result, a lot of the largest e-commerce companies completely outsource proxy management using a single endpoint proxy solution.
Single Endpoint Solution
Our recommendation is to go with a proxy provider who can provide a single endpoint for proxy configuration and hide all the complexities of managing your proxies. Scraping at scale is resource intensive enough without trying to reinvent the wheel by developing and maintaining your own internal proxy management infrastructure.
This is the approach most of the large e-commerce retailers take. Three of the worlds top five largest e-commerce companies use Crawlera as their primary proxy solution, the smart downloader developed by Scrapinghub, that completely outsources their proxy management. In total, Crawlera processes 8 billion requests per month.
The beauty of Crawlera is that instead of having to manage a pool of IPs, your spiders just send a request to Crawlera's single endpoint API where Crawlera retrieves and returns the desired data.
Under the hood, Crawlera manages a massive pool of proxies, carefully rotating, throttling, blacklists and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs and enabling users to focus on the data, not proxies.
The huge advantage of this approach is that it is extremely scalable. Crawlera can scale from a few hundred requests per day to millions of requests per day without any additional workload from the user. Simply increase the number of requests you are making and Crawlera will take care of the rest.
Better yet, with Crawlera you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.
Crawlera also comes with global support. Clients know that they can get expert input into any issue that may arise 24 hours per day, 7 days a week no matter where they are in the world.
If you'd like to learn more about Crawlera then, be sure to talk to our team about your project.
Wrapping Things Up
As you have seen there are a lot of challenges associated with managing proxies for large-scale web scraping projects. However, it is a surmountable challenge if you have adequate resources and expertise to implement a robust proxy management infrastructure. If not then you should seriously consider a single endpoint proxy solution such as Crawlera.
For those of you who are interested in scraping the web at scale but are wrestling with the decision of whether or not you should build up a dedicated web scraping team in-house or outsource it to a dedicated web scraping firm then be sure to check out our guide, Enterprise Web Scraping: The Build In-House or Outsource Decision.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.