A Sneak Peek Inside Crawlera: The World’s Smartest Web Scraping Proxy Network
“How does Scrapinghub Crawlera work?” is the most common question we get asked from customers who after struggling for months (or years) with constant proxy issues, only to have them disappear completely when they switch to Crawlera.
Today we’re going to give you a behind the scenes look at Crawlera so you can see for yourself why it is the world’s smartest web scraping proxy network and the only solution that allows you to completely forget about proxies, and focus on the data instead.
What is Crawlera?
First off, Crawlera is not like your average proxy solutions. Instead of just providing users with a batch of IPs and maybe rudimentary proxy rotation functionality, Crawlera is a complete proxy management solution for web scraping at scale.
Simply make a HTTP request to Crawlera’s single endpoint API, and Crawlera will return a successful response. No need to worry about managing proxies.
Crawlera achieves this by managing thousands of proxies so that you don’t have to: carefully rotating, throttling, blacklisting and selecting the optimal IPs to use for any individual request to give the optimal results with the maximum efficiency.
Completely removing the hassle of managing IPs and allowing you to focus on the data, not proxies.
From day one, Crawlera was designed exclusively for web scrapers. It has been optimised to ensure users can reliably extract the data they need at scale, without having to worry about bans or failed requests.
How Does Crawlera Work?
Crawlera completely removes the hassle of managing proxy pools and all the headaches that go with it - rotating proxies, defining ban rules for each website, managing user agents, throttling, etc.
Without Crawlera you would make a request to your own proxy pool and hope that it is successful.
Instead, make a request to the Crawlera API and it takes care of the rest.
Once, the HTTP request is sent to the API, Crawlera’s intelligent IP management layer kicks in. This layer automatically determines the target domain and selects the optimal proxy pool and request configuration to yield the best results. Configuring the delay between requests, user agents, cookies, etc. to ensure successful requests.
However, when the request is blocked is where Crawlera’s capabilities really shine through.
Instead of just returning a failed request to the client, Crawlera will change the proxies, request delays, user agents, etc. as required to return a successful request to the client.
Crawlera Handles All Your Proxy Headaches
Instead of having to manually design, build and configure your own proxy management infrastructure, with Crawlera all the individual elements that make a great proxy management solution tick are available out of the box and completely automated for your requests.
Crawlera automatically finds the sweet spot between crawl speed and reliability, the two most important factors for any web scraping professional.
As a web scraper, in most cases we want to extract our target data as fast as possible. However, with high speed comes an increasing risk that the requests will be blocked. As a result, finding the crawl speed sweet spot is our #1 priority.
If you designed your own proxy management solution, you embark on a tedious process of trial and error to find this elusive crawl speed sweet spot. Which is very time consuming and often doesn’t yield the optimal results.
In contrast, Crawlera is optimised for successful requests and automatically finds this sweet spot for you. Crawlera constantly throttles the request delays to ensure the maximum successful request throughput for your crawl. With the precise delay being determined by the target website, the estimated load on the website, and it’s ban history. Crawlera selects the optimal throttling configuration to give you the maximum request reliability.
Key to Crawlera’s functionality is its ban detection functionality, comprising of a constantly updated database of over 130+ ban types, response codes and captchas, developed as a result of Crawlera serving over 8 billion requests per month.
Crawlera monitors every request for bans, even those that are not immediately obvious from the response code. When detected, Crawlera automatically retries the request with a different IP (while also modifying the delay) and puts the banned IP in what we call “Jail Time”: A state where that same proxy won’t be used to make a request to that same website for a variable period of time.
Even so, you are able to add your own custom ban rules to cope with specific project requirements and challenges.
Cookie & Session Handling
Depending on your project requirements your spiders may need to store and manage cookies so that they can navigate to the target data and parse the data.
With Crawlera, these cookie and session requirements will be handled for you, eliminating the need to handle them on the crawler side.
However, you can easily disable this functionality to allow you to manually configure your crawler to manage cookies and sessions if your project requirements necessitate custom cookie and session handling.
Crawlera also uses browser like headers to give your requests more human-like behaviour.
When a request is received, Crawlera will select the most appropriate header from a browser header pool removing the need for you to worry about header configuration and rotation.
In certain circumstances, you might like to manually control the headers being used with your requests. As a result, Crawlera’s header functionality can be disabled by simply passing it a single command with your API request.
Headless Browser Support
With the increasing complexity of modern websites (most require 50 to 150 single requests to load) scraping efficiently at scale can become a real challenge.
To help combat this problem, Crawlera comes with headless browser support. Enabling you to easily integrate headless browsers such as Splash and Puppeteer with your crawlers, significantly reducing the time it takes to get successful requests.
If you’d like to learn more about Crawlera’s full feature set then be sure to check out Crawlera’s technical documentation.
Built For Reliability at Scale
Despite having all this advanced functionality available out of the box, where Crawlera really shines though is the reliability and predictability it gives you when scraping at scale.
If you start a web scraping project, often times you can get away with a simple proxy pool at the start but as you scale you will have to develop all the functionality listed above or else suffer from constant reliability issues.
In contrast, from day one Crawlera has been built with scraping at scale in mind.
Crawlera’s #1 priority is to give you maximum request reliability. As a result, Crawlera has been painstakingly designed to ensure it will consistently return successful requests to your crawlers no matter the website being scraped.
And if your request volume increases, you don’t need to worry about getting more proxies or reconfiguring your proxy management logic. Instead, you simply increase your allocated number of concurrent requests and Crawlera will handle the rest. Giving users great predictability of their web scraping costs as they scale.
Ensuring that developers and companies don’t need to invest huge amounts of time, engineering resources and money into developing their own scalable proxy management solutions. Allowing you to focus on the data instead.
Try Crawlera The World’s Smartest Proxy Network Today!
If you’re tired of troubleshooting proxy issues and would like to give Crawlera a try, then sign up today and you can cancel within 7 days if Crawlera isn’t for you, or schedule a call with our crawl consultant team. Scrapinghub offers Crawlera in two flavours:
- Crawlera Self-service: For web scraping teams (or individual developers) that are tired of managing their own proxy pools and that are ready to integrate an off-the-shelf proxy API into their web scraping stack that only charges for successful requests.
- Crawlera Enterprise: For larger organisations with mission-critical web crawling requirements looking for a dedicated crawling partner who’s tools and team of crawl consultants can help them crawl more reliably at scale, build custom solutions for their specific requirements, help debug any issues they may run into when scraping the web, and offer enterprise SLAs.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.