How to use proxies with Python Requests module

Sending HTTP requests in Python is not necessarily easy. We have built-in modules like urllib, urllib2 to deal with HTTP requests. Also, we have third-party tools like Requests. Many developers use Requests because it is high level and designed to make it extremely easy to send HTTP requests.

But choosing the tool which is most suitable for your needs is just one thing. In the web scraping world, there are many obstacles we need to overcome. One huge challenge is when your scraper gets blocked. To solve this problem, you need to use proxies. In this article I’m going to show you how to utilize proxies when using the Requests module so your scraper will not get banned.

Requests and proxies

In this part we're going to cover how to configure proxies in Requests. To get started we need a working proxy and a URL we want to send the request to.

Basic usage

import requests
proxies = {
“http”: “http://10.10.10.10:8000”,
“https”: “http://10.10.10.10:8000”, } r = requests.get(“http://toscrape.com”, proxies=proxies)

The proxies dictionary must follow this scheme. It is not enough to define only the proxy address and port. You also need to specify the protocol. You can use the same proxy for multiple protocols. If you need authentication use this syntax for your proxy:

http://user:pass@10.10.10.10:8000

Environment variables

In the above example you can define proxies for each individual request. If you don’t need this kind of customization you can just set these environment variables:

export HTTP_PROXY="http://10.10.10.10:8000"
export HTTPS_PROXY="http://10.10.10.10:1212"

This way you don’t need to define any proxies in your code. Just make the request and it will work.

Proxy with session

Sometimes you need to create a session and use a proxy at the same time to request a page. In this case, you first have to create a new session object and add proxies to it then finally send the request through the session object:

import requests
s = requests.Session()
s.proxies = {
“http”: “http://10.10.10.10:8000”,
“https”: “http://10.10.10.10:8000”, } r = s.get(“http://toscrape.com”)

IP rotating

As discussed earlier, a common problem that we encounter while extracting data from the web is that our scraper gets blocked. It is frustrating because if we can’t even reach the website we won’t be able to scrape it either. The solution for this is to use some kind of proxy or rather multiple proxies. A proxy solution will let us get around the IP ban.

To be able to rotate IPs, we first need to have a pool of IP addresses. We can use free proxies that we can find on the internet or we can use commercial solutions for this. Be aware, that if your product/service relies on scraped data a free proxy solution will probably not be enough for your needs. If a high success rate and data quality are important for you, you should choose a paid proxy solution like Crawlera.

IP rotation with Requests

So let’s say we have a list of proxies. Something like this:

ip_addresses = [“85.237.57.198:44959”, “116.0.2.94:43379”, “186.86.247.169:39168”, “185.132.179.112:1080”, “190.61.44.86:9991”]

Then, we can randomly pick a proxy to use for our request. If the proxy works properly we can access the given site. If there’s a connection error we might want to delete this proxy from the list and retry the same URL with another proxy.

try:
proxy_index = random.randint(0, len(ip_addresses) - 1)
proxy = {"http": ip_addresses(proxy_index), "https": ip_addresses(proxy_index)}
requests.get(url, proxies=proxies) except:
# implement here what to do when there’s a connection error
# for example: remove the used proxy from the pool and retry the request using another one

There are multiple ways you can handle connection errors. Because sometimes the proxy that you are trying to use is just simply banned. In this case, there’s not much you can do about it other than removing it from the pool and retrying using another proxy. But other times if it isn’t banned you just have to wait a little bit before using the same proxy again.

Implementing your own smart proxy solution which finds the best way to deal with errors is very hard to do. That’s why you should consider using a managed solution, like Crawlera, to avoid all the unnecessary pains with proxies.

Using Crawlera with Requests

As a closing note, I want to show you how to solve proxy issues in the easiest way with Crawlera.

import requests
url = "http://httpbin.org/ip"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = ":"
proxies = {
"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
"http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port) } r = requests.get(url, proxies=proxies, verify=False)

What does this piece of code do? It sends a successful HTTP request. When you use Crawlera, you don’t need to deal with proxy rotation manually. Everything is taken care of internally.

If you find that managing proxies on your own is too complex and you’re looking for an easy solution, give Crawlera a try.

August 08, 2019 In "Crawlera" , "Scrapy" , "Proxies" , "scrapyproject"
May 09, 2019 In "data" , "Web Data" , "python" , "NFL" , "Code"
March 01, 2019 In "data extraction" , "python" , "scrapy" , "Data Quality" , "Spidermon" , "spider" , "scrapyproject"
Crawlera, python, python 3, Proxies, Requests
Blog banner - 1200x250 – 1
Sign up now

Web Data Extraction Summit 2019

presented by Scrapinghub

Dublin, Ireland 
17th September 2019

GET TICKETS

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Follow Us

Learn More

Recent Posts