How to use a proxy in Puppeteer

Puppeteer is a high-level API for headless chrome. It’s one of the most popular tools to use for web automation or web scraping in Node.js. In web scraping, many developers use it to handle javascript rendering and web data extraction. In this article, we are going to cover how to set up a proxy in Puppeteer and what your options are if you want to rotate proxies.

TRY CRAWLERA FOR FREE

Puppeteer and proxies

In this section, we’re going to configure Puppeteer to use a proxy. For this, you will need a working proxy and a destination URL to send the request to.

'use strict';

const puppeteer = require('puppeteer');

(async() => {
const browser = await puppeteer.launch({
args: [ '--proxy-server=http://10.10.10.10:8000' ]
});
const page = await browser.newPage();
await page.goto('http://toscrape.com');
await browser.close(); })();

As simple as that. This code will ensure that every request goes through the defined proxy. One downside with Puppeteer is that you cannot define proxies for each request in a simple way. So, the specified proxy will be used for all the requests of the browser instance.

IP rotation with Puppeteer

When you scrape the web at scale, you need to rotate proxies to avoid bans. If you want to implement your own IP pool in Puppeteer you will realize that you can only set up proxies on browser-level (code above) and not per request. This is not ideal if you need to use different proxies for each request. See this Github issue for more information about this topic.

To rotate proxies in Puppeteer and to use a different IP address for each request you need a proxy server. To have a proxy server, you can implement your own or just use a backconnect proxy service for this. Be aware, implementing your own proxy server might put you into a rabbit hole where you will need to solve problems that are totally unrelated to web scraping and you can get distracted from what you really want to achieve (extract the data). So it’s not recommended. But if you decide to go this way, this is an example, created with proxy-chain:

const proxies = {
'useragent1': 'http://user:pass@85.237.57.198:44959,
'useragent2': 'http://user:pass@116.0.2.94:43379,
'useragent3': 'http://user:pass@186.86.247.169:39168, }; const server = new ProxyChain.Server({
port: 8000,
prepareRequestFunction: ({request}) => {
const userAgent = request.headers['user-agent'];
const proxy = proxies[userAgent];
return {
upstreamProxyUrl: proxy,
};
}); }); server.listen(() => console.log('Proxy server works!));

Puppeteer with Crawlera

If you don’t want to implement your own JS proxy server, you can use a proxy rotation service, like Crawlera. This is the simplest way to use proxies with Puppeteer. If you don’t want to struggle with IP rotation and just want successful requests, this is how to use Puppeteer with Crawlera:

Note: It is recommended to use Puppeteer 1.17 with Chromium 76.0.3803.0. For newer versions of Puppeteer, the latest Chromium snapshot that can be used is r669921.
  1. Set ignoreHTTPSErrors to true in puppeteer.launch method
  2. Specify Crawlera’s host and port in --proxy-server flag
  3. Send Crawlera credentials in the Proxy-Authorization header

Here’s an example:

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
ignoreHTTPSErrors: true,
args: [
'--proxy-server=proxy.crawlera.com:8010'
]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
'Proxy-Authorization': 'Basic ' + Buffer.from(':').toString('base64'),
});
console.log('Opening page ...');
try {
await page.goto('https://httpbin.scrapinghub.com/redirect/6', {timeout: 180000});
} catch(err) {
console.log(err);
}
console.log('Taking a screenshot ...');
await page.screenshot({path: 'screenshot.png'});
await browser.close(); })();

With Crawlera, you don’t have to struggle with IPs and rotation. Crawlera will take care of making your requests successful. For more tips on how to use Crawlera with Puppeteer see our support page. If you want to try Crawlera for FREE, go here!

TRY CRAWLERA FOR FREE

March 12, 2020 In "Autoscraping" , "data extraction" , "AutoExtract" , "News Data Extraction"
March 05, 2020 In "Web Scraping" , "Autoscraping" , "data extraction" , "Developer API" , "AutoExtract" , "Jobs Data"
February 27, 2020 In "Web Scraping" , "bots" , "Crawlera" , "Proxies"
Web Scraping, Crawlera, data extraction, web crawling, Proxies, Puppeteer