Spoofing your Scrapy bot IP using tsocks

Spoofing your Scrapy bot IP using tsocks

It is well known that many websites show different content depending on the region where they’re accessed. For example, some retailer sites show products available only for the region (US, Europe) of the user accessing the site.

Although this can be quite convenient for the website customers, it can be a pain for developers writing a spider for the site and running it from their local machines.

There is a simple way to proxy all requests as if they came from another server. You only need SSH access to this other server, no need to install any HTTP proxy. For this, you can use a program called tsocks.

Here’s how to do it in Ubuntu, though this recipe should be easy to extended to other Linux distros.

First, install tsocks with:

$ apt-get install tsocks

Then add this content to ~/.tsocksrc (update: recent versions settings are stored at ~/.tsocks.conf, but it may vary across distributions):

server = 127.0.0.1
server_type = 5
server_port = 9999

Next, SSH to the remote server you want to use:

$ ssh -D 9999 some_remote_server

And finally, in another terminal (without closing the SSH console), just run Scrapy by prefixing it with the tsocks command, like this:

$ tsocks scrapy crawl myspider

That’s all. Your spider will run in your local machine but proxying all communication through the remote server. No need to change any settings or configuration.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

One thought on “Spoofing your Scrapy bot IP using tsocks

Leave a Reply

Your email address will not be published. Required fields are marked *