How to Deploy Custom Docker Images for Your Web Crawlers

How to Deploy Custom Docker Images for Your Web Crawlers

What if you could have complete control over your environment? Your crawling environment, that is… One of the many benefits of our upgraded production environment, Scrapy Cloud 2.0, is that you can customize your crawler runtime environment via Docker images. It’s like a superpower that allows you to use specific versions of Python, Scrapy and the rest of your stack, deciding if and when to upgrade.

docker

With this new feature, you can tailor a Docker image to include any dependency your crawler might have. For instance, if you wanted to crawl JavaScript-based pages using Selenium and PhantomJS, you would have to include the PhantomJS executable somewhere in the PATH of your crawler’s runtime environment.

And guess what, we’ll be walking you through how to do just that in this post.

Heads up, while we have a forever free account, this feature is only available for paid Scrapy Cloud users. The good news is that it’s easy to upgrade your account. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

Using a custom image to run a headless browser

Download the sample project or clone the GitHub repo to follow along.

Imagine you created a crawler to handle website content that is rendered client-side via Javascript. You decide to use selenium and PhantomJS. However, since PhantomJS is not installed by default on Scrapy Cloud, trying to deploy your crawler the usual way would result in this message showing up in the job logs:

selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable needs to be in PATH.

PhantomJS, which is a C++ application, needs to be installed in the runtime environment. You can do this by creating a custom Docker image that downloads and installs the PhantomJS executable.

Building a custom Docker image

First you have to install a command line tool that will help you with building and deploying the image:

$ pip install shub

Before using shub, you have to include scrapinghub-entrypoint-scrapy in your project’s requirements file, which is a runtime dependency of Scrapy Cloud.

$ echo scrapinghub-entrypoint-scrapy >> ./requirements.txt

Once you have done that, run the following command to generate an initial Dockerfile for your custom image:

$ shub image init --requirements ./requirements.txt

It will ask you whether you want to save the Dockerfile, so confirm by answering Y.

Now it’s time to include the installation steps for PhantomJS binary in the generated Dockerfile. All you need to do is copy the highlighted code below and put it in the proper place inside your Dockerfile:

FROM python:2.7
RUN apt-get update -qq && \
    apt-get install -qy htop iputils-ping lsof ltrace strace telnet vim && \
    rm -rf /var/lib/apt/lists/*
RUN wget -q https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/bin && \
    rm -rf phantomjs-2.1.1-linux-x86_64.tar.bz2 phantomjs-2.1.1-linux-x86_64
ENV TERM xterm
ENV PYTHONPATH $PYTHONPATH:/app
ENV SCRAPY_SETTINGS_MODULE demo.settings
RUN mkdir -p /app
WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app

The Docker image you’re going to build with shub has to be uploaded to a Docker registry. I used Docker Hub, the default Docker registry, to create a repository under my user account:

docker-hub

Once this is done, you have to define the images setting in your project’s scrapinghub.yml (replace stummjr/demo with your own):

projects:
    default: PUT_YOUR_PROJECT_ID_HERE
requirements_file: requirements.txt
images:
    default: stummjr/demo

This will tell shub where to push the image once it’s built and also where Scrapy Cloud should pull the image from when deploying.

Now that you have everything configured as expected, you can build, push and deploy the Docker image to Scrapy Cloud. This step may take a couple minutes, so now might be a good time to go grab a cup of coffee. 🙂

$ shub image upload --username stummjr --password NotSoEasy
The image stummjr/demo:1.0 build is completed.
Pushing stummjr/demo:1.0 to the registry.
The image stummjr/demo:1.0 pushed successfully.
Deploy task results:
You can check deploy results later with 'shub image check --id 1'.
Deploy results:
{u'status': u'progress', u'last_step': u'pulling'}
{u'status': u'ok', u'project': 98162, u'version': u'1.0', u'spiders': 1}

If everything went well, you should now be able to run your PhantomJS spider on Scrapy Cloud. If you followed along with the sample project from the GitHub repo, your crawler should have collected 300 quotes scraped from the page that was rendered with PhantomJS.

Wrap Up

You now officially know how to use custom Docker images with Scrapy Cloud to supercharge your crawling projects. For example, you might want to do OCR using Tesseract in your crawler. Now you can, it’s just a matter of creating a Docker image with the Tesseract command line tool and pytesseract installed. You can also install tools from apt repositories and even compile the libraries/tools that you want.

Warning: this feature is still in beta, so be aware that some Scrapy Cloud features, such as addons, dependencies and settings, still don’t work with custom images.

For further information, check out the shub documentation.

Feel free to comment below with any other ideas or tips you’d like to hear more about!

This feature is a perk of paid accounts, so painlessly upgrade to unlock custom docker images for your projects. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

11 thoughts on “How to Deploy Custom Docker Images for Your Web Crawlers

      1. Hi can i deploy a selenium docker image in order to use it in a scrapy spider and run it on scrapinghub??

  1. Hi,
    I’ve tried what you explained exactly on my spider and it didn’t work out on scrapy cloud after deployment.
    I’m using selenuim with phantomjs in a separate file and calling it from my spider.
    can you help?

  2. Hi Vladir,

    If I could get selenium to work on scrapinghub I’d upgrade immediately 🙂 I’ve recently had to move some of my scrapy spiders to Selenium since websites seem to be making it more difficult to scrape (or just moving to modern frameworks with late loading of content).

    However, I get the following error when running “shub-image init –requirements ./requirements.txt”:

    File “/Users/xxx/.VirtualEnvs/classifieds/lib/python2.7/site-packages/shub_image/push.py”, line 4, in
    from shub.deploy import list_targets
    ImportError: cannot import name list_targets

    Any idea why ?
    Thank you,
    Charlie

    1. ps. This was a perfectly new virtualenv. Prior commands followed:

      mkvirtualenv classifieds
      pip install shub-image
      pip install Scrapy
      echo scrapinghub-entrypoint-scrapy > requirements.txt
      pip install ruamel.yaml # Note that this was required
      shub-image init –requirements ./requirements.txt # Error encountered

      Thank you very much for the help.

  3. I believe this is just what I need and a very well-written tutorial, but when I do
    shub image upload –username stummjr –password NotSoEasy

    I get

    Steps: 0%| | 0/1E
    rror unauthorized: authentication required: {u’message’: u’unauthorized: authent
    ication required’}

    1. After several tries I authenticated. Not sure why the delay. Maybe it takes awhile for your registration to take effect.

Leave a Reply

Your email address will not be published. Required fields are marked *