Introducing ScrapyRT: An API for Scrapy spiders

Introducing ScrapyRT: An API for Scrapy spiders

We’re proud to announce our new open source project, ScrapyRT! ScrapyRT, short for Scrapy Real Time, allows you to extract data from a single web page via an API using your existing Scrapy spiders.

Why did we start this project?

We needed to be able to retrieve the latest data for a previously scraped page, on demand. ScrapyRT made this easy by allowing us to reuse our spider logic to extract data from a single page, rather than running the whole crawl again.

How does ScrapyRT work?

ScrapyRT runs as a web service and retrieving data is as simple as making a request with the URL you want to extract data from and the name of the spider you would like to use.

Let’s say you were running ScrapyRT on localhost, you could make a request like this:

http://localhost:9080/crawl.json?spider_name=foo&url=http://example.com/product/1

ScrapyRT will schedule a request in Scrapy for the URL specified and use the ‘foo’ spider’s parse method as a callback. The data extracted from the page will be serialized into JSON and returned in the response body. If the spider specified doesn’t exist, a 404 will be returned. The majority of Scrapy spiders will be compatible without any additional programming necessary.

How do I use ScrapyRT in my Scrapy project?

 > git clone https://github.com/scrapinghub/scrapyrt.git
 > cd scrapyrt
 > pip install -r requirements.txt
 > python setup.py install
 > cd ~/your-scrapy-project
 > scrapyrt

ScrapyRT will be running on port 9080, and you can schedule your spiders per the example shown earlier.

We hope you find ScrapyRT useful and look forward to hearing your feedback!

Comment here or discuss on HackerNews.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

7 thoughts on “Introducing ScrapyRT: An API for Scrapy spiders

  1. Hi, sorry for the OT comment but there are two things you might want to take a look at:

    1) This post was re-published at http://thetrendythings.com/read/19151 with an advertisement placed next to it. I’m assuming they did this without your authorization just like they did with an article of mine. If that’s the case, I’d suggest to get in touch with them to take it down.

    2) The certificate you are using for https://blog.scrapinghub.com/ is invalid because it is issued for the *.wordpress.com domain. This manifests itself when logging into this comment system here via Twitter, for example, because it redirects back via HTTPS after authorization, leading to a nasty warning. You might want to fix that so as not to scare off potential commenters.

    Cheers
    Moritz

    1. Hi Moritz, thanks for your comments. Here are the answers to your question:

      1) they did mention blog.scrapinghub.com as the source, right below the title here: http://thetrendythings.com/read/19151
      2) our blog is supposed to be accessed only via http (not https). I wasn’t aware of certificate problem when adding comments, I’ll make our sysadmin team aware, for them to take a look.

      1. Hi Pablo,

        you’re welcome!

        1) Sure but there is no license statement on your site which allows commercial use of your content, so I thought you might be interested in other people making money from it. I guess this post is mostly a release announcement so you probably don’t care as much about it and are happy about the increased visibility. In that case I’d suggest to place your content under the CC-BY license or something similar so that other people may profit from it, too, in a clearly legal way.

        2) Excellent!

        Moritz

  2. Hmmm, URL parameter is required. I prefer I can run a spider with custom parameters defined by me, but I still can see JSON response from ScrapyRT. It’s great, really needed, but not complete.

Leave a Reply

Your email address will not be published. Required fields are marked *