Introducing ScrapyRT: An API for Scrapy spiders

We’re proud to announce our new open source project, ScrapyRT! ScrapyRT, short for Scrapy Real Time, allows you to extract data from a single web page via an API using your existing Scrapy spiders.

Why did we start this project?

We needed to be able to retrieve the latest data for a previously scraped page, on demand. ScrapyRT made this easy by allowing us to reuse our spider logic to extract data from a single page, rather than running the whole crawl again.

How does ScrapyRT work?

ScrapyRT runs as a web service and retrieving data is as simple as making a request with the URL you want to extract data from and the name of the spider you would like to use.

Let’s say you were running ScrapyRT on localhost, you could make a request like this:

http://localhost:9080/crawl.json?spider_name=foo&url=http://example.com/product/1

ScrapyRT will schedule a request in Scrapy for the URL specified and use the ‘foo’ spider’s parse method as a callback. The data extracted from the page will be serialized into JSON and returned in the response body. If the spider specified doesn’t exist, a 404 will be returned. The majority of Scrapy spiders will be compatible without any additional programming necessary.

How do I use ScrapyRT in my Scrapy project?

 > git clone https://github.com/scrapinghub/scrapyrt.git > cd scrapyrt > pip install -r requirements.txt > python setup.py install > cd ~/your-scrapy-project > scrapyrt

ScrapyRT will be running on port 9080, and you can schedule your spiders per the example shown earlier.

We hope you find ScrapyRT useful and look forward to hearing your feedback!

Comment here or discuss on HackerNews.