Browsed by
Author: Richard Dowinton

Frontera: The Brain Behind the Crawls

At Scrapinghub we're always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up to date lists of item pages and are able to optimise our crawl as much as possible.

Scrape Data Visually with Portia and Scrapy Cloud

Note: Portia is no longer available for new users. It has been disabled for all the new organisations from August 20, 2018 onward.

It’s been several months since we first integrated Portia into our Scrapy Cloud platform, and last week we officially began to phase out Autoscraping in favor of Portia.

Handling JavaScript in Scrapy with Splash

A common roadblock when developing spiders is dealing with sites that use a heavy amount of JavaScript. Many modern websites run entirely on JavaScript, and require scripts to be run in order for the page to render properly. In many cases, pages also present modals and other dialogues that need to be interacted with to show the full page. In this post we’re going to show you how you can use...

New Changes to Our Scrapy Cloud Platform

We are proud to announce some exciting changes we've introduced this week. These changes bring a much more pleasant user experience, and several new features including the addition of Portia to our platform!

Introducing ScrapyRT: An API for Scrapy spiders

We’re proud to announce our new open source project, ScrapyRT! ScrapyRT, short for Scrapy Real Time, allows you to extract data from a single web page via an API using your existing Scrapy spiders.