Skip to content

Scrape Data Visually with Portia and Scrapy Cloud

It’s been several months since we first integrated Portia into our Scrapy Cloud platform, and last week we officially began to phase out Autoscraping in favor of Portia.

In case you aren’t familiar with Portia, it’s an open source tool we developed for visually scraping websites. Portia allows you to make templates of pages you want to scrape and uses those templates to create a spider to scrape similar pages.


Autoscraping is our predecessor to Portia, and for the time being it’s still available to users who already have Autoscraping-based projects. Any new projects as well as existing projects without Autoscraping spiders will be only be able to use Portia.

In this post we’re going to introduce Portia by creating a spider for Allrecipes. Let’s start by creating a new Portia project:

portia-create-project

Once the project has been created, you will be redirected to the main Portia screen:

portia-project-screen

To create a new spider for the project, begin by entering the website URL in Portia’s address bar and clicking the ‘New Spider’ button. Portia will create a new spider and display the page:

portia-loaded-page

You can navigate the site like you normally would until you find a page containing data you want to scrape. Sites which require JavaScript to render aren’t currently supported.

portia-page-to-annotate

Once you’ve found a page with data you’re interested in, click the ‘Annotate this page’ button at the top to create a new template.

You will notice that hovering over an element will highlight it, and clicking it will create an annotation. An annotation defines a mapping between an element’s attribute or content to a field in an item you wish to scrape.

portia-create-annotation

In the screenshot above we have clicked the title of a recipe. On the left of the annotation window you will see an ‘Attribute’ dropdown. This allows you to select which part of the element you wish to map. In this case we’re going to map the content, but when annotating elements like images you may want to select a different attribute such as the ‘src’ value.

The value which will be extracted for this particular page is shown in the middle of the annotation window under ‘Value’. On the right you can select the field to map the attribute to. Because new projects are created with a default item, there are already fields we can map to.

Let’s say we don’t want to use the default item. For the time being we will discard the annotation by clicking the red trash can icon at the top of the annotation window.

portia-delete-annotation

Move your mouse to the right to display the right-hand sidebar and expand the ‘Extracted item’ tab. You will notice the current extracted item type will be ‘default’, click the ‘Edit items’ button.

portia-edit-items

Here you can edit the default item and its fields, as well as create more items if you wish. In this example we’ll simply edit the default item:

portia-recipe-item

Click ‘Save changes’ and you will now be able to map elements to your new set of fields. Once you have annotated everything you wish to extract, click ‘Save template’ and you will be redirected to the spider’s start URL. You can now test your spider by visiting another page similar to the one you annotated:

portia-extracted-items

Once you’ve tested several pages and are satisfied your spider is working, you can now deploy your project to Dash. Click the project link in the breadcrumbs (displayed top left) to leave the spider and go to the project page.

portia-publish-changes

Click the ‘Publish changes’ button on the right-hand sidebar to publish your project, and you should receive a message box asking if you want to be redirected to the schedule page. Click ‘OK’ and you will be redirected to the jobs page in Dash where you can now schedule your spider.

portia-dash-schedule-spider

Click the ‘Schedule’ button on the top right and select your spider from the dropdown. Click ‘Schedule’ and Dash will start your spider. You will notice that items are being scraped just like any standard Scrapy spider, and you can go into the jobs item page and download the scraped items as you normally would:

portia-scraped-items

That’s all there is to it! Hopefully this demonstrates just how easy it is to create spiders using Portia without writing any code whatsoever. There are a lot of features we didn’t cover, and we recommend taking a look at the documentation in GitHub if you want to learn more. Portia is open source, so you can run your own instance if you don’t wish to use Scrapy Cloud, and we are open to pull requests!

Sign Up For Scrapy Cloud (Free)

Scrapinghub: A Remote Working Success Story

When Scrapinghub came into the world in 2010, one thing we wanted was for it to be a company which could be powered by a global workforce, each individual working remotely from anywhere in the world.
Reduced commuting time and as a consequence increased family time were the primary reasons for this. In Uruguay, Pablo was commuting long distances to do work which realistically could have be done just as easily from home, and  Shane wanted to divide his time between Ireland, London and Japan. Having a regular office space was never going to work out for these guys.

Where we are based!

The Pitfalls of Open Plan

From the employee’s point of view, as well as eliminating the daily commute and wiping out the associated costs – fuel, parking, tax and insurance if  you own a car, or bus/train fares if relying on public transport –  remote working allows you to work in an office space of your choosing. You decorate and fit out your own space to your own tastes. No more putting up with the pitfalls of open plan, with distractions and interruptions possible every minute of the day. No more complaining about air con or lack thereof. How often have you picked up a cold or flu from a sick co-worker? The spread of colds and other illnesses is a huge disadvantage to a shared working space.

Communication

Yes an open plan environment is good for collaboration but with the likes of Skype and Google Hangouts, we get all the benefits of face-to-face communication in an instant. All you need is a webcam, mic and a Google+ or Skype account. Simple! We can hold meetings, conduct interviews, brainstorm and share presentations.
For real time messaging, HipChat and Slack are the primary team communication tools used by remote companies. In Scrapinghub, we use Slack as a platform to bring all of our communication together in one place. It’s great for real-time messaging, as well as sharing and archiving documents. It encourages daily intercommunication and significantly reduces the amount of emails sent.

Savings

From an employer’s point of view, a major benefit of a fully remote company is huge cost savings on office rent. This in particular is important for small start-ups who might be tight on initial cashflow. Other benefits include having wider access to talent and the fact that remote working is a huge selling point to potential hires.

Productivity

The big downside of remote working from an employers’ perspective is obviously productivity or a worry of reduced productivity if an employee works unsupervised from home. Research by Harvard Business Review shows that productivity will actually increase when people are trusted by their company to work remotely. This is mainly due to a quieter environment.  Some employers are very slow to change from a traditional workplace to remote working because of productivity worries.
Whether productivity increases or decreases completely depends on the inner workings of the individual business, and if a culture of creativity, trust and motivation exists, then it’s the perfect working model.

Social Interaction

Scrapinghub regularly holds a virtual office tour day. So often we have meetings via Hangouts and we get little glimpses into our colleagues’ offices often on the other side of the world. A poster or book spine might catch the eye, and these virtual office tour days are a way of learning more about each other, stepping into our colleague’s space for just a minute and seeing what it’s like on their end.
Social interaction is also encouraged through the use of an off topic channel on Slack and different communities on Google+ such as Scholars, Book Club and Technology. Scrapinghubbers can discuss non-work related issues this way and many team members meet up with their colleagues from around the world when they are travelling.

Top Tips for Remote Working

  1. Routine: If your working hours are self-monitored, try to work the same hours every day. Set your alarm clock and get up at the same time each morning. Create a work schedule and set yourself a tea break and lunch break to give your day structure.
  2. Work space: Ensure you have a defined work space in your home, preferably with a door so you can close it if there are children, guests or pets that may distract you.
  3. Health:  Sitting at a computer for hours on end isn’t healthy for the body. Try to get up from your computer every 30 minutes or so and walk around. Stretch your arms above your head and take some deep breaths.  This short time will also give your eyes a break from the screen.
  4. Social:  If you find working from home lonely, then why not look into co-working? This involves people from different organisations using a shared work environment.
  5. Focus: It’s very tempting to check your personal email and social media when working from home. This is hugely distracting so use an app like Self Control to block distracting sites for a set period of time.

 

 

Bye Bye HipChat, Hello Slack!

For many years now, we have used HipChat for our team collaboration. For the most part we were satisfied, but there were a number of pain points and over time Slack started to seem like the better option for us. Last month we decided to ditch HipChat in favour of Slack. Here’s why.

User Interface

 

Slack has a much more visual interface; avatars are shown alongside messages, and the application as a whole is much more vibrant and colorful. One thing we really like about Slack is you can see a nice summary of your recent mentions. Another cool feature is the ability to star items such as messages and files, which you can later access from the Flexpane menu in the top right corner.

Slack notification preferences

Slack notification preferences

Notifications can be configured on a case-by-case basis for channels and groups, and you can even add highlight words to let you know when someone mentions a word or phrase.

Migration

 

Migrating from HipChat to Slack was a breeze. Slack allows easy importing of logs from HipChat, not to mention other chat services such as Flowdock and Campfire, as well as text files and CSV.

Slack has a great guide for importing chat logs from another service.

Teams, Groups and Channels

 

With Slack, accounts are at the team level just like HipChat, however you can use the same email address for multiple teams, unlike HipChat where each account must have its own email address. Another benefit of Slack is you can sign into multiple teams simultaneously. We found with HipChat our clients would sometimes run into trouble due to using HipChat for their own company as well, so this feature helps a lot in this regard.

Admittedly, the distinction between channels and private groups was a little confusing at first, as their functionality is very similar. Private groups are very similar to channels, however, unlike channels which are visible to everyone, private groups are visible only to the creator and those who were invited. Accounts can be restricted to a subset of channels, and single-channel guests can only access one channel.

Integration with Third-Party Services

 

Slack integrates with a large number of services, and this doesn’t include their community-built integrations.

We created a Slack bot based on Limbo (previously called Slask) using Slack’s Real Time Messaging API which allowed us to easily port our build bot to Slack.

Our #news channel

Our #news channel

Slack’s Twitter integration notifies you of mentions and retweets in real time, allowing us to quickly respond to relevant tweets and keep in tune with our audience. We also created a news channel and made use of Slack’s RSS integration to post the latest articles from various sources and keep everyone in the know.

Downsides of Slack

 

There’s no self-hosted version of Slack available, and while this wasn’t a feature we needed, this may be a deal breaker for some companies.

In our case, the only big negative we found was there’s no native app for Linux, but we found HipChat’s Linux client to have fallen behind the Mac and web clients, so it was easy to ignore this caveat in favor of a more consistent experience between platforms. We also noticed the mobile apps sync a lot better compared to HipChat when using multiple devices.

One minor grievance is the lack of @here in Slack. In HipChat, @here would notify all channel members who are available, however, in Slack the only alternative is @channel which addresses the whole room regardless of their availability.

Pricing and Feature Comparison

 

It’s worth pointing out that HipChat is a lot cheaper than Slack, and HipChat Basic’s message history is up to 25,000 messages compared to Slack Lite’s 10,000 message limit. Both HipChat Basic and Slack Lite are limited to 5GB file storage.

Slack HipChat
Pros
  • Support for multiple teams
  • Much better search
  • Larger selection of integrations
  • More customizable
  • Sleeker UI
  • Video conferencing
  • Screen sharing
  • More affordable
  • Simpler interface
Cons
  • No Windows or Linux client
  • No self-hosted version available
  • No support for multiple groups under one account
  • Poor search functionality
Platforms OS X, Web, iOS, Android (Windows Desktop + Phone in development) Windows, OS X, Linux, Web, iOS, Android
API Yes Yes
Supported integrations 73 available 65 available
Video No (coming soon) Yes (Plus only)
Screen sharing No (coming soon) Yes (Plus only)
Pricing Lite: Free
Standard: $6.67 per user / month
Plus: $12.50 per user / month
Basic: Free
Plus: $2 per user / month

 

Final Thoughts

 

On the whole we’ve been really impressed by Slack, and for such a new application (it’s initial release was in August 2013!) it’s very slick and well polished. The lack of video and screen sharing haven’t been a problem for us, as we use Google+ hangouts for meetings, and Slack includes a nice ‘/hangout’ command to start a hangout with everyone in the room. For those who need these features, you will be pleased to know that earlier this year Slack has acquired Screenhero with an aim to add voice, video and screen sharing to Slack.

Comment here or discuss on HN

The History of Scrapinghub

Joanne O’Flynn meets with Pablo Hoffman and Shane Evans to find out what inspired them to set up web crawling company Scrapinghub.

Scrapinghub may be a very young company but it already has a great story to tell. Shane Evans from Ireland and Pablo Hoffman from Uruguay came together in a meeting of great web crawling minds to form the business after working together on the same project but for different companies.

In 2007, Shane was leading software development for MyDeco, a London-based startup. Shane’s team needed data to develop the MyDeco systems and set about trying to find an appropriate company that they could trust to deliver the data. Frustrated with the lack of high quality software and services available, Shane took the matter into his own hands and he decided to create a framework with his team to build web crawlers to the highest standard.

This turned out well and it didn’t take long to write plugins for the most important websites that they wanted to obtain data from. Continued support for more websites and maintenance of the framework was required so Shane looked to find someone to help.

Meanwhile, after studying computer science and electrical engineering, Pablo graduated in 2007. Soon after graduating, he set up Insophia, a Python development outsourcing company in Montevideo, Uruguay. One of Pablo’s main clients recommended Insophia to MyDeco and it wasn’t long before he was running the MyDeco web scraping team and helping to develop the data processing architecture.

Pablo could see massive potential in the code and asked Shane six months later if they could open source this web crawler. Cleverly combining the words ‘Scrape’ and ‘Python’, the Scrapy project was born. After Pablo spent many months refining it and releasing updates, Scrapy became quite popular and word reached his ears that several high-profile companies, including a social media giant, were using their technology!

In 2010, after developing a fantastic working relationship, Shane and Pablo could see there was an opportunity to start a company that could really make a difference by providing web crawling services, while continuing to advance open source projects. The two men decided to go into business together with one goal: To make it easier to get structured data from the internet.

With a tight and knowledgeable core group of developers and a huge drive to provide the most functional and efficient web crawlers, Shane and Pablo formed Scrapinghub. They started off as just a handful of hard-working programmers; by 2011 there were 10 employees, by 2012 there were 20 and staff numbers continued to double each year, up to the present where there are now almost 100 employees – or Scrapinghubbers as they affectionately call themselves – globally dedicated to developing the best web crawling and data processing solutions.

In 2014 alone, the company scraped and stored data from more than 10 billion pages (more than five times the amount the company did in 2013) and an extra 5 billion have passed through Crawlera. 2015 is already off to a great start with the release of two new open source projects: ScrapyRT and Skinfer, a tool for inferring JSON schemas. The icing on the cake is Scrapinghub’s February announcement of its participation in the DARPA project Memex. It’s testament to the knowledge and experience of a very dedicated team working together all around the world. From small beginnings come great things and it’s clear that for Scrapinghub, a very bright future awaits.

Skinfer: A Tool for Inferring JSON Schemas

Imagine that you have a lot of samples for a certain kind of data in JSON format. Maybe you want to have a better feel of it, know which fields appear in all records, which appear only in some and what are their types. In other words, you want to know the schema for the data that you have.

We’d like to present you skinfer, a tool that we built for inferring the schema from samples in JSON format. Skinfer will take a list of JSON samples and give you one JSON schema that describes all of the samples. (For more information about JSON Schema, we recommend the online book Understanding JSON Schema.)

Install skinfer with pip install skinfer, then generate a schema running the command schema_inferer passing a list of JSON samples (it can be a JSON lines file with all samples or a list of JSON files passed via the command line).

Here is an example of usage with a simple input:

$ cat samples.json
{"name": "Claudio", "age": 29}
{"name": "Roberto", "surname": "Gomez", "age": 72}
$ schema_inferer --jsonlines samples.json
{
    "$schema": "http://json-schema.org/draft-04/schema",
    "required": [
        "age",
        "name"
    ],
    "type": "object",
    "properties": {
        "age": {
            "type": "number"
        },
        "surname": {
            "type": "string"
        },
        "name": {
            "type": "string"
        }
    }
}

Once you’ve generated a schema for your data, you can:

  1. Run it against other samples to see if they share the same schema
  2. Share it with anyone who wants to know the structure of your data
  3. Complement it manually, adding descriptions for the fields
  4. Use a tool like docson to generate a nice page documenting the schema of your data (see example here)

Another interesting feature of Skinfer is that it can also merge a list of schemas, giving you a new schema that describes samples from all previously given schemas. For this, use the json_schema_merger command passing it a list of schemas.

This is cool because you can continuously keep updating a schema even after you’ve already generated it: you can just merge it with the one you already have.

Feel free to dive into the code, explore the docs and please file any issues that you have on GitHub. :)

Handling JavaScript in Scrapy with Splash

A common problem when developing spiders is dealing with sites that use a heavy amount of JavaScript. It’s not uncommon for a page to require JavaScript to be executed in order to render the page properly. In this post we’re going to show you how you can use Splash to handle JavaScript in your Scrapy projects.

What is Splash?

Splash is our in-house solution for JavaScript rendering, implemented in Python using Twisted and QT. Splash is a lightweight web browser which is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more. Best of all, it’s open source!

Setting Up Splash

The easiest way to set up Splash is through Docker:

$ docker run -p 8050:8050 scrapinghub/splash

Splash will now be running on localhost:8050. If you’re on OS X, it will be running on the IP address of boot2docker’s virtual machine.

If you would like to install Splash without using Docker, please refer to the documentation.

Using Splash with Scrapy

Now that Splash is running, you can test it in your browser:

http://localhost:8050/

On the right enter a URL (e.g. http://amazon.com) and click ‘Render me!’. Splash will display a screenshot of the page as well as charts and a list of requests with their timings. At the bottom you should see a textbox containing the rendered HTML.

Manually

You can use Request to send links to Splash:

req_url = "http://localhost:8050/render.json"
body = json.dumps({
    "url": url,
    "har": 1,
    "html": 0,
})
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(req_url, self.parse_link, method='POST',
                                 body=body, headers=headers)

If you’re using CrawlSpider, the easiest way is to override the process_links function in your spider to replace links with their Splash equivalents:

def process_links(self, links):
    for link in links:
        link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
    return links

ScrapyJS (recommended)

The preferred way to integrate Splash with Scrapy is using ScrapyJS. You can install ScrapyJS using pip:

pip install scrapyjs

To use ScrapyJS in your project, you first need to enable the middleware:

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

The middleware needs to take precedence over the HttpProxyMiddleware, which by default is at position 750, so we set the middleware to 725.

You will then need to set the SPLASH_URL setting in your project’s settings.py:

SPLASH_URL = 'http://localhost:8050/'

Don’t forget, if you are using boot2docker on OS X, you will need to set this to the IP address of the boot2docker virtual machine, e.g.:

SPLASH_URL = 'http://192.168.59.103:8050/'

Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom DUPEFILTER_CLASS and a custom cache storage backend:

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

If you already use another cache storage backend, you will need to subclass it and replace all calls to scrapy.util.request.request_fingerprint with scrapyjs.splash_request_fingerprint.

Now that the Splash middleware is enabled, you can begin rendering your requests with Splash using the ‘splash’ meta key.

For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # …

The ‘args’ dict contains arguments to send to Splash, you can find a full list of available arguments in the HTTP API documentation. We don’t need to provide the url parameter in the args, as it will be pre-filled from the URL in the Request object. By default the endpoint is set to ‘render.json’, but here we have overridden it and set it to ‘render.html’ to provide a HTML response.

Running Custom JavaScript

Sometimes when visiting a website, you may need to press a button or close a modal to view the page properly. Splash allows you to run your own JavaScript code within the context of the web page you’re requesting. There are several ways you can accomplish this:

Using the js_source Parameter

The js_source parameter can be used to send the JavaScript you want to execute. It’s recommended that you send this as a POST parameter rather than GET due to web servers and proxies enforcing limits on the size of GET parameters. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/json' \
    -d '{"js_source": "document.title=\"My Title\";", "url": "http://example.com"}' \
    'http://localhost:8050/render.html'

Sending JavaScript in the Body

You can also set the Content-Type header to ‘application/javascript’ and send the JavaScript code you would like to execute in the body of the request. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/javascript' \
    -d 'document.title="My Title"; \
    'http://localhost:8050/render.html?url=http://domain.com'

Splash Scripts (recommended)

Splash supports Lua scripts through its execute endpoint. This is the preferred way to execute JavaScript as you can preload libraries, choose when to execute the JavaScript, and retrieve the output.

Here’s an example script:

function main(splash)
    splash:go("http://example.com")
    splash:wait(0.5)
    local title = splash:evaljs("document.title")
    return {title=title}
end

This will return a JSON object containing the title:

{
    "title": "Some title"
}

Every script requires a main function to act as the entry point. You can return a Lua table which will be rendered as JSON, which is what we have done here. The splash:go function is used to tell Splash to visit the provided URL. The splash:evaljs function allows you to execute JavaScript within the page context, however, if you don’t need the result you should use splash:runjs instead.

You can test your Splash scripts in your browser by visiting your Splash instance’s index page (e.g. http://localhost:8050/). It’s also possible to use Splash with IPython notebook as an interactive web-based development environment, see here for more details. Note: You will need to use the master branch in the meantime until Splash 1.5 is released, so instead of docker pull scrapinghub/splash-jupyter (as stated in the docs) you should do docker pull scrapinghub/splash-jupyter:master.

A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:

function main(splash)
    splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
    splash:go("http://example.com")
    splash:runjs("$('#some-button').click()")
    return splash:html()
end

Here we use splash:autoload to load the jQuery library from Google’s CDN. We then tell Splash to visit the website and run our custom JavaScript to click the button with jQuery’s click function. The rendered HTML is then returned to the browser.

You can find more info on running JavaScript with Splash in the docs, and for a more in-depth tutorial, check out the Splash Scripts Tutorial.

We hope this tutorial gave you a nice introduction to Splash, and please let us know if you have any questions or comments!

Scrapinghub Crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?”

— Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

new search engine exposes the dark web

Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

Follow

Get every new post delivered to your Inbox.

Join 819 other followers