Skip to content

Scrapinghub: A Remote Working Success Story

When Scrapinghub came into the world in 2010, one thing we wanted was for it to be a company which could be powered by a global workforce, each individual working remotely from anywhere in the world.
Reduced commuting time and as a consequence increased family time were the primary reasons for this. In Uruguay, Pablo was commuting long distances to do work which realistically could have be done just as easily from home, and  Shane wanted to divide his time between Ireland, London and Japan. Having a regular office space was never going to work out for these guys.

Where we are based!

The Pitfalls of Open Plan

From the employee’s point of view, as well as eliminating the daily commute and wiping out the associated costs – fuel, parking, tax and insurance if  you own a car, or bus/train fares if relying on public transport –  remote working allows you to work in an office space of your choosing. You decorate and fit out your own space to your own tastes. No more putting up with the pitfalls of open plan, with distractions and interruptions possible every minute of the day. No more complaining about air con or lack thereof. How often have you picked up a cold or flu from a sick co-worker? The spread of colds and other illnesses is a huge disadvantage to a shared working space.

Communication

Yes an open plan environment is good for collaboration but with the likes of Skype and Google Hangouts, we get all the benefits of face-to-face communication in an instant. All you need is a webcam, mic and a Google+ or Skype account. Simple! We can hold meetings, conduct interviews, brainstorm and share presentations.
For real time messaging, HipChat and Slack are the primary team communication tools used by remote companies. In Scrapinghub, we use Slack as a platform to bring all of our communication together in one place. It’s great for real-time messaging, as well as sharing and archiving documents. It encourages daily intercommunication and significantly reduces the amount of emails sent.

Savings

From an employer’s point of view, a major benefit of a fully remote company is huge cost savings on office rent. This in particular is important for small start-ups who might be tight on initial cashflow. Other benefits include having wider access to talent and the fact that remote working is a huge selling point to potential hires.

Productivity

The big downside of remote working from an employers’ perspective is obviously productivity or a worry of reduced productivity if an employee works unsupervised from home. Research by Harvard Business Review shows that productivity will actually increase when people are trusted by their company to work remotely. This is mainly due to a quieter environment.  Some employers are very slow to change from a traditional workplace to remote working because of productivity worries.
Whether productivity increases or decreases completely depends on the inner workings of the individual business, and if a culture of creativity, trust and motivation exists, then it’s the perfect working model.

Social Interaction

Scrapinghub regularly holds a virtual office tour day. So often we have meetings via Hangouts and we get little glimpses into our colleagues’ offices often on the other side of the world. A poster or book spine might catch the eye, and these virtual office tour days are a way of learning more about each other, stepping into our colleague’s space for just a minute and seeing what it’s like on their end.
Social interaction is also encouraged through the use of an off topic channel on Slack and different communities on Google+ such as Scholars, Book Club and Technology. Scrapinghubbers can discuss non-work related issues this way and many team members meet up with their colleagues from around the world when they are travelling.

Top Tips for Remote Working

  1. Routine: If your working hours are self-monitored, try to work the same hours every day. Set your alarm clock and get up at the same time each morning. Create a work schedule and set yourself a tea break and lunch break to give your day structure.
  2. Work space: Ensure you have a defined work space in your home, preferably with a door so you can close it if there are children, guests or pets that may distract you.
  3. Health:  Sitting at a computer for hours on end isn’t healthy for the body. Try to get up from your computer every 30 minutes or so and walk around. Stretch your arms above your head and take some deep breaths.  This short time will also give your eyes a break from the screen.
  4. Social:  If you find working from home lonely, then why not look into co-working? This involves people from different organisations using a shared work environment.
  5. Focus: It’s very tempting to check your personal email and social media when working from home. This is hugely distracting so use an app like Self Control to block distracting sites for a set period of time.

 

 

Bye Bye HipChat, Hello Slack!

For many years now, we have used HipChat for our team collaboration. For the most part we were satisfied, but there were a number of pain points and over time Slack started to seem like the better option for us. Last month we decided to ditch HipChat in favour of Slack. Here’s why.

User Interface

 

Slack has a much more visual interface; avatars are shown alongside messages, and the application as a whole is much more vibrant and colorful. One thing we really like about Slack is you can see a nice summary of your recent mentions. Another cool feature is the ability to star items such as messages and files, which you can later access from the Flexpane menu in the top right corner.

Slack notification preferences

Slack notification preferences

Notifications can be configured on a case-by-case basis for channels and groups, and you can even add highlight words to let you know when someone mentions a word or phrase.

Migration

 

Migrating from HipChat to Slack was a breeze. Slack allows easy importing of logs from HipChat, not to mention other chat services such as Flowdock and Campfire, as well as text files and CSV.

Slack has a great guide for importing chat logs from another service.

Teams, Groups and Channels

 

With Slack, accounts are at the team level just like HipChat, however you can use the same email address for multiple teams, unlike HipChat where each account must have its own email address. Another benefit of Slack is you can sign into multiple teams simultaneously. We found with HipChat our clients would sometimes run into trouble due to using HipChat for their own company as well, so this feature helps a lot in this regard.

Admittedly, the distinction between channels and private groups was a little confusing at first, as their functionality is very similar. Private groups are very similar to channels, however, unlike channels which are visible to everyone, private groups are visible only to the creator and those who were invited. Accounts can be restricted to a subset of channels, and single-channel guests can only access one channel.

Integration with Third-Party Services

 

Slack integrates with a large number of services, and this doesn’t include their community-built integrations.

We created a Slack bot based on Limbo (previously called Slask) using Slack’s Real Time Messaging API which allowed us to easily port our build bot to Slack.

Our #news channel

Our #news channel

Slack’s Twitter integration notifies you of mentions and retweets in real time, allowing us to quickly respond to relevant tweets and keep in tune with our audience. We also created a news channel and made use of Slack’s RSS integration to post the latest articles from various sources and keep everyone in the know.

Downsides of Slack

 

There’s no self-hosted version of Slack available, and while this wasn’t a feature we needed, this may be a deal breaker for some companies.

In our case, the only big negative we found was there’s no native app for Linux, but we found HipChat’s Linux client to have fallen behind the Mac and web clients, so it was easy to ignore this caveat in favor of a more consistent experience between platforms. We also noticed the mobile apps sync a lot better compared to HipChat when using multiple devices.

One minor grievance is the lack of @here in Slack. In HipChat, @here would notify all channel members who are available, however, in Slack the only alternative is @channel which addresses the whole room regardless of their availability.

Pricing and Feature Comparison

 

It’s worth pointing out that HipChat is a lot cheaper than Slack, and HipChat Basic’s message history is up to 25,000 messages compared to Slack Lite’s 10,000 message limit. Both HipChat Basic and Slack Lite are limited to 5GB file storage.

Slack HipChat
Pros
  • Support for multiple teams
  • Much better search
  • Larger selection of integrations
  • More customizable
  • Sleeker UI
  • Video conferencing
  • Screen sharing
  • More affordable
  • Simpler interface
Cons
  • No Windows or Linux client
  • No self-hosted version available
  • No support for multiple groups under one account
  • Poor search functionality
Platforms OS X, Web, iOS, Android (Windows Desktop + Phone in development) Windows, OS X, Linux, Web, iOS, Android
API Yes Yes
Supported integrations 73 available 65 available
Video No (coming soon) Yes (Plus only)
Screen sharing No (coming soon) Yes (Plus only)
Pricing Lite: Free
Standard: $6.67 per user / month
Plus: $12.50 per user / month
Basic: Free
Plus: $2 per user / month

 

Final Thoughts

 

On the whole we’ve been really impressed by Slack, and for such a new application (it’s initial release was in August 2013!) it’s very slick and well polished. The lack of video and screen sharing haven’t been a problem for us, as we use Google+ hangouts for meetings, and Slack includes a nice ‘/hangout’ command to start a hangout with everyone in the room. For those who need these features, you will be pleased to know that earlier this year Slack has acquired Screenhero with an aim to add voice, video and screen sharing to Slack.

Comment here or discuss on HN

The History of Scrapinghub

Joanne O’Flynn meets with Pablo Hoffman and Shane Evans to find out what inspired them to set up web crawling company Scrapinghub.

Scrapinghub may be a very young company but it already has a great story to tell. Shane Evans from Ireland and Pablo Hoffman from Uruguay came together in a meeting of great web crawling minds to form the business after working together on the same project but for different companies.

In 2007, Shane was leading software development for MyDeco, a London-based startup. Shane’s team needed data to develop the MyDeco systems and set about trying to find an appropriate company that they could trust to deliver the data. Frustrated with the lack of high quality software and services available, Shane took the matter into his own hands and he decided to create a framework with his team to build web crawlers to the highest standard.

This turned out well and it didn’t take long to write plugins for the most important websites that they wanted to obtain data from. Continued support for more websites and maintenance of the framework was required so Shane looked to find someone to help.

Meanwhile, after studying computer science and electrical engineering, Pablo graduated in 2007. Soon after graduating, he set up Insophia, a Python development outsourcing company in Montevideo, Uruguay. One of Pablo’s main clients recommended Insophia to MyDeco and it wasn’t long before he was running the MyDeco web scraping team and helping to develop the data processing architecture.

Pablo could see massive potential in the code and asked Shane six months later if they could open source this web crawler. Cleverly combining the words ‘Scrape’ and ‘Python’, the Scrapy project was born. After Pablo spent many months refining it and releasing updates, Scrapy became quite popular and word reached his ears that several high-profile companies, including a social media giant, were using their technology!

In 2010, after developing a fantastic working relationship, Shane and Pablo could see there was an opportunity to start a company that could really make a difference by providing web crawling services, while continuing to advance open source projects. The two men decided to go into business together with one goal: To make it easier to get structured data from the internet.

With a tight and knowledgeable core group of developers and a huge drive to provide the most functional and efficient web crawlers, Shane and Pablo formed Scrapinghub. They started off as just a handful of hard-working programmers; by 2011 there were 10 employees, by 2012 there were 20 and staff numbers continued to double each year, up to the present where there are now almost 100 employees – or Scrapinghubbers as they affectionately call themselves – globally dedicated to developing the best web crawling and data processing solutions.

In 2014 alone, the company scraped and stored data from more than 10 billion pages (more than five times the amount the company did in 2013) and an extra 5 billion have passed through Crawlera. 2015 is already off to a great start with the release of two new open source projects: ScrapyRT and Skinfer, a tool for inferring JSON schemas. The icing on the cake is Scrapinghub’s February announcement of its participation in the DARPA project Memex. It’s testament to the knowledge and experience of a very dedicated team working together all around the world. From small beginnings come great things and it’s clear that for Scrapinghub, a very bright future awaits.

Skinfer: A Tool for Inferring JSON Schemas

Imagine that you have a lot of samples for a certain kind of data in JSON format. Maybe you want to have a better feel of it, know which fields appear in all records, which appear only in some and what are their types. In other words, you want to know the schema for the data that you have.

We’d like to present you skinfer, a tool that we built for inferring the schema from samples in JSON format. Skinfer will take a list of JSON samples and give you one JSON schema that describes all of the samples. (For more information about JSON Schema, we recommend the online book Understanding JSON Schema.)

Install skinfer with pip install skinfer, then generate a schema running the command schema_inferer passing a list of JSON samples (it can be a JSON lines file with all samples or a list of JSON files passed via the command line).

Here is an example of usage with a simple input:

$ cat samples.json
{"name": "Claudio", "age": 29}
{"name": "Roberto", "surname": "Gomez", "age": 72}
$ schema_inferer --jsonlines samples.json
{
    "$schema": "http://json-schema.org/draft-04/schema",
    "required": [
        "age",
        "name"
    ],
    "type": "object",
    "properties": {
        "age": {
            "type": "number"
        },
        "surname": {
            "type": "string"
        },
        "name": {
            "type": "string"
        }
    }
}

Once you’ve generated a schema for your data, you can:

  1. Run it against other samples to see if they share the same schema
  2. Share it with anyone who wants to know the structure of your data
  3. Complement it manually, adding descriptions for the fields
  4. Use a tool like docson to generate a nice page documenting the schema of your data (see example here)

Another interesting feature of Skinfer is that it can also merge a list of schemas, giving you a new schema that describes samples from all previously given schemas. For this, use the json_schema_merger command passing it a list of schemas.

This is cool because you can continuously keep updating a schema even after you’ve already generated it: you can just merge it with the one you already have.

Feel free to dive into the code, explore the docs and please file any issues that you have on GitHub. :)

Handling JavaScript in Scrapy with Splash

A common problem when developing spiders is dealing with sites that use a heavy amount of JavaScript. It’s not uncommon for a page to require JavaScript to be executed in order to render the page properly. In this post we’re going to show you how you can use Splash to handle JavaScript in your Scrapy projects.

What is Splash?

Splash is our in-house solution for JavaScript rendering, implemented in Python using Twisted and QT. Splash is a lightweight web browser which is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more. Best of all, it’s open source!

Setting Up Splash

The easiest way to set up Splash is through Docker:

$ docker run -p 8050:8050 scrapinghub/splash

Splash will now be running on localhost:8050. If you’re on OS X, it will be running on the IP address of boot2docker’s virtual machine.

If you would like to install Splash without using Docker, please refer to the documentation.

Using Splash with Scrapy

Now that Splash is running, you can test it in your browser:

http://localhost:8050/

On the right enter a URL (e.g. http://amazon.com) and click ‘Render me!’. Splash will display a screenshot of the page as well as charts and a list of requests with their timings. At the bottom you should see a textbox containing the rendered HTML.

Manually

You can use Request to send links to Splash:

req_url = "http://localhost:8050/render.json"
body = json.dumps({
    "url": url,
    "har": 1,
    "html": 0,
})
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(req_url, self.parse_link, method='POST',
                                 body=body, headers=headers)

If you’re using CrawlSpider, the easiest way is to override the process_links function in your spider to replace links with their Splash equivalents:

def process_links(self, links):
    for link in links:
        link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
    return links

ScrapyJS (recommended)

The preferred way to integrate Splash with Scrapy is using ScrapyJS. You can install ScrapyJS using pip:

pip install scrapyjs

To use ScrapyJS in your project, you first need to enable the middleware:

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

The middleware needs to take precedence over the HttpProxyMiddleware, which by default is at position 750, so we set the middleware to 725.

You will then need to set the SPLASH_URL setting in your project’s settings.py:

SPLASH_URL = 'http://localhost:8050/'

Don’t forget, if you are using boot2docker on OS X, you will need to set this to the IP address of the boot2docker virtual machine, e.g.:

SPLASH_URL = 'http://192.168.59.103:8050/'

Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom DUPEFILTER_CLASS and a custom cache storage backend:

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

If you already use another cache storage backend, you will need to subclass it and replace all calls to scrapy.util.request.request_fingerprint with scrapyjs.splash_request_fingerprint.

Now that the Splash middleware is enabled, you can begin rendering your requests with Splash using the ‘splash’ meta key.

For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # …

The ‘args’ dict contains arguments to send to Splash, you can find a full list of available arguments in the HTTP API documentation. We don’t need to provide the url parameter in the args, as it will be pre-filled from the URL in the Request object. By default the endpoint is set to ‘render.json’, but here we have overridden it and set it to ‘render.html’ to provide a HTML response.

Running Custom JavaScript

Sometimes when visiting a website, you may need to press a button or close a modal to view the page properly. Splash allows you to run your own JavaScript code within the context of the web page you’re requesting. There are several ways you can accomplish this:

Using the js_source Parameter

The js_source parameter can be used to send the JavaScript you want to execute. It’s recommended that you send this as a POST parameter rather than GET due to web servers and proxies enforcing limits on the size of GET parameters. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/json' \
    -d '{"js_source": "document.title=\"My Title\";", "url": "http://example.com"}' \
    'http://localhost:8050/render.html'

Sending JavaScript in the Body

You can also set the Content-Type header to ‘application/javascript’ and send the JavaScript code you would like to execute in the body of the request. See curl example below:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/javascript' \
    -d 'document.title="My Title"; \
    'http://localhost:8050/render.html?url=http://domain.com'

Splash Scripts (recommended)

Splash supports Lua scripts through its execute endpoint. This is the preferred way to execute JavaScript as you can preload libraries, choose when to execute the JavaScript, and retrieve the output.

Here’s an example script:

function main(splash)
    splash:go("http://example.com")
    splash:wait(0.5)
    local title = splash:evaljs("document.title")
    return {title=title}
end

This will return a JSON object containing the title:

{
    "title": "Some title"
}

Every script requires a main function to act as the entry point. You can return a Lua table which will be rendered as JSON, which is what we have done here. The splash:go function is used to tell Splash to visit the provided URL. The splash:evaljs function allows you to execute JavaScript within the page context, however, if you don’t need the result you should use splash:runjs instead.

You can test your Splash scripts in your browser by visiting your Splash instance’s index page (e.g. http://localhost:8050/). It’s also possible to use Splash with IPython notebook as an interactive web-based development environment, see here for more details. Note: You will need to use the master branch in the meantime until Splash 1.5 is released, so instead of docker pull scrapinghub/splash-jupyter (as stated in the docs) you should do docker pull scrapinghub/splash-jupyter:master.

A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:

function main(splash)
    splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
    splash:go("http://example.com")
    splash:runjs("$('#some-button').click()")
    return splash:html()
end

Here we use splash:autoload to load the jQuery library from Google’s CDN. We then tell Splash to visit the website and run our custom JavaScript to click the button with jQuery’s click function. The rendered HTML is then returned to the browser.

You can find more info on running JavaScript with Splash in the docs, and for a more in-depth tutorial, check out the Splash Scripts Tutorial.

We hope this tutorial gave you a nice introduction to Splash, and please let us know if you have any questions or comments!

Scrapinghub Crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?”

— Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

new search engine exposes the dark web

Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

New Changes to Our Scrapy Cloud Platform

We are proud to announce some exciting changes we’ve introduced this week. These changes bring a much more pleasant user experience, and several new features including the addition of Portia to our platform!

Here are the highlights:

An Improved Look and Feel

We have introduced a number of improvements in the way our dashboard looks and feels. This includes a new layout based on Bootstrap 3, a more user-friendly color scheme, the ability to schedule jobs once per month, and a greatly improved spiders page with pagination and search.

Filtering and pagination of spiders:

spiders

A new user interface for adding periodic jobs:

periodic-jobs

A new user interface for scheduling spiders:

schedule-spider

And much more!

Your Organization on Scrapy Cloud

You are now able to create and manage organizations in Scrapy Cloud, add members if necessary and from there create new projects under your organization. This will make it much easier for you to manage your projects and keep them all in one place. Also, to make things simpler, our billing system will soon be managed at the organization level rather than per individual user.

You are now able to create projects within the context of an organization, however other organization members will need to be invited in order to access it. A user can be invited to a project even if that user is not a member of the project’s organization.

Export Items as XML

Due to popular demand, we have added the ability to download your items as XML:

export-items-xml

Improvements to Periodic Jobs

We have made several improvements to the way periodic jobs are handled:

  • There is no delay when creating or editing a job. For example, if you create a new a job at 11:59 to run at 12:00, it will do so without any trouble.
  • If there is any downtime, jobs that were intended to be scheduled during the downtime will be scheduled automatically once the service is restored.
  • You can now schedule jobs at specific dates in the month.

Portia Now Available in Dash

Last year, we open-sourced our annotation based scraping tool, Portia. We have since been working to integrate it into Dash, and it’s finally here!

We have added an ‘Open in Portia’ button to your projects’ Autoscraping page, so you can now open your Scrapy Cloud projects in Portia. We intend Portia to be a successor to our existing Autoscraping interface, and hope you find it to be a much more pleasant experience. No longer do you have to do a preliminary crawl to begin annotating, you can just jump straight in!

Check out this demo of how you can create a spider using Portia and Dash!

Enjoy the new features, and of course if you have any feedback please don’t hesitate to post on our support forum!

Follow

Get every new post delivered to your Inbox.

Join 817 other followers