Web Scraping to Create Open Data

Web Scraping to Create Open Data

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

https://en.wikipedia.org/wiki/Open_data

My first experience with open data was in the year 2010. I wanted to create a better app for Bicing, the local bike sharing system in Barcelona. Their website was a nightmare to use and I was tired of needing to walk to each station, trying to guess which ones had bicycles. There was no app for Android, other than a couple of unofficial attempts that didn’t work at all.

I began as most would; I searched the internet and found a library named python-bicing that was somehow able to retrieve station and bike information. This was my first time using Python and, after some investigation, I learned what the code was doing: accessing the official website, parsing the JavaScript that generated their buggy map and giving back a nice chunk of Python objects that represented bike share stations.

This I learned was called web scraping. It was like I had figured out a magic trick that would allow me to always be able to access the data I needed without having to rely on faulty websites.

The rise of OpenBicing and CityBikes

Shortly after, I launched OpenBicing, an Android app for the local bike sharing system in Barcelona, together with a backend that used python-bicing. I also shared a public API that provided this information so that nobody else had to do the dirty work ever again.

Pasted image at 2016_03_30 08_02 AM

Since other cities were having the same problem, we expanded the scope of the project worldwide and renamed it CityBikes. That was 6 years ago.

Pasted image at 2016_03_30 08_03 AM.png

To date, CityBikes is the most comprehensive and widely used open API for bike sharing information, with support for over 400 cities worldwide. Our API processes around 10 requests per second and we scrape each of the 418 feeds about every three minutes. Making our core library available for anyone to contribute has been crucial in maintaining and adding coverage for all of the supported systems.

Pasted image at 2016_03_30 06_04 PM

The open data fallacy

We are usually regarded as “an open data project” even though less than 10% of our feeds come from properly licensed, documented and machine-readable feeds. The remaining 90% is composed of 188 feeds that are machine-readable, but not licensed nor documented and 230 that are entirely maintained by scraping HTML pages.

Graph-CityBikes

NABSA (North American BikeShare Association) recently published GBFS (General Bikeshare Feed Specification). This is clearly a step in the right direction, but I can’t help but look at the almost 60% of services we currently support through scraping and wonder how long it will take the remaining organizations to release their information, if ever. This is even more the case considering these numbers aren’t even taking into account worldwide coverage.

Over the last few years there has been a progression by transportation companies and city councils toward providing their information as “open data”. Directive 2003/98/EC encourages EU member states to release information regarding public services.

Yet, in most cases, there’s little action in enforcing Public Private Partnerships (PPP) to release their public information under a non-restrictive license or even to transfer ownership of the data to city councils to be included in their open data portals.

Even with the increasing number of companies and institutions interested in participating in open data, by no means should we consider open data a reality or something to be taken for granted. I firmly believe in the future and benefits of open data, I have seen them happening all around CityBikes, but as technologists we need to stress the fact that the data is not out there yet.

The benefits of open data

When I started this project, I sought to make a difference in Barcelona. Now you can find tons of bike sharing apps that use our API on all major platforms. It doesn’t matter that these are not our own apps. They are solving the same problem we were trying to fix, so their success is our success.

Besides popular apps like Moovit or CityMapper, there are many neat projects out there, some of which are published under free software licenses. Ideally, a city council could create a customization of any of these apps for their own use.

Most official applications for bike sharing systems have terrible ratings. The core business of transportation companies is running a service, so they have no real motivation to create an engaging UI or innovate further. In some cases, the city council does not even own the rights to the data, being completely at the mercy of the company providing the transportation service.

Open data over apps

When providing public services, city councils and companies often get lost in what they should offer as an aid to the service. They focus on a nice map or a flashy application, rather than providing the data behind these service aids. Maps, apps, and websites have a limited focus and usually serve a single purpose. On the other hand, data is malleable and the purest form of representation. While you can’t create something new from looking and playing with a static map (except, of course, if you scrape it), data can be used to create countless different iterations. It can even provide a bridge that will allow anyone to participate, improve and build on top of these public services.

Wrap Up

At this point, you might wonder why I care so much about bike sharing. To me it’s not about bike sharing anymore. CityBikes is just too good of an open data metaphor, a simulation in which public information is freely accessible to everyone. It shows the benefits of open data and the deficiencies that arise from the lack thereof.

We shouldn’t have to create open data by scraping websites. This information should be already available, easily accessed and provided in a machine-readable format from the original providers, be they city councils or transportation companies. However, until there’s another option, we’ll always have scraping.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

7 thoughts on “Web Scraping to Create Open Data

  1. How often do you need to update your scraper scripts because the source webpage / script has changed? Or it didn’t change since few years?

    1. Luckily this kind of websites do rarely change, so a good estimate would be spiders needing an update every one or two years, at most. Something that I did not cover is the reuse of implementation between services: data from the current 418 supported cities is provided by a set of 36 spiders.

  2. Did you run into problems of authorities claiming that your scraping was a violation of their TOS?
    There have after all been some well known cases of website owners sending cease and desist orders and suing.
    I am really interested to know what your experiences in this were.

    1. That has happened, but less frequently than one would think. Over the years we have received just 3 violation reports by a cease and desist, so authorities have not been really been involved.

      Scraping websites is relatively safe, compared to pinning into private APIs used by official mobile clients. We currently do not support private APIs anymore. Whilst this is the most straightforward way to get the information out there and convince the release of the data, the dangers dwarf the benefits. Our current approach is about reaching sufficient critical mass to convince transportation companies on the benefits of releasing their non critical data under non restrictive licenses.

      Sometimes, the case is just the other way around. We do often get contacted by city councils and small companies about adding support for their service. That directly means their system is going to be compatible with most bike sharing apps.

      As a side note and, of course, only on the transportation space, trademark infringement is the most usual case for suing and cease and desist. There’s been many cases of unofficial apps getting a C&D for not stating clearly that they are unrelated to the official service. One thing is to create an alternative app and another thing is deceive users into thinking they are using the original service.

Leave a Reply

Your email address will not be published. Required fields are marked *