Skip to content

A (not so) Short Story on Getting Decent Internet Access

by

This is a tale of trial, tribulation, and triumph. It is the story of how I overcame obstacles including an inconveniently placed grove of eucalyptus trees, armed with little more than a broom and a pair of borrowed binoculars, to establish a stable internet connection.

I am a remote worker and accessibility to stable internet ranks right up there with food and shelter. With the rise of digital nomads and distributed companies, I am sure many of you can identify with the frustration of nonexistent or slow internet. Read on to learn more than you ever thought you would about how to MacGyver your very own homemade fiber to the home (FTTH) connection.

Searching for a connection

This adventure took place in September 2015. My wife and I decided to move from Montevideo, Uruguay to our hometown Durazno. We bought a cozy house on the outskirts of the city, about 3 km away from the downtown.

The first thing we checked was availability to our main ISP, Antel. They had 2 options:

  • Fiber to the home (FTTH)
  • Mobile access (LTE)

I walked around the area in an attempt to find a Network Access Point (NAP). The NAP is a part of the ISPs infrastructure and the presence of one indicates that the zone is covered with FTTH. I know how NAPs look, specifically the ones used by Antel, so it was a fast way to determine if I would have access to FTTH coverage. To my surprise I was unable to find one and confirmed via their map that there were none available.

Copper lines were also a bust because my house was too far from the closest node. Besides, the speed was close to a 56K dial-up modem and there were many disconnection issues. My neighbour had the same service and cancelled it. He said it was useless and had trouble despite only using it for email and day-today web access.

I then bought a dongle and tested Antel’s LTE/4G service with both my laptop and smartphone. The results were appalling. The upload speed was 128-256Kbps and was by far the biggest let down.

3G coverage was also terrible and I experienced many disconnections. This put mobile technology off the table.

In case you care to see details, here are the speed tests I performed onsite.

Challenge accepted

I looked deeper into Antel’s FTTH deployed zone and saw that it was only 1.5km away in a straight line. I figured it was worth a shot to try creating a wireless Point-to-Point link.

Planning the link

Point-to-Point Link Requirements

Line of Sight: I needed to put the 2 radios in a place where they could see each other. Since I was going to use an unlicensed frequency (2.4Ghz or 5Ghz), this was a difficult requirement.

The second necessity was determining the Fresnel Zone clearance. Basically you need to have the two radios high enough that you can avoid interference and poor link quality.

Point-to-Point Link Issues

The most destructive force for radio electrical signals is water (radio signals just don’t get through easily or at all) and trees are full of water. I realized that the Eucalyptus grove (labeled Major issue in the image below) was right in the path of the signal path. Diving into Google Earth and Google Maps gave me some insight on dodging eucalyptus forest and other troublesome surface heights. I then stood on the roof of my house with some borrowed binoculars to help determine a placement for the remote endpoint.

graph

Through the binoculars, I spotted a white balloon that looked like some kind of tank in the FTTH zone. I drove up and realized that it was a water tank situated near a large warehouse for trucks. It made sense to put the radio on top of this tank to benefit from its height (labeled “FTTH” in the image above) for the Fresnel Zone clearance.

I rang up the owner and arranged a meeting with him to explain the situation. I found it difficult to explain what I needed and he had no idea what I was talking about, so he took some convincing.

The owner’s main concern was that I may interfere with his internet access. I reassured him and explained that I would be using a service different to the one he was using. He had nothing to worry about as I only needed the height of his tower.

He agreed and let me perform some tests to check the viability of my plan.

water tower

The Parameters

Google Earth revealed:

  • 93m of surface height at the tank + 7m for the height of the tower (labeled “FTTH”) = 100m
  • My house was 104m (labeled “No INET” in Google Earth picture) + 4m for my house’s small warehouse that would store the radio = 108m

The major concern was that at halfway point, the surface height of the land was 103m and I needed at least 4.4m of Fresnel zone clearance = 107.4m

There was nothing I could do to boost the height of the tank, so I had to hope for the best and test out the connection.

Testing the connection

I borrowed 2 Ubiquiti NanoStation Loco M5s (5Ghz band) to test the link. I knew that if I could establish a link with these smaller capacity devices, then I was on the right track.

At the FTTH point (the water tank) I placed a Nano.

setting it up

I put the other one onto a broomstick. I then wandered around my property (including the roof of the house and the warehouse) with this makeshift aligning tool, waving it around and trying to find the best connection spot.

I was shocked when I finally managed to get a one-way 90Mbps while holding the broom and thought, this might actually work!

iperf results

I used iperf to test the connection. It runs in a server/client model and I ran the server at the “FTTH” point and then the client at the “No INET” point.

testing

Here are the iperf results:

code

Choosing Hardware

I decided to commit to the Point-to-Point link as the solution to my internet woes. The next step was figuring out the right pair of antennas to get the most out of this link.

Product NSM5 NSM5 loco NanoBeam M5 LiteBeam M5
Gain 16 dBi 13 dBi 16 dBi 23 dBi
Max Pow 8W 5.5W 6W 4W
Wind Survival 200 Km/h 200 Km/h
Weight 0.4 Kg 0.18 Kg 0.320 Kg 0.750 Kg
CPU MIPS 24KC (400 Mhz) MIPS 24KC (400 Mhz) MIPS 74KC (560 Mhz) MIPS 74K
Mem 32 MB (SDRAM) 32 MB (SDRAM) 64 MB (DDR2) 64 MB
Network 2 x 10/100 eth0 1 x 10/100 eth0 1 x 10/100 eth0 1 x 10/100 eth0
54M -> -75dBm (802.11a) 54M -> -75dBm (802.11b/g) 54M -> -75dBm (802.11a) 54M -> -84dBm (802.11a)
Amazon Price usd 88 usd 62 usd 69 usd 55
Part Name NSM5 LOCOM5 NBE-M5-16 LBE-M5-23
MIMO 2×2 SISO 1×1

I picked the NanoBeam, one of the newest products on the Airmax line. I chose it because it was a MIMO 2×2 radio (compared to the LiteBeam) and had a nice antenna gain. It also featured a newer CPU, DDR2 memory, and it was also pretty light and had very high wind survivability.

Installing the link

List of Materials:

  • 2 x NanoBeam M5 ($139 + $40 [courier] )
  • 1 x APC UPS (used at home)
  • 1 x ZTE F660 ONT (provided by ISP)
  • 1 x watertight enclosure ($60)
  • 50m of Cat6 STP (~ $35)
  • 1 x TP Link WDR 3600 (with OpenWRT v14.07 “Barrier Breaker” – this was my home router)
  • apcupsd (to monitor the UPS)

Total Cost: $274

I enrolled in a basic fiber plan at the “FTTH” point (30/4 Mbps) for $25 a month.

with computer

Wrap Up

I’ve yet to experience any lags or disconnection issues since I set up the Point-to-Point FTTH link. I’ve even run Netflix and Spotify at the same time to test the connection, all while downloading several things on the computer. If a house is built right in the path of the signal, I’ll need to increase the height if I want the link to continue operating. But, so far so good!

end result

SUCCESS!

Just goes to show what a little perseverance can do when paired with a broomstick and borrowed binoculars.

Technical details

For those who are especially interested in the subject, here are the nitty gritty details on signal strength along with other information:

technical details

After performing a full duplex “Network Speed Test” with NanoBeam’s own speed test tool:

testSame test, but with one-way only:

one way test

ISP speed test:

ISP speed test

status

Scrapy Tips from the Pros: April 2016 Edition

Scrapy Tips

Welcome to the April Edition of Scrapy Tips from the Pros. Each month we’ll release a few tricks and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we only have one tip for you, but it’s a doozy! So if you ever find yourself scraping an ASP.Net page where you need to submit data through a form, step back and read this post.

Dealing with ASP.Net Pages, PostBacks and View States

Websites built using ASP.Net technologies are typically a nightmare for web scraping developers, mostly due to the way they handle forms.

These types of websites usually send state data in requests and responses in order to keep track of the client’s UI state. Think about those websites where you register by going through many pages while filling your data in HTML forms. An ASP.Net website would typically store the data that you filled out in the previous pages in a hidden field called “__VIEWSTATE” which contains a huge string like the one shown below:

ViewState example

I’m not kidding, it’s huge! (dozens of kB sometimes)

This is a Base64 encoded string representing the client UI state and contains the values from the form. This setup is particularly common for web applications where user actions in forms trigger POST requests back to the server to fetch data for other fields.

The __VIEWSTATE field is passed around with each POST request that the browser makes to the server. The server then decodes and loads the client’s UI state from this data, performs some processing, computes the value for the new view state based on the new values and renders the resulting page with the new view state as a hidden field.

If the __VIEWSTATE is not sent back to the server, you are probably going to see a blank form as a result because the server completely lost the client’s UI state. So, in order to crawl pages resulting from forms like this, you have to make sure that your crawler is sending this state data with its requests, otherwise the page will not load what it’s expected to load.

Here’s a concrete example so that you can see firsthand how to handle these types of situations.

Scraping a Website Based on ViewState

The scraping guinea pig today is spidyquotes.herokuapp.com/search.aspx. SpidyQuotes lists quotes from famous people and its search page allows you to filter quotes by author and tag:

image05

A change in the Author field fires up a POST request to the server to fill the Tag select box with the tags that are related to the selected author. Clicking Search brings up any quotes that fit the tag from the selected author:

image04

In order to scrape these quotes, our spider has to simulate the user interaction of selecting an author, a tag and submitting the form. Take a closer look at each step of this flow by using the Network Panel that you can access through your browser’s Developer Tools. First, visit spidyquotes.herokuapp.com/search.aspx and then load the tool by pressing F12 or Ctrl+Shift+I (if you are using Chrome) and clicking on the Network tab.

image00

Select an author from the list and you will see that a request to “/filter.aspx” has been made. Clicking on the resource name (filter.aspx) leads you to the request details where you can see that your browser sent the author you’ve selected along with the __VIEWSTATE data that was in the original response from the server.

image02

Choose a tag and click Search. You will see that your browser sent the values selected in the form along with a __VIEWSTATE value different from the previous one. This is because the server included some new information in the view state when you selected the author.

image01

Now you just need to build a spider that does the exact same thing that your browser did.

Building your Spider

Here are the steps that your spider should follow:

  1. Fetch spidyquotes.herokuapp.com/filter.aspx
  2. For each Author found in the form’s authors list:
    • Create a POST request to /filter.aspx passing the selected Author and the __VIEWSTATE value
  3. For each Tag found in the resulting page:
    • Issue a POST request to /filter.aspx passing the selected Author, selected Tag and view state
  4. Scrape the resulting pages

Coding the Spider

Here’s the spider I developed to scrape the quotes from the website, following the steps just described:

import scrapy

class SpidyQuotesViewStateSpider(scrapy.Spider):
    name = 'spidyquotes-viewstate'
    start_urls = ['http://spidyquotes.herokuapp.com/search.aspx']
    download_delay = 1.5

    def parse(self, response):
        for author in response.css('select#author > option ::attr(value)').extract():
            yield scrapy.FormRequest(
                'http://spidyquotes.herokuapp.com/filter.aspx',
                formdata={
                    'author': author,
                    '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first()
                },
                callback=self.parse_tags
            )

    def parse_tags(self, response):
        for tag in response.css('select#tag > option ::attr(value)').extract():
            yield scrapy.FormRequest(
                'http://spidyquotes.herokuapp.com/filter.aspx',
                formdata={
                    'author': response.css(
                        'select#author > option[selected] ::attr(value)'
                    ).extract_first(),
                    'tag': tag,
                    '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first()
                },
                callback=self.parse_results,
            )

    def parse_results(self, response):
        for quote in response.css("div.quote"):
            yield {
                'quote': response.css('span.content ::text').extract_first(),
                'author': response.css('span.author ::text').extract_first(),
                'tag': response.css('span.tag ::text').extract_first(),
            }

Step 1 is done by Scrapy, which reads start_urls and generates a GET request to /search.aspx.

The parse() method is in charge of Step 2. It iterates over the Authors found in the first select box and creates a FormRequest to /filter.aspx for each Author, simulating if the user had clicked over every element on the list. It is important to note that the parse() method is reading the __VIEWSTATE field from the form that it receives and passing it back to the server, so that the server can keep track of where we are in the page flow.

Step 3 is handled by the parse_tags() method. It’s pretty similar to the parse() method as it extracts the Tags listed and creates POST requests passing each Tag, the Author selected in the previous step and the __VIEWSTATE received from the server.

Finally, in Step 4 the parse_results() method parses the list of quotes presented by the page and generates items from them.

Simplifying your Spider Using FormRequest.from_response()

You may have noticed that before sending a POST request to the server, our spider extracts the pre-filled values that came in the form it received from the server and includes these values in the request it’s going to create.

We don’t need to manually code this since Scrapy provides the FormRequest.from_response() method. This method reads the response object and creates a FormRequest that automatically includes all the pre-filled values from the form, along with the hidden ones. This is how our spider’s parse_tags() method looks:

def parse_tags(self, response):
    for tag in response.css('select#tag > option ::attr(value)').extract():
        yield scrapy.FormRequest.from_response(
            response,
            formdata={'tag': tag},
            callback=self.parse_results,
        )

So, whenever you are dealing with forms containing some hidden fields and pre-filled values, use the from_response method because your code will look much cleaner.

Wrap Up

And that’s it for this month! You can read more about ViewStates here. We hope you found this tip helpful and we’re excited to see what you can do with it. We’re always on the lookout for new hacks to cover, so if you have any obstacles that you’ve faced while scraping the web, please let us know.

Feel free to reach out on Twitter or Facebook with what you’d like to see in the future.

 

Machine Learning with Web Scraping: New MonkeyLearn Addon

Say Hello to the MonkeyLearn Addon

We deal in data. Vast amounts of it. But while we’ve been traditionally involved in providing you with the data that you need, we are now taking it a step further by helping you analyze it as well.

To this end, we’d like to officially announce the MonkeyLearn integration for Scrapy Cloud. This feature will bring machine learning technology to the data that you extract through Scrapy and Portia. We also offer a MonkeyLearn Scrapy Middleware so you can use it on your own platform.

Scrapinghub-MonkeyLearn-Addon-02

MonkeyLearn is a classifier service that lets you analyze text. It provides machine learning capabilities like categorizing products or sentiment analysis to figure out if a customer review is positive or negative.

You can use MonkeyLearn as an addon for Scrapy Cloud. It only takes a minute to enable and once this is done, your items will flow through a pipeline directly into MonkeyLearn’s service. You specify which field you want to analyze, which MonkeyLearn classifier to apply to it, and which field it should output the result to.

Say you were involved in the production of Batman vs Superman and you’re interested in how people reacted to your high budget movie. You could use Scrapy or Portia to track mentions of this film across the web and then use MonkeyLearn to perform sentiment analysis on the samples that you collect. But don’t get too excited because you might not like the results of your search…

sad affleck

There are so many ways that you can use this addon with our platform, so we’ll be featuring a series of tutorials that will help you make the most out of this partnership.

Getting Started with MonkeyLearn

MonkeyLearn provides public modules that are already ready to go or you can create your own text analysis module by training a custom machine learning model.

For example, using traditional sentiment analysis on comments filled with trolls would return a 100% negative rating. To develop a “Troll Finder” you would need to create a custom model with a higher tolerance for the extreme negativity. You could create categories like “troll”, “ubertroll”, and “trollmaster” for further categorization. Check out MonkeyLearn’s tutorial to help you through this task.

Before you get started with the MonkeyLearn addon on Scrapy, you first need to sign up for the MonkeyLearn service. They offer a free account, so you don’t need to worry about the cash monies. Once you’ve signed up, you’ll be taken to your dashboard:

Screenshot 2016-04-12 16.48.46

Click the “Explore” option in the top menu to check out the whole range of ready-made classifiers that you can apply to the scraped data. There are a ton of different options to choose from including sentiment analysis for product reviews, language detectors, and extractors for useful data such as phone numbers and addresses.

Screenshot 2016-04-12 16.46.48

Choose the classifier that you’re interested in and make a note of its ID. You can find the ID in the URL:

Screenshot 2016-04-12 16.32.48 copy

And now that you’re all set on the MonkeyLearn side, it’s time to head back over to Scrapy Cloud.

Addon Walkthrough

You can access the MonkeyLearn addon through your dashboard. Navigate to Addons
Setup:

Add ons page

Enable the addon and click Configure:

Screenshot 2016-04-08 17.29.52

Head down to Settings:

Screenshot 2016-04-12 16.35.11 copy

To configure the addon, you need to set your MonkeyLearn API key, specify the classifier you want to use and the field in which you want the result to be stored. You’ll need the classifier ID you chose earlier from the MonkeyLearn platform.

MonkeyLearn reads the content from the classifier fields you’ve specified, performs the classification task on the data, and returns the result of the classification/analysis in the field that you defined as categories field.

For example, in order to detect the category of a movie based on the title, you would need to add the ID from the module you want to use in the first text box. In the second text box you would list your authorization token and the item field you want to analyze (title, in our case) in the third text box. In the fourth text box you would list the name of the field that is going to store the results from MonkeyLearn.

Screenshot 2016-04-12 16.34.57 copy

And you’re all done! Locked and loaded and ready to go with MonkeyLearn.

Using MonkeyLearn with Scrapy and Portia

Since the MonkeyLearn addon is a part of Scrapy Cloud, it works with both Scrapy (Python-based web scraping framework) and Portia (visual web scraping tool). These data extraction tools are both open source, so you can easily run both on your own system.

The addon means you don’t need to worry about learning MonkeyLearn’s API and how to route requests manually. If you need to use MonkeyLearn outside of Scrapy Cloud, you can use the middleware for the same purpose.

When to use MonkeyLearn

We’re really excited about this integration because it is a huge step in closing the gap between data acquisition and analysis.

MonkeyLearn offers a range of text analysis services including:

  • Classifying products into categories based on their name
  • Detecting the language of text
  • Sentiment analysis
  • Keyword extractor
  • Taxonomy classifier
  • News categorizer
  • Entity extraction

We’ll delve into what you can do with each of these tools in future tutorials. For now, feel free to experiment and explore this integration in your web scraping projects.

Wrap Up

Data and textual analysis is more efficient by combining MonkeyLearn’s machine learning capabilities with our data extraction platform. Whether you are using this for personal projects (tracking and monitoring advance reviews for Captain America: Civil War Team Cap) or for professional tasks, we’re excited to see what you come up with.

Keep your eyes peeled, the first tutorial will walk you through using the Retail Classifier with the MonkeyLearn addon. Sign up for free for Scrapy Cloud and for Monkeylearn and give this addon a whirl.

Mapping Corruption in the Panama Papers with Open Data

We are at a point in the digital age where corruption is increasingly difficult to hide. Information leaks are abundant and shocking.

We rely on whistleblowers for many of these leaks. They have access to confidential information that’s impossible to obtain elsewhere. However, we also live in a time where data is more open and accessible than at any other point in history. With the rise of Open Data, people can no longer shred away their misdeeds. Nothing is ever truly deleted from the internet.

It might surprise you how many insights into corruption and graft are hiding in plain sight through openly available information. The only barriers are clunky websites, inexperience in data extraction, and unfamiliarity with data analysis tools.

We now collectively have the resources to produce our own Panama Papers. Not just as one offs, but as regular accountability checks to those in situations of power. This is especially the case if we combine our information to create further links.

One example of this democratization of information is a recent project in Peru called Manolo and its intersection with the Panama Papers. Manolo used the webscraping of open data to collect information on Peruvian government officials and lobbyists.

Manolo

Manolo is a web application that uses Scrapy to extract records (2.2 million so far) of the visitors frequenting several Peruvian state institutions. It then repackages the data into an easily searchable interface, unlike the government websites.

Peruvian journalists frequently use Manolo. It has even helped them uncover illegal lobbying by tracking the visits of construction company representatives who are currently under investigation to specific government officials.

Developed by Carlos Peña, a Scrapinghub engineer, Manolo is a prime example of what a private citizen can accomplish. By opening access to the Peruvian media, this project has opened up a much needed conversation about transparency and accountability in Peru.

image02

Clunky government website

image01

Scrapy working its magic

image00

Extracted data in a structured format

image03

Final transformation into Manolo

Cross-Referencing Datasets

With leaks like the Panama Papers as a starting point, web scraping can be used to build datasets to discover wrongdoing and to call out corrupt officials.

For example, you could cross-reference names and facts from the Panama Papers with the data that you retrieve via web scraping. This would give you more context and could lead to you discovering more findings.

We actually tested this out ourselves with Manolo. One of the names found in the Panama Papers is Virgilio Acuña Peralta, currently a Peruvian congressman. We found his name in Manolo’s database since he visited the Ministry of Mining last year.

image04

According to the Peruvian news publication Ojo Público, Acuña wanted to use Mossack Fonseca to reactivate an offshore company that he could use to secure construction contracts with the Peruvian state. As a congressman, this is illegal. In Peru, there are efforts to investigate Virgilio Acuña and his brother, who recently ran for president, for money laundering.

image06

Virgilio Acuña, congressman (middle), on the right Cesar Acuña (campaigning for the Peruvian presidency last month) [Photo via @Ojo_Publico]

Another name reported in the Panama Papers by Ojo Público was Jaime Carbajal Peréz, a close associate of former Peruvian president Alan García.

Ojo Público states that in 2008, Carbajal, along with colleague Percy Uriarte and others, bought the offshore company Winscombe Management Corp. from Mossack Fonseca. Carbajal and Uriarte own a business that sells books to state-run schools. Further plot twist is that the third owner of the bookstore, José Antonio Chang, was the Minister of Education from 2006 to 2011 and the Prime Minister from 2010 to 2011.

A quick search of the Manolo database reveals that Percy Uriarte visited the Peruvian Ministry of Education 45 times between 2013 and 2015. IDL-Reporteros, another Peruvian news outlet, reported that the company led by Carbajal illegally sold books to the Peruvian government in 2010. He used a front company for these transactions since he was forbidden by law to engage in contracts with the state due to his close association with the former president.

image05

Data Mining

Data from previous leaks such as the Saudi Cables, Swiss Leaks and the Offshore Leaks have been released publicly. In many cases, the data has been indexed and catalogued, making it easy for you to navigate and search it. And now we’re just waiting on the full data dump from the Panama Papers leaks.

You can use information retrieval techniques to dig deeper into the leaks.You could find matching records, sift through the data by tagging specific words, parts of speech, or phrases, and identify different entities like institutions or people.

Creating Open Data

If the information is not readily available in a convenient database, then you can always explore open data yourself:

  • Identify sources of open data that have the information you need. These are often government websites and public records.
  • Scrape the data. You can do this in two ways: using a visual web scraper like Portia (no programming needed) or a framework like Scrapy that allows you to customize your code. Big bonus is that these are both open source tools and completely free.
  • Download your data and import into your favorite data analysis software. Then start digging!
  • Package your findings by creating visual representations of the data with tools like Tableau or Plot.ly.

We actually offer platform integrations with Machine Learning programs like BigML and MonkeyLearn. We’re looking into integrating more data tools later this year, so keep an eye out!

Wrap Up

Data and corruption are everywhere and they can both seem difficult to access.That’s where web scraping comes in. We are a point where citizens have the tools necessary to hold elected officials, businesses, and folks in power accountable for illegal and corrupt actions. By increasing transparency and creating further links between information leaks and readily available data, we can remove the loopholes where firms like Mossack Fonseca exist.

Frameworks like Scrapy are great for those who know how to code, but there shouldn’t be a barrier to acquiring data. Journalists who wish to take advantage of these vast sources of information can use visual web scrapers like Portia to get the data for current and future investigations.

Web Scraping to Create Open Data

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

https://en.wikipedia.org/wiki/Open_data

My first experience with open data was in the year 2010. I wanted to create a better app for Bicing, the local bike sharing system in Barcelona. Their website was a nightmare to use and I was tired of needing to walk to each station, trying to guess which ones had bicycles. There was no app for Android, other than a couple of unofficial attempts that didn’t work at all.

I began as most would; I searched the internet and found a library named python-bicing that was somehow able to retrieve station and bike information. This was my first time using Python and, after some investigation, I learned what the code was doing: accessing the official website, parsing the JavaScript that generated their buggy map and giving back a nice chunk of Python objects that represented bike share stations.

This I learned was called web scraping. It was like I had figured out a magic trick that would allow me to always be able to access the data I needed without having to rely on faulty websites.

The rise of OpenBicing and CityBikes

Shortly after, I launched OpenBicing, an Android app for the local bike sharing system in Barcelona, together with a backend that used python-bicing. I also shared a public API that provided this information so that nobody else had to do the dirty work ever again.

Pasted image at 2016_03_30 08_02 AM

Since other cities were having the same problem, we expanded the scope of the project worldwide and renamed it CityBikes. That was 6 years ago.

Pasted image at 2016_03_30 08_03 AM.png

To date, CityBikes is the most comprehensive and widely used open API for bike sharing information, with support for over 400 cities worldwide. Our API processes around 10 requests per second and we scrape each of the 418 feeds about every three minutes. Making our core library available for anyone to contribute has been crucial in maintaining and adding coverage for all of the supported systems.

Pasted image at 2016_03_30 06_04 PM

The open data fallacy

We are usually regarded as “an open data project” even though less than 10% of our feeds come from properly licensed, documented and machine-readable feeds. The remaining 90% is composed of 188 feeds that are machine-readable, but not licensed nor documented and 230 that are entirely maintained by scraping HTML pages.

Graph-CityBikes

NABSA (North American BikeShare Association) recently published GBFS (General Bikeshare Feed Specification). This is clearly a step in the right direction, but I can’t help but look at the almost 60% of services we currently support through scraping and wonder how long it will take the remaining organizations to release their information, if ever. This is even more the case considering these numbers aren’t even taking into account worldwide coverage.

Over the last few years there has been a progression by transportation companies and city councils toward providing their information as “open data”. Directive 2003/98/EC encourages EU member states to release information regarding public services.

Yet, in most cases, there’s little action in enforcing Public Private Partnerships (PPP) to release their public information under a non-restrictive license or even to transfer ownership of the data to city councils to be included in their open data portals.

Even with the increasing number of companies and institutions interested in participating in open data, by no means should we consider open data a reality or something to be taken for granted. I firmly believe in the future and benefits of open data, I have seen them happening all around CityBikes, but as technologists we need to stress the fact that the data is not out there yet.

The benefits of open data

When I started this project, I sought to make a difference in Barcelona. Now you can find tons of bike sharing apps that use our API on all major platforms. It doesn’t matter that these are not our own apps. They are solving the same problem we were trying to fix, so their success is our success.

Besides popular apps like Moovit or CityMapper, there are many neat projects out there, some of which are published under free software licenses. Ideally, a city council could create a customization of any of these apps for their own use.

Most official applications for bike sharing systems have terrible ratings. The core business of transportation companies is running a service, so they have no real motivation to create an engaging UI or innovate further. In some cases, the city council does not even own the rights to the data, being completely at the mercy of the company providing the transportation service.

Open data over apps

When providing public services, city councils and companies often get lost in what they should offer as an aid to the service. They focus on a nice map or a flashy application, rather than providing the data behind these service aids. Maps, apps, and websites have a limited focus and usually serve a single purpose. On the other hand, data is malleable and the purest form of representation. While you can’t create something new from looking and playing with a static map (except, of course, if you scrape it), data can be used to create countless different iterations. It can even provide a bridge that will allow anyone to participate, improve and build on top of these public services.

Wrap Up

At this point, you might wonder why I care so much about bike sharing. To me it’s not about bike sharing anymore. CityBikes is just too good of an open data metaphor, a simulation in which public information is freely accessible to everyone. It shows the benefits of open data and the deficiencies that arise from the lack thereof.

We shouldn’t have to create open data by scraping websites. This information should be already available, easily accessed and provided in a machine-readable format from the original providers, be they city councils or transportation companies. However, until there’s another option, we’ll always have scraping.

Scrapy Tips from the Pros: March 2016 Edition

Scrapy-Tips-March-2016

Welcome to the March Edition of Scrapy Tips from the Pros! Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll cover how to use a cookiejar with the CookiesMiddleware to get around websites that won’t allow you to crawl multiple pages at the same time using the same cookie. We’ll also share a handy tip on how to use multiple fallback XPath/CSS expressions with item loaders to get data from websites more reliably.

**Students reading this, we are participating in Google Summer of Code 2016 and some of our project ideas involve Scrapy! If you’re interested, take a look at our ideas and remember to apply before Friday, March 25!

If you are not a student, please share with your student friends. They could get a summer stipend and we might even hire them at the end.**

Work Around Sites With Weird Session Behavior Using a CookieJar

Websites that store your UI state on their server’s sessions are a pain to navigate, let alone scrape. Have you ever run into websites where one tab affects the other tabs open on the same site? Then you’ve probably run into this issue.

While this is frustrating for humans, it’s even worse for web crawlers. It can severely hinder a web crawling session. Unfortunately, this is a common pattern for ASP.Net and J2EE-based websites. And that’s where cookiejars come in. While the cookiejar is not a frequent need, you’ll be so glad that you have it for those unexpected cases.

When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site. But, as you may know, Scrapy requests are asynchronous. This means that you probably have multiple requests being handled concurrently to the same website while sharing the same cookie. To avoid having requests affect each other when crawling these types of websites, you must set different cookies for different requests.

You can do this by using a cookiejar to store separate cookies for different pages in the same website. The cookiejar is just a key-value collection of cookies that Scrapy keeps during the crawling session. You just have to define a unique identifier for each of the cookies that you want to store and then use that identifier when you want to use that specific cookie.

For example, say you want to crawl multiple categories on a website, but this website stores the data related to the category that you are crawling/browsing in the server session. To crawl the categories concurrently, you would need to create a cookie for each category by passing the category name as the identifier to the cookiejar meta parameter:

class ExampleSpider(scrapy.Spider):
    urls = [
        'http://www.example.com/category/photo',
        'http://www.example.com/category/videogames',
        'http://www.example.com/category/tablets'
    ]

    def start_requests(self):
        for url in urls:
            category = url.split('/')[-1]
            yield scrapy.Request(url, meta={'cookiejar': category})

Three different cookies will be managed in this case (‘photo’, ‘videogames’ and ‘tablets’). You can create a new cookie whenever you pass a nonexistent key as the cookiejar meta value (like when a category name hasn’t been visited yet). When the key we pass already exists, Scrapy uses the respective cookie for that request.

So, if you want to reuse the cookie that has been used to crawl the ‘videogames’ page, for example, you just need to pass ‘videogames’ as the unique key to the cookiejar. Instead of creating a new cookie, it will use the existing one:

yield scrapy.Request('http://www.example.com/atari2600', meta={'cookiejar': 'videogames'})

Adding Fallback CSS/XPath Rules

Item Loaders are useful when you need to accomplish more than simply populating a dictionary or an Item object with the data collected by your spider. For example, you might need to add some post-processing logic to the data that you just collected. You might be interested in something as simple as capitalizing every word in a title to more complex operations. With an ItemLoader, you can decouple this post-processing logic from the spider in order to have a more maintainable design.

This tip shows you how to add extra functionality to an Item Loader. Let’s say that you are crawling Amazon.com and extracting the price for each product. You can use an Item Loader to populate a ProductItem object with the product data:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()
    price = scrapy.Field()


class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
        ...

    def parse_product(self, response):
        loader = ItemLoader(item=ProductItem(), response=response)
        loader.add_css('price', '#priceblock_ourprice ::text')
        loader.add_css('name', '#productTitle ::text')
        loader.add_value('url', response.url)
        yield loader.load_item()

This method works pretty well, unless the scraped product is a deal. This is because Amazon represents deal prices in a slightly different format than regular prices. While the price of a regular product is represented like this:

<span id="priceblock_ourprice" class="a-size-medium a-color-price">
    $699.99
</span>

The price of a deal is shown slightly differently:

<span id="priceblock_dealprice" class="a-size-medium a-color-price">
    $649.99
</span>

A good way to handle situations like this is to add a fallback rule for the price field in the Item loader. This is a rule that is applied only if the previous rules for that field have failed. To accomplish this with the Item Loader, you can add a add_fallback_css method:

class AmazonItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    def get_collected_values(self, field_name):
        return (self._values[field_name]
                if field_name in self._values
                else self._values.default_factory())

    def add_fallback_css(self, field_name, css, *processors, **kw):
        if not any(self.get_collected_values(field_name)):
            self.add_css(field_name, css, *processors, **kw)

As you can see, the add_fallback_css method will use the CSS rule if there are no previously collected values for that field. Now, we can change our spider to use AmazonItemLoader and then add the fallback CSS rule to our loader:

def parse_product(self, response):
    loader = AmazonItemLoader(item=ProductItem(), response=response)
    loader.add_css('price', '#priceblock_ourprice ::text')
    loader.add_fallback_css('price', '#priceblock_dealprice ::text')
    loader.add_css('name', '#productTitle ::text')
    loader.add_value('url', response.url)
    yield loader.load_item()

This tip can save you time and make your spiders much more robust. If one CSS rule fails to get the data, there will be other rules that can be applied which will extract the data you need.

If Item Loaders are new to you, check out the documentation.

Wrap Up

And there you have it! Please share any and all problems that you’ve run into while web scraping and extracting data. We’re always on the lookout for new tips and hacks to share in our Scrapy Tips from the Pros monthly column. Hit us up on Twitter or Facebook and let us know if we’ve helped your workflow.

And if you haven’t yet, give Portia, our open source visual web scraping tool, a try. We know you’re attached to Scrapy, but it never hurts to experiment with your stack😉

Please apply to join us for Google Summer of Code 2016 by Friday, March 25!

This Month in Open Source at Scrapinghub March 2016

OS-Scrapinghub

Welcome to This Month in Open Source at Scrapinghub! In this monthly column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource [@] scrapinghub.com or on Twitter @scrapinghub.

Scrapy

The big news for Scrapy lately is that Python 3 is now supported for the majority of use cases, the exceptions being FTP and email. We are very proud of the work done by our community of users and contributors, both old and new. It was a long ride, but we’re finally here. You all made it happen!

Check out the cool stuff that we packed into this release and please pay attention to the following backward incompatible changes:

  • When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default ACL policy is now “private” instead of “public”. You can use FILES_STORE_S3_ACL to change it.
  • Support URLs without scheme. See issue 1498. (scrapy shell)
  • The –lsprof command line option has been removed. See issue 1689.
  • Scrapy no longer retries requests that receive a HTTP 400 Bad Request response. See issue 1289.

Scrapy 1.1 is not officially released yet (we’re aiming for the end of March), but Release Candidate 3 is available for you to test. It’s the last mile, so we’d really appreciate if you could report any issues that you may have with Scrapy 1.1.0rc3 so that we can do our best to fix them.

Oh, and for those who want to stay on stabler (and less-shiny) grounds, we released Scrapy 1.0.5 with a few bug fixes.

Splash 2.0 is out (Qt 5 and Python 3 inside)

Splash 2.0 is out! (Actually we’re already at v2.0.3 and 2.1 will be released soon)

  • Python 3 support
  • Now runs on Qt 5
  • Improved UI
  • Built-in support for JSON and Base64
  • Fixed a problem with JavaScript set cookies
  • Fixed a bug that prevented updating proxy settings

Check out the repository here.

Google Summer of Code 2016

This is our third year of participating in Google Summer of Code and we’ve got plenty of possible project ideas for Scrapy, Portia, Splash, and Frontera. This program is open to students who are interested in working on open source projects with professional mentors. We’ve actually hired two of our previous participants, so you might even get a job out of this opportunity!

Scrapinghub is running under the Python Software Foundation umbrella, so please take the time to read through their guidelines before applying.

Applications opened on March 14 and close on March 25. We’re looking forward to working with you!

Libraries

Dateparser 0.3.4

Changes from 0.3.1 (last October):

New features:

  • Added Hijri Calendar support.
  • Added settings for better control over parsing dates.
  • Support to convert parsed time to the given timezone for both complete and relative dates.
  • Finnish language support. (from 0.3.3)

Improvements:

  • Fixed a problem with caching datetime.now in FreshnessDateDataParser.
  • Added month names and week day names abbreviations to several languages.
  • More simplifications for Russian and Ukrainian languages.
  • Fixed a problem with parsing time component of date strings with several kinds of apostrophes.
  • Faster parsing after switching to regex module. (from 0.3.3)
  • You can now use the RETURN_AS_TIMEZONE_AWARE setting to return tz aware date object. (from 0.3.3)
  • Fixed conflicts with month/weekday names similarity across languages. (from 0.3.3)

Note that 0.3.4 forces python-dateutil before or at 2.4.2. It doesn’t work with python-dateutil 2.5.

Portia 2.0 Beta

The beta version of Portia 2.0 is out! This major release comes with a completely overhauled UI and plenty of new fancy tricks (including multiple item extraction) to help make automatic data extraction even easier. Stay tuned for the official release and in the meantime, try out Portia 2.0 beta and let us know what you think.

The other big news in the Portia camp is the closure of Kimono Labs. For those affected, we offer a Kimono Labs to Portia migration so that you don’t need to lose any of your work.

Frontera

We released Frontera version 0.4 in January, however, we feel it deserved more coverage.

  • Distributed Frontera and Frontera were merged together into a single project to make it easier to use and understand.
  • We completely redesigned the Backend concept. It now consists of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies. Currently there is one revisiting policy demonstrating how to use that abstraction.
  • Overall distributed concept is now integrated into Frontera. This makes the difference between usage of components in single process and distributed spiders/backend run modes clearer.
  • Significantly restructured and augmented documentation, addressing user needs in a more accessible way by sharing the most popular frontera use cases as a starting point.
  • Less of a configuration footprint since the current version requires smaller configuration files to perform a quick start.

Let us know what you think! (use v0.4.1 from PyPI)

Scrapinghub Command Line Client

The Scrapinghub command line client, Shub, has long lived as merely a fork of scrapyd-client, the command line client for scrapyd. Last January, we freed it in the form of Shub v2.0! This release brings many new features and major improvements in usability.

If you work with multiple Scrapinghub projects, or even multiple API keys, you were probably irritated about the amount of repetition you needed to put into your scrapy.cfg file.

Shub v2.0 now reads from its own configuration file, scrapinghub.yml, where you can configure different projects or keys on a single link. You don’t need to worry about migrating your configuration as Shub will automatically generate new configuration files from your old ones. To avoid storing your API keys in version control, you can run shub login which will take your API key and create a configuration file, .scrapinghub.yml, in your home directory. Shub will read this file by default, so you don’t need to specify the API key in future deployments.

If you’re new to deploying your projects to Scrapinghub, or have just started a new project, running shub deploy in the project folder will guide you through a wizard and automatically generate your configuration files. No need to copy-and-paste from our web interface anymore!

Not only have we worked on deploying projects and onboarding new users. Shub provides a much nicer shell experience now, with a dedicated help page for every command (shub schedule --help) and extensive error messages. If you’re not used to installing Python packages from the command line, our new stand-alone binaries (including for Windows) might be for you.

A particularly long-awaited new feature is the addition of viewing log entries, or items, live as they are being scraped. Just run shub log -f JOBID and watch your spiders at work. Shub will let you know the JOBID when you schedule a run via shub schedule. Alternatively,you can simply look it up on the web interface.

Find the full documentation here. You can install shub v2.0.2 via pip install -U shub, or get the binaries here.

Don’t forget to tell us what you think!

Wrap Up

Thus concludes the March edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors so please explore our GitHub. And remember, students, there are a variety of projects available for our open source projects, so apply to work with Scrapinghub on Google Summer of Code 2016.

Follow

Get every new post delivered to your Inbox.

Join 103 other followers

%d bloggers like this: