Browsed by
Author: Carlos Peña

Improving Access to Peruvian Congress Bills with Scrapy

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love! 🙂

How Web Scraping is Revealing Lobbying and Corruption in Peru

How Web Scraping is Revealing Lobbying and Corruption in Peru

Update: With the release of the Panama Papers, a reliable means of exposing corruption and the methods of money laundering and tax evasion are now even more important. Web scraping provides an avenue to find, collate, and organize data without relying on information leaks.

Paid political influence and corruption continue to be major issues in a surprising number of countries throughout the world. The recent “Operation Lava Jato” in Brazil, in which officials from the state company Petrobras were accused of taking bribes from construction companies in exchange for contracts financed with taxpayer money, is a reminder of this. It has been suggested that unregulated lobbying can lead to this kind of corruption and Transparency International, a non-governmental organization that monitors corruption, has strongly recommended that governments regulate lobbying.

I live in Peru and with corruption scandals regularly making headlines, I was curious to see how Peruvian officials fared and to examine the role that lobbyists play in my government.

Is it possible to track lobbying in Peru?

The Peruvian Law of Transparency requires government institutions to maintain a website where they publish the list of people that visit their offices. Every visitor must register their ID document number, the reason they’re visiting, and the public servant whom they are visiting along with the time of entrance, time of exit and date of visit. You can find all of this information on websites such as this.

The problem with open data

While almost all institutions have their visitor list available online, they all suffer from the same model of a broken user interface. One of the major issues with these websites is that you’re not able to search for visitors, you can only browse who visited on a particular day. If you’re looking for a known lobbyist, you would need to visually scan several pages as there can be up to 400 visitors per day in some institutions.

Obviously this method of search is time-consuming, tedious, boring, and it’s easy to miss the person that you are searching for. This problem is compounded when you want to search for a particular person visiting more than one public institution.

The web scraping solution

To help journalists track lobbyists, I started the project Manolo. Manolo is a simple search tool that contains the visitor records of several government institutions in Peru. You can type any name and search for any individual across 14 public institutions in a matter of milliseconds.

Under the hood

Unfortunately Peruvian institutions do not provide their visitor data in a structured way. There’s no API to fetch this data in a machine readable format, so the only way to get the information is by scraping their websites and then mining the data.

Manolo consists of a bunch of web scrapers written with Scrapinghub’s popular Python framework Scrapy. The Scrapy spiders crawl the visitor lists once a week, extracting structured data from the HTML content and storing it in a PostgreSQL database. Elasticsearch indexes this data and users can then perform searches via a Django-based web UI, available on the Manolo website.

image00

So far, my friend @matiskay and I have created 14 spiders which have scraped more than 2 million records. You can find the source code for the spiders and the search tool here.

image02

Who uses Manolo?

I have always aimed to provide this search tool online for anyone to use free of charge. At the beginning, I was just hoping that I could convince one or two journalists to use Manolo when searching for information.

You can imagine my suprise when two national newspapers (El Comercio and Diario Exitosa), one TV station (Latina) and two news web portals (Utero.pe and LaMula) became active users of Manolo.

One of the most impressive uses of Manolo was when a journalist used it to find out that the legal representative of a company that signed numerous and extremely profitable construction contracts with the government was a regular visitor to the President’s residence building. According to the data scraped by Manolo, this representative visited some of the closest allies of the president 33 times. He was already under investigation because his company pocketed the money from the contracts and never built anything.

The discovery of the 33 visits made big headlines in 4 nationwide newspapers. Soon after, journalists from a variety of media outlets started using Manolo to find suspicious visits of known lobbyists to ministries responsible for the construction of public infrastructure. They even found that lobbyists involved in the famous “Lava Jato” corruption scandal were also regular visitors of Peruvian institutions.

image01

Chocherin Visited Government Palace 33 Times

Wrap Up

Government corruption through lobbying is shockingly prevalent. Transparency is the solution, but even publicly available information can be convoluted and difficult to access. And that’s where web scraping comes in. Open data is what we should all be striving towards and hopefully others will follow Manolo’s lead.

Our use of Scrapy can easily be applied to any other situation where you are looking to scrape websites for information on government officials (or anyone else, for that matter). Even those without a technical background can scrape websites using Portia, our free and open source visual web scraping tool.