Join Scrapinghub for Google Summer of Code 2016

Join Scrapinghub for Google Summer of Code 2016

GSoC-Header

It’s that time of year again! Google Summer of Code 2016 applications are upon us and we welcome any and all students who are interested in open source and web scraping. For those just hearing about this program, Google Summer of Code provides stipends ($5500 for a successfully completed project) to students who are interested in writing code for open source projects. This is a full-time commitment and a great way to get involved in the open source community.

This is our third year participating in this prestigious program and we’re excited to announce projects around Scrapy, Portia, Splash, and Frontera. We feature projects that range from “Easy” to “Advanced” and we’re happy to have students with different levels of technical skills.

Student applications are accepted from March 14 to the 25 and students accepted to our projects will be announced on April 22, 2016. We’re very excited to mentor Python enthusiasts!

Please take a look at our guidelines and application tips so that we can make this process as easy and straightforward as possible.

Project Ideas

Here are our available open source project ideas. Browse around:

Scrapy

Scrapy is our Python-based web scraping framework.

Advanced:

Intermediate:

Easy:

Portia

Portia is our visual web scraper. This tool allows you to get the data you need from websites without needing to write a single line of code.

Advanced:

  • Portia spider generation: make Portia spiders less sensitive to layout changes on websites by detecting when the layout of a website changes and using crawled datasets and the new page structure to auto-repair the spiders.

Splash

Splash is a headless browser that executes JavaScript for people crawling websites.

Intermediate

  • Web scraping helpers: provide an easy way to click a link, submit a form, and extract data from a webpage using Splash Scripts.
  • Migrate to QtWebEngine: migrate the Splash rendering engine from QtWebKit to QtWebEngine.

Frontera

Frontera is a web crawling framework consisting of crawl frontier and distribution/scaling primitives. It allows you to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during a crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next and is capable of doing this in a distributed manner.

Advanced:

  • Reliable Queue|Spider communication: provide a reliable communication between ZeroMQ queue and spiders and fix known issues with message queues being consumed when there are no spiders running.
  • Frontera Web UI: create a web management UI for Frontera. This would allow people to see and control errors, download speed, and storage contents and to also do advanced crawler management.
  • Frontera cluster provisioning service: build a service to monitor host resources and Frontera processes in a cluster, automatically restarting failed processes and providing an easy way to configure each component.

Intermediate:

  • Python 3 support: migrate the Framework code to Python 3 while maintaining compatibility with Python 2.
  • Docker support: provide Docker containers for all Frontera components allowing an easier setup of the distributed mode.

Wrap Up

Scrapinghub is a sub-organization under the umbrella of the Python Software Foundation so please take a moment to read through their guidelines and expectations.

This is a great opportunity to not only hone your coding skills on some popular open source projects, but you might even get a job out of it. We’ve actually hired two of our previous participants. You never know what might come of participating in Scrapinghub’s Google Summer of Code 2016 projects!

 

Apply Now

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

One thought on “Join Scrapinghub for Google Summer of Code 2016

Leave a Reply

Your email address will not be published. Required fields are marked *