What I Learned as a Google Summer of Code student at Scrapinghub

Google Summer of Code (GSoC) was such a great experience for students like me. I learned so much about open source communities as well as contributing to their complex projects. I also learned a great deal from my mentors, Konstantin and Cathal, about programming and software engineering practices. In my opinion, the most valuable lesson I got from GSoC was what it was like to be a Software Engineer, which prepared me to continue the pursuit of my dream career in technology.

What is GSoC?

GSoC is a program hosted by Google for students to spend their summer contributing to open source projects from various organizations. Fortunately, my proposal for a project called Scrapy in Python Software Foundation got accepted. Before I applied for the program, I had no idea what open source was, so I decided to take some small steps to get started on contributing. I spent 3 months studying smaller projects within the Scrapy organization to familiarize myself with the code. I found the application easier after I got used to contributing. I also received a lot of help from people within the community, especially from Konstantin, who was to be my GSoC mentor. From my experience, it was extremely important to have a willingness to learn and ask for help when anyone wanted to start participating in an open source project.

My Expectations

In the beginning, I did not know what to expect at GSoC due to the mixed review from students. I figured since it is an individual project for each student, their experiences must be different from one another. I also wondered how a student-built project could have significant impact on the community. However, I came out of GSoC with a completely different perspective. I learned that it was me who would make or break my project, and I was responsible to push myself harder. I also believed my project was a success because I got other contributors’ attention and, hopefully, I inspired more students to do the same thing!

My GSoC Journey

The first week of GSoC was the most confusing part of the program, because my original project proposal was too generic. It took more effort than I anticipated to build the product I planned on originally. However, it was okay, because it was a common mistake among students, since most of us never worked on a large scale project before.

I was first assigned to profile the Scrapy project spider to identify the components that had significant run time to implement speed improvement. It is a meticulous process, so I had to spend so much time on it. I was bored because I thought I did not join GSoC to analyze graphs. However, I learned that everything had to start small. I could spend less time on profiling and started coding instead, but I would be building something meaningless, since I wasn’t sure which component to be optimized. I believe it is an analogy to building a skyscraper with an insecure foundation.

After a lot of time profiling, I finally found out what component needed optimization, the URL parsing library from CPython urllib. I then had to profile again. I was getting impatient, since I couldn’t get my hands on coding yet. The potential libraries fell into 2 types: those that didn’t improve anything at all and those that were fast but not compatible with Scrapy. I felt hopeless from being stuck on the same problem for such a long time. However, we, student developers, will have to experience that eventually. I took a deep breath and continued with the task that I was assigned. Eventually, I decided to build a library from scratch to replace urllib, and I named the project Scurl (GitHub repository).

After a month developing Scurl, I got it to a stable stage. However, the Chromium source I used was from another project on Github, so Scurl would not last long if the Chromium source could not be updated. What if Chromium would release a patch to the components that the library uses? Or the source code would change completely in 20 years? My project would be thrown away.

Chromium is a gigantic project. Building the Chromium Source on an average machine is slower than a snail running half a mile (quoted from here). Working with just 2 components of the project was really difficult. Since I did not have any prior experience working on Chromium’s source, or C++, I spent a lot of time trying to track which source files I needed for my project. It was taking me too long to figure it out, and I thought of giving up several times. However, giving up was not an option, since I already spent two and a half months for the project. Despite the struggle, I learned how to update the Chromium source code, which allowed others to maintain this library with ease.

Lessons Learned

Overall, GSoC gave me a chance to be a better software developer. Not only did I have an opportunity to hone my programming skills, I also trained my mind to be ready for the career path that I chose. I am truly grateful for the experience that I had. I want to thank my mentors, Konstantin and Cathal for their help and support, and I hope that I have inspired others to go out of their comfort zones to do the same!

Special thanks to Tram Nguyen and Samuel Coveney for helping me edit this article!

July 07, 2017 In "Web Scraping" , "Scrapy" , "python" , "scrapy" , "web crawling" , "infinite scroll"
June 19, 2017 In "eli5" , "Open source" , "open source" , "Machine Learning"
April 19, 2017 In "Releases" , "Scrapy" , "Scrapinghub" , "scrapy" , "Scrapy Cloud" , "scrapy cloud" , "deploy" , "github"
Open source, Scrapy, GSoC, Scurl

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Learn More

Recent Posts