Browsed by
Category: Open source

Do Androids Dream of Electric Sheep?

Do Androids Dream of Electric Sheep?

It got very easy to do Machine Learning: you install a ML library like scikit-learn or xgboost, choose an estimator, feed it some training data, and get a model which can be used for predictions. Ok, but what’s next? How would you know if it works well? Cross-validation! Good! How would you know that you haven’t messed up the cross validation? Are there data leaks? If the quality is not good enough, how to improve it? Are there data preprocessing…

Read More Read More

Improved Frontera: Web Crawling at Scale with Python 3 Support

Improved Frontera: Web Crawling at Scale with Python 3 Support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible. And so we’re pleased to announce that Frontera will remain alive and kicking because it now supports Python 3 in full! Joining the ranks of Scrapy and Scrapy Cloud, you can officially continue to quickly create and scale fully formed crawlers without any issues in your Python 3-ready stack. As a key web crawling toolbox…

Read More Read More

This Month in Open Source at Scrapinghub August 2016

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera. If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub. Scrapy This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars…

Read More Read More

This Month in Open Source at Scrapinghub June 2016

This Month in Open Source at Scrapinghub June 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera. If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource@scrapinghub.com or on Twitter @scrapinghub Scrapy 1.1 For those who missed the big news, Scrapy 1.1 is live! It’s the first official release that comes with Python 3 support, so you can…

Read More Read More

This Month in Open Source at Scrapinghub March 2016

This Month in Open Source at Scrapinghub March 2016

Welcome to This Month in Open Source at Scrapinghub! In this monthly column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera. If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource [@] scrapinghub.com or on Twitter @scrapinghub. Scrapy The big news for Scrapy lately is that Python 3 is now supported for the majority of use cases, the exceptions being FTP and…

Read More Read More

Portia: The Open Source Alternative to Kimono Labs

Portia: The Open Source Alternative to Kimono Labs

Attention Kimono users: we’ve created an exporter so you can easily convert your projects from Kimono to Portia! Imagine your business depended heavily on a third party tool and one day that company decided to shut down its service with only 2 weeks notice. That, unfortunately, is what happened to users of Kimono Labs yesterday. And it’s one of the many reasons why we love open source so much. Portia is an open source visual scraping tool developed by Scrapinghub…

Read More Read More

Scrapy on the Road to Python 3 Support

Scrapy on the Road to Python 3 Support

Scrapy is one of the few popular Python packages (almost 10k github stars) that’s not yet compatible with Python 3. The team and community around it are working to make it compatible as soon as possible. Here’s an overview of what has been happening so far. You’re invited to read along and participate in the porting process from Scrapy’s Github repository. First off you may be wondering: “why is Scrapy not in Python 3 yet?” If you asked around you…

Read More Read More

The Road to Loading JavaScript in Portia

The Road to Loading JavaScript in Portia

Support for JavaScript has been a much requested feature ever since Portia’s first release 2 years ago. The wait is nearly over and we are happy to inform you that we will be launching these changes in the very near future. If you’re feeling adventurous you can try it out on the develop branch at Github. This post aims to highlight the path we took to achieving JavaScript support in Portia. The Plan As with everything in software, we started…

Read More Read More

Google Summer of Code 2015

Google Summer of Code 2015

We are very excited to be participating again this year on Google Summer of Code. After a successful experience last year where Julia Medina (now a proud Scrapinghubber!) worked on Scrapy API cleanup and per-spider settings, we are back again this year with 3 ideas approved: Jacob de Mayer from Germany is working on Simplified Scrapy Addons to make it super simple to enable extensions on Scrapy. A highly welcome addition to Scrapy and its growing userbase!. You can follow…

Read More Read More

Frontera: The Brain Behind the Crawls

Frontera: The Brain Behind the Crawls

At Scrapinghub we’re always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up to date lists of item pages and are able to optimise our crawl as much as possible. From these complex projects emerge technologies that can be used across all of…

Read More Read More