Birthdays are a big deal at Scrapinghub. We always make sure to celebrate each team member on their special day, recognizing achievements and sending well-wishes for the year to come. Well, on December 15, 2015, we celebrated one of the most momentous birthdays of the year: our own. We are officially 5 years old and what an amazing 5 years it has been.
From humble beginnings of a mere 3 Scrapinghubbers, we are now 124 strong and distributed over 37 countries. With each step forward, we have added new products, developed more open source projects, and created further avenues to remain the leaders in web scraping services.
So talk a walk with us down memory lane and join us in celebrating this major milestone:
Now that we’ve looked back at our accomplishments, it’s time to turn our sights forward. 2016 is looking to be a year of major releases and huge leaps in scraping the web even more efficiently.
Projects and Products
It will be easier than ever to get data without needing to write a line of code. With our upcoming Portia release, we’ll be improving the UX and simplifying the process of web scraping. Our goal is to make Portia more intelligent and to start, it will be able to automatically extract data from lists or tables in web pages. For all of you advanced users, we will also be including the ability to export your Portia spider as a Scrapy spider written in Python so that you can fine-tune the code however you like. Portia will also be able to decide what should be crawled and when actions should be run so that you can focus on getting data and not shepherding your spiders.
Kumo: Scrapy Cloud 2.0
The new generation of web scraping platforms is coming! Originally developed as a lab project, we are almost ready to share a beta version of Kumo, AKA Scrapy Cloud 2.0. A key feature of Kumo is better resource usage through advanced cluster management software. We are able to allocate resources more efficiently, allowing us to offer better performance and more flexible pricing plans.
Kumo also provides a way to build and deploy Docker images. This feature is incredibly helpful for those who want to deploy crawlers that use their favorite Natural Language Processing toolkit, which may require a couple of C or Fortran libraries to be compiled. Kumo allows for a finer control over production environment.
Our platform developers are working on a brand new Web API for Scrapinghub called OneAPI. OneAPI will cover the entire Scrapinghub platform and will provide a single, consistent and well documented interface for all platform resources. We aim to provide the best possible experience for developers who are using our platform by making our new API easy to learn and use.
We are proud to announce that Python 3 support for Scrapy is almost ready to go! Scrapy core developers are hard at work and Scrapy 1.1 will be released soon with basic Python 3 support. This will be the year of Python 3 at Scrapinghub because once Scrapy fully supports Python 3, we will start rolling it out across all of our projects.
Crawlera is also receiving an overhaul this year. We’re updating the architecture and doing more data analysis to improve how we manage IP addresses and crawl settings. We are providing more feedback to users along with advanced crawling features. Our goal is for Crawlera to become even more reliable and scalable, so that you can continue to crawl uninterrupted.
Frontera is no longer only a crawl frontier framework, it’s also a set of components that allows you to distribute and scale custom web crawlers. Plus, it includes Scrapy support out of the box. Frontera allows us to build large scale web crawlers in Python. This will be a great year for Frontera, with a lot features coming including a web management UI, adaptive revisiting, docker containers, and a standalone Frontera-based crawler.
With the growing number of Scrapinghubbers, we’ve been able to continue to expand our teams and our resources. We’re building more datasets and products based on data for tasks like lead generation, competitive intelligence, trend prediction, and marketing analytics.
In non-platform news, keep your eyes peeled for two new monthly columns on our blog. We’re ramping up this year to provide more tips, workflow hacks, and release information so that you are up-to-date on the happenings at Scrapinghub.
Scrapy Tips from the Pros
We are Scrapy experts who have done battle in the trenches and faced pretty much any scraping-related problem you could conceivably imagine. Scrapy was developed by our two founders, we maintain it, and our products and services are designed to make the most out of this framework. We know our Scrapy and we will share our hard-won knowledge with you in this newly launched monthly column. Each post will feature a couple of tricks and new developments to help you overcome any obstacles that you may come across while scraping the web.
This Month in Open Source at Scrapinghub
It’s shaping up to be a pretty busy year for us, what with releases and advancements in Portia, Scrapy Cloud, Frontera, etc. So, we’re making it easier for you to keep up with the slew of Scrapinghubber projects coming at you with the release of this column. Each month we’ll round up the most significant updates so that you can always catch up on anything you might have missed.
We hope that you are as psyched about the coming year as we are. Lots of improvements are in store and we invite you to come along for the ride. Follow us on Twitter, Facebook, and Instagram or subscribe to our RSS Feed. Reach out and let us know what you would like to see from Scrapinghub in 2016.