This Month in Open Source at Scrapinghub March 2016

This Month in Open Source at Scrapinghub March 2016

OS-Scrapinghub

Welcome to This Month in Open Source at Scrapinghub! In this monthly column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource [@] scrapinghub.com or on Twitter @scrapinghub.

Scrapy

The big news for Scrapy lately is that Python 3 is now supported for the majority of use cases, the exceptions being FTP and email. We are very proud of the work done by our community of users and contributors, both old and new. It was a long ride, but we’re finally here. You all made it happen!

Check out the cool stuff that we packed into this release and please pay attention to the following backward incompatible changes:

  • When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default ACL policy is now “private” instead of “public”. You can use FILES_STORE_S3_ACL to change it.
  • Support URLs without scheme. See issue 1498. (scrapy shell)
  • The –lsprof command line option has been removed. See issue 1689.
  • Scrapy no longer retries requests that receive a HTTP 400 Bad Request response. See issue 1289.

Scrapy 1.1 is not officially released yet (we’re aiming for the end of March), but Release Candidate 3 is available for you to test. It’s the last mile, so we’d really appreciate if you could report any issues that you may have with Scrapy 1.1.0rc3 so that we can do our best to fix them.

Oh, and for those who want to stay on stabler (and less-shiny) grounds, we released Scrapy 1.0.5 with a few bug fixes.

Splash 2.0 is out (Qt 5 and Python 3 inside)

Splash 2.0 is out! (Actually we’re already at v2.0.3 and 2.1 will be released soon)

  • Python 3 support
  • Now runs on Qt 5
  • Improved UI
  • Built-in support for JSON and Base64
  • Fixed a problem with JavaScript set cookies
  • Fixed a bug that prevented updating proxy settings

Check out the repository here.

Google Summer of Code 2016

This is our third year of participating in Google Summer of Code and we’ve got plenty of possible project ideas for Scrapy, Portia, Splash, and Frontera. This program is open to students who are interested in working on open source projects with professional mentors. We’ve actually hired two of our previous participants, so you might even get a job out of this opportunity!

Scrapinghub is running under the Python Software Foundation umbrella, so please take the time to read through their guidelines before applying.

Applications opened on March 14 and close on March 25. We’re looking forward to working with you!

Libraries

Dateparser 0.3.4

Changes from 0.3.1 (last October):

New features:

  • Added Hijri Calendar support.
  • Added settings for better control over parsing dates.
  • Support to convert parsed time to the given timezone for both complete and relative dates.
  • Finnish language support. (from 0.3.3)

Improvements:

  • Fixed a problem with caching datetime.now in FreshnessDateDataParser.
  • Added month names and week day names abbreviations to several languages.
  • More simplifications for Russian and Ukrainian languages.
  • Fixed a problem with parsing time component of date strings with several kinds of apostrophes.
  • Faster parsing after switching to regex module. (from 0.3.3)
  • You can now use the RETURN_AS_TIMEZONE_AWARE setting to return tz aware date object. (from 0.3.3)
  • Fixed conflicts with month/weekday names similarity across languages. (from 0.3.3)

Note that 0.3.4 forces python-dateutil before or at 2.4.2. It doesn’t work with python-dateutil 2.5.

Portia 2.0 Beta

The beta version of Portia 2.0 is out! This major release comes with a completely overhauled UI and plenty of new fancy tricks (including multiple item extraction) to help make automatic data extraction even easier. Stay tuned for the official release and in the meantime, try out Portia 2.0 beta and let us know what you think.

The other big news in the Portia camp is the closure of Kimono Labs. For those affected, we offer a Kimono Labs to Portia migration so that you don’t need to lose any of your work.

Frontera

We released Frontera version 0.4 in January, however, we feel it deserved more coverage.

  • Distributed Frontera and Frontera were merged together into a single project to make it easier to use and understand.
  • We completely redesigned the Backend concept. It now consists of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies. Currently there is one revisiting policy demonstrating how to use that abstraction.
  • Overall distributed concept is now integrated into Frontera. This makes the difference between usage of components in single process and distributed spiders/backend run modes clearer.
  • Significantly restructured and augmented documentation, addressing user needs in a more accessible way by sharing the most popular frontera use cases as a starting point.
  • Less of a configuration footprint since the current version requires smaller configuration files to perform a quick start.

Let us know what you think! (use v0.4.1 from PyPI)

Scrapinghub Command Line Client

The Scrapinghub command line client, Shub, has long lived as merely a fork of scrapyd-client, the command line client for scrapyd. Last January, we freed it in the form of Shub v2.0! This release brings many new features and major improvements in usability.

If you work with multiple Scrapinghub projects, or even multiple API keys, you were probably irritated about the amount of repetition you needed to put into your scrapy.cfg file.

Shub v2.0 now reads from its own configuration file, scrapinghub.yml, where you can configure different projects or keys on a single link. You don’t need to worry about migrating your configuration as Shub will automatically generate new configuration files from your old ones. To avoid storing your API keys in version control, you can run shub login which will take your API key and create a configuration file, .scrapinghub.yml, in your home directory. Shub will read this file by default, so you don’t need to specify the API key in future deployments.

If you’re new to deploying your projects to Scrapinghub, or have just started a new project, running shub deploy in the project folder will guide you through a wizard and automatically generate your configuration files. No need to copy-and-paste from our web interface anymore!

Not only have we worked on deploying projects and onboarding new users. Shub provides a much nicer shell experience now, with a dedicated help page for every command (shub schedule --help) and extensive error messages. If you’re not used to installing Python packages from the command line, our new stand-alone binaries (including for Windows) might be for you.

A particularly long-awaited new feature is the addition of viewing log entries, or items, live as they are being scraped. Just run shub log -f JOBID and watch your spiders at work. Shub will let you know the JOBID when you schedule a run via shub schedule. Alternatively,you can simply look it up on the web interface.

Find the full documentation here. You can install shub v2.0.2 via pip install -U shub, or get the binaries here.

Don’t forget to tell us what you think!

Wrap Up

Thus concludes the March edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors so please explore our GitHub. And remember, students, there are a variety of projects available for our open source projects, so apply to work with Scrapinghub on Google Summer of Code 2016.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Leave a Reply

Your email address will not be published. Required fields are marked *