This Month in Open Source at Scrapinghub June 2016

This Month in Open Source at Scrapinghub June 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource@scrapinghub.com or on Twitter @scrapinghub

OS-Scrapinghub

Scrapy 1.1

For those who missed the big news, Scrapy 1.1 is live! It’s the first official release that comes with Python 3 support, so you can go ahead and move your stack over.

The major changes in this release since the RC1 we announced in February include improved HTTPS connections (with proxy support) and handling URLs with non-ASCII characters. Make sure you upgrade w3lib to 1.14.2.

We’re very grateful for the feedback we received during the release candidate phase. A huge thanks to all the reporters, reviewers and code/documentation contributors.

If you find anything that’s not working, please take a few minutes to report the issue(s) on GitHub.

Notable limitations still present in this release include:

  • Scrapy 1.1 doesn’t work on Windows under Python 3 (Twisted is not fully ported to Python 3 on Windows, but we’ll keep an eye out for updates to this situation).
  • Scrapy’s FTP, Telnet console, and email do not work in Python 3.

Splash 2.1

Splash 2.1 now lets you:

If you’re using the Scrapy-Splash plugin (formerly “scrapyjs”), we encourage you to upgrade to the latest v0.7 version. It includes many goodies that makes integrating with Scrapy much easier. Check the latest README for details, especially the scrapy_splash.SplashRequest utility.

Google Summer of Code 2016

We’re thrilled to have 5 students this year:

  • Aron Bordin is working on supporting spiders in other programming languages with Scrapy.
  • Preetwinder Bath is porting Frontera to Python 3.
  • Tamer Tas is working on dockerization and orchestration of Frontera deployments.
  • Avishkar Gupta is replacing PyDispatcher to improve Scrapy’s signaling API performance.
  • Michael Manukyan is adding web scraping helpers for Splash.

We’d like to thank the Python Software Foundation for again having Scrapinghub as a sub-org this year!

Libraries

cssselect maintenance

Scrapy relies on lxml and cssselect for all the XPath and CSS selection awesomeness that we use each and every day at Scrapinghub. We learned that Simon Sapin, author of cssselect package, was looking for new maintainers. So we put ourselves forward and now cssselect is hosted under the Scrapy organization on GitHub. Don’t worry though, Simon is still involved! We’re planning on fixing a few corner cases and maybe working on CSS Selectors Level 4. We’ll definitely need assistance with this task, so please reach out if you’re interested in helping out!

Dateparser 0.3.5

We released Dateparser 0.3.5 with support for dates in Danish and Japanese. It now handles dates with accents much better. The library is now working with the latest version of python-dateutil.

Check the full release notes here.

js2xml

This side project of mine is now hosted under Scrapinghub’s organization on GitHub. It’s a little helper library to convert JavaScript code into an XML tree. This means you can use XPath and CSS selectors to extract data (strings, objects, function arguments, etc.) from HTML-embedded JavaScript (this does not interpret it though). You’d be amazed at how much valuable data is “hidden” in JavaScript inside web pages.

It’s on PyPI and is now Python 3-compatible.

Check this Jupyter/ipython notebook for an overview of what you can do with it and make sure to let us know what you think.

w3lib 1.14.2

We updated our w3lib library to handle non-ASCII URLs better, as part of adding Python 3 support to Scrapy 1.1. We recommend that you upgrade to the latest 1.14.2 version.

parsel 1.0.2

If you’re using Scrapy 1.1, you’re using parsel under the hood. Parsel is Scrapy Selectors as an independent package. There’s a new release of parsel that fixes the hiding of XPath exceptions.

Portia

We’ve made some changes to Slybot, the Portia crawler, that include:

  • Re-added nested regions and text data annotations.
  • Selectors now handle comments correctly.
  • Added automatic link following seeded with start urls and sample urls.
  • Allow adjusting splash wait time.

For Portia itself:

  • New download API endpoint: GET portia/api/projects/PID/download[/SID]

Most of the recent developments have been taking place in the Portia beta.

The big changes include:

  • Clustering of pages during extraction to decide which sample to use for extraction.
  • Download Portia spider as Scrapy code: GET portia/api/projects/PID/download[/SID]?format=code
  • Uses Django style Storage object for accessing files.
  • Database access more consistent for MySQL backend.
  • Better element overlays; they can now be split across lines.
  • Re-add toggle CSS option for samples, you can now annotate hidden elements.
  • UI usable on low resolution screens, thanks to smarter wrapping.
  • Inform user of unpublished changes when using Git backend.

Try out the beta using the nui-develop branch.

Frontera

Frontera 0.5 introduces improved crawling strategy, new logging and better test coverage.

Mosquitera

Scrapy-mosquitera is a library to assist Scrapy spiders to do more optimal crawls. In its basic form, it’s a collection of matchers and a mixin to narrow down the crawl to a specific date range. However, you can extend it to be applicable on any domain (URL paths, location filtering, etc). You can find more details about how it works and how you can create your own matchers in the documentation.

Wrap Up

This concludes the June edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors, so if you’re interested, feel free to explore our GitHub.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Leave a Reply

Your email address will not be published. Required fields are marked *