Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.
If you’re interested in learning more or even becoming a contributor, reach out to us by email at firstname.lastname@example.org or on Twitter @scrapinghub
For those who missed the big news, Scrapy 1.1 is live! It’s the first official release that comes with Python 3 support, so you can go ahead and move your stack over.
The major changes in this release since the RC1 we announced in February include improved HTTPS connections (with proxy support) and handling URLs with non-ASCII characters. Make sure you upgrade w3lib to 1.14.2.
We’re very grateful for the feedback we received during the release candidate phase. A huge thanks to all the reporters, reviewers and code/documentation contributors.
Notable limitations still present in this release include:
- Scrapy 1.1 doesn’t work on Windows under Python 3 (Twisted is not fully ported to Python 3 on Windows, but we’ll keep an eye out for updates to this situation).
- Scrapy’s FTP, Telnet console, and email do not work in Python 3.
Splash 2.1 now lets you:
- Save large arguments to the Splash server so you don’t need to send them in every request. This is particularly useful for when you want to cache Lua scripts.
- Take screenshots of page regions instead of the whole viewport.
If you’re using the Scrapy-Splash plugin (formerly “scrapyjs”), we encourage you to upgrade to the latest v0.7 version. It includes many goodies that makes integrating with Scrapy much easier. Check the latest README for details, especially the scrapy_splash.SplashRequest utility.
Google Summer of Code 2016
We’re thrilled to have 5 students this year:
- Aron Bordin is working on supporting spiders in other programming languages with Scrapy.
- Preetwinder Bath is porting Frontera to Python 3.
- Tamer Tas is working on dockerization and orchestration of Frontera deployments.
- Avishkar Gupta is replacing PyDispatcher to improve Scrapy’s signaling API performance.
- Michael Manukyan is adding web scraping helpers for Splash.
Scrapy relies on lxml and cssselect for all the XPath and CSS selection awesomeness that we use each and every day at Scrapinghub. We learned that Simon Sapin, author of cssselect package, was looking for new maintainers. So we put ourselves forward and now cssselect is hosted under the Scrapy organization on GitHub. Don’t worry though, Simon is still involved! We’re planning on fixing a few corner cases and maybe working on CSS Selectors Level 4. We’ll definitely need assistance with this task, so please reach out if you’re interested in helping out!
We released Dateparser 0.3.5 with support for dates in Danish and Japanese. It now handles dates with accents much better. The library is now working with the latest version of python-dateutil.
Check the full release notes here.
It’s on PyPI and is now Python 3-compatible.
Check this Jupyter/ipython notebook for an overview of what you can do with it and make sure to let us know what you think.
We updated our w3lib library to handle non-ASCII URLs better, as part of adding Python 3 support to Scrapy 1.1. We recommend that you upgrade to the latest 1.14.2 version.
We’ve made some changes to Slybot, the Portia crawler, that include:
- Re-added nested regions and text data annotations.
- Selectors now handle comments correctly.
- Added automatic link following seeded with start urls and sample urls.
- Allow adjusting splash wait time.
For Portia itself:
- New download API endpoint: GET portia/api/projects/PID/download[/SID]
Most of the recent developments have been taking place in the Portia beta.
The big changes include:
- Clustering of pages during extraction to decide which sample to use for extraction.
- Download Portia spider as Scrapy code: GET portia/api/projects/PID/download[/SID]?format=code
- Uses Django style Storage object for accessing files.
- Database access more consistent for MySQL backend.
- Better element overlays; they can now be split across lines.
- Re-add toggle CSS option for samples, you can now annotate hidden elements.
- UI usable on low resolution screens, thanks to smarter wrapping.
- Inform user of unpublished changes when using Git backend.
Try out the beta using the nui-develop branch.
Frontera 0.5 introduces improved crawling strategy, new logging and better test coverage.
Scrapy-mosquitera is a library to assist Scrapy spiders to do more optimal crawls. In its basic form, it’s a collection of matchers and a mixin to narrow down the crawl to a specific date range. However, you can extend it to be applicable on any domain (URL paths, location filtering, etc). You can find more details about how it works and how you can create your own matchers in the documentation.
This concludes the June edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors, so if you’re interested, feel free to explore our GitHub.