Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.
If you’re interested in learning more or even becoming a contributor, reach out to us by emailing firstname.lastname@example.org or on Twitter @scrapinghub.
This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!
What’s coming in Scrapy 1.2 (in a couple of weeks)?
- The ability to specify the encoding of items in your JSON, CSV or XML output files
- Creating Scrapy projects in any folder you want (not only the current one)
We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).
Here are some of the newly released ones:
- scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
- scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.
- scrapy-deltafetch: used to skip pages that you already scraped items from in previous crawls; our very own Valdir wrote about it in July’s Scrapy Tips From the Pros column.
In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.
It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!
Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day ;-)
w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!
And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!