This Month in Open Source at Scrapinghub August 2016

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub.

OS-Scrapinghub

Scrapy

This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!

What’s coming in Scrapy 1.2 (in a couple of weeks)?

  • The ability to specify the encoding of items in your JSON, CSV or XML output files
  • Creating Scrapy projects in any folder you want (not only the current one)

Scrapy Plugins

We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).

Here are some of the newly released ones:

  • scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
  • scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.

Libraries

Dateparser

In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.

It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!

Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day 😉

w3lib

w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!

Wrap Up

And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Leave a Reply

Your email address will not be published. Required fields are marked *