This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub.

OS-Scrapinghub

Scrapy

This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!

What’s coming in Scrapy 1.2 (in a couple of weeks)?

  • The ability to specify the encoding of items in your JSON, CSV or XML output files
  • Creating Scrapy projects in any folder you want (not only the current one)

Scrapy Plugins

We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).

Here are some of the newly released ones:

  • scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
  • scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.

Libraries

Dateparser

In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.

It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!

Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day ;-)

w3lib

w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!

Wrap Up

And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!

June 19, 2018 In "Scrapinghub"
June 07, 2018 In "Scrapinghub" , "Alternative Financial Data"
May 30, 2018 In "Scrapinghub"
Open source, Scrapy, Scrapinghub

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Learn More

Recent Posts