Introducing w3lib and scrapely

In an effort to make Scrapy code smaller and more reusable, we’ve been working on splitting the Scrapy codebase into two different modules:

  1. w3lib
  2. scrapely

w3lib

A library with simple, reusable functions for working with URLs, HTML, forms, and HTTP. Things that aren’t found in the Python standard library. This library doesn’t have any external dependency.

For more info see:

scrapely

Scrapely is library for extracting structured data from HTML pages. What makes it different from other Python web scraping libraries is that it doesn’t depend on lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely depends on numpy (it uses it to speed up calculations) and w3lib.

You can find more info, or try it out, in the Github page.

Scrapy codebase

After these changes, Scrapy codebase has been reduced by 4574 lines, including blank and comments (according to cloc).

Before:

$ cloc /tmp/scrapy2/scrapy
     333 text files.
     332 unique files.                                       
      18 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (628.0 files/s, 66050.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              301      5819      5341     20663 x   4.20 =       86784.60
HTML                 11       117        93       792 x   1.90 =        1504.80
XML                   2         1         0       199 x   1.90 =         378.10
-------------------------------------------------------------------------------
SUM:                314      5937      5434     21654 x   4.09 =       88667.50
-------------------------------------------------------------------------------

After:

$ cloc /tmp/scrapy/scrapy
     308 text files.
     307 unique files.                                       
      14 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (586.0 files/s, 55136.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              284      5206      3801     18242 x   4.20 =       76616.40
XML                   2         1         0       199 x   1.90 =         378.10
HTML                  7        17         0       102 x   1.90 =         193.80
-------------------------------------------------------------------------------
SUM:                293      5224      3801     18543 x   4.16 =       77188.30
-------------------------------------------------------------------------------

Scrapy dependencies

Scrapy 0.14 will depend on w3lib. Scrapy 0.13 (current dev version) already depends on w3lib, but w3lib is already packaged and provided in the official APT repos (package python-w3lib). So, if you’re using Scrapy 0.13 on Ubuntu, you can upgrade safely. Otherwise, you can always install/upgrade with easy_install or pip. Stable version (Scrapy 0.12) is not affected at all by this change.

If you have any comments or questions feel free to post them in the scrapy-users group.

September 12, 2018 In "Scrapy" , "Scurl" , "Open source" , "GSoC"
July 07, 2017 In "infinite scroll" , "python" , "Scrapy" , "scrapy" , "web crawling" , "Web Scraping"
June 19, 2017 In "eli5" , "Machine Learning" , "open source" , "Open source"
Open source, Releases, Scrapy

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Learn More

Recent Posts