Introducing w3lib and scrapely

Introducing w3lib and scrapely

In an effort to make Scrapy code smaller and more reusable, we’ve been working on splitting the Scrapy codebase into two different modules:

  1. w3lib
  2. scrapely

w3lib

A library with simple, reusable functions for working with URLs, HTML, forms, and HTTP. Things that aren’t found in the Python standard library. This library doesn’t have any external dependency.

For more info see:

scrapely

Scrapely is library for extracting structured data from HTML pages. What makes it different from other Python web scraping libraries is that it doesn’t depend on lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely depends on numpy (it uses it to speed up calculations) and w3lib.

You can find more info, or try it out, in the Github page.

Scrapy codebase

After these changes, Scrapy codebase has been reduced by 4574 lines, including blank and comments (according to cloc).

Before:

$ cloc /tmp/scrapy2/scrapy
     333 text files.
     332 unique files.                                       
      18 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (628.0 files/s, 66050.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              301      5819      5341     20663 x   4.20 =       86784.60
HTML                 11       117        93       792 x   1.90 =        1504.80
XML                   2         1         0       199 x   1.90 =         378.10
-------------------------------------------------------------------------------
SUM:                314      5937      5434     21654 x   4.09 =       88667.50
-------------------------------------------------------------------------------

After:

$ cloc /tmp/scrapy/scrapy
     308 text files.
     307 unique files.                                       
      14 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (586.0 files/s, 55136.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              284      5206      3801     18242 x   4.20 =       76616.40
XML                   2         1         0       199 x   1.90 =         378.10
HTML                  7        17         0       102 x   1.90 =         193.80
-------------------------------------------------------------------------------
SUM:                293      5224      3801     18543 x   4.16 =       77188.30
-------------------------------------------------------------------------------

Scrapy dependencies

Scrapy 0.14 will depend on w3lib. Scrapy 0.13 (current dev version) already depends on w3lib, but w3lib is already packaged and provided in the official APT repos (package python-w3lib). So, if you’re using Scrapy 0.13 on Ubuntu, you can upgrade safely. Otherwise, you can always install/upgrade with easy_install or pip. Stable version (Scrapy 0.12) is not affected at all by this change.

If you have any comments or questions feel free to post them in the scrapy-users group.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Leave a Reply

Your email address will not be published. Required fields are marked *