Skip to content

XPath tips from the web scraping trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you’re looking for a tutorial, here is a XPath tutorial with nice examples.

In this post, we’ll show you some tips we found valuable when using XPath in the trenches, using Scrapy Selector API for our examples.

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> xp = lambda x: sel.xpath(x).extract() # let's type this only once
>>> xp('//a//text()') # take a peek at the node-set
   [u'Click here to go to the ', u'Next Page']
>>> xp('string(//a//text())')  # convert it to a string
   [u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

So, in general:

GOOD:

>>> xp("//a[contains(., 'Next Page')]")
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:

>>> xp("//a[contains(.//text(), 'Next Page')]")
[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")
[u'Page']

BAD:

>>> xp("substring-after(//a//text(), 'Next ')")
[u'']

You can read more detailed explanations about string values of nodes and node-sets in the XPath spec.

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel=Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]") # get all first LI elements under whatever it is its parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]") # get the first LI element in the whole document
[u'<li>1</li>']
>>> xp("//ul/li[1]")  # get all first LI elements under an UL parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]") # get the first LI element under an UL parent in the document
[u'<li>1</li>']

Also,

//a[starts-with(@href, '#')][1] gets a collection of the local anchors that occur first under their respective parents.

(//a[starts-with(@href, '#')])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

Let’s cook up some examples:

>>> sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
>>> xp = lambda x: sel.xpath(x).extract()

BAD: doesn’t work because there are multiple classes in the attribute

>>> xp("//*[@class='content']")
[]

BAD: gets more than we want

>>> xp("//*[contains(@class,'content')]")
[u'<p class="content-author">Someone</p>']

GOOD:

>>> xp("//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]")
[u'<p class="content text-wrap">Some content</p>']

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Read more about what you can do with Scrapy’s Selectors here.

Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the tutorial to quickly review this.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes. Source: http://stackoverflow.com/a/19350897/2572383

Do you have another XPath tip?

Please, leave us a comment with your tips or questions. :)

And for everybody who contributed tips and reviewed this article, a big thank you!

Introducing Data Reviews

One of the things that takes more time when building a spider is reviewing the scraped data and making sure it conforms to the requirements and expectations of your client or team. This process is so time consuming that, in many cases, it ends up taking more time than writing the spider code itself, depending on how well the requirements are written. To make this process more efficient we have introduced the ability to comment data directly on Dash (Scrapinghub UI), right next to the data, instead of relying on other channels (like issue trackers, emails or chat).

 

data_reviews

 

With this new feature you can discuss problems with data right where they appear without having to copy/paste data around, and have a conversation with your client or team until the issue is resolved. This reduces the time spent on data QA, making the whole process more productive and rewarding.

So go ahead, start adding comments to your data (you can comment whole items or individual fields) and let the conversation flow around data! You can mark resolved issues by archiving comments, and you will see jobs with unresolved (unarchived) comments directly on the Jobs Dashboard.

Last, but not least, you have the Data Reviews API to insert comments programmatically. This is useful, for example, to report problems in post-processing scripts that analyze the scraped data.

Happy scraping!

Extracting schema.org microdata using Scrapy selectors and XPath

Web pages are full of data, that is what web scraping is mostly about. But often you want more than data, you want meaning. Microdata markup embedded in HTML source helps machines understand what the pages are about: contact information, product reviews, events etc.

Web authors have several ways to add metadata to their web pages: HTML “meta” tags, microformats, social media meta tags (Facebook’s Open Graph protocol, Twitter Cards).

And then there is Schema.org, an initiative from big players in the search space (Yahoo!, Google, Bing and Yandex) to “create and support a common set of schemas for structured data markup on web pages.”

Here are some stats from Web Data Commons project on the use of microdata in the Common Crawl web corpus:

In summary, we found structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%).

Let’s focus on Schema.org syntax for the rest of this article. The markup looks like this (example from schema.org “Getting started” page):

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

Let’s assume you want to extract this microdata and get the movie item from the snippet above.

Using Scrapy Selector let’s first loop on elements with an itemscope attribute (this represents a container for an item’s properties) using .//*[@itemscope], and for each item, get all properties, i.e. elements that have an itemprop attribute:

>>> from scrapy.selector import Selector
>>> selector = Selector(text="""
... <div itemscope itemtype ="http://schema.org/Movie">
...   <h1 itemprop="name">Avatar</h1>
...   <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
...   <span itemprop="genre">Science fiction</span>
...   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
... </div>""", type="html")
>>> selector.xpath('.//*[@itemscope]')
[<Selector xpath='.//*[@itemscope]' data=u'<div itemscope itemtype="http://schema.o'>]
>>> 

Let’s print out more interesting stuff, like the item’s itemtype and the properties’ values (the text representation of their HTML element):

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
... 
[u'http://schema.org/Movie']
[u'name'] [u'Avatar']
[u'director'] [u'James Cameron']
[u'genre'] [u'Science fiction']
[u'trailer'] [u'Trailer']
>>> 

Hm. The value for trailer isnt that interesting. We should have selected the href of <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>.

For the following extended example markup, taken from Wikipedia,

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> 
(born <time itemprop="birthDate" datetime="1954-08-16">August 16, 1954</time>)
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

we see it’s not always href attributes that are interesting, but datetime perhaps, or src for images, or content for meta element like “… any attribute really.

Therefore we need to get itemprop text content and elements attributes (we use the @* XPath expression for that).

Let’s do that on the 2nd HTML snippet:

>>> selector = Selector(text="""
... <div itemscope itemtype="http://schema.org/Movie">
...   <h1 itemprop="name">Avatar</h1>
...   <div itemprop="director" itemscope itemtype="http://schema.org/Person">
...   Director: <span itemprop="name">James Cameron</span> 
... (born <time itemprop="birthDate" datetime="1954-08-16">August 16, 1954</time>)
...   </div>
...   <span itemprop="genre">Science fiction</span>
...   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
... </div>""", type="html")
>>> for item in selector.xpath('.//*[@itemscope]'):
...     print item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract(),
...         print property.xpath('@*').extract()
... 
[u'http://schema.org/Movie']
[u'name'] [u'Avatar'] [u'name']
[u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  '] [u'director', u'', u'http://schema.org/Person']
[u'name'] [u'James Cameron'] [u'name']
[u'birthDate'] [u'August 16, 1954'] [u'birthDate', u'1954-08-16']
[u'genre'] [u'Science fiction'] [u'genre']
[u'trailer'] [u'Trailer'] [u'../movies/avatar-theatrical-trailer.html', u'trailer']
[u'http://schema.org/Person']
[u'name'] [u'James Cameron'] [u'name']
[u'birthDate'] [u'August 16, 1954'] [u'birthDate', u'1954-08-16']
>>> 

Wait a minute! We’re getting only attribute values, not attribute names (and itemprop attribute twice for that matter, once for @itemprop and once for @*).

To get names of attributes, one cannot apparently use name() on an attribute… BUT you can do name(@*[i]), i being the attribute position:

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print "Property:",
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
...         for position, attribute in enumerate(property.xpath('@*'), start=1):
...             print "attribute: name=%s; value=%s" % (
...                 property.xpath('name(@*[%d])' % position).extract(),
...                 attribute.extract())
...         print
...     print
... 
Item: [u'http://schema.org/Movie']
Property: [u'name'] [u'Avatar']
attribute: name=[u'itemprop']; value=name

Property: [u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  ']
attribute: name=[u'itemprop']; value=director
attribute: name=[u'itemscope']; value=
attribute: name=[u'itemtype']; value=http://schema.org/Person

Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16

Property: [u'genre'] [u'Science fiction']
attribute: name=[u'itemprop']; value=genre

Property: [u'trailer'] [u'Trailer']
attribute: name=[u'href']; value=../movies/avatar-theatrical-trailer.html
attribute: name=[u'itemprop']; value=trailer


Item: [u'http://schema.org/Person']
Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16


>>> 

There’s still something wrong, right? Duplicate properties.

It’s because the 2nd HTML snippet is using embedded items. Indeed, James Cameron is the director of “Avatar”, but he’s also a Person (yes, he is!). That’s why the markup says <div> with an itemscope attribute.

How can we fix that, only selecting properties at the current scope, and leave nested properties when we reach the nested item?

Well, it happens that Scrapy Selectors support some EXSLT extensions, notably the sets operations. Here, we’ll use set:difference: we’ll select itemprops under the current itemscope, and then exclude those that are themselves children of another itemscope, under the current one.

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     for property in item.xpath(
...             """set:difference(.//*[@itemprop],
...                               .//*[@itemscope]//*[@itemprop])"""):
...         print "Property:", property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
...         for position, attribute in enumerate(property.xpath('@*'), start=1):
...             print "attribute: name=%s; value=%s" % (
...                 property.xpath('name(@*[%d])' % position).extract(),
...                 attribute.extract())
...         print
...     print
... 
Item: [u'http://schema.org/Movie']
Property: [u'name'] [u'Avatar']
attribute: name=[u'itemprop']; value=name

Property: [u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  ']
attribute: name=[u'itemprop']; value=director
attribute: name=[u'itemscope']; value=
attribute: name=[u'itemtype']; value=http://schema.org/Person

Property: [u'genre'] [u'Science fiction']
attribute: name=[u'itemprop']; value=genre

Property: [u'trailer'] [u'Trailer']
attribute: name=[u'href']; value=../movies/avatar-theatrical-trailer.html
attribute: name=[u'itemprop']; value=trailer


Item: [u'http://schema.org/Person']
Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16


>>> 

But wouldn’t it be nice to keep a reference to the Person item as property of the Movie director itemprop? We’d need to uniquely identify items in the markup, maybe the position of each itemscope element, or its number.

You can use XPath count() function for that, counting other itemscopes that are siblings before (preceding::*[@itemscope]) or ancestors of the current one (ancestor::*[@itemscope]):

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     print "ID:", item.xpath("""count(preceding::*[@itemscope])
...                              + count(ancestor::*[@itemscope])
...                              + 1""").extract()
... 
Item: [u'http://schema.org/Movie']
ID: [u'1.0']
Item: [u'http://schema.org/Person']
ID: [u'2.0']
>>> 

Here’s a cleaned up example routine, referencing items by their ID when itemprops are also itemscopes:

>>> import pprint
>>> items = []
>>> for itemscope in selector.xpath('//*[@itemscope][@itemtype]'):
...     item = {"itemtype": itemscope.xpath('@itemtype').extract()[0]}
...     item["item_id"] = int(float(itemscope.xpath("""count(preceding::*[@itemscope])
...                                                  + count(ancestor::*[@itemscope])
...                                                  + 1""").extract()[0]))
...     properties = []
...     for itemprop in itemscope.xpath("""set:difference(.//*[@itemprop],
...                                                       .//*[@itemscope]//*[@itemprop])"""):
...         property = {"itemprop": itemprop.xpath('@itemprop').extract()[0]}
...         if itemprop.xpath('@itemscope'):
...             property["value_ref"] = {
...                 "item_id": int(float(itemprop.xpath("""count(preceding::*[@itemscope])
...                                                      + count(ancestor::*[@itemscope])
...                                                      + 1""").extract()[0]))
...             }
...         else:
...             value = itemprop.xpath('normalize-space(.)').extract()[0]
...             if value:
...                 property["value"] = value
...         attributes = []
...         for index, attribute in enumerate(itemprop.xpath('@*'), start=1):
...             propname = itemprop.xpath('name(@*[%d])' % index).extract()[0]
...             if propname not in ("itemprop", "itemscope"):
...                 attributes.append((propname, attribute.extract()))
...         if attributes:
...             property["attributes"] = dict(attributes)
...         properties.append(property)
...     item["properties"] = properties
...     items.append(item)
... 
>>> pprint.pprint(items)
[{'item_id': u'1.0',
  'itemtype': u'http://schema.org/Movie',
  'properties': [{'itemprop': u'name', 'value': u'Avatar'},
                 {'attributes': {u'itemtype': u'http://schema.org/Person'},
                  'itemprop': u'director',
                  'value_ref': {'item_id': u'2.0'}},
                 {'itemprop': u'genre', 'value': u'Science fiction'},
                 {'attributes': {u'href': u'../movies/avatar-theatrical-trailer.html'},
                  'itemprop': u'trailer',
                  'value': u'Trailer'}]},
 {'item_id': u'2.0',
  'itemtype': u'http://schema.org/Person',
  'properties': [{'itemprop': u'name', 'value': u'James Cameron'},
                 {'attributes': {u'datetime': u'1954-08-16'},
                  'itemprop': u'birthDate',
                  'value': u'August 16, 1954'}]}]
>>> 

This dict output is not quite what W3C details in the microdata specs, but it’s close enough to leave the rest to you as an exercise ;-)

You can find a Github Gist of the above code here.

Announcing Portia, the open source visual web scraper!

We’re proud to announce the developer release of Portia, our new open source visual scraping tool based on Scrapy. Check out this video:

As you can see, Portia allows you to visually configure what’s crawled and extracted in a very natural way. It provides immediate feedback, making the process of creating web scrapers quicker and easier than ever before!

Portia is available to developers on github. We plan to offer a hosted version on Scrapinghub soon, which will be compatible with Autoscraping and fully integrated with our platform.

Please send us your feedback!

Optimizing memory usage of scikit-learn models using succinct Tries

We use the scikit-learn library for various machine-learning tasks at Scrapinghub. For example, for text classification we’d typically build a statistical model using sklearn’s Pipeline, FeatureUnion, some classifier (e.g. LinearSVC) + feature extraction and preprocessing classes. The model is usually trained on a developers machine, then serialized (using pickle/joblib) and uploaded to a server where the classification takes place.

Sometimes there can be too little available memory on the server for the classifier. One way to address this is to change the model: use simpler features, do feature selection, change the classifier to a less memory intensive one, use simpler preprocessing steps, etc. It usually means trading accuracy for better memory usage.

For text it is often CountVectorizer or TfidfVectorizer that consume most memory. For the last few months we have been using a trick to make them much more memory efficient in production (50x+) without changing anything from statistical point of view – this is what this article is about.

Let’s start with the basics. Most machine learning algorithms expect fixed size numeric feature vectors, so text should be converted to this format. Scikit-learn provides CountVectorizer, TfidfVectorizer and HashingVectorizer for text feature extraction (see the scikit-learn docs for more info).

CountVectorizer.transform converts a collection of text documents into a matrix of token counts. The counts matrix has a column for each known token and a row for each document; the value is a number of occurrences of a token in a document.

To create the counts matrix CountVectorizer must know which column corresponds to which token. The CountVectorizer.fit method basically remembers all tokens from some collection of documents and stores them in a “vocabulary”. Vocabulary is a Python dictionary: keys are tokens (or n-grams) and values are integer ids (column indices) ranging from 0 to len(vocabulary)-1.

Storing such a vocabulary in a standard Python dict is problematic; it can take a lot of memory even on relatively small data.

Let’s try it! Let’s use the “20 newsgroups” dataset available in scikit-learn. The “train” subset of this dataset has about 11k short documents (average document size is about 2KB, or 300 tokens; there are 130k unique tokens; average token length is 6.5).

Create and persist CountVectorizer:

from sklearn import datasets
from sklearn.externals import joblib

newsgroups_train = datasets.fetch_20newsgroups(subset='train')
vec = CountVectorizer()
vec.fit(newsgroups_train.data)
joblib.dump(vec, 'vec_count.joblib')

Load and use it:

from sklearn.externals import joblib
vec = joblib.load('vec_count.joblib')
X = vec.transform(['the dog barks'])

On my machine, the loaded vectorizer uses about 82MB of memory in this case. If we add bigrams (by using CountVectorizer(ngram_range=(1,2))) then it would take about 650MB – and this is for a corpus that is quite small.

There are only 130k unique tokens; it’ll require less than 1MB to store these tokens in a plain text file ((6.5+1) * 130k). Maybe add an another megabyte to store column indices if they are not implicit (130k * 8). So the data itself should take only a couple of MBs. We may also have to somehow enumerate tokens and enable fast O(1) access to data, so there would be an overhead, but it shouldn’t take 80+MB – we’d expect 5-10MB at most. The serialized version of our CountVectorizer takes about 6MB on disk without any compression, but it expands to 80+MB when loaded to memory.

Why does it happen? There are two main reasons:

  1. Python objects are created for numbers (column indices) and strings (tokens). Each Python object has a pointer to its type + a reference counter (=> +16 bytes overhead per object on 64bit systems); for strings there are extra fields: length, hash, pointer to the string data, flags, etc. (the string representation is different in Python < 3.3 and Python 3.3+).
  2. Python dict is a hash table and introduces overheads – you have to store hash table itself, pointers to keys and values, etc. There is a great talk on Python dict implementation by Brandon Rhodes, check it if you’re interested in knowing more

Storing static string->id mapping in a hash table is not the most efficient way to do it: there are perfect hashes, tries, etc.; add Python objects overhead and here we are.

So I decided to try an alternative storage for vocabulary. MARISA-Trie (via Python wrapper) looked like a suitable data structure, as it:

  • is a heavily optimized succinct trie-like data structure, so it compresses string data well
  • provides a unique id for each key for free, and this id is in range from 0 to len(vocabulary)-1 – we don’t have to store these indices ourselves
  • only creates Python objects (strings, integers) on demand.

MARISA-Trie is not a general replacement for dict: you can’t add a key after building, it requires more time and memory to build, lookups (via Python wrapper) are slower – about 10x slower than dict’s, and it works best for “meaningful” string keys which have common parts (not for some random data).

I must admit I don’t fully understand how MARISA-Tries work :) The implementation is available in a folder named “grimoire“, and the only information about the implementation I could find is Japanese slides which are outdated (as library author Susumu Yata says). It seems to be a succinct implementation of Patricia-Trie which can store references to other MARISA-Tries in addition to text data; this allows it to compress more than just prefixes (as in “standard” tries). “Succinct” means the Trie is encoded as a bit array.

You may never heard of this library, but if you have a recent Android phone it is likely MARISA-Trie is in your pocket – a copy of marisa-trie is in the Android 4.3+ source tree.

Ok, great, but we have to tell scikit-learn to use this data structure instead of a dict for vocabulary storage.

Scikit-learn allows passing a custom vocabulary (a dict-like object) to CountVectorizer. But this won’t help us because MARISA-Trie is not exactly dict-like; it can’t be built and modified like dict. CountVectorizer should build a vocabulary for us (using its tokenization and preprocessing features) and only then we may “freeze” it to a compact representation.

At first, we were doing it using a hack. fit and fit_transform methods were overridden: first, they call the parent method to build a vocabulary, then they freeze that vocabulary (i.e. build a MARISA-Trie from it) and trick CountVectorizer to think a fixed vocabulary was passed to the constructor, and then parents method is called once more. Calling fit/fit_transform twice is necessary because the indices learned on the first call and indices in the frozen vocabulary are different. This quick & dirty implementation is here, and this is what we’re using in production.

I recently improved it and removed this “call fit/fit_transform twice” hack for CountVectorizer, but we haven’t used this implementation yet. See https://gist.github.com/kmike/9750796.

The results? For the same dataset, MarisaCountVectorizer uses about 0.9MB for unigrams (instead of 82MB) and about 13.3MB for unigrams+bigrams (instead of 650MB+). This is a 50-90x reduction of memory usage. Tada!

memory

The downside is that MarisaCountVectorizer.fit and MarisaCountVectorizer.fit_transform methods are 10-30% slower than CountVectorizer’s (new version; old version was up to 2x+ slower).

unigrams

ngrams

Numbers:

  • CountVectorizer(): 3.6s fit, 5.3s dump, 1.9s transform
  • MarisaCountVectorizer(), new version: 3.9s fit, 0s dump, 2.5s transform
  • MarisaCountVectorizer(), old version: 7.5s fit, 0s dump, 2.6s transform
  • CountVectorizer(ngram_range=(1,2)): 15.2s fit, 52.0s dump, 5.3s transform
  • MarisaCountVectorizer(ngram_range=(1,2)), new version: 18.7s fit, 0.0s dump, 6,8s transform
  • MarisaCountVectorizer(ngram_range=(1,2)), old version: 28.3s fit, 0.0s dump, 6.8s transform

‘fit’ method was executed on ‘train’ subset of ’20 newsgroups’ dataset; ‘transform’ method was executed on ‘test’ subset.

marisa-trie stores all data in a contignuous memory block so saving it to disk and loading it from disk is much faster than saving/loading a Python dict serialized using pickle.

Serialized file sizes (uncompressed):

  • CountVectorizer(): 5.9MB
  • MarisaCountVectorizer(): 371KB
  • CountVectorizer(ngram_range=(1,2)): 59MB
  • MarisaCountVectorizer(ngram_range=(1,2)): 3.8MB

TfidfVectorizer is implemented on top of CountVectorizer; it could also benefit from more efficient storage for vocabulary. I tried it, and for MarisaTfidfVectorizer the results are similar. It is possible to optimize DictVectorizer as well.

Note that MARISA-based vectorizers don’t help with memory usage during training. They may help with memory usage when saving models to disk though – pickle allocates big chunks of memory when saving Python dicts.

So when memory usage is an issue, ditch scikit-learn standard vectorizers and use marisa-based variants? Not so fast: don’t forget about HashingVectorizer. It has a number of benefits. Check the docs: HashingVectorizer doesn’t need a vocabulary so it fits and serializes in no time and it is very memory efficient because it is stateless.

As always, there are some tradeoffs:

  • HashingVectorizer.transform is irreversable (you can’t check which tokens are active) so it is harder to inspect what a classifer has learned from text data.
  • There could be collisions, and with improper n_features it could affect the prediction quality of a classifier.
  • A related disadvantage is that the resulting feature vectors are larger than the feature vectors produced by other vectorizers unless we allow collisions. The HashingVectorizer.transform result is not useful by itself, it is usually passed to the next step (classifier or something like PCA), and a larger input dimension could mean that this subsequent step will take more memory and will be slower to save/load, so the memory savings of HashingVectorizer could be compensated by increased memory usage of subsequent steps.
  • HashingVectorizer can’t limit features based on document frequency (min_df and max_df options are not supported).

Of course, all vectorizers have their own advantages and disadvantages, and there are use cases for all of them. You can use e.g. CountVectorizer for development and switch to HashingVectorizer for production, avoiding some of HashingVectorizer downsides. Also, don’t forget about feature selection and other similar techniques. Using succinct Trie-based vectorizers is not the only way to reduce memory usage, and often it is not the best way, but sometimes they are useful; being a drop-in replacement for CountVectorizer and TfidfVectorizer helps.

In our recent project, min_df > 1 was crucial for removing noisy features. Vocabulary wasn’t the only thing that used memory; MarisaTfidfVectorizer instead of TfidfVectorizer (+ MarisaCountVectorizer instead of CountVectorizer) decreased the total classifier memory consumption by about 30%. It is not a brilliant 50x-80x, but it made the difference between “classifier fits into memory” and “classifier doesn’t fit into memory”.

Some links:

There is a ticket to discuss efficient vocabulary storage with scikit-learn developers. Once the discussion settles our plan is to make a PR to scikit-learn to make using such vectorizers easier and/or release an open-source package with MarisaCountVectorizer & friends – stay tuned!

Open Source at Scrapinghub

Here at Scrapinghub we love open source. We love using and contributing to it. Over these years we have open sourced a few projects, that we keep using over and over, in the hope that it will make others lives easier. Writing reusable code is harder than it sounds, but it enforces good practices such as documenting accurately, testing extensively and worrying about backwards support. In the end it produces better software, and keeps programmers happier. This is why we open source as much as we can and always deliver the complete source code to our clients, so they can run everything on their machines if they ever want or need to do so.

Here is a list of open source projects we currently maintain, most of them born and raised at Scrapinghub:

  • Scrapy is the most popular web crawling framework for Python, used by thousands of companies around the world to power their web crawlers. At Scrapinghub we use it to crawl millions of pages daily. We use Scrapy Cloud for running our Scrapy crawlers without having to manage servers or plan capacity beforehand.

  • Scrapely is a supervised learning library for extracting structured data from HTML pages. You train it with examples and Scrapely automatically extracts all similar pages. It powers the extraction engine of our Autoscraping service.

  • Slybot combines the power of Scrapy and Scrapely into a standalone web crawler application. We are currently working in a new version that will include a fully-featured visual annotation tool (the one used so far by Autoscraping never got open sourced). UPDATE: the new version has been released, see the announcement.

  • Pydepta is used to extract repeated data (such as records in tables) automatically. It’s based on the Web Data Extraction Based on Partial Tree Alignment paper.

  • Webstruct is a framework for creating machine-learning-based named-entity recognition  systems that work on HTML data. Trained webstruct models can work on many different websites, while Scrapely shines where you need to extract data from a single website. Webstruct models require much more training than Scrapely ones, but we do it once per task  (e.g. “contact extraction”), not per website, so it scales better to a larger number of websites. A big refactoring is in the works and due to be merged soon.

  • Loginform is used for filling website login forms given just the login page url, username & password. Which form and fields to submit are inferred automatically

  • Webpager is used to paginate search results automatically without having to specify where the “next” button is

  • Splash is a web service to render pages using javascript. It’s quite lightweight compared to running a complete web browser (like selenium)

If you are working on web data mining, take a moment to review them, there’s a high chance you will need one of those for your next project, and it might not be the kind of wheel you would want to reinvent.

Looking back at 2013

This time last year Pablo and I were chatting about the previous year and what to expect in 2013. I noticed that our team had almost doubled in size in the previous year and we wondered could that possibly continue in 2013?

It turns out it did! We went from 20 team members to almost 40 and the number of projects, customers, etc. all doubled. We’ve become even more distributed as we’ve grown, covering 19 countries:

Join us and put your country on the map!

Click for interactive map of current team

What the numbers don’t show is that these new hires have become indispensable members of the team and it’s hard to imagine working without them.

Over 2 billion web pages were scraped using our platform in 2013! We’re pleased this was done politely, without receiving a single complaint. More users participated in our beta and we saw a 4x growth in our platform usage[1]:

platform_blog_chart

This would not have been possible without our new architecture, for reasons explained in our most popular blog post of 2013.

Our open source contributions increased – we had 2 large stable Scrapy releases (0.18 and 0.20) and started several new projects (e.g. splash, webstruct, webpagerscrapyjs and loginform). Scrapy usage (github stars, forks, etc.) doubled again this year and is currently number 12 in github’s trending python repositories this month. We have more exciting releases planned, such as a new open source annotation UI for Autoscraping and Scrapy 1.0!

We are very proud of our team and what we have achieved this year. We would like to thank all our customers and supporters. Happy New Year everyone!


  1. These numbers exclude crawls that do not store data on our servers. Some crawls use S3 or other external data storage
Follow

Get every new post delivered to your Inbox.

Join 278 other followers