Browsed by
Tag: open source

Incremental Crawls with Scrapy and DeltaFetch

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

  • www.example.com/product?id=123
  • www.example.com/deals?id=123
  • www.example.com/category/keyboards?id=123
  • www.example.com/category/gaming?id=123

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Improving Access to Peruvian Congress Bills with Scrapy

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love! 🙂

Portia: The Open Source Alternative to Kimono Labs

Portia: The Open Source Alternative to Kimono Labs

Attention Kimono users: we’ve created an exporter so you can easily convert your projects from Kimono to Portia!

Imagine your business depended heavily on a third party tool and one day that company decided to shut down its service with only 2 weeks notice. That, unfortunately, is what happened to users of Kimono Labs yesterday.

And it’s one of the many reasons why we love open source so much.

Portia is an open source visual scraping tool developed by Scrapinghub to make it easier to get data from the web without needing to write a line of code.

You can do anything with Portia that you can with Kimono Labs, but without vendor lock-in. This allows you to run your spiders on our platform, but you always have the option to move to your own infrastructure.

Portia-Demo

Also, we won’t be leaving you high and dry, Kimono users. We’ve developed an exporter to easily migrate your Kimono projects to Portia. Check it out: kimono.scrapinghub.com

Portia’s Features

  • Open source (you won’t need to worry about us shutting down on you)
    • You can always export all your data and your crawler configurations
  • Support for JavaScript-based websites
    • User interactions (such as click, scroll, wait, filling forms) are simulated by recording and replaying user actions on the page
  • Browser-based, so there’s no need for extensions
  • Portia is based on Scrapy, and can be extended and customized further using code

Get started with Portia here. And take a look at our Knowledge Base in case you have any questions.

Kimono and Portia

Take a look at Kimono and Portia in action:

Kimono final
Kimono
Portia final
Portia

Portia 2.0

Our upcoming Portia 2.0 release includes:

  • Extracting multiple items from a list
  • Nested items support
  • New, revamped UI based on user experiences

And soon after, we’ll also be adding new features like:

  • A visual method of defining links to follow without the need for regular expressions.
  • A way to download Portia projects as Scrapy projects using CSS and XPath selectors

Scalable Platform

Portia is fully integrated into our platform, Scrapy Cloud, but you can also checkout the repository and run it locally or on your own server. The benefits of running Portia on Scrapy Cloud include:

  • Robust scheduling
  • On-demand scaling
  • Monitoring add-on that checks if all the expected items were extracted for each crawl
  • View and compare the items extracted through its UI
  • Built-in add-ons for Crawlera and Splash, along with third party tools

Wrap Up

You can try out Portia here, although you’ll need to register (free!) to deploy your spiders to Scrapy Cloud. Plus, since Portia is open source, we welcome any and all developers who are interested in contributing.

Selection_245

migrate-button

Skinfer: A Tool for Inferring JSON Schemas

Skinfer: A Tool for Inferring JSON Schemas

Imagine that you have a lot of samples for a certain kind of data in JSON format. Maybe you want to have a better feel of it, know which fields appear in all records, which appear only in some and what are their types. In other words, you want to know the schema for the data that you have.

We’d like to present you skinfer, a tool that we built for inferring the schema from samples in JSON format. Skinfer will take a list of JSON samples and give you one JSON schema that describes all of the samples. (For more information about JSON Schema, we recommend the online book Understanding JSON Schema.)

Install skinfer with pip install skinfer, then generate a schema running the command schema_inferer passing a list of JSON samples (it can be a JSON lines file with all samples or a list of JSON files passed via the command line).

Here is an example of usage with a simple input:

$ cat samples.json
$ cat samples.json
{"name": "Claudio", "age": 29}
{"name": "Roberto", "surname": "Gomez", "age": 72}
$ schema_inferer --jsonlines samples.json
{
    "$schema": "http://json-schema.org/draft-04/schema",
    "required": [
        "age",
        "name"
    ],
    "type": "object",
    "properties": {
        "age": {
            "type": "number"
        },
        "surname": {
            "type": "string"
        },
        "name": {
            "type": "string"
        }
    }
}

Once you’ve generated a schema for your data, you can:

  1. Run it against other samples to see if they share the same schema
  2. Share it with anyone who wants to know the structure of your data
  3. Complement it manually, adding descriptions for the fields
  4. Use a tool like docson to generate a nice page documenting the schema of your data (see example here)

Another interesting feature of Skinfer is that it can also merge a list of schemas, giving you a new schema that describes samples from all previously given schemas. For this, use the json_schema_merger command passing it a list of schemas.

This is cool because you can continuously keep updating a schema even after you’ve already generated it: you can just merge it with the one you already have.

Feel free to dive into the code, explore the docs and please file any issues that you have on GitHub. 🙂

Scrapinghub Crawls the Deep Web

Scrapinghub Crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?”

— Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

new search engine exposes the dark web
Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

Optimizing Memory Usage of Scikit-Learn Models Using Succinct Tries

Optimizing Memory Usage of Scikit-Learn Models Using Succinct Tries

We use the scikit-learn library for various machine-learning tasks at Scrapinghub. For example, for text classification we’d typically build a statistical model using sklearn’s Pipeline, FeatureUnion, some classifier (e.g. LinearSVC) + feature extraction and preprocessing classes. The model is usually trained on a developers machine, then serialized (using pickle/joblib) and uploaded to a server where the classification takes place.

Sometimes there can be too little available memory on the server for the classifier. One way to address this is to change the model: use simpler features, do feature selection, change the classifier to a less memory intensive one, use simpler preprocessing steps, etc. It usually means trading accuracy for better memory usage.

For text it is often CountVectorizer or TfidfVectorizer that consume most memory. For the last few months we have been using a trick to make them much more memory efficient in production (50x+) without changing anything from statistical point of view – this is what this article is about.

Let’s start with the basics. Most machine learning algorithms expect fixed size numeric feature vectors, so text should be converted to this format. Scikit-learn provides CountVectorizer, TfidfVectorizer and HashingVectorizer for text feature extraction (see the scikit-learn docs for more info).

CountVectorizer.transform converts a collection of text documents into a matrix of token counts. The counts matrix has a column for each known token and a row for each document; the value is a number of occurrences of a token in a document.

To create the counts matrix CountVectorizer must know which column corresponds to which token. The CountVectorizer.fit method basically remembers all tokens from some collection of documents and stores them in a “vocabulary”. Vocabulary is a Python dictionary: keys are tokens (or n-grams) and values are integer ids (column indices) ranging from 0 to len(vocabulary)-1.

Storing such a vocabulary in a standard Python dict is problematic; it can take a lot of memory even on relatively small data.

Let’s try it! Let’s use the “20 newsgroups” dataset available in scikit-learn. The “train” subset of this dataset has about 11k short documents (average document size is about 2KB, or 300 tokens; there are 130k unique tokens; average token length is 6.5).

Create and persist CountVectorizer:

from sklearn import datasets
from sklearn.externals import joblib

newsgroups_train = datasets.fetch_20newsgroups(subset='train')
vec = CountVectorizer()
vec.fit(newsgroups_train.data)
joblib.dump(vec, 'vec_count.joblib')

Load and use it:

from sklearn.externals import joblib
vec = joblib.load('vec_count.joblib')
X = vec.transform(['the dog barks'])

On my machine, the loaded vectorizer uses about 82MB of memory in this case. If we add bigrams (by using CountVectorizer(ngram_range=(1,2))) then it would take about 650MB – and this is for a corpus that is quite small.

There are only 130k unique tokens; it’ll require less than 1MB to store these tokens in a plain text file ((6.5+1) * 130k). Maybe add an another megabyte to store column indices if they are not implicit (130k * 8). So the data itself should take only a couple of MBs. We may also have to somehow enumerate tokens and enable fast O(1) access to data, so there would be an overhead, but it shouldn’t take 80+MB – we’d expect 5-10MB at most. The serialized version of our CountVectorizer takes about 6MB on disk without any compression, but it expands to 80+MB when loaded to memory.

Why does it happen? There are two main reasons:

  1. Python objects are created for numbers (column indices) and strings (tokens). Each Python object has a pointer to its type + a reference counter (=> +16 bytes overhead per object on 64bit systems); for strings there are extra fields: length, hash, pointer to the string data, flags, etc. (the string representation is different in Python < 3.3 and Python 3.3+).
  2. Python dict is a hash table and introduces overheads – you have to store hash table itself, pointers to keys and values, etc. There is a great talk on Python dict implementation by Brandon Rhodes, check it if you’re interested in knowing more

Storing static string->id mapping in a hash table is not the most efficient way to do it: there are perfect hashes, tries, etc.; add Python objects overhead and here we are.

So I decided to try an alternative storage for vocabulary. MARISA-Trie (via Python wrapper) looked like a suitable data structure, as it:

  • is a heavily optimized succinct trie-like data structure, so it compresses string data well
  • provides a unique id for each key for free, and this id is in range from 0 to len(vocabulary)-1 – we don’t have to store these indices ourselves
  • only creates Python objects (strings, integers) on demand.

MARISA-Trie is not a general replacement for dict: you can’t add a key after building, it requires more time and memory to build, lookups (via Python wrapper) are slower – about 10x slower than dict’s, and it works best for “meaningful” string keys which have common parts (not for some random data).

I must admit I don’t fully understand how MARISA-Tries work 🙂 The implementation is available in a folder named “grimoire“, and the only information about the implementation I could find is Japanese slides which are outdated (as library author Susumu Yata says). It seems to be a succinct implementation of Patricia-Trie which can store references to other MARISA-Tries in addition to text data; this allows it to compress more than just prefixes (as in “standard” tries). “Succinct” means the Trie is encoded as a bit array.

You may never heard of this library, but if you have a recent Android phone it is likely MARISA-Trie is in your pocket – a copy of marisa-trie is in the Android 4.3+ source tree.

Ok, great, but we have to tell scikit-learn to use this data structure instead of a dict for vocabulary storage.

Scikit-learn allows passing a custom vocabulary (a dict-like object) to CountVectorizer. But this won’t help us because MARISA-Trie is not exactly dict-like; it can’t be built and modified like dict. CountVectorizer should build a vocabulary for us (using its tokenization and preprocessing features) and only then we may “freeze” it to a compact representation.

At first, we were doing it using a hack. fit and fit_transform methods were overridden: first, they call the parent method to build a vocabulary, then they freeze that vocabulary (i.e. build a MARISA-Trie from it) and trick CountVectorizer to think a fixed vocabulary was passed to the constructor, and then parents method is called once more. Calling fit/fit_transform twice is necessary because the indices learned on the first call and indices in the frozen vocabulary are different. This quick & dirty implementation is here, and this is what we’re using in production.

I recently improved it and removed this “call fit/fit_transform twice” hack for CountVectorizer, but we haven’t used this implementation yet. See https://gist.github.com/kmike/9750796.

The results? For the same dataset, MarisaCountVectorizer uses about 0.9MB for unigrams (instead of 82MB) and about 13.3MB for unigrams+bigrams (instead of 650MB+). This is a 50-90x reduction of memory usage. Tada!

memory

The downside is that MarisaCountVectorizer.fit and MarisaCountVectorizer.fit_transform methods are 10-30% slower than CountVectorizer’s (new version; old version was up to 2x+ slower).

unigrams

ngrams

Numbers:

  • CountVectorizer(): 3.6s fit, 5.3s dump, 1.9s transform
  • MarisaCountVectorizer(), new version: 3.9s fit, 0s dump, 2.5s transform
  • MarisaCountVectorizer(), old version: 7.5s fit, 0s dump, 2.6s transform
  • CountVectorizer(ngram_range=(1,2)): 15.2s fit, 52.0s dump, 5.3s transform
  • MarisaCountVectorizer(ngram_range=(1,2)), new version: 18.7s fit, 0.0s dump, 6,8s transform
  • MarisaCountVectorizer(ngram_range=(1,2)), old version: 28.3s fit, 0.0s dump, 6.8s transform

‘fit’ method was executed on ‘train’ subset of ’20 newsgroups’ dataset; ‘transform’ method was executed on ‘test’ subset.

marisa-trie stores all data in a contignuous memory block so saving it to disk and loading it from disk is much faster than saving/loading a Python dict serialized using pickle.

Serialized file sizes (uncompressed):

  • CountVectorizer(): 5.9MB
  • MarisaCountVectorizer(): 371KB
  • CountVectorizer(ngram_range=(1,2)): 59MB
  • MarisaCountVectorizer(ngram_range=(1,2)): 3.8MB

TfidfVectorizer is implemented on top of CountVectorizer; it could also benefit from more efficient storage for vocabulary. I tried it, and for MarisaTfidfVectorizer the results are similar. It is possible to optimize DictVectorizer as well.

Note that MARISA-based vectorizers don’t help with memory usage during training. They may help with memory usage when saving models to disk though – pickle allocates big chunks of memory when saving Python dicts.

So when memory usage is an issue, ditch scikit-learn standard vectorizers and use marisa-based variants? Not so fast: don’t forget about HashingVectorizer. It has a number of benefits. Check the docs: HashingVectorizer doesn’t need a vocabulary so it fits and serializes in no time and it is very memory efficient because it is stateless.

As always, there are some tradeoffs:

  • HashingVectorizer.transform is irreversable (you can’t check which tokens are active) so it is harder to inspect what a classifer has learned from text data.
  • There could be collisions, and with improper n_features it could affect the prediction quality of a classifier.
  • A related disadvantage is that the resulting feature vectors are larger than the feature vectors produced by other vectorizers unless we allow collisions. The HashingVectorizer.transform result is not useful by itself, it is usually passed to the next step (classifier or something like PCA), and a larger input dimension could mean that this subsequent step will take more memory and will be slower to save/load, so the memory savings of HashingVectorizer could be compensated by increased memory usage of subsequent steps.
  • HashingVectorizer can’t limit features based on document frequency (min_df and max_df options are not supported).

Of course, all vectorizers have their own advantages and disadvantages, and there are use cases for all of them. You can use e.g. CountVectorizer for development and switch to HashingVectorizer for production, avoiding some of HashingVectorizer downsides. Also, don’t forget about feature selection and other similar techniques. Using succinct Trie-based vectorizers is not the only way to reduce memory usage, and often it is not the best way, but sometimes they are useful; being a drop-in replacement for CountVectorizer and TfidfVectorizer helps.

In our recent project, min_df > 1 was crucial for removing noisy features. Vocabulary wasn’t the only thing that used memory; MarisaTfidfVectorizer instead of TfidfVectorizer (+ MarisaCountVectorizer instead of CountVectorizer) decreased the total classifier memory consumption by about 30%. It is not a brilliant 50x-80x, but it made the difference between “classifier fits into memory” and “classifier doesn’t fit into memory”.

Some links:

There is a ticket to discuss efficient vocabulary storage with scikit-learn developers. Once the discussion settles our plan is to make a PR to scikit-learn to make using such vectorizers easier and/or release an open-source package with MarisaCountVectorizer & friends – stay tuned!

Autoscraping casts a wider net

Autoscraping casts a wider net

We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our capacity to accommodate these users.

Natalia made a screencast to help our new users get started:
It’s also a great introduction to what this service can do.

We released slybot as an open source integration of the scrapely extraction library and the scrapy framework. This is the core technology behind the autoscraping service and we will make it easy to export autoscraping spiders from Scrapinghub  and run them completely with slybot – allowing our users to have the flexibility and freedom provided by open source.