Browsed by
Author: Mikhail Korobov

Do Androids Dream of Electric Sheep?

Do Androids Dream of Electric Sheep?

It got very easy to do Machine Learning: you install a ML library like scikit-learn or xgboost, choose an estimator, feed it some training data, and get a model which can be used for predictions. Ok, but what’s next? How would you know if it works well? Cross-validation! Good! How would you know that you haven’t messed up the cross validation? Are there data leaks? If the quality is not good enough, how to improve it? Are there data preprocessing…

Read More Read More

Optimizing Memory Usage of Scikit-Learn Models Using Succinct Tries

Optimizing Memory Usage of Scikit-Learn Models Using Succinct Tries

We use the scikit-learn library for various machine-learning tasks at Scrapinghub. For example, for text classification we’d typically build a statistical model using sklearn’s Pipeline, FeatureUnion, some classifier (e.g. LinearSVC) + feature extraction and preprocessing classes. The model is usually trained on a developers machine, then serialized (using pickle/joblib) and uploaded to a server where the classification takes place. Sometimes there can be too little available memory on the server for the classifier. One way to address this is to…

Read More Read More