Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.
Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.
Attention Kimono users: we've created an exporter so you can easily convert your projects from Kimono to Portia!
Imagine your business depended heavily on a third party tool and one day that company decided to shut down its service with only 2 weeks notice. That, unfortunately, is what happened to users of Kimono Labs yesterday.
And it’s one of the many reasons why we love open source so much.
Imagine that you have a lot of samples for a certain kind of data in JSON format. Maybe you want to have a better feel of it, know which fields appear in all records, which appear only in some and what are their types. In other words, you want to know the schema for the data that you have.
"The easiest way to think about Memex is: How can I make the unseen seen?"
We use the scikit-learn library for various machine-learning tasks at Scrapinghub. For example, for text classification we'd typically build a statistical model using sklearn's Pipeline, FeatureUnion, some classifier (e.g. LinearSVC) + feature extraction and preprocessing classes. The model is usually trained on a developers machine, then serialized (using pickle/joblib) and uploaded to a server...