Why MongoDB Is a Bad Choice for Storing Our Scraped Data

MongoDB was used early on at Scrapinghub to store scraped data because it's convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the alternatives available a few years ago) and it worked well for some time.

Usage has grown from a simple store for scraped data used on a few projects to the back end of our Scrapy Cloud platform. Now we are experiencing limitations with our current architecture and rather than continue to work with MongoDB, we have decided to move to a different technology (more in a later blog post). Many customers are surprised to hear that we are moving away from MongoDB, I hope this blog post helps explain why it didn't work for us. 

Locking

We have a large volume of short queries which are mostly writes from web crawls. These rarely cause problems as they are fast to execute and the volumes are quite predictable. However, we have a lower volume of longer running queries (e.g. exporting, filtering, bulk deleting, sorting, etc.) and when a few of these run at the same time we get lock contention. 

Each MongoDB database (server prior to 2.2) has a Readers-Writer lock. Due to lock contention all the short queries need to wait longer and the longer running queries get much longer! Short queries take so long they time out and are retried. Requests from our website (e.g. users browsing data) take so long that all worker threads in our web server get blocked querying MongoDB. Eventually the website and all web crawls stop working!

To address this we:

  • Modified the MongoDB driver to timeout operations and retry certain queries with an exponential backoff
  • Sync data to our new backend storage and run some of the bulk queries there
  • Have many separate MongoDB databases with data partitioned between them
  • Scaled up our servers
  • Delayed implementing (or disabled) features that need to access a lot of fresh data

Poor space efficiency

MongoDB does not automatically reclaim disk space used by deleted objects and it is not feasible (due to locking) to manually reclaim space without substantial downtime. It will attempt to reuse space for newly inserted objects, but we often end up with very fragmented data. Due to locking, it's not possible for us to defragment without downtime.

Scraped data often compresses well, but unfortunately there is no built in compression in MongoDB. It doesn't make sense for us to compress data before inserting because the individual records are often small and we need to search the data.

Always storing object field names can be wasteful, particularly when they never change in some collections.

Too Many Databases

We run too many databases for MongoDB to comfortably handle. Each database has a minimum size allocation so we have wasted space if the size of the data in that DB is small. If no data is in the disk cache (e.g. after a server restart), then it can take a long time to start MongoDB as it needs to check each database. 

Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written. Retrieving data in order requires sorting which is impractical when the number of records gets large.

It is only possible to maintain order in MongoDB if you use capped collections, which are not suitable for crawl output.

Skip + Limit Queries are slow

There is no limit on the number of items written per crawl job and it's not unusual to see jobs that have a few million items. When reading data from the middle of a crawl job, MongoDB needs to walk the index from the beginning to the offset specified. It gets slow browsing deep into a job with a lot of data.

Users may download job data via our API by paginating results. For large jobs (say, over a million items), it's very slow and some users work around this by issuing multiple queries in parallel, which of course causes high server load and lock contention.

Restrictions

There are some odd restrictions, like the allowed characters in object field names. This is unfortunate, since we lack control over the field names we need to store.

Impossible to keep the working set in memory

We have many TB of data per node. The frequently accessed parts are small enough that it should be possible to keep them in memory. The infrequently accessed data is often sequentially scanned crawl data.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

Data that should be good, ends up bad!

After embracing MongoDB, its use spread to many areas, including as a back-end for our django UI. The data stored here should be clean and structured, but MongoDB makes this difficult. Some limitations that affected us are:

  • No transactions - We often need to update a few collections at a time and in the case of failure (server crash, bug, etc.) only some of this data is updated. Of course this leads to inconsistent state. In some cases we apply a mix of batch jobs to fix the data, or various work-arounds in code. Unfortunately, it has become common to just ignore the problem, thinking it might be rare and unimportant (a philosophy encouraged by MongoDB).
  • Silent failures hide errors - It's better to detect errors early, and "let it crash". Instead MongoDB hides problems (e.g. writing to non-existing collection) and encourages very defensive programming (does the collection exist? is there an index on the field I need? Is the data the type I expect? etc.)
  • Safe mode poorly understood - Often developers don't understand that without safe=True, the data may never get written (e.g. in case of error), or may get written at some later time. We had many problems (such as intermittently failing tests) where developers expected to read back data they had written with safe=False.
  • Lack of a schema or data constraints - Bugs can lead to bad data being inserted in the database and going unnoticed.
  • No Joins - Joins are extremely useful, but with MongoDB you're forced to either maintain denormalized data without triggers or transactions, or issue many queries loading reference data.

Summary

There is a niche where MongoDB can work well. Many customers tell us that they have positive experiences using MongoDB to store crawl data. So did Scrapinghub for a while, but it’s no longer a good fit for our requirements and we cannot easily work around the problems presented in this post.

I'll describe the new storage system in future posts, so please follow @scrapinghub if you are interested!

Comment here or in HackerNews thread.

June 19, 2018 In "Scrapinghub"
June 07, 2018 In "Alternative Financial Data" , "Scrapinghub"
May 30, 2018 In "Scrapinghub"
crawling, mongodb, performance, Scrapinghub, Scrapy Cloud, storage

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

Welcome

Here we blog about all things related to web scraping and web data.

If you want to learn more about how you can use web data in your company, check out our Data as a Services page for inspiration.

Learn More

Recent Posts