Why MongoDB Is a Bad Choice for Storing Our Scraped Data

Why MongoDB Is a Bad Choice for Storing Our Scraped Data

MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the alternatives available a few years ago) and it worked well for some time.

Usage has grown from a simple store for scraped data used on a few projects to the back end of our Scrapy Cloud platform. Now we are experiencing limitations with our current architecture and rather than continue to work with MongoDB, we have decided to move to a different technology (more in a later blog post). Many customers are surprised to hear that we are moving away from MongoDB, I hope this blog post helps explain why it didn’t work for us. 

Locking

We have a large volume of short queries which are mostly writes from web crawls. These rarely cause problems as they are fast to execute and the volumes are quite predictable. However, we have a lower volume of longer running queries (e.g. exporting, filtering, bulk deleting, sorting, etc.) and when a few of these run at the same time we get lock contention. 

Each MongoDB database (server prior to 2.2) has a Readers-Writer lock. Due to lock contention all the short queries need to wait longer and the longer running queries get much longer! Short queries take so long they time out and are retried. Requests from our website (e.g. users browsing data) take so long that all worker threads in our web server get blocked querying MongoDB. Eventually the website and all web crawls stop working!

To address this we:

  • Modified the MongoDB driver to timeout operations and retry certain queries with an exponential backoff
  • Sync data to our new backend storage and run some of the bulk queries there
  • Have many separate MongoDB databases with data partitioned between them
  • Scaled up our servers
  • Delayed implementing (or disabled) features that need to access a lot of fresh data

Poor space efficiency

MongoDB does not automatically reclaim disk space used by deleted objects and it is not feasible (due to locking) to manually reclaim space without substantial downtime. It will attempt to reuse space for newly inserted objects, but we often end up with very fragmented data. Due to locking, it’s not possible for us to defragment without downtime.

Scraped data often compresses well, but unfortunately there is no built in compression in MongoDB. It doesn’t make sense for us to compress data before inserting because the individual records are often small and we need to search the data.

Always storing object field names can be wasteful, particularly when they never change in some collections.

Too Many Databases

We run too many databases for MongoDB to comfortably handle. Each database has a minimum size allocation so we have wasted space if the size of the data in that DB is small. If no data is in the disk cache (e.g. after a server restart), then it can take a long time to start MongoDB as it needs to check each database. 

Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written. Retrieving data in order requires sorting which is impractical when the number of records gets large.

It is only possible to maintain order in MongoDB if you use capped collections, which are not suitable for crawl output.

Skip + Limit Queries are slow

There is no limit on the number of items written per crawl job and it’s not unusual to see jobs that have a few million items. When reading data from the middle of a crawl job, MongoDB needs to walk the index from the beginning to the offset specified. It gets slow browsing deep into a job with a lot of data.

Users may download job data via our API by paginating results. For large jobs (say, over a million items), it’s very slow and some users work around this by issuing multiple queries in parallel, which of course causes high server load and lock contention.

Restrictions

There are some odd restrictions, like the allowed characters in object field names. This is unfortunate, since we lack control over the field names we need to store.

Impossible to keep the working set in memory

We have many TB of data per node. The frequently accessed parts are small enough that it should be possible to keep them in memory. The infrequently accessed data is often sequentially scanned crawl data.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

Data that should be good, ends up bad!

After embracing MongoDB, its use spread to many areas, including as a back-end for our django UI. The data stored here should be clean and structured, but MongoDB makes this difficult. Some limitations that affected us are:

  • No transactions – We often need to update a few collections at a time and in the case of failure (server crash, bug, etc.) only some of this data is updated. Of course this leads to inconsistent state. In some cases we apply a mix of batch jobs to fix the data, or various work-arounds in code. Unfortunately, it has become common to just ignore the problem, thinking it might be rare and unimportant (a philosophy encouraged by MongoDB).
  • Silent failures hide errors – It’s better to detect errors early, and “let it crash”. Instead MongoDB hides problems (e.g. writing to non-existing collection) and encourages very defensive programming (does the collection exist? is there an index on the field I need? Is the data the type I expect? etc.)
  • Safe mode poorly understood – Often developers don’t understand that without safe=True, the data may never get written (e.g. in case of error), or may get written at some later time. We had many problems (such as intermittently failing tests) where developers expected to read back data they had written with safe=False.
  • Lack of a schema or data constraints – Bugs can lead to bad data being inserted in the database and going unnoticed.
  • No Joins – Joins are extremely useful, but with MongoDB you’re forced to either maintain denormalized data without triggers or transactions, or issue many queries loading reference data.

Summary

There is a niche where MongoDB can work well. Many customers tell us that they have positive experiences using MongoDB to store crawl data. So did Scrapinghub for a while, but it’s no longer a good fit for our requirements and we cannot easily work around the problems presented in this post.

I’ll describe the new storage system in future posts, so please follow @scrapinghub if you are interested!

Comment here or in HackerNews thread.

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

30 thoughts on “Why MongoDB Is a Bad Choice for Storing Our Scraped Data

  1. I think these are all valid concerns but a few are things which are known about MongoDB and so should factor into the original decision to use it. In particular, no joins, no transactions and no schema are all features and if you need those then you shouldn’t have chosen MongoDB in the first place.

    “Safe mode” changed in November last year and is now defaulted to on, or acknowledged writes. This helps with the problem you described and is a nice default because it gives reasonable performance + safety. You can dial back safety to get performance or vice versa.

    Locking is probably the most often cited issue and it is better in 2.4, but could (and will) be improved. I think the biggest issue for us is disk space space reuse. The compact command helps but it does require a maintenance window and some time to complete, which is inconvenient.

  2. I ‘ve been working with mongoDB for about a year now and encountered many of these problems and managed to implement workarounds for most, I’ll list a few of these workarounds.

    Poor space efficiency:
    Hard drives are cheap so this was a non-issue

    No transactions:
    we had to implement it ourselves with document based redis lock handled by our software. good thing they are not required everywhere in our use case, far from it. small atomic operations cover most of our cases.

    safe mode:
    I’d also prefer the default to be safe, but well.. this can be solved rather easily by coding some helpers.

    lack of schema/data constraint:
    depending on the language you use some solutions are already there, ex.: mongoose
    that being said even when we use mongoose we have many helpers on top of it so that schema require less code, and respect specific traits.

    but… I do see your point that it might not be worth your time if you need a workaround for many things and you should find a database more fitting to your use case. Good for us, the workarounds we required were negligible overhead.

    1. Very nice inded .
      Who’s going to maintain these ‘workarounds’ in three years if you’ve left this project ?
      Specially the transcation “workaround”…

  3. I am glad to hear there are locking improvements in 2.4. Indeed, 2.2 was a decent improvement for us, but didn’t go far enough. The safe mode defaults are also a welcome change.

    The lack of joins & transactions of course did factor into the original decision. My point (which perhaps could be clearer) was that MongoDB ended up being used outside of the area in which we originally intended to use it. There was some reluctance to add another technology when we could get by with what we had for what was (initially) only a small use. Additionally, some limitations were not always well understood by web developers (who were new to mongo and enthusiastic to try it). I see this as our mistake. With hindsight, it’s clear we should have introduced an RDBMS immediately and kept MongoDB for managing the crawl data.

  4. The ONLY reason to choose MongoDB is because you’re being lazy. Seriously, it has “NoSQL Cool” for people who want to write SQL. Any basic survey of the options out there- Cassandra, Riak, Couchbase, BigCouch, Voldemort, etc … gives you multiple actually distributed, scalable databases.

    Global Write Lock? You’re recommending a database with a global write lock? You’re fired.

    Seriously, when did engineering leave this profession? Where are people’s standards?

    MongoDB is only appropriate if your data fits on a single machine…. and if that’s the case there are many other, more established choices such as MySQL and Postgres.

  5. When we chose MongoDB, we used it for “a simple store for scraped data used on a few projects” back in 2010. It was a useful tool we could use on some consulting projects. We deployed it on an AWS small instance (the 2GB limit was fine for a while, we are self-funded and didn’t want to spend much on hardware) along with other services. The early version of the platform was quick to develop and this is partly thanks to MongoDB.

    Once we realized that we had a useful service and as our data size, traffic and requirements grew, we knew MongoDB wasn’t the best fit.

    MySQL or Postgress would not have worked well for storing scraped data as we need to store arbitrary JSON objects and filter them (e.g. find products in the last crawl job with a price < 20, find blog posts crawled by this spider with more than 10 comments, etc.).

    1. Your query examples suggests to me that elasticsearch might be a good fit. I have used at a smaller scale worked well for me.

  6. for the love of living things, why would a database be used to store large volumes of scraped data, especially for shared service like yours? It sounds like you need hdfs or some special-purpose datastore with good compression support. Relational db’s would be bad for this (ie. slow) in similar way’s mongo is.

    I’m not really defending mongo, but it sounds like you picked the wrong tool for the job in any case.

    1. Actually I can guarantee that RDBMS can handle this. Has been working very well for 14 years and in some cases on ancient 32bit technology and we aren’t talking about trivial traffic either.
      It’s attention to detail and common sense that is needed

  7. Great post indeed regarding your experience. We’re planning to adopt Mongo as well. However, I’ve studied it long enough for all of our use cases (one of which is similar to yours in terms of storing crawled data) so I wasn’t surprised to see the issues you ran ultimately ran into. As your follow up comment suggested, you definitely started using it outside the initial design, stuff that it wasn’t meant to do.

  8. Great post regarding limitations of MongoDB. Nice post. This is useful for all the techies..! Than x for this post..!

  9. Pingback: MongoDB
      1. We use hbase on many billions of records per month – 50 million would be common for a single project. I think hbase will work fine provided each individual record is not huge (it doesn’t do great with very large objects).

  10. Very nice post… We were thinking of using MongoDB for our new project. But after reading this post now we are re-thinking of our decision. Thanks though for the post.

  11. The fast you discovered that mongodb doesn’t support transactions after you implemented your system … says it all.
    For all sake I think your dev/ops guys don’t understand quite well mongodb – see this – Impossible to keep the working set in memory – ok so what database will do that for you ? On a second thought actually you can keep all working set in memory … but you may need to buy half of aws instances 🙂

    One think I can agree however is the fact that just using mongodb will NOT solve all your problems.
    You need to know very well the data structure you want (documents) as well the access/update patterns.

Leave a Reply

Your email address will not be published. Required fields are marked *