Disclaimer: While I’ve experimented with MongoDB, I haven’t deployed it to production. These are my honest reservations that keep me from deploying a production app in Mongo. I welcome comments from those who have spent time in the trenches.
There are features I really enjoy about MongoDB. I love the hierarchical data structure. I love making a single, fast call to get a chunk of JSON that’s stored natively in the shape I need. And I enjoy rapidly prototyping an idea by slapping JSON into Mongo until the ideal schema becomes more apparent.
But there’s a long list of concerns that keep me from choosing it for production.
Schema = Protection
In a RDBMS, the schema dictates data types, nullable fields, maximum lengths, foreign key constraints, etc. This clarifies requirements. It simplifies application code. It also means when someone inevitably sends me a dataset to import, the DB won’t accept malformed data. In contrast, schemaless databases don’t natively protect from bad data. Sure, your application can (and should) enforce rules. When working with MongoDB, the schema can be enforced on the application side via Mongoose. However, there’s a gaping hole in this approach: Developers can insert and update data directly via the database. Honestly, have you ever seen a DB where all data was entered and manipulated solely through the application? Ad-hoc data manipulation will happen, so the database should protect itself.
Look guys, the data fits perfectly.
A strategically designed database won’t allow bad data in. Well-designed schemas work like a gated check-in. You’re forced to normalize and clean up your data before import, which is the most logical time to do so. Yes, I recognize MongoDB’s design assumes the lack of a schema is a selling point, but in my 15+ years of application development, I’ve yet to build an app where the structure of the data didn’t matter. A lot. I agree with Sarah Mei:
Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value. If you have an implicit schema — meaning, if there are things you are expecting in that JSON — then MongoDB is the wrong choice.
Complex Ad-hoc Queries
Nearly every app I’ve built eventually requires ad-hoc queries to pull together data for reporting and complex UIs. Sure, Mongo supports querying, but your choice of document structure greatly impacts query performance and feasibility. Mongo is fastest when treated like a simple key/value store for retrieving hierarchical data. Thus, selecting the ideal document structure up-front is critical to assure you can efficiently access the data you need. And as you’ll see below, the right document structure for today often becomes the wrong structure for tomorrow.
Hierarchical Documents Assume Static Requirements
In a traditional RDBMS, your schema models relationships that are highly unlikely to change. The inherent nature of the data guides you toward a logical normalization path that will support ad-hoc queries and avoid repeating data. In a schemaless DB like MongoDB, selecting a logical document structure requires considering all the ways your data might be used up-front. Making the right call for the long-term isn’t easy.
Ironically, Mongo’s schemaless hierarchical document structure seems best suited for static data retrieval requirements. This means the document structure you initially choose can become a real millstone later. What options do I have when the document structure I selected no longer supports the queries I need? This is exactly what happened to Sarah Mei when building an app for Diaspora. Sure, relational databases require you to make a decision about your data structure up-front as well, but the explicit, normalized, and relational nature of that structure makes it more flexible and versatile. Flexibility is key when requirements change. And they always do.
You Need Two DBs Anyway
For reasons outlined above, you’ll likely find you need two DBs: One for transactions, and another for analytics. Maintaining and syncing two DBs in radically different technologies isn’t trivial. Yet Mongo reps acknowledge the common need of separate OLTP and OLAP DBs when working with MongoDB. But if I need two databases, why not simply lean on two relational DBs that are optimized for these two very different roles? Two optimized RDBMS are likely to offer comparable performance, superior data integrity, and simpler maintenance as the DBA doesn’t have to learn the intricacies of managing and syncing two radically different DB systems.
Inconsistent Data = Bugs
I’ve never written an app where an inconsistent data structure was considered a feature. Yes, inconsistent data structures are unavoidable in some cases, but when at all possible, I prefer setting explicit requirements and expectations for my data structures. Why? This mindset drives out holes in requirements. Oh, a user can have multiple addresses? A vehicle’s model year may be null or a decimal? Woah, these things have a big impact and ripple through a system all the way up to the UI design. Thus, schemaless DBs introduce hidden edge cases in your data. Edge cases should be addressed as early as possible, and an explicit schema enforces this critical step. The later you find out about edge cases in your data, the greater the impact to your application (and your timelines).
If this little guy is in my hotel bed, I’d like to know before booking.
Explicit > Implicit
Even if you embrace the wild west mentality that you don’t need a schema, you still have one. It’s just implicit instead of explicit. And that’s a problem because clean code is about writing code for humans. Explicit schemas convey expectations in a clear, standardized, and centralized manner that humans can easily understand.
The argument here parallels considerations in strong vs dynamic typing. Dynamic types can help you move faster in the short-term, but you lose the ability to lean on the compiler, and the decisions you make about types become implicit instead of explicit. Thus, in a dynamic language, your co-workers have to read either comments or tests to determine data types and interfaces. In a strongly typed language, the interface and expectations are explicit. Bottom-line, you have to convey your assumptions at some point, and a schema is a consistent and logical point to do so. Admittedly, you can optionally create a MongoDB schema in your application code via Mongoose, but this is optional and, as we discussed above, fails to adequately protect the data from direct manipulation.
Bottom Line
These are my core reservations, but I’d love to be corrected in the comments. If data integrity truly doesn’t matter to you, then Mongo’s hierarchical data structure, schemaless nature, and high performance may make it a great fit for you. But until I build an app where the data structure truly doesn’t matter, I’m sticking with a traditional RDBMS.
Thought #1: I agree with you on the issues of schema=protection; bad data and ad-hoc data directly into the db.
Thought #2: For a SQL db It would be hard to beat the flexibility of Mongodb in document (i.e. text – huge amounts of searchable text) handling.
All in all, I think I’ll continue using schema, structured dbs and continue experimenting with “schemaless” Mongodb
Nice article.
Though it holds true for all other document dataases
That’s why 2nd generation of Document Databases supports schema-less, schema-full and schema-hybrid features. Furthermore Transactions and Relationship concepts is back. The first “Multi-Model” DBMS, as the evolution of 1st generation of NoSQL, is OrientDB (orientdb.com), but other players are improving their engine to be multi-model and overcome to such limitation. I think 1st generation of NoSQL was useful to break the RDBMS domain in the developer’s heart. Now we need something more. My 0,02.
I like the idea of using PostgreSQL to get the best of both worlds. Postgres supports the JSON datatype, which can be indexed and queried (http://www.postgresql.org/docs/9.4/static/datatype-json.html). This helps cover two very different use cases with a single technology. Nice way to dip our toes into a new methodology.
Thanks for the comment!
MongoDB 3.2 (unreleased future version) will include document validation (not schema validation):
https://jira.mongodb.org/browse/SERVER-18227
It only applies to new documents, but I suppose that would be sufficient for new projects. Or any old projects with a currently consistent schema. Either way, it’s a nice enhancement!
Have a look at http://www.cloudcms.com they handle the schema consistency issue by allowing you to create “definitions” off collections – IE use a JSON schema document to define a collections structure. It also allows you to create a highly relational structure – a bit like a graph database overlaying a document database. Lots of other cool features like being able to embed elements of one document in another document and keep them synced – when one gets updated it updates the embedded copies etc. It’s quite a learning curve – but some amazing functionality to solve a lot of the issue highlighted with mongo above – and a whole lot of other concepts that give loads of power.
I totally disagree that MongoDB is the wrong choice if you have expectations about the shape of the data. I believe that the technology comes with a change in mindset and process. What NoSQL allows is moving the integrity enforcement from the storage to the application. This also means the enforcement needs to be done with new data import processes.
The benefit of NoSQL isn’t simply the ability to mutate the shape of the data, but rather to have data stored in a shape that your application expects. Now I don’t have to spend time denormalizing the data to actually make use of it.
So I have an API that is the gateway to my storage for everything going in and coming out. The API defines the expectation and enforces the integrity. Change happens in one central repository. I don’t have domain knowledge split between my storage and my application.
A change to my data model in SQL means me changing the table, various stored procedures and views, then the various applications that hinge on this data – the model, mappings, repositories and and and…
With NoSQL, I change my model and the change is persisted.
When changing workflows becomes so seamless, the need for ad-hoc data manipulation disappears. When data is denormalize, ad-hoc queries aren’t all that complex.
The way a startup views their data vs a corporate view is very different. At the end of the day, you use the right tool for the job.