I never really "got" the new wave of NoSQL databases. Mongo seemed to be the one I could most easily wrap my head around, but still.
I was never sure, though, if that meant I had never faced a problem suitable for one of these DBMSs or if my mind is just so warped by years of using relational engines (mostly Postgres, or SQLite for simple projects) that I could not think of modeling my data any other way.
Recently though, I had to get familiar with the database schema of the ERP system we use at work, plus some modifications that have been done to it over the years, and it kind of feels to me like somebody was trying force a square peg through a round hole (i.e. trying to model data in relational terms, either not fully "getting" the relational model or using data that simply refuses to be modeled in that way).
I sometimes think the people who wrote the ERP system might have enjoyed a NoSQL DBMS. Then again, with a multi-user ERP system, you <i>really</i> want transactions (personally, I feel that ACID-compliant transactions are single most useful benefit of RDBMS engines), and most NoSQL-engines seem to kind of not have them.
(1) Transactional model. Many NoSQL databases are non-ACID, but others are (Google's stores all have some transactional guarantees). Some databases try to gain efficiency by relaxing their transactional guarantees, some probably just didn't get around to implementing a proper transactional system yet.
(2) Data model. The relational models can be overly restrictive as you cannot easily represent contained, repeated elements in an object without ending up in a crazy join smorgasbord. Note that that doesn't mean you have to be dynamically typed.
(3) Distribution/sharding/clustering. RDBMSes are traditionally single machine, and getting them to cluster is usually a huge source of pain. NoSQL databases are often built from the ground for sharding.
I think people go for MongoDB mostly for (2) and ease of use. Very few people have an actual big data problem where you really need (3), and for reliability there are simpler solutions (hot standby). (1) makes it so much easier to build reliable systems that it'd be a real deal breaker for me.
I personally don't understand why so many people go for NoSQL, seems to me like that creates a substantial cost, both for performance, but more importantly missing the transaction guarantees, with no real benefit, at least none that's obvious to me. MongoDB with its unapplied writes, no real transactions, but no real distribution story seems like an odd choice in particular.
Pardon me if I'm just sniping on the word "object" here, but if you think of your data as objects then you will find the relational model restrictive.
In my experience, objects are an application concept, closely coupled to an implementation. If you can conceive of your data in implementation-independent terms, i.e. as entities and relationships, then you can put a RDBMS to effective use.
You don't need big data for (3), we use mongo sharding primarily for spreading database write load. Not saying that it's the best way to do that but it's what we use it for and I doubt we're the only ones.
I see where you are coming from, and my opinion is mostly the same. However, I can see a use case for relaxing constraints in a distributed scenario, as described by the CAP theorem.
In a distributed scenario, when a partition event occurs, relational databases opt for consistency, whereas nosql opts for availability. This is formally correct behaviour by relational dbs, but comes with a cost. The cost is a serial performance component, that can't be parallelized. Nosql DBs (most of them), in this scenario, go for availability, and may thus eschew some consistency guarantee tasks, with a cost in data consistency and an advantage in parallel performance.
The trick, as ever, is to use each tool in its function. Nosql and relational dbs are wholly different tools, for wholly different problem classes. Using nosql where consistency is paramount irks me to no end, and that is the case 80% of the time I see people using nosql. On the other hand, in specific cases, nosql DBs are a new useful tool in my arsenal.
It is for storing huge number of data with dynamic keys.
For example when site admin says "I want new archive that I can fill with items, items will have Id (automatically), Name (string), IsMale (bool)". He also want to do complex queries on this data as well. That's where NoSQL comes to help.
And to answer why exactly MongoDb is so popular - it's because it has awesome driver support for every popular language.
I don't understand what's so hard to understand here. It's a simple solution to EAV/nulltable nightmare.
We've redone our product catalog for a website using a NoSQL solution ( not MongoDB, but we did look at it). Our products are in multiple different categories, and have vastly differently attributes depending on categories. NoSQL solutions are perfect of this. As you point out it's a simply alternative/solution to deploying an EAV model.
I've only seen EAV used in one system, Magento, but was a disaster. It's complex and slow to the point that very product in stored both in the EAV model and as a "flattened product".
For systems dealing with sales and economy in general I would almost alway pick a RBDMS, it's seems a much more natural fit. The ability to do ad-hoc queries in SQL, rather that map-reduce is a huge advantage.
Quite frankly, no application I have ever worked on has had to deal with "huge" amounts of data by any common definition (a couple of gigabytes at the most).
And like I did say, looking at our ERP system's database I am beginning to understand the appeal of a database without a fixed schema. Some of the tables have dozens of columns, with most of the rows being full of NULL values. So I do get that part, but no application I have ever worked on was like that.
This avoids large numbers of rows containing nulls, but it violates a normal form. The mnemonic is that the table must contain the Key, the whole key, and nothing but the key, so help me Codd. Anytime you have a "REFERENCES table(pk) UNIQUE", you violate the "whole key" bit.
The nice thing about using an RDBMS with JSON support, rather than a NoSQL solution, is that you can store all the fixed-schema stuff in column as usual, and benefit from the performance, consistency, ease of joins and so on with that, but you can also store your JSON documents alongside that data in the same table, efficiently indexed.
Yes, but what happens if your JSON data size grows so large that it can't fit on a single machine? Multi-master replication or sharding is a terrible pain in any RDBMS (at least according to my research and trials).
At the end of 2013, Stack Overflow worked on one SQL server (plus a redis server for caching). The rest of Stack Exchange runs on another SQL server.[0]
For the most part, for most projects, worrying about multi-master replication is going to be pointless. You can always put some data in a distributed K/V (or document) store and point to that from your SQL if you need to.
What I meant was a data-size that was too large for a machine. Adding arbitrarily large JSON to your table could expand the data-size to be too big for one machine, or even too big for block storage. Plus you might not want it all in block storage. My point is that an RDBMS can be better served storing the relational data alone with a separate DB Engine for the potentially massive JSON data.
How many use cases really exceed a single, multi-terabyte machine? Your argument is true for any data type and format not just JSON. If you have more than a few terabytes then you have "big data" and most solutions, including Mongo, probably won't work.
You're storing exactly the same data whatever you're storing it in. The point is that there's rather few use cases where your primary database is going to be more than a few terabytes, which is usually easy to handle with a single machine + some caching. I pointed to Stack Overflow as an example of a large site which still manages to keep its entire database on one machine.
It has nothing to do with the format. I'm using JSON as an example of something that could be large. I'm only talking about separating the "could become massive" columns/data/whatever from the "small, relational data".
Well one thing we store is HTML content and "MS Word" like document data. We also store hundreds of revisions for all of those documents. I wouldn't want to use an RDMBS for this because (a) its not relational, but also (b) backup/replication/load distribution would be too painful. A system like CouchDB can be spread out over n-machines, any of them write-capable.
If I hadn't seen so many dev teams eschew constraints and triggers in favor of broken client code being allowed to screw up the data, I would agree to the political element easily. But in practice, the ignorance is so great, I can't even be sure there is a political choice being made.
I was never sure, though, if that meant I had never faced a problem suitable for one of these DBMSs or if my mind is just so warped by years of using relational engines (mostly Postgres, or SQLite for simple projects) that I could not think of modeling my data any other way.
Recently though, I had to get familiar with the database schema of the ERP system we use at work, plus some modifications that have been done to it over the years, and it kind of feels to me like somebody was trying force a square peg through a round hole (i.e. trying to model data in relational terms, either not fully "getting" the relational model or using data that simply refuses to be modeled in that way).
I sometimes think the people who wrote the ERP system might have enjoyed a NoSQL DBMS. Then again, with a multi-user ERP system, you <i>really</i> want transactions (personally, I feel that ACID-compliant transactions are single most useful benefit of RDBMS engines), and most NoSQL-engines seem to kind of not have them.