I never really "got" the new wave of NoSQL databases. Mongo seemed to be the one...

Nitramp · on Sept 27, 2014

I think people mix up three things:

(1) Transactional model. Many NoSQL databases are non-ACID, but others are (Google's stores all have some transactional guarantees). Some databases try to gain efficiency by relaxing their transactional guarantees, some probably just didn't get around to implementing a proper transactional system yet.

(2) Data model. The relational models can be overly restrictive as you cannot easily represent contained, repeated elements in an object without ending up in a crazy join smorgasbord. Note that that doesn't mean you have to be dynamically typed.

(3) Distribution/sharding/clustering. RDBMSes are traditionally single machine, and getting them to cluster is usually a huge source of pain. NoSQL databases are often built from the ground for sharding.

I think people go for MongoDB mostly for (2) and ease of use. Very few people have an actual big data problem where you really need (3), and for reliability there are simpler solutions (hot standby). (1) makes it so much easier to build reliable systems that it'd be a real deal breaker for me.

I personally don't understand why so many people go for NoSQL, seems to me like that creates a substantial cost, both for performance, but more importantly missing the transaction guarantees, with no real benefit, at least none that's obvious to me. MongoDB with its unapplied writes, no real transactions, but no real distribution story seems like an odd choice in particular.

jl6 · on Sept 27, 2014

> ... contained, repeated elements in an object

Pardon me if I'm just sniping on the word "object" here, but if you think of your data as objects then you will find the relational model restrictive.

In my experience, objects are an application concept, closely coupled to an implementation. If you can conceive of your data in implementation-independent terms, i.e. as entities and relationships, then you can put a RDBMS to effective use.

admiun · on Sept 27, 2014

You don't need big data for (3), we use mongo sharding primarily for spreading database write load. Not saying that it's the best way to do that but it's what we use it for and I doubt we're the only ones.

sergiosgc · on Sept 27, 2014

I see where you are coming from, and my opinion is mostly the same. However, I can see a use case for relaxing constraints in a distributed scenario, as described by the CAP theorem.

In a distributed scenario, when a partition event occurs, relational databases opt for consistency, whereas nosql opts for availability. This is formally correct behaviour by relational dbs, but comes with a cost. The cost is a serial performance component, that can't be parallelized. Nosql DBs (most of them), in this scenario, go for availability, and may thus eschew some consistency guarantee tasks, with a cost in data consistency and an advantage in parallel performance.

The trick, as ever, is to use each tool in its function. Nosql and relational dbs are wholly different tools, for wholly different problem classes. Using nosql where consistency is paramount irks me to no end, and that is the case 80% of the time I see people using nosql. On the other hand, in specific cases, nosql DBs are a new useful tool in my arsenal.

romanovcode · on Sept 27, 2014

It is for storing huge number of data with dynamic keys.

For example when site admin says "I want new archive that I can fill with items, items will have Id (automatically), Name (string), IsMale (bool)". He also want to do complex queries on this data as well. That's where NoSQL comes to help.

And to answer why exactly MongoDb is so popular - it's because it has awesome driver support for every popular language.

I don't understand what's so hard to understand here. It's a simple solution to EAV/nulltable nightmare.

mrweasel · on Sept 27, 2014

We've redone our product catalog for a website using a NoSQL solution ( not MongoDB, but we did look at it). Our products are in multiple different categories, and have vastly differently attributes depending on categories. NoSQL solutions are perfect of this. As you point out it's a simply alternative/solution to deploying an EAV model.

I've only seen EAV used in one system, Magento, but was a disaster. It's complex and slow to the point that very product in stored both in the EAV model and as a "flattened product".

For systems dealing with sales and economy in general I would almost alway pick a RBDMS, it's seems a much more natural fit. The ability to do ad-hoc queries in SQL, rather that map-reduce is a huge advantage.

krylon · on Sept 27, 2014

Thanks for the explanation.

Quite frankly, no application I have ever worked on has had to deal with "huge" amounts of data by any common definition (a couple of gigabytes at the most).

And like I did say, looking at our ERP system's database I am beginning to understand the appeal of a database without a fixed schema. Some of the tables have dozens of columns, with most of the rows being full of NULL values. So I do get that part, but no application I have ever worked on was like that.

yummyfajitas · on Sept 27, 2014

Some of the tables have dozens of columns, with most of the rows being full of NULL values.

This is generally addressed in a relational design with a star schema. First create a dimension table:

    CREATE TABLE person (
      id BIGINT PRIMARY KEY NOT NULL
    )

Then create fact tables:

    CREATE TABLE person_name (
      person_id BIGINT REFERENCES person(id) UNIQUE,
      name VARCHAR(128) NOT NULL
    )

    CREATE TABLE person_bank_details (
      person_id BIGINT REFERENCES person(id) UNIQUE,
      bank_detail ....
    )

This avoids large numbers of rows containing nulls, but it violates a normal form. The mnemonic is that the table must contain the Key, the whole key, and nothing but the key, so help me Codd. Anytime you have a "REFERENCES table(pk) UNIQUE", you violate the "whole key" bit.

yangyang · on Sept 27, 2014

The nice thing about using an RDBMS with JSON support, rather than a NoSQL solution, is that you can store all the fixed-schema stuff in column as usual, and benefit from the performance, consistency, ease of joins and so on with that, but you can also store your JSON documents alongside that data in the same table, efficiently indexed.

mikegioia · on Sept 27, 2014

Yes, but what happens if your JSON data size grows so large that it can't fit on a single machine? Multi-master replication or sharding is a terrible pain in any RDBMS (at least according to my research and trials).

vertex-four · on Sept 27, 2014

At the end of 2013, Stack Overflow worked on one SQL server (plus a redis server for caching). The rest of Stack Exchange runs on another SQL server.[0]

For the most part, for most projects, worrying about multi-master replication is going to be pointless. You can always put some data in a distributed K/V (or document) store and point to that from your SQL if you need to.

[0] http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-s...

mikegioia · on Sept 27, 2014

What I meant was a data-size that was too large for a machine. Adding arbitrarily large JSON to your table could expand the data-size to be too big for one machine, or even too big for block storage. Plus you might not want it all in block storage. My point is that an RDBMS can be better served storing the relational data alone with a separate DB Engine for the potentially massive JSON data.

rpedela · on Sept 27, 2014

How many use cases really exceed a single, multi-terabyte machine? Your argument is true for any data type and format not just JSON. If you have more than a few terabytes then you have "big data" and most solutions, including Mongo, probably won't work.

vertex-four · on Sept 27, 2014

You're storing exactly the same data whatever you're storing it in. The point is that there's rather few use cases where your primary database is going to be more than a few terabytes, which is usually easy to handle with a single machine + some caching. I pointed to Stack Overflow as an example of a large site which still manages to keep its entire database on one machine.

yangyang · on Sept 27, 2014

Why does the format the data is stored in matter more than the data itself? JSON or columns in rows - it's not fundamentally different.

mikegioia · on Sept 27, 2014

It has nothing to do with the format. I'm using JSON as an example of something that could be large. I'm only talking about separating the "could become massive" columns/data/whatever from the "small, relational data".

collyw · on Sept 27, 2014

Sure, but how many people genuinely have data that big?

mikegioia · on Sept 27, 2014

Well one thing we store is HTML content and "MS Word" like document data. We also store hundreds of revisions for all of those documents. I wouldn't want to use an RDMBS for this because (a) its not relational, but also (b) backup/replication/load distribution would be too painful. A system like CouchDB can be spread out over n-machines, any of them write-capable.

poolpool · on Sept 27, 2014

This is exactly what SharePoint does on top of boring, battle tested SQL Server.

Shorel · on Sept 27, 2014

> I never really "got" the new wave of NoSQL databases.

It's easy: Overzealous DBAs who insist in normalization at all costs.

A new technology allows developers to try new approaches to the challenges, by sidestepping those DBAs.

marcosdumay · on Sept 27, 2014

So, a technological solution to a political problem.

Now I can understand where that people are coming from.

spacemanmatt · on Sept 28, 2014

If I hadn't seen so many dev teams eschew constraints and triggers in favor of broken client code being allowed to screw up the data, I would agree to the political element easily. But in practice, the ignorance is so great, I can't even be sure there is a political choice being made.

MoOmer · on Sept 27, 2014

Allow me to present Cassandra: http://planetcassandra.org/what-is-apache-cassandra/

Note that Cassandra scales linearly.

http://wiki.apache.org/cassandra/ArticlesAndPresentations