MapReduce from the basics to the actually useful (in under 30 minutes)

conesus · on Jan 14, 2011

Am I crazy or was there not a single reduce query in the post? I've been working on map reduce queries for a little while now, aggregating statistics over feeds and stories posted, normalizing over item frequencies, and generally doing simple stuff.

I was very excited to see the second half of MapReduce. The reduce queries are always harder to write. If map is addition, reduce is division. I was hoping for examples on different reduce queries that are more advanced than the standard count tags in a list of lists. But no reduce period, so am I missing something?

banjiewen · on Jan 14, 2011

There are reduce views there - both examples 2 and 3 use the built-in "_sum" reduce, which is equivalent to something like:

    function(key, values, rereduce) { return sum(values); }

With regard to more advanced views: you'd be surprised how far you can get with the built-ins (_sum, _count, _stats); I've built a non-trivial data backend (on Cloudant, natch,) using pretty much entirely _sum reduces. Abusing the reduce with complex calculations doesn't seem to be worth it from either a disk space or query performance standpoint.

mlmilleratmit · on Jan 14, 2011

Plenty of reduce queries, you just didn't need to write the code ;) Good point, though, good material for the next round. I'm open to suggestions if you have an interesting data set or question.

moultano · on Jan 14, 2011

Here's a great example of something that is most natural in mapreduce: http://www.danvk.org/wp/2007-04-06/nebulabrot/

aheilbut · on Jan 14, 2011

Terminology is so badly abused in this article that it almost reads like a parody.

Also, the tasks could have been accomplished in 3 or 4 lines of SQL.

tdavis · on Jan 14, 2011

In what way is terminology badly abused? There isn't a single use of the word "cloud" in the entire article and the only "buzzword" terminology used appears to be used accurately and sparingly. If that isn't your contention, why not explain otherwise? How does the abuse/misuse of terminology detract from the quality of the article or make it inaccurate?

Further, your argument that the same could be accomplished in "3 or 4 lines of SQL" is a straw man. The article never claimed the specific task was shorter/easier to do using MapReduce and Cloudant; the author made an example based on common use cases—one that didn't require architecting a full requirements specification for when one would find Cloudant/MapReduce/Non-relational databases superior to a few lines of SQL for various values of "superior".

Your comment is vague, largely irrelevant, and completely useless regardless of its accuracy since you provided absolutely no arguments to back up your claims. The fact that it has even two points is disheartening.

aheilbut · on Jan 14, 2011

The phrases that irked me were:

"I have yet to meet a database that isn’t a key/value store"

Dimensions that are "somewhat orthogonal"

"It suffices to say that MapReduce is all about giving programmers an efficient way to consume data without needing to know how or where it is actually stored."

I don't think that definition suffices, and it misses (or buries in 'efficient') the rather central point that MapReduce is a programming model for distributing computation.

I'm all for non-relational databases where they're appropriate, and Cloudant sounds like it is doing great things. But I think that there's a risk in presenting toy examples in a way that seems to sell them as the solution to common use cases that really could be solved more easily with old-school tools.

bitdiddle · on Jan 14, 2011

I suppose "somewhat orthogonal" is like being "almost pregnant", perhaps the author was just speaking loosely here.

I agree that the examples were toy ones, but it's precisely these simple things like projections that we see folks struggling with when coming from the relational world.

The answer to "can't you just do this in SQL" is always yes you can. Things change, non-relational dbs were around before relational ones and are now returning for a variety of reasons.

I'd like to see some follow on posts that delve into the subtleties of rereduce, another real pain point for new users.

One thing that is not emphasized enough in my mind is the flexibility that a schema-less database such as BigCouch gives you. Consider the trivial schema one would construct to support the 3 or 4 lines of SQL required for these toy examples and then consider how that schema might evolve as needs for different queries change, as the data grows, as different apps with different O-R mapping issues are brought into play and so on.

A schema-less approach does push more of the complexity into the app layer for sure, but it allows the schema to evolve more naturally. After all schema-less doesn't mean the schema really goes away conceptually.

I agree also that Cloudant is doing great things, that's why I joined the team, that and the free coffee :) Thanks for the feedback.

mlmilleratmit · on Jan 14, 2011

Hi Aheilbut, apologies if I over-simplified. Perhaps I also undersold scalability -- that's a key point of this type of approach, but non-obvious on such a small data example.

kordless · on Jan 14, 2011

Your people page on your website isn't loading for me in Chrome on OSX, just FYI.

mlmilleratmit · on Jan 14, 2011

whoops! Will fix.

js4all · on Jan 14, 2011

Thanks for the article. It is like a continuation from the NoSQL tapes. When I saw the video, I wondered how to use the built-in reduce functions. Now it is clear.

I also like the use of json.tool to format the output at the command line.