Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
mrjob: Yelp open sources its Elastic MapReduce framework for Python (yelp.com)
101 points by pretz on Oct 29, 2010 | hide | past | favorite | 13 comments


This past week I started working on a Python 3 port of this, mostly to learn. No EMR unfortunately, but Hadoop should be possible. I just got back from a trip, so it's still not very far along, just runs the "local" version, but it should get a bit farther next week.

I can confirm that it is a great way to learn about MapReduce.

Link: http://github.com/irskep/mrjob/tree/py3k

I will likely totally restart the py3k port now that I know what I am doing a bit better. I've been writing Python 3 for about, oh, two weeks.


Amazon EMR is an amazing value proposition for virtually any research need, and it's very cool to see wrapper frameworks targeting it directly. Still, for anyone managing their own compute clusters and wanting to do MR in python, I'd suggest checking out Disco.

Disco (http://discoproject.org) is a really elegant MR framework implemented in erlang and python, with additional support for jobs in C and Java. I've used it for a little over a year and am convinced it is the superior MR platform (Hadoop's terasort victories notwithstanding). New features are being integrated very quickly, the core platform is rock solid, management is simple and it's extremely flexible.


this was a game changer for us -- instead of everyone contending for the Hadoop cluster, each developer has their own personal arsenal of Hadoop clusters. huge win.


derwiki forgot to add that we need help wrangling our boatloads of data. We're hiring engineers:

http://www.yelp.com/careers


Thanks for showing this to us at CWRU! It has already given me hours of fun. (See my top-level post about a py3k port.)


But then don't you have a lot of CPUs going unused, because you are partitioning your resources?

Is it really difficult to automatically allocate shared resources?


We're allocating EMR clusters as needed. When they're no longer needed, they go away. Waste is minimal.


On this note, does anyone know a good tutorial on map reduce for experienced programmers? Basically, I want to learn how to frame advanced problems in terms of MR - I am particularly interested in expressing my discrete event simulation in terms of MR.


You want exercises, not a walk-through.

The thing with using higher-order functions isn't learning the definitions (which are rather simple, really,) but figuring out how to use those tools.

And for that you need practice. Start from describing trivial problems (word count, for example), and advance to more complicated ones. Any good book on functional programming would have lots of exercises (http://mitpress.mit.edu/sicp/ is probably the most famous, but is surely not the only one.)

I grokked functional programming by learning Calculus of variations, but YMMV.


Nice to see one more production use of Cython.


mrjob doesn’t contain any Cython. The author was just stating it was a challenge getting Yelp’s codebase (which contains some Cython) running on EMR.


I understood that from the article. But, in the light of recent discussion on Cython, i though it was interesting to note a "2.0" company like yelp using Cython.


So does most of your data live in S3 in JSON format?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: