One of the largest challenges in building an exascale cluster is communication. ...

dekhn · on July 30, 2015

You are incorrect saying MapReduce isn't locality aware. Hadoop supports machine, rack, row, and cluster locality scheduling.

Also, most modern Internet HPC systems dedicate a ton of design and equipment to having very high cross-sectional bandwidth, which enables the locality restrictions to be relaxed.

lorenzhs · on July 30, 2015

Well my point exactly. That "ton of design and equipment" doesn't scale particularly well, as its cost grows highly super-linearly with the computing power. You need to reduce communication volume to be cost effective at exascale.

dekhn · on July 30, 2015

This isn't true. You can build awesome high bandwidth clusters for extremely cheap. It takes an understanding of ethernet silicon and TCP implementations, but it can be done. Amazon for example recognized that superlinear cost scaling was killing their profits, and invested in building newer systems with better designs that solve these problems.

See also this paper http://research.google.com/pubs/pub36740.html

The main challenge is that because these are built with multistage routers, they have fairly high latency. So much of the effort in modern HPC systems used for Hadoopy workloads goes to latency hiding.

lorenzhs · on July 31, 2015

You say "awesome high bandwidth" but at 1 Gbit/s per node you're still a long way from an InfiniBand 4X FDR Interconnect (54 Gbit/s and sub-microsecond latency, significantly lower than your network ). As you write, these are built with multistage routers, which add even more latency. So in effect they have reduced (but still high) communication capabilities to keep costs manageable, just as I said.

dekhn · on July 31, 2015

1 Gbit/sec, if you look at the Jupiter paper, was the host speed in 2004. The Jupiter system works with 10G and 40G interfaces on the host.

What's important to recognize is you simply cannot buy Infiniband switches that let you contact a lot (10K+) of hosts together. The vendors won't sell you this, they won't do the R&D to make it, and it would cost infinite anyway.

This is a deliberate choice: for most Internet work, it's better to have really fat bisection bandwidth and non-blocking fabrics, and latency is ignored due to the high cost of building a crossbar that supports that with high radix.

Only if you have an algorithm that absolutely requires, and simply cannot be fixed, low latency, you are almost always better off building a cheaper, fatter fabric, and hiring engineers who know how to write applications that are latency tolerant.

dekhn · on July 31, 2015

here we go, the Jupiter paper is now published: http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....