Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the largest challenges in building an exascale cluster is communication. Computing power increases at a higher rate than memory throughput does, and memory throughput increases faster than communication infrastructure advances.

Many argue that an exascale computer can only be cost efficient if the communication capabilities scale highly sublinearly with the computation done in the subsystems [1]. In particular, you can't move the data, and new algorithms are needed that can deal with data that is arbitrarily distributed. This is quite challenging and unfortunately the theoretical computer science community seems to have decided that distributed memory algorithms have been covered since the 90s and are not worth their time. Yet they ignore the progress that has been made in other models of computation since, and many algorithmic improvements of the last decades are not applicable. It is high time to develop communication-efficient algorithms for the basic "toolbox".

I guess what I'm trying to say is that you can't just throw MapReduce at an Exascale machine and expect it to perform well. Instead, you need an environment that is rich in primitives that have been implemented in a communication-efficient way. It's faster and cheaper to spend a little more effort on local communication if that allows for reduced communication volume (and/or the number of connections that need to be established!).

The issue I have with the MapReduce approach is that it doesn't particularly care about data locality. Thus it is very hard to achieve communication volume sublinear in the input size, which is absolutely deadly in an exascale setting.

I also understand the frustration with MPI, it is a very low-level API focused on data movement. It can be rather frustrating to use, but there do exist tools to make it more fun (Boost.MPI with C++11/14 is an excellent example). That said, with a well-engineered set of algorithmic tools, ideally you wouldn't need to use low-level MPI calls at all. However, MPI still remains a useful tool to implement these things.

Exascale computing requires us to rethink a lot of things.

[1] http://www.ipdps.org/ipdps2013/SBorkar_IPDPS_May_2013.pdf Shekhar Borkar (Intel), Keynote presentation at the 2013 IEEE International Parallel & Distributed Processing Symposium



You are incorrect saying MapReduce isn't locality aware. Hadoop supports machine, rack, row, and cluster locality scheduling.

Also, most modern Internet HPC systems dedicate a ton of design and equipment to having very high cross-sectional bandwidth, which enables the locality restrictions to be relaxed.


Well my point exactly. That "ton of design and equipment" doesn't scale particularly well, as its cost grows highly super-linearly with the computing power. You need to reduce communication volume to be cost effective at exascale.


This isn't true. You can build awesome high bandwidth clusters for extremely cheap. It takes an understanding of ethernet silicon and TCP implementations, but it can be done. Amazon for example recognized that superlinear cost scaling was killing their profits, and invested in building newer systems with better designs that solve these problems.

See also this paper http://research.google.com/pubs/pub36740.html

The main challenge is that because these are built with multistage routers, they have fairly high latency. So much of the effort in modern HPC systems used for Hadoopy workloads goes to latency hiding.


You say "awesome high bandwidth" but at 1 Gbit/s per node you're still a long way from an InfiniBand 4X FDR Interconnect (54 Gbit/s and sub-microsecond latency, significantly lower than your network ). As you write, these are built with multistage routers, which add even more latency. So in effect they have reduced (but still high) communication capabilities to keep costs manageable, just as I said.


1 Gbit/sec, if you look at the Jupiter paper, was the host speed in 2004. The Jupiter system works with 10G and 40G interfaces on the host.

What's important to recognize is you simply cannot buy Infiniband switches that let you contact a lot (10K+) of hosts together. The vendors won't sell you this, they won't do the R&D to make it, and it would cost infinite anyway.

This is a deliberate choice: for most Internet work, it's better to have really fat bisection bandwidth and non-blocking fabrics, and latency is ignored due to the high cost of building a crossbar that supports that with high radix.

Only if you have an algorithm that absolutely requires, and simply cannot be fixed, low latency, you are almost always better off building a cheaper, fatter fabric, and hiring engineers who know how to write applications that are latency tolerant.


here we go, the Jupiter paper is now published: http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: