209 seconds of time on 910 2ghz nodes gives about 3.8e14 instructions, or about 380 instructions per byte (assuming only one of the dual cores on each Xeon was active). That's quite a lot of overhead especially given that they could have most of the sort in the 7TB of available cluster memory.
They were probably hitting their interconnect limits.
They were probably hitting their interconnect limits.