Putting huge numbers of ordinary CPUs on a chip only helps until the memory bus ...

da_chicken · on July 16, 2015

Die size isn't really about packing more cores into a CPU. it's about packing more CPUs on a wafer. The material cost is the largely the same. Smaller die sizes are about your quality control with masks and deposits.

Given a 300mm diameter wafer and a 10mm square CPU die at 14nm you get about 700 CPUs per wafer (Pi * diameter * diameter) / (4 * die * die) (https://en.wikipedia.org/wiki/Wafer_(electronics)#Analytical...). If you change that 10mm CPU to 10nm, you'll have about a 7.2mm square CPU die. That's nearly 1400 CPUs per wafer. Even if you get 100% of CPUs from the 14nm die, once your yield hits 50% on 10nm the 10nm process produces more for the same cost. Now you can either reap the profits, or reduce costs.

orlp · on July 16, 2015

I'm no expert, but isn't there another technique in play as well?

As you get smaller you can duplicate CPU components to make your chip fabrication more robust against errors. If a component is faulty on the die, the CPU can be patched to use the other (identical) component.

Animats · on July 16, 2015

The Cell did that. They really had 9 Cell processors on the chip, but only 8 of them were enabled on any given PS3.

Patching at a lower level has been tried, but it's usually more trouble in manufacturing than it's worth. There's a long history of workarounds for low yield, but the fab industry has usually been able to fix the fab problems and get the yield up.

Except for memory devices, where patching out bad columns is standard.

daemin · on July 17, 2015

You're off by 1 there, there were 8 SPU's on a Cell chip, 1 was disabled to account for manufacturing defects, and another was reserved for exclusive use of the O. Leaving the developer with 6 SPU's to use.

monocasa · on July 17, 2015

I think he's including the PPE in his count of 9 processors.

TheLoneWolfling · on July 17, 2015

Unfortunately, you end up having to make the interconnects longer to accommodate the additional redundant components, which ends up slowing everything down.

It's a tradeoff that's not always worth it.

monocasa · on July 16, 2015

This is done in quite a few places including GPUs.

bravo22 · on July 16, 2015

That only works for things like on-board SRAM

nickpsecurity · on July 16, 2015

I mainly agree with that sentiment. What do you think about the TOMI tech, though?

http://www.venraytechnology.com/Making%20a%20Frosted%20Cake%...

I think it was a brilliant reframing of the problem: move CPU's to memory with its cost-benefit analysis rather than vice versa (status quo). The design also achieves what some exascale proposals are trying to achieve with R&D in terms of better integrating CPU & memory with lower energy. It's also massively parallel (128 cores) and optimized for big data. Close to your next big thing.

It's main risk right now is that DRAM vendors are more conservative and mass market than most fabs. There's not all this MOSIS, multi-project runs, and so on. Their low-volume cost is currently high (tens of thousands). They might be facing a chicken-and-egg problem in terms of hitting enough volume to get a nice, production deal. I do like their tech and think it has far more potential than what they're doing right now.

minthd · on July 16, 2015

>> It's main risk right now is that DRAM vendors are more conservative and mass market than most fabs.

recently micron released a memory based processor/state-machine architecture called "automata". This might be a good sign that the problem you mention will be solved.

nickpsecurity · on July 17, 2015

That was a really neat processor. That's what happens when hardware vendors look at FSM problems their way instead of support software developer's way (eg C lang). The best bet might for those using memory fab tech to haggle one of the fabs to do MPW's on, say, one production line. Fab as a whole can keep cranking out tons of memory chips, new players can crank out theirs, and any risk is very limited.

The problem has already been solved outside memory fabs several times over. The memory fabs just need to take some steps, themselves. If I were them, I'd push I.P. vendors to follow path of Micron and Venray just to get more fab customers.

ak217 · on July 16, 2015

Except you still need a fast bus for the CPUs to talk to each other and to access shared memory. So for all but the most embarrassingly parallel workloads, you just move the bottleneck from the memory bus to the shared cache bus, do you not?

white-flame · on July 16, 2015

A memory bus has long delays to set up a transfer, is typically only 64 bits wide, and only achieves good bandwidth on large burst operations.

The Venray design allows single-cycle random access to full 4096 bit cache lines, at least as described in the earlier iterations. Contention is far less an issue in this model, for many cores on 1 large memory chip. Multi-chip sticks are then akin to multi-socket motherboards.

nickpsecurity · on July 17, 2015

Good answer.

vardump · on July 16, 2015

Intel CPUs resemble GPUs more and more over time. I think just scatter, GPU style ultra slow (high latency) but wide memory interface and texture lookup is missing in Skylake (Xeon).

Gather was already added in Haswell, although it performs badly so far.

Skylake (Xeon AVX-512) handles 16 float wide vectors (512 bits) and can dual issue per clock, bringing effective rate to 32. That's definitely comparable to modern GPUs.

Wasn't Nvidia WARP just 16 float wide per clock cycle? Or 32? For comparison, high end Nvidia 980 GTX GPU has only 16 of such SIMD execution cores. However, they count those 16 cores as 2048 in their marketing literature.

I do wonder if Intel is planning to unify CPU and GPU in 10 years or less. Things sure seem to be moving that way.

If Intel can add significant amounts of eDRAM in package, x86 CPUs aren't that far from being capable of handling GPU duties as well.

valarauca1 · on July 16, 2015

Vector Instructions != Scalar Instructions

"WARP Scheduler" gives you a hint.

Okay, so how this works is you have a processor that is 16 scalar cores wide. Each scalar core is really just an out of order scheduler, for 32 in-order pipelined, boring, ALU's. These ALU's can each execute the same instruction, together, giving you the illusion that the scalar core is doing vector processing.

The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's, and execute the branch statement on another, allowing for a 10 instruction section of code to run in ~3 instructions time. Trying doing that with a vector processor.

Technically in CUDA you can schedule each ALU itself, thus marketing stuff.

Would you like to know more? http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

vardump · on July 17, 2015

> The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's

That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want. This is typically much faster than having a data dependant, unpredictable branch.

Another way is to SIMD sort data according to criteria to different registers and process them separately. This completely sidesteps having to execute both sides of the branch, although some computational resources are still wasted.

valarauca1 · on July 17, 2015

>That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want.

What your talking about is how x86_64 processors can optimize away some branches. Which it does this with the cmov instruction. This has nothing to do with SSE/AVX. Its common to confuse this b/c intel says the branches are executed in parallel (and they often are), just in parallel as the OoO pipeline allows, which is actually quite a few.

Both sides of the branch are pre-computed, then the branch is computed. But its output is sent to a cmov, which just re-assigns a register, instead of jmp into a branch. This avoids pipeline flushes. cmov isn't prefect still costing ~10 cycles, but compared to the ~100 of a pipeline flush its still cheaper.

Provided the same operations are being done on both branches then SSE/AVX can be used. As both branches are just values, and that is literally what vector processors are good at. The chain will end with a cmov.

vardump · on July 18, 2015

It has absolutely nothing to do with CMOV. I'm talking about computing, say, 16 results in parallel in a SIMD register, for both sides of "if"-statement. Then masking unwanted results out. SSE/AVX can simulate "CMOV", but for 128/256/512 bit wide vectors.

To make it even more clear, there's not a single CMOV in my code, anywhere. The data doesn't usually even touch general purpose (scalar) registers, because that'd totally destroy the performance.

What you are talking about is how things were done until 1997-1999 or so. SSE in 1999 and especially SSE2 in 2001 changed radically the way you compute with x86 CPUs.

I'm talking about things like vpcmpw [1] (compare 8/16/32 of 16 bit integers and store mask), vpcompressd (compress floats according to a mask, for example for SIMD "sorting" if and else inputs separately), vpblendmd (blending type combining, this example is for int32), vmovdqu16 (for just selectively moving according to mask).

You can do most operations on 8, 16, 32, 64 unsigned and signed, and of course 32-bit and 64-bit floats. Some restrictions apply especially to 8 and 16 bit operands. When appropriate, it's kind of cool to process 64 bytes in one instruction. :)

[1]: https://software.intel.com/sites/landingpage/IntrinsicsGuide... SSE/AVX instruction and intrinsics guide.

m_mueller · on July 17, 2015

GPUs have evolved with about the same pacing. Nvidia's Kepler architecture has a vector length of 192 (single prec.) per core and up to 15 of these cores on one chip.

The question really is, do you optimize the chip for heavily data parallel problems, saving overhead on schedulers and having a very wide memory bus, or do you optimize for single threaded performance of independent threads and give it some data parallelism (Xeon). As a programmer, when you're actually dealing with data parallel programs, doing so efficiently on a GPU is actually quite a bit easier since you have one less level of parallelism to deal with.

orbifold · on July 17, 2015

Um no 192 = 6 * 32 each streaming multiprocessor operates on warps of size 32, the 6 is the number of different functional units

m_mueller · on July 17, 2015

I think we're mixing up terminologies here. One SMX operates on up to 192 values in parallel (Nvidia calls this 192 "threads" per SMX). Functional units AFAIK is only used in terms of "special functional units" which isn't relevant for this discussion. One SMX has 6 Warp schedulers, but I'm not sure on how independant these can operate. My guess is that branch divergence will only NOP out one whole Warp, but I'm not sure whether the Warps can enter different routines or even kernels (my guess is yes for routines/no for kernels).

orbifold · on July 17, 2015

So the different functional units (this has a specific meaning in hardware design) are 32 wide and indeed if the instructions to be executed can utilize all 6 of them at the same time the smx will operate on 192 values but that wont be the case if you only need to executed a large number of double precision floating point operations.

ak217 · on July 16, 2015

What defines "unusual"? The CPU/GPU split is a distinction without a difference there. NVIDIA and ATI have both been selling massively parallel architectures in their GPUs for most of a decade now, and NVIDIA has great traction in the supercomputing and machine learning space due to its HPC business development, excellent tooling and developer support. Intel is trying to do the same with Xeon Phi, and they're certainly throwing their weight behind it.

Both Intel and NVIDIA are addressing the memory bus bottleneck with chip packages that stack memory chips and shorten/widen the bus. The two are converging to similar designs and I foresee a big fight as they go head to head (remarkable, given NVIDIA's size, but not unprecedented, given how badly the ARM crowd has smoked Intel in mobile).

throwaway2048 · on July 17, 2015

the issue is not so much memory BANDWIDTH as it is an issue of memory latency on heavily branching code.

pmalynin · on July 17, 2015

With CUDA you get 32 banks, all of which can be accessed in parallel.