The history of "higher level" instructions isn't good. The DEC VAX had an assemb...

i336_ · on Feb 27, 2017

> Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.

I've been interested in Forth (and related stack) processors for a while, and my armchair observations over a few months have suggested that the (much-vaunted) performance gains associated with such processor designs are apparently not straightforward to map or relate to or take advantage of.

I remember (unfortunately not sure where right now) reading how the GA144 was built at a time when 18-bit memory was the current trending novelty and that it's not really a perfect processor design. I'm still fascinated by it though (sore lack of on-chip memory notwithstanding).

What sort of scale are you referring to when you say "superscalar"? 144 processors? 1000? Do stack-based architectures remain a not-especially-practical-or-competitive novelty, or are they worth pursuing outside of CompSci?

(FWIW, everything else you've written is equally interesting, but slightly over my current experience level)

Animats · on Feb 27, 2017

Pure Forth machines were interesting when the CPU clock and the memory ran at about the same speed, and the number of gates was very limited. Chuck Moore's original Forth machine had a main memory, a data stack memory, and a return stack memory, each of which was cycled on each clock. It took only about 4K gates to make a CPU that way.

Today, the speed ratio between CPU and main memory is several orders of magnitude. The main problem in CPU design today is dealing with that huge mismatch. That's why there are so many levels of cache now, and vast amounts of cleverness and complexity to try to make memory accesses fewer, but wider.

The next step is massive parallelism. GPUs are today's best examples. Dedicated deep learning chips should be available soon, if they're not already. That problem can be implemented as a small inner loop made massively parallel.

qznc · on Feb 27, 2017

> the speed ratio between CPU and main memory is several orders of magnitude

What if you compare scratchpad SRAM and an energy-efficient CPU?

pjc50 · on Feb 27, 2017

Compare in what way? e.g. the L1 cache is SRAM running at core speed with low latency (4 cycles for Haswell).

qznc · on Feb 27, 2017

A factor of 4 instead of orders of magnitude means a Forth machine might still be worthwhile. No?

pjc50 · on Feb 27, 2017

Where's the Forth machine getting its operands from?

Sure, if you constrain your programs to use a tiny area of memory you might be able to achieve theoretical speed, but what workloads can you achieve with that?

If you were to write a browser in Forth, presumably it would still have to store all its DOM in DRAM?

qznc · on Feb 27, 2017

You probably want to parallelize the browser for energy efficiency. Then you can distributed the DOM across the scratchpads of hundreds of cores maybe?

pjc50 · on Feb 27, 2017

So each core has its own JS engine, or when you iterate across the DOM with a selector you have to query across all the nodes? This doesn't sound great.

(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and they're a right pain to program for)

hvidgaard · on Feb 27, 2017

Superscalar is not about the number of processors, but a single cores ability to run multiple instructions in parallel. The most well known example of this is SIMD.

minipci1321 · on Feb 27, 2017

> to run multiple instructions in parallel.

> The most well known example of this is SIMD.

SImd == Single Instruction (multiple data). Correct explanation, unsuitable illustration.

hvidgaard · on Feb 27, 2017

SIMD is an example of super scalar design. It is running multiple instructions in parallel, they all just happen to be the same instruction. I could have said MIMD, but that is not a term that is well known.

gpderetta · on Feb 27, 2017

I wouldn't call SIMD superscalar. The complexity of a superscalar design is being able to track multiple instructions, their dependencies and their out of order completion [1]. Classical SIMD machines run every lane in lock step.

[1] not OoO issue, that would be a proper Out Of Order CPU.

hvidgaard · on Feb 27, 2017

It's a classical example of how much definition matters. Though, if we define superscalar to mean the instructions can run independenly (i.e. not in lockstep), then I agree that a single SIMD unit is not superscalar. But a design with 2 SIMD units that operate on 2 different data steams independently would be be a superscalar design.

gpderetta · on Feb 27, 2017

I'm no expert, but I think my definition is what historically has been considered superscalar.

Yes, the design you described is definitely superscalar, but the fact that the two streams are simd is incidental.

hvidgaard · on Feb 27, 2017

I cannot call myself an expert, but I do have some experience in the domain. The classic example of ILP in a superscalar design is:

   a = b + c
   d = e + f
   g = a + d

Where calculation of a and d is executed in parallel.

Wikipedia is not entirely clear either, but the entire page gives the impression that it should send instructions to multiple execution units in parallel, in which SIMD would be a single execution unit.

minipci1321 · on Feb 27, 2017

> Where calculation of a and d is executed in parallel.

Why then not to try to reason from the assembler/machine code standpoint?

The parallel calculation above could be done in different ways:

a) the compiler would emit two "scalar" ADD instructions following one another (allocate registers so independent execution is possible).

b) the compiler would coalesce both additions b+c and e+f into one vector operation (let's assume data layout makes such optimization useful), and emitted only one "vector" (SIMD) instruction.

In the case a) the two scalar instructions would be fetched "sequentially" by the prefetch unit, but executed in parallel, in two separate instances of the adder ==> "super-scalar".

In the other case, the vector operation would not be paired with another instruction which is part of the computation above you mentioned.

EDIT: corrected typo in operands of the coalesced additions.

gpderetta · on Feb 27, 2017

> RISC is a win until you want to go superscalar and have more than one instruction per clock.

Uh? ARM, SPARC, POWER are doing just fine in the superscalar domain (heck, power8 is 8 wide!). I don't think they have any particular advantage with regard to CISCy x86 (other than a simpler, more scalable, decoder) but don't have any disadvantage either.

kutkloon7 · on Feb 27, 2017

You're right. RISC is more suitable for superscalar processing. According to https://en.wikipedia.org/wiki/Superscalar_processor#History, RISC microprocessors were the first ones to implement superscalar execution.

acqq · on Feb 27, 2017

> A useful near-term feature would be zero-cost hardware exceptions on integer overflow.

I see this mentioned regularly. Has anybody actually measured that performance penalty? I'd really like to see the benchmarks and the code samples (both the high-level and the machine code). Because at the machine code level it's just checking a single flag after the operation that could have resulted in an overflow, and only if there's an overflow more exception code has to be executed. And nobody should program the execution paths where the exceptions happen much more often than not? If there's a need for an explicit check (like when implementing bignum routines) there should be a real language feature for that, that is, something the language designers should do, not the CPU designers.

fulafel · on Feb 27, 2017

The Rust people did, it then became a discussion about sufficiently smart compilers.

I'd imagine from a compiler engineering standpoint, it't pretty hard to make this performance neutral. It makes all the artichmetic loops into non-leaf nodes in the call graph and so has second order effects in optimization passes.

hinkley · on Feb 27, 2017

If your parameters have known ranges you could elide a lot of checks.

You're not going to overflow multiplying an 8 bit number times a 16 bit number, for instance.

CalChris · on Feb 27, 2017

Another thing the 432 ISA was was dense with bit aligned variable length instructions ranging from 6 bits to 100s. This solved a 70s problem.

A very successful high level architecture is the IBM System 38. Not as sexy as the 432 but they sold a lot of systems.