Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The history of "higher level" instructions isn't good. The DEC VAX had an assembly language intended to make life easier for assembly programmers, but it slowed the machine down. The Intel IAPX 432 had lots of bells and whistles, but was really slow. The RISC machines with lots of registers turned out not to be all that useful, and too much register saving and restoring was required. RISC is a win until you want to go superscalar and have more than one instruction per clock. Then it's a lose. Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.

A useful near-term feature would be zero-cost hardware exceptions on integer overflow. This is an error in both Java and Rust, and tends to be turned off at compile time because it has a performance penalty. The problem is that people will want to be able to unwind and recover, which means exact exceptions and a lot of compiler support for them.

If you could figure out how to do zero-cost subscript checking, that would be a step forward. That check needs additional info about bounds, which usually means a delay.

I used to be a fan of schemes for safely calling from one address space to another. i386 almost has this, with call gates, which don't quite do enough to be useful. A few machines have had hardware context switching, but that hasn't been a big performance improvement. All that has to be tightly integrated with the OS or it's a lose. It's an enhancement to Plan 9, not anything anybody uses.

The same is true of fancy schemes for inter-CPU communication, but that probably needs more attention. Like it or not, we have to figure out what to do with large numbers of non-shared-memory CPUs. Some way to set up memory-safe message passing between non-shared-memory CPUs without involving the OS after setup would be useful.

An IOMMU that allows drivers in user space with minimal performance degradation is a good thing. Those exist.



> Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.

I've been interested in Forth (and related stack) processors for a while, and my armchair observations over a few months have suggested that the (much-vaunted) performance gains associated with such processor designs are apparently not straightforward to map or relate to or take advantage of.

I remember (unfortunately not sure where right now) reading how the GA144 was built at a time when 18-bit memory was the current trending novelty and that it's not really a perfect processor design. I'm still fascinated by it though (sore lack of on-chip memory notwithstanding).

What sort of scale are you referring to when you say "superscalar"? 144 processors? 1000? Do stack-based architectures remain a not-especially-practical-or-competitive novelty, or are they worth pursuing outside of CompSci?

(FWIW, everything else you've written is equally interesting, but slightly over my current experience level)


Pure Forth machines were interesting when the CPU clock and the memory ran at about the same speed, and the number of gates was very limited. Chuck Moore's original Forth machine had a main memory, a data stack memory, and a return stack memory, each of which was cycled on each clock. It took only about 4K gates to make a CPU that way.

Today, the speed ratio between CPU and main memory is several orders of magnitude. The main problem in CPU design today is dealing with that huge mismatch. That's why there are so many levels of cache now, and vast amounts of cleverness and complexity to try to make memory accesses fewer, but wider.

The next step is massive parallelism. GPUs are today's best examples. Dedicated deep learning chips should be available soon, if they're not already. That problem can be implemented as a small inner loop made massively parallel.


> the speed ratio between CPU and main memory is several orders of magnitude

What if you compare scratchpad SRAM and an energy-efficient CPU?


Compare in what way? e.g. the L1 cache is SRAM running at core speed with low latency (4 cycles for Haswell).


A factor of 4 instead of orders of magnitude means a Forth machine might still be worthwhile. No?


Where's the Forth machine getting its operands from?

Sure, if you constrain your programs to use a tiny area of memory you might be able to achieve theoretical speed, but what workloads can you achieve with that?

If you were to write a browser in Forth, presumably it would still have to store all its DOM in DRAM?


You probably want to parallelize the browser for energy efficiency. Then you can distributed the DOM across the scratchpads of hundreds of cores maybe?


So each core has its own JS engine, or when you iterate across the DOM with a selector you have to query across all the nodes? This doesn't sound great.

(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and they're a right pain to program for)


Superscalar is not about the number of processors, but a single cores ability to run multiple instructions in parallel. The most well known example of this is SIMD.


> to run multiple instructions in parallel.

> The most well known example of this is SIMD.

SImd == Single Instruction (multiple data). Correct explanation, unsuitable illustration.


SIMD is an example of super scalar design. It is running multiple instructions in parallel, they all just happen to be the same instruction. I could have said MIMD, but that is not a term that is well known.


I wouldn't call SIMD superscalar. The complexity of a superscalar design is being able to track multiple instructions, their dependencies and their out of order completion [1]. Classical SIMD machines run every lane in lock step.

[1] not OoO issue, that would be a proper Out Of Order CPU.


It's a classical example of how much definition matters. Though, if we define superscalar to mean the instructions can run independenly (i.e. not in lockstep), then I agree that a single SIMD unit is not superscalar. But a design with 2 SIMD units that operate on 2 different data steams independently would be be a superscalar design.


I'm no expert, but I think my definition is what historically has been considered superscalar.

Yes, the design you described is definitely superscalar, but the fact that the two streams are simd is incidental.


I cannot call myself an expert, but I do have some experience in the domain. The classic example of ILP in a superscalar design is:

   a = b + c
   d = e + f
   g = a + d
Where calculation of a and d is executed in parallel.

Wikipedia is not entirely clear either, but the entire page gives the impression that it should send instructions to multiple execution units in parallel, in which SIMD would be a single execution unit.


> Where calculation of a and d is executed in parallel.

Why then not to try to reason from the assembler/machine code standpoint?

The parallel calculation above could be done in different ways:

a) the compiler would emit two "scalar" ADD instructions following one another (allocate registers so independent execution is possible).

b) the compiler would coalesce both additions b+c and e+f into one vector operation (let's assume data layout makes such optimization useful), and emitted only one "vector" (SIMD) instruction.

In the case a) the two scalar instructions would be fetched "sequentially" by the prefetch unit, but executed in parallel, in two separate instances of the adder ==> "super-scalar".

In the other case, the vector operation would not be paired with another instruction which is part of the computation above you mentioned.

EDIT: corrected typo in operands of the coalesced additions.


> RISC is a win until you want to go superscalar and have more than one instruction per clock.

Uh? ARM, SPARC, POWER are doing just fine in the superscalar domain (heck, power8 is 8 wide!). I don't think they have any particular advantage with regard to CISCy x86 (other than a simpler, more scalable, decoder) but don't have any disadvantage either.


You're right. RISC is more suitable for superscalar processing. According to https://en.wikipedia.org/wiki/Superscalar_processor#History, RISC microprocessors were the first ones to implement superscalar execution.


> A useful near-term feature would be zero-cost hardware exceptions on integer overflow.

I see this mentioned regularly. Has anybody actually measured that performance penalty? I'd really like to see the benchmarks and the code samples (both the high-level and the machine code). Because at the machine code level it's just checking a single flag after the operation that could have resulted in an overflow, and only if there's an overflow more exception code has to be executed. And nobody should program the execution paths where the exceptions happen much more often than not? If there's a need for an explicit check (like when implementing bignum routines) there should be a real language feature for that, that is, something the language designers should do, not the CPU designers.


The Rust people did, it then became a discussion about sufficiently smart compilers.

I'd imagine from a compiler engineering standpoint, it't pretty hard to make this performance neutral. It makes all the artichmetic loops into non-leaf nodes in the call graph and so has second order effects in optimization passes.


If your parameters have known ranges you could elide a lot of checks.

You're not going to overflow multiplying an 8 bit number times a 16 bit number, for instance.


Another thing the 432 ISA was was dense with bit aligned variable length instructions ranging from 6 bits to 100s. This solved a 70s problem.

A very successful high level architecture is the IBM System 38. Not as sexy as the 432 but they sold a lot of systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: