The Texas Instrument TI320C40 digital signal processor had even weirder pipeline issues:
- Branch delay slots (https://en.wikipedia.org/wiki/Delay_slot), where one or more instruction(s) after a branch would be executed before the branch actually occurred.
- Load delay slots, where values stored into registers weren't guaranteed to appear until some later instruction. I believe the the value in the register was undefined for several cycles?
Writing tightly-optimized assembly code for these chips was pretty horrible, sort of like playing an unusually tasteless Zachtronics clone.
It was also kinda awesome because, as long as you were willing to spend days to optimize one page of code, you could get so much performance out of it.
Things like deliberately using the fact that multiplies only write the results into a register ~6 cycles later, means you can use that register for a bunch of other stuff in the meantime, and then on the 6th cycle the results would magically appear.
Basically, for those 6 cycles, you had no registers in-use for either the source operands or destination of the multiplication.
Obviously this is also pipelinable - you can start more multiplies while the first is running, using the same source and destination registers, but meanwhile you've used other instructions to load more data into the inputs and do something else with the outputs.
- Branch delay slots (https://en.wikipedia.org/wiki/Delay_slot), where one or more instruction(s) after a branch would be executed before the branch actually occurred.
- Load delay slots, where values stored into registers weren't guaranteed to appear until some later instruction. I believe the the value in the register was undefined for several cycles?
Writing tightly-optimized assembly code for these chips was pretty horrible, sort of like playing an unusually tasteless Zachtronics clone.