The original MIPS and SPARC designs were classic scalar RISC pipelines. Later, Hennessey and Patterson invented yet another classic RISC, the DLX, for use in their seminal textbook "Computer Architecture: A Quantitative Approach."

Each of these designs fetched and attempted to execute one instruction per cycle. Each design had a five stage execution pipeline. During operation, each pipeline stage would work on one instruction at a time.

Each of these stages consisted of an initial set of flip-flops, and combinatorial logic which operated on the outputs of those flops.

The Classic Five Stage RISC Pipeline

Instruction Fetch

The Instruction Cache on these machines had a latency of one cycle. During the Instruction Fetch stage, a 32 bit instruction was fetched from the cache.

At the same time the instruction was fetched, these machines predicted the address of the next instruction by incrementing the address of the instruction just fetched. This prediction was always wrong in the case of a taken branch, jump, or exception, but the penalty for being wrong was small and simply incrementing takes very little circuitry. Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address.

Decode

All MIPS and SPARC instruction have at most two register inputs. During the decode stage, these two register names are identified within the instruction, and the two registers named are read from the register file. In the MIPS design, the register file had 32 entries.

At the same time the register file was read, instruction issue logic in this stage determined if the pipeline was ready to execute the instruction in this stage. If not, the issue logic would cause both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stages would prevent their initial flops from accepting new bits.

Execute

Instructions on these simple RISC machines can be divided into three latency classes:

Single cycle latency. Add, subtract, compare, and logical

operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.

Two cycle latency. All loads from memory. During the execute stage,

the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.

Many cycle latency. Integer multiply and divide and all

floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a seperate set of registers.

Access

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available.

During this stage, load and store instructions access the data cache, either reading (in the case of loads) or writing (in the case of stores). By the end of this stage, load instructions have fetched their data from the cache.

Writeback

During this stage, both single cycle and two cycle instructions write their results into the register file.

Bypassing

Suppose the CPU is executing the following piece of code:

SUB r3,r4 -> r10
AND r10,3 -> r11

The instruction fetch and decode stages will send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:

	cycle 0	cycle 1	cycle 2	cycle 3	cycle 4	cycle 5
Fetch	SUB	AND
Decode		SUB	AND
Execute			SUB	AND
Access				SUB	AND
Writeback					SUB	AND

In cycle 2, the Decode stage fetches r10 from the register file. Because the SUB instruction writing to r10 is simultaneously in the execute stage, the value read from the register file is wrong.

The solution to this problem is a pair of bypass multiplexors. These multiplexors sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexor selects between a register file read port, the current output of the ALU, and the current output of the access stage (which is either a loaded value or a forwarded ALU result).

Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexors to select the most recent data. These bypass multiplexors make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexor, and a flip-flop. Without the multiplexors, the latency of reading and writing the register file would have to be included in the latency of these instructions.