Out-of-order execution
Out-of-Order Execution is a micro-architectural paradigm of high performance CPUs.
Out-of-order execution is a restricted form of data flow computation, which was a major research area in Computer architecture in the 1980s. The Intel Pentium Pro processor was the first Out-of-Order device to reach production. Most high-end processors following that landmark device also use this paradigm. The notable exception are the SPARC processors from Sun Microsystems.
The logical complexity of the Out-of-Order schemes was the reason that such machines were not produced until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to large silicon area that is required to build this class of machine.
Important academic research in this subject was led by Yale Patt and his HPSm simulator. A paper by J.E. Smith and A.R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in Out-of-Order machines.
In-Order Processors
In earlier processors, the processing of instructions is normally done in these steps:
- instruction fetch
- if input operands are available, the instruction is dispatched to the appropriate functional unit else the processor stalls until they are available
- the instruction is executed by the appropriate functional unit
- the functional unit writes the results back to the register file
Out-of-Order Processors
This new paradigm breaks up the processing of instructions into these steps:
- instruction fetch
- instruction dispatch to an instruction queue (also called instruction buffer or reservation stations)
- the instruction waits in the queue until it's input operands are available.The instruction is allowed to leave the queue before earlier, older instructions
- the instruction is issued to the appropriate functional unit and executed by that unit
- the results are queued
- when all older results have been written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
Dispatch and Issue Decoupling allows Out-of-Order issue
One of the differences created by the new paradigm is the creation of queues which allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was Decoupled Architecture. In the earlier In-Order processors, these stages operated in a fairly lock-step, pipelined fashion.
The queues allows instructions to be executed in the order when their input operands become available, as opposed to their program order (the order that the programmer or compiler had placed the instructions). That is, the instructions held in the dispatch buffer get executed in dataflow order.
To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming is used. In this scheme, there are more physical registers then defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.
Execute and Writeback Decoupling allows program restart
The queue for results is necessary to resolve issues as branch mispredictions and exceptions/traps. The results queue allow programs to be re-started after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.
Micro-architectural Choices
- Are the instructions dispatched to a one centralized queue or to multiple distributed queues?
- IBM PowerPC processors use queues which are distributed among the different functional units while most other Out-of-Order processors use a centralized queue.
- Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps which hold the register renaming information for each instruction in flight.
- Early Intel Out-of-order processors use a results queue called a Re-order Buffer, while most later Out-of-Order processors use register maps.