Talk:Out-of-order execution

This is the talk page for discussing improvements to the Out-of-order execution article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing Start‑class

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Start	This article has been rated as Start-class on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the project's importance scale.

It is requested that a computing diagram or diagrams be included in this article to improve its quality. Specific illustrations, plots or diagrams can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the listing at Wikipedia:Requested images.

The contents of Decoupled architecture was merged into Out-of-order execution on 2016-02-13. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. For the discussion at that location, see its talk page.

branch prediction

how does out of order execution fit in with branch prediction? some people claim that to have branch prediction there needs to be an out of order execution, since otherwise the next instruction can only be fetched when the branch is done, thus having no benefit from any prediction at all. — Preceding unsigned comment added by 62.155.231.153 (talk) 21:19, 15 June 2011 (UTC)[reply]

There is a tradition of instruction fetch, instruction decode, instruction execution, instruction retirement, without OoO you can still do those in parallel. So, with branch prediction you can still do fetch and decode ahead of time, and then do execution in order. With core memory, this was pretty important. A big part of the 360/91 is 16-way interleaved memory, but it still took a long time. After the 360/91, the 360/85, first machine with memory cache, was close to the speed on the 360/91 for many programs, as the cache made up for the processing speed. As well as I remember, the 360/91 doesn't do branch prediction, but can prefetch on two possible branch paths. The 360/91 has a special 'loop mode' which keeps a cache of instruction words, such that tight loops can run without fetching instructions from core. Not exactly prediction, but it only works for successful back branches. Gah4 (talk) 18:01, 16 November 2015 (UTC)[reply]

You can implement OoO execution without branch prediction, but performance would likely be drastically impaired. The purpose of branch prediction is to enlarge the instruction window under execution (the depth of the pipe), the window being defined as the range of instructions in the dynamic instruction stream from the issue point (next instruction to fetch/decode/issue) through the commit point (next instruction to retire). I have long used the rule of thumb that execution rate in an OoO machine could be first order modeled using the approximation, "IPC = min(I, sqrt(W)/L)", where IPC is instructions per clock (throughput), I is the machine's peak issue rate (e.g, in a 4-issue machine, I = 4), W is the mean window size in instructions, and L is the mean latency to execute an instruction once it is "ready" (all input dependencies satisfied) and "dispatched" (sent to an execution unit). L for example is 1 for a single-cycle ALU while it could be 2 or more clocks for a load-to-L1D-cache-hit instruction. Assuming a conditional branch density of 20%, the average window size without branch prediction would be quite small. With branch prediction, we can greatly increase the mean window size. This will obviously have a big impact on performance Mshebanow (talk) 04:17, 31 July 2020 (UTC)[reply]

Queue

Isn't the point of a queue to have something first in first out? FIFO?. If the instruction is fetched and decoded and stuck in a queue, wouldn't that imply some kind of order of instruction? Or is it more like a priority queue? I'm kind of confused on that point, could someone clarify? Epachamo 03:54, 2 March 2006 (UTC)[reply]

Conceptually, the queue holds instructions and (once produced) their results. Instructions are kept in order by the queue. When the result of an instruction is computed, it is written to the proper entry in the queue, next to the instruction. When the instruction commits (reaches head of the queue) its result is read from the queue and written to the programmer visible registers. In summary: it is a FIFO discipline with respect to instructions, but not results. Results are written into the queue out of order, but extracted in program order. Schiermer 19:53, 30 March 2006 (UTC)[reply]

I suppose that seems right for in-order retirement. I believe the 360/91 does out-of-order retirement, with register renaming to get the data in the right place. But it is hard to do OoO retirement and virtual memory, as you don't know when a page fault might occur. The 360/91 has the dreaded "multiple imprecise interrupt" when more than one interrupt happens before the pipeline has been flushed. The next instruction address doesn't tell you where the fault occurred, either. You can't have a page fault that doesn't identify the faulting instruction and allow it to be retried. Gah4 (talk) 18:09, 16 November 2015 (UTC)[reply]

History

I believe there are a few inaccuracies in the history section of the article.

Firstly, the article mentions, 'IBM also introduced the first out-of-order microprocessor, the POWER1 (1990) for its RS/6000.'

This is misleading and has potential for confusion as the POWER1 could only execute floating point instructions out of order, but the article speaks of it in such a way that suggests that it was fully out of order. It was not until the POWER3 was released in 1997 that the POWER series had a fully OoO member.

Secondly, the article says, 'It was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream.'

This too is also misleading and has potential for confusion because it sounds like that Intel, by using OoO, prompted other companies such as MTI, HP, HAL and DEC design equivalent CPUs. In fact, the Pentium Pro was based on the OoO technique of the PA-8000, which Intel had access to thanks to its alliance with HP. Other companies such as MIPS also had OoO CPUs in development at roughly the same time as HP and Intel. Further more, the article says that the Pentium Pro brought the technology into the mainstream which is inaccurate as the processor was deemed too expensive for the consumer market by Intel late into the design stage and was retargeted towards enterprise markets.

Does anyone have any thoughts or comments? Rilak 17:39, 14 August 2007 (UTC)[reply]

I just find odd that the claim "it was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream" when the Power Macs (sp. models 6100/60, 7100/66, and 8100/80) were introduced in March of 1994. These models were based on the PPC601 and the Motorola & IBM white papers I head read in 1992/1993 stated that the 601 operated "fully" out-of-order (or "out-of-queue" as the authors preferred) as opposed to out-of-order FPU execution of the POWER1. (I hope I still have those 15-year-old papers) ▪ NeoAmsterdam^Talk_Edits 00:10, 18 March 2008 (UTC)[reply]

From this paper: The Design Space of Register Renaming Techniques by Dezsö Sima of Budapest Polytechnic, published in the September/October 2000 issue of the IEEE Micro, the ES/9000 mainframe of 1992 from IBM was the first processor to implement full out of order execution. I'm not sure of the out of order capabilities of the PowerPC, but I think that they were limited much like the POWER1 and POWER2. The NextGen Nx586 of 1994, released one year before the Pentium Pro could execute fixed point instructions out of order, and the AMD K6 was released in the same year as the Pentium Pro had complete capability. Other designs such as the SPARC64 from HAL and the PowerPC chips of 1995 were also fully out of order. I think with all this, the we can conclude that the Pentium Pro was certainly not innovative at all in regards of OoO and was following the general trend like everyone else. Rilak (talk) 06:22, 20 March 2008 (UTC)[reply]

I was the person who wrote the sentence "brought the technology to the mainstream". What I meant is that the Pentium Pro device was the first OoO device to be sold in large mass-market volumes, not so much that it caused other design houses to change their microarchitectures. Reading all of the other CPUs listed on this discussion page, I still believe my statement is an accurate assessment. The only other device that you might claim sold in large volumes are the PPCs used in the Macs, but by the early 1990s Apple had perhaps 10 percent(?) of the PC market in total and much of that still 68K based. As a separate point, the Pentium Pro CPU was the one that ended the architecture wars of the '80s and '90s by removing any hope of performance advantage to be provided by the simpler RISC architectures. Dyl (talk) 07:19, 20 March 2008 (UTC)[reply]

Maybe the statement could be clarified further? The Pentium Pro was not exactly common, it was only used in expensive (relative to PCs of the time) x86 workstations and entry level servers. I'm not sure what you mean by the PPro removing any hope of a performance advantage by simpler RISC architectures, but I think that it is fair to say that most RISC designs of the same period were still faster, to put it bluntly. I think that the only significance of the PPro is that it was part of the trend to translate x86 instructions in simpler micro-ops that could be executed more effectively by a RISC-like or inspired core. Rilak (talk) 19:29, 20 March 2008 (UTC)[reply]

My belief is that the volume of x86 workstations/entry-level servers is still higher then Apple's 601-based systems. Though I don't have the references to back that up. Perhaps the statement would be less controversial if it said PentiumPro and its descendant PentiumII brought OoO to the masses. I disagree about the performance issue. I was there at Microprocessor Forum '96? when the first PPro performance numbers were announced and the integer numbers were as good as any at the time. Yes, the RISC machines would continue to have better FP numbers for a long while, but for the mass-market that's a non-issue. Dyl (talk) 16:34, 21 March 2008 (UTC)[reply]

Yes, PPro workstations and servers were far more widespread than PPC systems but isn't that the point? The PPro was not a consumer device, eg. Pentium in a $1,500 box. I would say that the Pentium II was the processor that really brought OoO to the masses but I think it would be better if we simply didn't state what chip did this and that first. Instead by providing the history without any "controversial" words with debatable meanings such as 'mainstream' or 'masses', the article would be a lot more neutral. As for performance, I would say that the PA-8000 ad Alpha beat the P6, but that's completely irrelevant to this article. Rilak (talk) 06:52, 22 March 2008 (UTC)[reply]

I changed the "mainstream" comment to be less controversial. Since I didn't add the "first" comments, I'll leave that to others to change or defend. Dyl (talk) 23:35, 22 March 2008 (UTC)[reply]

Regarding POWER1 could only execute floating point instructions out of order,: For scientific programming, floating point is the dominant time use, and most fixed point instructions are fast enough that in-order is fine. The design of the CDC 6600, 7600, and IBM 360/91 is mostly to speed up floating point calculations. Gah4 (talk) 02:55, 24 September 2016 (UTC)[reply]

Lynn Conway

During the development of ACS-1, Lynn Conway invented Dynamic Instruction Scheduling. Her paper can be found here: http://ai.eecs.umich.edu/people/conway/ACS/DIS/DynInstSched-paper.html. This is likely the most importent development in out-of-order execution, but is nowhere mentioned in the article.

Yale Patt seems to be getting to much credit. As he mostly plagiarized Lynn's work. This is mentioned here: http://ai.eecs.umich.edu/people/conway/Retrospective2.html#anchor305182 — Preceding unsigned comment added by 145.120.12.69 (talk) 22:49, 6 February 2012 (UTC)[reply]

It's an interesting debate. The problem is that Lynn (who wasn't named Lynn at the time, which makes some things more tricky in some cases) published internally only (as I understand it). Obviously Tomasulo is the one who gets the base credit--and it's well deserved--even though his algorithm was really only used for floating point scheduling. I think Lynn's work deserves credit, but needs some independent reliable sourcing (WP:V) about its impact. The links above probably don't cut it. And I very much doubt the movement of the idea from Lynn to Yale will able to be shown to have occurred one way or the other.

Out of curiosity, does Dr. Conway's paper address renaming? If not, I'm not really seeing how this is an improvement over scoreboarding. In fact, after a quick scan, I'm not seeing how name dependencies are resolved/dealt with at all. I assuming I'm missing something obvious. While a bit off topic for Wikipedia, I'd love to know what I'm missing. Back to work! Hobit (talk) 01:06, 8 February 2012 (UTC)[reply]

Reading the ACS disclosure now, I can definitely state than in 1984 when initial concepts of HPS were fleshed out, we were not aware of that paper. So, I think the statement, "As he mostly plagiarized Lynn's work," is not correct. (For the record, Wen-mei Hwu and myself started working on HPS in the summer of that year while we were interns at DEC's ERL (Eastern Research Lab) in Hudson, Mass.) Mshebanow (talk) 03:56, 31 July 2020 (UTC)[reply]

Merge from Decoupled architecture

Decoupled architecture → Out-of-order execution

Support looks like the contents of Decoupled architecture can be effectively merged into Out-of-order_execution#Dispatch_and_issue_decoupling_allows_out-of-order_issue. ~KvnG 00:00, 3 February 2015 (UTC)[reply]
Support the Decoupled architecture is too complicated for some readers and is very short. I think this should be merged. --☣Anar chyte☣ 02:53, 22 May 2015 (UTC)[reply]
Support I think I understand Decoupled architecture, but it seems to be saying that "decoupled architecture" is another way of saying "an architecture with out-of-order execution". If that isn't the case, then it should be rewritten in a way that explains the distinction.--Wcoole (talk) 19:24, 30 October 2015 (UTC)[reply]

Done ~Kvng (talk) 04:59, 14 February 2016 (UTC)[reply]

Static vs. Dynamic Scheduling Superscalar

I'm a bit muddled on the distinction between Static and Dynamic Scheduling and they overlap with OOOE, VLIW, and Superscalar. I think I understand, in a general way, the technology involved, but the terminology seems to be a bit variable. The Conway paper uses "Dynamic Instruction Scheduling", but describes something I recognize as OOOE. http://foldoc.org/Very%20Long%20Instruction%20Word says that VLIW is equivalent to "staticly scheduled superscalar". It seems as though "static" in this context means "in-order" and "dynamic" means "out-of-order", but I haven't found a reference that states this clearly. I also haven't found a source that contradicts this interpretation.

I'd be very happy if someone more conversant with this area were to add to an appropriate article, explaining these terms.--Wcoole (talk) 19:38, 30 October 2015 (UTC)[reply]

It isn't so obvious. Seems to me that the idea of dynamic is that it is determined during execution. For example, it could change from run to run, as memory access, and especially effects from other tasks running, could be unpredictable.

Note that Itanium is listed as not OoO. Instead, Itanium depends on the compiler to order things in the optimal order. I would say, echnically, static should refer to fixed order, not necessarily in-order. Gah4 (talk) 02:30, 24 September 2016 (UTC)[reply]

Basic concept

The Basic concept section seem to suggest that the steps (fetch, decode, execute, retire) happen sequentially for in-order. As well as I know, they can overlap as long as execute is in-order.

Also, for OoO, the section seems to imply in-order retirement. As far as I know, all current processors do that, as it is almost necessary for dynamic memory. The 360/91, as well as I know, allows OoO retirement, consistent with register renaming. With OoO retirement, it is necessary to flush all pipelines before an interrupt can be taken. Gah4 (talk) 02:35, 24 September 2016 (UTC)[reply]

Also, in explaining in order, it says: the instruction is dispatched to the appropriate functional unit. I suppose this is right, but the reason is that such processors tend to have only one functional unit. The idea of more functional units is to allow more parallelism, which increases the need for OoO. Gah4 (talk) 15:10, 8 January 2018 (UTC)[reply]

FWIW, we always defined six major events in the life of an instruction:

Fetch - an "FPC" (fetch program counter) is used to index memory to get the next instruction to execute. That instruction is read and brought into the machine.
Decode - the instruction is decoded and classified along with a determination of all operands (to be read and/or written).
Issue - the instruction is staged for execution. Staging means it is made available for the execution core.
Dispatch - the instruction is destaged and sent to an execution unit for actual execution.
Complete - the instruction completes execution and its result is broadcast to other (potentially) waiting, staged instructions.
Commit (aka retire) - the instruction's side effects are made (permanently) visible outside the machine.

(Note there is controversy on the definitions of Dispatch versus Issue). I would define that in an in-order machine, at the very least, Fetch/Decode/Issue/Dispatch and Commit must occur in program order; Complete can occur out of program order so long as side effects from completion are not exposed outside the machine (in order retirement/commit). Whereas in an OoO machine, only Fetch/Decode/Issue and Commit must occur in order; Dispatch and Complete can occur out of program order so long as operand dependency hazards are honored. Mshebanow (talk) 04:31, 31 July 2020 (UTC)[reply]

I suspect that all six are commonly only distinguished in fancier processors, which do some overlap. Also, on simpler processors those might not all be really separate. Early computers published instruction timings, which were often a requirement when bidding for sales. As processor got more parallel, it wasn't possible to give individual times. That is where the distinctions above become interesting. Early parallel processors used out-of-order retirement (or commit), but that is a lost art. Mostly virtual addressing complicated things. If an interrupt, especially an exceptional condition in the executing instructions, occurs, the pipeline must be flushed. Translation interrupts can't occur during that time. The IBM 360/91, a popular example in many books on pipelined processors, allows for out-of-order retirement. (There isn't enough state to keep it all in.) One result is the dreaded multiple imprecise interrupt. The IBM PL/I compilers generate a message saying where a problem occurred. On the 360/91, the message says "near" instead. Fetch is important, as that affects self-modifying code. Gah4 (talk) 07:11, 31 July 2020 (UTC)[reply]

Pipelining

There isn't much discussion on pipelining and parallel execution. The big advantages of OoO come when you can execute more than one instruction at a time, and especially if you pipeline them. There is a small advantage in memory access without execution overlap, but not the big advantage possible. Gah4 (talk) 03:12, 24 September 2016 (UTC)[reply]

although out-of-order execution was limited to floating-point instructions

I am wondering about the meaning of: although out-of-order execution was limited to floating-point instructions. Does that mean that if you erase all the floating point instructions, those that are left are in-order? If you exchange the order of a floating point and fixed point instruction, which one is out of order? But yes, all the OoO hardware on the 360/91, such as register renaming and reservation stations is for floating point, but that allows faster fixed-point operations to finish earlier. Gah4 (talk) 15:31, 10 April 2017 (UTC)[reply]

The only part of that machine that could execute instructions out of order was the floating point processor. I don't recall how interactions between the integer side and the fp side were handled. Hobit (talk) 18:23, 10 April 2017 (UTC)[reply]

OK, but this is pretty fundamental to the meaning of OoO execution, the subject here. Many processors overlap fetch, decoding, and retirement with execution, but still do execution in order. For the 8086, you can see this by modifying instructions after fetch, where the processor does not detect the change. That is part of the 8086 model. The 360/91 does detect such modification, because S/360 allows for self modifying code. Since fixed point is used for computing addresses, you pretty much have to be able to execute fixed point instructions mixed in with floating point instructions. Next, consider the 8086/8087 where the 8087 executes concurrently with the 8086. (A WAIT instruction is required to resynchronize the instruction stream.) This does not, as far as I know, count as OoO, as fixed point instructions are executed in order, and floating point instructions are in order, even though the two instruction streams can overlap. It seems to me that the article needs to make this distinction clear. Gah4 (talk) 20:16, 10 April 2017 (UTC)[reply]

It would be good to clarify, I agree. I even knew enough details of this processor at one point I could have clarified, but I don't right now. If you have solid insight (and ideally a source) I'd suggest you tackle it in the article. Hobit (talk) 22:02, 10 April 2017 (UTC)[reply]

meltdown versus spectre

The revision on 19:45, 4 January 2018‎, has added information about the Meltdown (security vulnerability) stating that the vulnerability takes advantage of out-of-order execution. This statement is actually incorrect in that it was taking advantage of missing privilege checks that occurred during the execution of Tomasulo algorithm across a security boundary (ring levels) and not OoOE directly. This vulnerability also was only exploitable against Intel's implementation of their architecture (and not AMD which actually performed the proper checks).

I believe the author intended to mention Spectre (security vulnerability) (speculative execution) which is an attack on Robert Tomasulo's Tomasulo algorithm and directly taking advantage of an oversight of out-of-order execution. Despite the Meltdown (security vulnerability) being the more practical vulnerability (and thus the more public one), Spectre (security vulnerability) is the vulnerability that directly attacks the speculative execution provided by Tomasulo algorithm. I've updated the page to reference this. 2605:6000:E890:B100:CD1D:1DAC:CAE2:5F48 (talk) 20:00, 31 January 2018 (UTC)[reply]

Even more, it seems to me that it is speculative execution, and not OoO, that creates the problem. The 360/91, where Tomasulo algorithm came from, does OoO but not speculative execution. It does speculative prefetch (on two additional paths), though. Also, with no cache, cache related effects don't occur. But otherwise, it seems to me that this is big enough for an article of its own, mentioned from this page. Gah4 (talk) 21:47, 31 January 2018 (UTC)[reply]

Agreed. It seems that Speculative_execution has this written to completion. Should we just remove the mentioning of these attacks and perhaps just replace them with a link to that article? 2605:6000:E890:B100:CD1D:1DAC:CAE2:5F48 (talk) —Preceding undated comment added 18:24, 9 February 2018 (UTC)[reply]

Basic concept - what about writing results back to memory

I feel that a major piece is missing from the description of the basic concept. The described instruction cycle stops after writing operation results back to register file. But at some point some results must be stored in memory (the so-called side effects of the instructions, observable from the outside).

In a very simple architecture, you could have all operations on register and have separate instructions to store registers in memory. Even this simplified case is absent in the article. I think however in most CPUs: One instruction can fetch operands from memory, operate on it, and store the result in memory. This has effects on the OoO instruction cycle, and for completeness, this should be described, as well as any observable effects of OoO on the memory data flow. Basically: Will memory reads and writes always occur in "program order" or not?

(Pardon me for not being updated on all the correct terminology. Although memory in practice is cached, I assume here a simple memory model without caching or virtual addressing, to simplify this discussion.)

1. Data dependency in memory. In the stage "Instruction dispatch to an instruction queue", if this instruction needs operands from memory, the logic must discover if an earlier queued instruction will update the same memory. If so, this instruction must (in principle) stall until the previous instruction has finished the memory update cycle, in addition to waiting for the register file to be updated. Here I don't know the tech details: theoretically there is an obvious possibility to bypass the memory here and just pass the data from instruction A to B via internal register instead, in particular if instruction B will also update the memory cell.

2. An instruction that writes to memory is not finished when the register file is updated, it is finished when the write-to-memory cycle is completed. (Memory is most likely cache memory.)

3. The question, as already stated: Will the CPU guarantee that all observable side effects (memory reads and writes) occur in "program order" so that the (assembly-level) programmer of an OoO CPU does not have to worry about it in terms of program correctness? (Performance optimization is another thing.)

3a. Only considering one CPU vs memory.

3b. Including exceptions and interrupts, which is however already discussed in the article and the talk page.

3c. With multiple processing units sharing one memory, or CPU + peripheral. If the answer to 3a is yes, then it appears to me that there are also no OoO-induced synchronization issues when using multiple CPUs. Am I wrong? Probably.

Note, related to 3c: I know that there are many, many issues with multiprocessing, but how much does CPU-level OoO contribute to this, this is a part of what I'm trying to get at. Perhaps most or all of the trouble comes from "external" OoO execution caused by compiler optimizations that actually modify the program flow seen by the CPU, as well as cache coherency issues? I'm posting a separate talk subject about the need to distinguish between internal and external OoO.

I apologize for the length of this talk. This looks like a bunch of questions. I know this is not the Stack Overflow forum. My point is that the article should clarify these things in order to address some of these questions.

195.60.183.2 (talk) 12:34, 25 January 2021 (UTC) (Anders Hallström non-registered user)[reply]