Prefetch input queue

Introduction

Pipelining was brought to the forefront of computing architecture design during the 1960's due to the need for faster and more efficient computing. Pipelining is the broader concept and Most modern processors load their instructions some clock cycles before they execute them. This is achieved by pre-loading machine code from memory into a prefetch input queue (PIQ).

This behavior only applies to von Neumann computers (that is, not Harvard architecture computers) that can run self-modifying code and have some sort of instruction pipelining. Nearly all modern high-performance computers fulfill these three requirements.^[1] The microprocessors function as the CPU in the stored program model of the digital computer. Its job is to generate all system timing signals and synchronize the transfer of data between memory, I/O, and itself.^[2]

Usually, the prefetching behavior of the PIQ is invisible to the programming model of the CPU. However, there are some circumstances where the behavior of PIQ is visible, and needs to be taken into account by the programmer.

When the x86-processor changes mode from realmode to protected mode and vice versa, the PIQ has to be flushed, or else the CPU will continue to translate the machine code as if it were written in its last mode. If the PIQ is not flushed, the processor might translate its codes wrong and generate an invalid instruction exception.

When executing self-modifying code, a change in the processor code immediately in front of the current location of execution might not change how the processor interprets the code, as it is already loaded into its PIQ. It simply executes its old copy already loaded in the PIQ instead of the new and altered version of the code in its RAM and/or cache.

This behavior of the PIQ can be used to determine if code is being executed inside an emulator or directly on the hardware of a real CPU. Most emulators will probably never simulate this behavior. If the PIQ-size is zero (changes in the code always affect the state of the processor immediately), it can be deduced that either the code is being executed in an emulator or the processor invalidates the PIQ upon writes to addresses loaded in the PIQ.

Queuing Theory & Performance based Evaluation of different Queuing Models

Queuing Theory is considered as the subdivision of the applied probability theory, where limited resources are being shared simultaneously, which results in waiting queue lines being formed at the resource side. It was A.K Erlang ( 1878-1929 )who first of all thought of a Queue as a solution to Congestion in telephone traffic and thereafter this concept was used widely in various applications ranging from aviation traffic control, road traffic congestion, biology, astronomy to the modern application like nuclear cascade theory, voice/data communication protocols, advanced electronics like microprocessors and other embedded devices etc. Different queuing models are proposed in order to approximately simulate the real time queuing systems so that those can be analysed mathematically for different performance specifications, for example average instruction execution time, arrival rate of queue elements, service rate of queue element etc.

Queuing models can be represented using Kendall's notation : A1/A2/A3/A4/A5/A6

where:

A1 is the distribution of time between two arrivals

A2 is the service time distribution

A3 is the total number of servers

A4 is the capacity of system

A5 is the calling population

A6 is the service discipline assumed

1) M/M/1 Model ( Single Queue Single Server/ Markovian ) : In this model, elements of queue are served on first Come first served basis. If we assume the mean arrival and service rates to be AR and SR respectively, then actual rates are varying around these average values randomly and hence have to be determined by cumulative probability distribution function.

Poisson's Probability distribution Function, which is mostly used for modelling of queuing systems, is used here. This distribution is representing the number of successes of any event when a large number of trials are being performed .Interestingly here, basic event of interest in rarest and has very less chances of occurrence.

2) M/M/r Model : This model is generalization of basic M/M/1 model where multiple servers working in parallel are used . Since the service time distribution is Markovian , each server will have its own independent but identical exponential service time hold distribution and then by differential analysis, state of the queue can be judged for, what are the number of services completed are forming a Poisson's process and time between two successive events .

3) M/M/r/r Model (Erlang's Model) : As stated earlier , it was Erlang to introduce concept of Queue for Telephone systems. So in this model there are multiple number of servers but stil if any user is not getting access to any of server then it immediately leaves the system without waiting for service time delay and act as impatient customer.This happens with our regular land-line systems where if one gets the desired line busy , he/she has to leave that system for some time till that line gets vacant .

4) M/G/1 Model (Takacs' finite input Model) : Hereafter in advanced models, various cases are considered for instance if few server machines fail from time to time alternately, and if they are being repaired by some repairmen less than failed machines . In such cases, various factors affect working efficiency of Queue like time taken to repair a failed machine, exact difference between number of failed machines and repairmen etc.

now as here we can see Service time distribution is no longer Markovian but has changed from M to G , so this model considers number of failed machines more then one but being repaired by single repairman.So it iq quite obvious that service time for any user is going to increase in this case.

These are some of basic Queuing Models which are widely used and form minimal reference set for next advanced and complex Models. Generally in applications like Prefetch input queue, First model that is M/M/1 Model is popularly used because of limited use of Queue features. Since commonly, only single prefetch input queue is used, Single queue/Single server model is quite efficient in saving time by overlapping fetch phase and execution phase.

In this model in accordance with microprocessors, the user will be Execution Unit and server will be Bus Interface Unit.

Instruction Queue

The processor executes a program by fetching the instructions from memory and executing them. Usually the processor execution speed is much faster than the memory access speed. Instruction queue is used to prefetch the next instructions in a separate buffer while the processor is executing the current instruction. A pipelined processor may process each instruction in four steps,

                Fetch:    Read the instruction from memory.
                Decode:   Decode the source operand and fetch the source operands.
                Execute:  Execute the instruction.
                Write:    Store the result in memory.

With a four stage pipeline, the rate at which instructions are executed is four times that of sequential execution. ^[3]

The processor usually has two separate units for fetching the instructions and for executing the instructions. The 8086 CPU is divided into two independent functional parts, the bus interface unit and the execution unit.^[4] ^[5]

The implementation of a pipeline architecture is possible only if the bus interface unit and the execution unit are independent. While the execution unit is decoding or executing an instruction which does not require the use of the data and address buses, the bus interface unit fetches instruction opcodes from the memory.

This process is much faster than sending out an address, reading the opcode and then decoding and executing it. Fetching the next instruction while the current instruction is being decoded or executed is called Pipeline (computing)pipelining.

x86 example code

code_starts_here:
   mov eax, ahead
   mov [eax], 0x9090
ahead: 
   jmp near to_the_end
   ; Some other code
to_the_end:

This self-modifying program will overwrite the jmp to_the_end with two NOPs (which is encoded as 0x9090). The jump jmp near to_the_end is assembled into two bytes of machine code, so the two NOPs will just overwrite this jump and nothing else. (That is, the jump is replaced with a do-nothing-code.)

Because the machine code of the jump is already read into the PIQ, and probably also already executed by the processor (superscalar processors execute several instructions at once, but they "pretend" that they don't because of the need for backward compatibility), the change of the code will not have any change of the execution flow.

Example program to detect the size of the PIQ

This is an example NASM-syntax self-modifying x86-assembly algorithm that determines the size of the PIQ:

code_starts_here:
   xor cx, cx                  ; zero register cx
   xor ax, ax                  ; zero register ax

   mov dx, cs
   mov [code_segment], dx      ; "calculate" codeseg in the far jump below (edx here too)

around:
   cmp ax, 1                   ; check if ax has been alterd
   je found_size

   mov [nop_field+cx], 0x90    ; 0x90 = opcode "nop" (NO oPeration)
   inc cx

   db 0xEA                     ; 0xEA = opcode "far jump"
   dw flush_queue              ; should be followed by offset (rm = "dw", pm = "dd")
code_segment:
   dw 0                        ; and then the code segment (calculated above)
flush_queue:
  
   mov [nop_field+cx], 0x40    ; 0x40 = opcode "inc ax" (INCrease ax)

nop_field:
   nop times 256
   jmp around
found_size:
   ;
   ;    register cx now contains the size of the PIQ
   ;    this code is for real mode and 16-bit protected mode, but it could easily be changed into 
   ;    running for 32-bit protected mode as well. just change the "dw" for 
   ;    the offset to "dd". you need also change dx to edx at the top as
   ;    well. (dw and dx = 16 bit addressing, dd and edx = 32 bit addressing)
   ;

What this code does is basically that it changes the execution flow, and determines by brute force how large the PIQ is. "How far away do I have to change the code in front of me for it to affect me?" If it is too near (it is already in the PIQ) the update will not have any effect. If it is far enough, the change of the code will affect the program and the program has then found the size of the processor's PIQ. If this code is being executed under multitasking OS, the context switch may lead to the wrong value.

References

^ "ARM Information Center". ARM Technical Support Knowledge Articles.
^ "8086 Architecture". The 80x86 IBM PC and Compatible Computers (Vol 1 and Vol 2) Microcomputer Systems: The 8086/8088 Family.
^ Zaky, Safwat (1996). Computer Organization (Fourth ed.). McGraw-Hill. pp. 310–329. ISBN 0-07-114309-2. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ "Block diagram of 8086 CPU".
^ Hall, Douglas (2006). Microprocessors and Interfacing. Tata McGraw-Hill. p. 2.12. ISBN 0-07-060167-4.

External links

[1] "ARM Information Center". ARM Technical Support Knowledge Articles.

[2] "8086 Architecture". The 80x86 IBM PC and Compatible Computers (Vol 1 and Vol 2) Microcomputer Systems: The 8086/8088 Family.

[3] Zaky, Safwat (1996). Computer Organization (Fourth ed.). McGraw-Hill. pp. 310–329. ISBN 0-07-114309-2. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[4] "Block diagram of 8086 CPU".

[5] Hall, Douglas (2006). Microprocessors and Interfacing. Tata McGraw-Hill. p. 2.12. ISBN 0-07-060167-4.

[1]

[2]

[3]

[4]

[5]