Multithreading (computer architecture)

Hardware techniques used to support multithreading often parallel the software techniques used for computer multitasking of computer programs.

Block Multi-threading

Concept

The simplest type of multi-threading is where one thread runs until it is blocked by an event that normally would create a long latency stall. Such a stall might be a cache-miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of ready-to-run threads.

Conceptually, it is similiar to cooperative multi-tasking used in RTOSes in which tasks voluntarily give up execution time when they need to wait upon some type of event.

Terminology

This type of multithreading is known as Block or Cooperative or Coarse-grained multithreading.

Hardware Cost

The goal of multithreading hardware support is to allow quick switching between a blocked thread and another thread ready to run. To support quick thread switching, the hardware cost is to replicate the program visible registers as well as some processor control registers (such as the program counter). Such additional hardware has these benefit:

the switch can be done in one CPU cycle
It appears to each thread that they are executing alone and not sharing any hardware resources with any

other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading.

In order to switch efficiently between active threads, each active thread needs to have its own register set. For example, to quickly switch between two threads, the register hardware needs to be replicated twice.

Research Topics

For this type of multithreading, a major area of research is the thread scheduler which must quickly choose among the list of ready-to-run threads to execute next as well as maintain the read-to-run and stalled thread lists. An important topic are the different thread priority schemes that can be used by the scheduler. The thread scheduler might be implemented totally in software or totally in hardware or as a hw/sw combination.

Another area of research is what type of events should cause a thread switch - cache misses, inter-thread communication, DMA completion, etc.

Examples

MIPS architecture Multi-Threaded ASE

Interleaved Multi-threading

See article: barrel processor

Concept

A more complicated type of multithreading is where the processor switches threads every CPU cycle. For example:

in cycle i, it executes an instruction from thread A
in cycle i+1, it executes an instruction from thread B
in cycle i+2, it executes an instruction from thread C
and so on for every cycle.

The purpose of this type of multithreading is to remove all data dependency stalls from the execution pipeline.

Conceptually, it is similiar to pre-exemptive multi-tasking used in operating systems. One can make the analogy that the time-slice given to each active thread is one CPU cycle.

Terminology

This type of multithreading was first called Barrel processing, in which the staves of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Fine-grained multithreading are more modern terminology.

Hardware Costs

The additional hardware cost of this type multithreading is that each pipeline stage has to maintain its own register set and control registers. For example, a CPU with 7 pipeline stages needs to have its register hardware replicated 7 times.

Examples

Denelcor Heterogeneous Element Processor
Intel Super-threading
Sun Microsystem Ultrasparc T1

Simultaneous Multi-threading

See main article Simultaneous multithreading

Concept

The most advanced type of multi-threading applies to superscalar processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In a Simultaneous Multi-threading (SMT), the superscalar processor issues one instruction from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism, this type of multithreading is trying to exploit available across multiple threads.

Terminology

To distinguish the other flavors of multithreading from SMT, the term Temporal multithreading is used to denote when one only thread can execute at a time.

Hardware Costs

The additional hardware cost of this type of multithreading is that each pipeline stage now has to maintain multiple register sets, one set for each executing thread. For example, for a processor with a 7-stage pipeline that can issue 3 instructions per cycle, the register hardware needs to be replicated 21 times.