Friday, January 18, 2008


Multithreading computers have hardware support to efficiently execute multiple threads.

Block Multi-threading
The simplest type of multi-threading is where one thread runs until it is blocked by an event that normally would create a long latency stall. Such a stall might be a cache-miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of ready-to-run threads.
For example:
Conceptually, it is similar to cooperative multi-tasking used in real-time operating systems in which tasks voluntarily give up execution time when they need to wait upon some type of event.

Cycle i  : instruction j from thread A is issued
Cycle i+1: instruction j+1 from thread A is issued
Cycle i+2: instruction j+2 from thread A is issued, load instruction which misses in all caches
Cycle i+3: thread scheduler invoked, switches to thread B
Cycle i+4: instruction k from thread B is issued
Cycle i+5: instruction k+1 from thread B is issued Concept
This type of multithreading is known as Block or Cooperative or Coarse-grained multithreading.

Terminology
The goal of multithreading hardware support is to allow quick switching between a blocked thread and another thread ready to run. To achieve this goal, the hardware cost is to replicate the program visible registers as well as some processor control registers (such as the program counter). Switching from one thread to another thread means the hardware switches from using one register set to another.
Such additional hardware has these benefit:
In order to switch efficiently between active threads, each active thread needs to have its own register set. For example, to quickly switch between two threads, the register hardware needs to be instantiated twice.

The thread switch can be done in one CPU cycle.
It appears to each thread that they are executing alone and not sharing any hardware resources with any other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading. Hardware Cost

Many families of microcontrollers and embedded processors have multiple register banks to allow quick context switching for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads. Examples
See article: barrel processor

Interleaved Multi-threading
A higher performance type of multithreading is where the processor switches threads every CPU cycle. For example:
The purpose of this type of multithreading is to remove all data dependency stalls from the execution pipeline. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline.
Conceptually, it is similar to pre-exemptive multi-tasking used in operating systems. One can make the analogy that the time-slice given to each active thread is one CPU cycle.

Cycle i  : an instruction from thread A is issued
Cycle i+1: an instruction from thread B is issued
Cycle i+2: an instruction from thread C is issued Concept
This type of multithreading was first called Barrel processing, in which the staves of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Fine-grained or time-sliced multithreading are more modern terminology.

Terminology
In addition to the hardware costs discussed in the Block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to larger to avoid thrashing between the different threads.

Hardware Costs

Denelcor Heterogeneous Element Processor
Intel Super-threading
Sun Microsystem Ultrasparc T1
Lexra NetVortex
MIPS 34K core which implements the Multi-Threaded ASE
Raza Microelectronics Inc XLR Examples
See main article Simultaneous multithreading

Simultaneous Multi-threading
The most advanced type of multi-threading applies to superscalar processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor issues one instruction from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism, this type of multithreading is trying to exploit parallelism available across multiple threads.
For example:

Cycle i  : instruction j from thread A; instruction k from thread B both simultaneously issued
Cycle i+1: instruction k+1 from thread B; instruction m from thread C both simultaneously issued
Cycle i+2: instruction j+1 from thread A; instruction m+1 from thread C both simultaneously issued Concept
To distinguish the other flavors of multithreading from SMT, the term Temporal multithreading is used to denote when only one thread can be issued at a time.

Multithreading (computer hardware) Hardware Costs

Alpha AXP EV8 (uncompleted)
Intel Hyperthreading
IBM Power5
Power Processing Element within the Cell microprocessor
Sun Microsystem UltraSPARC T2