Hyper-Threading technology
Abstract
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
Description of Hyper-Threading technology
The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand, we cannot rely entirely on traditional approaches to processor design. Micro architecture techniques used to achieve past processor performance improvement-super pipelining, branch prediction, super-scalar execution, out-of-order execution, caches-have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel's Hyper-Threading Technology is one solution.
Single-Chip Multiprocessor Hardware configurations
In our analysis of SMT and multiprocessing, we focus on a particular region of the MP design space, specifically, small-scale, single-chip, shared-memory multiprocessors. As chip densities increase, single-chip multiprocessing will be possible, and some architects have already begun to investigate this use of chip real estate [Olukotun et al. 1996]. An SMT processor and a small-scale, on-chip multiprocessor have many similarities: for example,both have large numbers of registers and functional units, on-chip caches,and the ability to issue multiple instructions each cycle. In this study, we keep these resources approximately similar for the SMT and MP comparisons,and in some cases we give a hardware advantage to the MP.lWe look at both two- and four-processor multiprocessors, partitioning the scheduling unit resources of the multiprocessor CPUs (the functional units,instruction queues, and renaming registers) differently for each case. In the two-processor MP (MP2), each processor receives half of the on-chip execution resources previously described, so that the total resources relative to an SMT are comparable (Table I). For a four-processor MP (MP4), each processor contains approximately one-fourth of the chip resources.
Thread Invocation
Like Multiscalar, both IMT variants invoke threads in program order by predicting the next thread from among the targets of the previous thread (specified by the thread descriptor) using a thread predictor. A descriptor cache (Figure 3) stores recently-fetched thread descriptors.Although threads are invoked in program order, IMT may fetch later threads' instructions out of order prior to fetching all of earlier threads' instructions, thereby interleaving instructions from multiple threads. To decide which thread to fetch from, IMT consults the fetch policy. Resource Allocation & Fetch Policy IMT processor, N-IMT, uses an unmodified ICOUNT policy [13], in which the thread with the least number of instructions in flight is chosen to fetch instructions from every cycle.