HTAM
HTAM
com
INTRODUCTION
PROCESSOR MICROARCHITECTURE
Traditional approaches to processor design have focused on higher
clock speeds, instruction-level parallelism (ILP), and caches. Techniques to
achieve higher clock speeds involve pipelining the microarchitecture to finer
granularities, also called super-pipelining. Higher clock frequencies can
greatly improve performance by increasing the number of instructions that can
be executed each second. Because there will be far more instructions in-flight
in a superpipelined microarchitecture, handling of events that disrupt the
pipeline, e.g., cache misses, interrupts and branch mispredictions, can be
costly.
Figure 1 shows the relative increase in performance and the costs, such
as die size and power, over the last ten years on Intel processors. In order to
isolate the microarchitecture impact, this comparison assumes that the four
generations of processors are on the same silicon process technology and that
the speed-ups are normalized to the performance of an Intel486 processor.
Although we use Intel’s processor history in this example, other high-
performance processor manufacturers during this time period would have
similar trends. Intel’s processor performance, due to microarchitecture
advances alone, has improved integer performance five- or six-fold. Most
integer applications have limited ILP and the instruction flow can be hard to
predict.
Over the same period, the relative die size has gone up fifteen-fold, a
three-times-higher rate than the gains in integer performance. Fortunately,
advances in silicon process technology allow more transistors to be packed
into a given amount of die area so that the actual measured die size of each
generation microarchitecture has not increased significantly.
101seminartopics.com
The relative power increased almost eighteen-fold during this period.
Fortunately, there exist a number of known techniques to significantly reduce
power consumption on processors and there is much on-going research in this
area. However, current processor power dissipation is at the limit of what can
be easily dealt with in desktop platforms and we must put greater emphasis on
improving performance in conjunction with new technology, specifically to
control power.
THREAD-LEVEL PARALLELISM
HYPER-THREADING TECHNOLOGY
ARCHITECTURE
A second goal was to ensure that when one logical processor is stalled
the other logical processor could continue to make forward progress. A logical
processor may be temporarily stalled for a variety of reasons, including
servicing cache misses, handling branch mispredictions, or waiting for the
results of previous instructions. Independent forward progress was ensured by
managing buffering queues such that no logical processor can use all the
entries when two active software threads2 were executing. This is
accomplished by either partitioning or limiting the number of active entries
each thread can have.
A third goal was to allow a processor running only one active software
thread to run at the same speed on a processor with Hyper-Threading
Technology as on a processor without this capability. This means that
partitioned resources should be recombined when only one software thread is
active. A high-level view of the microarchitecture pipeline is shown in Figure
4. As shown, buffering queues separate major pipeline logic blocks. The
buffering queues are either partitioned or duplicated to ensure independent
forward progress through each logic block.
101seminartopics.com
In the following sections we will walk through the pipeline, discuss the
implementation of major functions, and detail several ways resources are
shared or replicated.
101seminartopics.com
FRONT END
The TC entries are tagged with thread information and are dynamically
allocated as needed. The TC is 8-way set associative, and entries are replaced
based on a least recently- used (LRU) algorithm that is based on the full 8
ways. The shared nature of the TC allows one logical processor to have more
entries than the other if needed.
Microcode ROM
The ITLBs are duplicated. Each logical processor has its own ITLB and
its own set of instruction pointers to track the progress of instruction fetch for
the two logical processors. The instruction fetch logic in charge of sending
requests to the L2 cache arbitrates on a first- come first-served basis, while
always reserving at least one request slot for each logical processor. In this
way, both logical processors can have fetches pending simultaneously.
Each logical processor has its own set of two 64-byte streaming buffers
to hold instruction bytes in preparation for the instruction decode stage. The
ITLBs and the streaming buffers are small structures, so the die size cost of
duplicating these structures is very low.
The decode logic takes instruction bytes from the streaming buffers and
decodes them into uops. When both threads are decoding instructions
simultaneously, the streaming buffers alternate between threads so that both
threads share the same decoder logic. The decode logic has to keep two copies
of all the state needed to decode IA-32 instructions for the two logical
processors even though it only decodes instructions for one logical processor
at a time. In general, several instructions are decoded for one logical processor
before switching to the other logical processor. The decision to do a coarser
level of granularity in switching between logical processors was made in the
interest of die size and to reduce complexity. Of course, if only one logical
processor needs the decode logic, the full decode bandwidth is dedicated to
that logical processor. The decoded instructions are written into the TC and
forwarded to the uop queue.
Uop Queue
After uops are fetched from the trace cache or the Microcode ROM, or
forwarded from the instruction decode logic, they are placed in a “uop queue.”
This queue decouples the Front End from the Out-of-order Execution Engine
in the pipeline flow. The uop queue is partitioned such that each logical
processor has half the entries. This partitioning allows both logical processors
to make independent forward progress regardless of front-end stalls (e.g., TC
miss) or execution stall.
101seminartopics.com
If there are uops for both logical processors in the uop queue, the
allocator will alternate selecting uops from the logical processors every clock
cycle to assign resources. If a logical processor has used its limit of a needed
resource, such as store buffer entries, the allocator will signal, “stall” for that
logical processor and continue to assign resources for the other logical
processor. In addition, if the uop queue only contains uops for one logical
processor, the allocator will try to assign resources for that logical processor
every cycle to optimize allocation bandwidth, though the resource limits
would still be enforced.
101seminartopics.com
By limiting the maximum resource usage of key buffers, the machine
helps enforce fairness and prevents deadlocks.
Allocator
The out-of-order execution engine has several buffers to perform its re-
ordering, tracing, and sequencing operations. The allocator logic takes uops
from the uop queue and allocates many of the key machine buffers needed to
execute each uop, including the 126 re-order buffer entries, 128 integer and
128 floating-point physical registers, 48 load and 24 store buffer entries. Some
of these key buffers are partitioned such that each logical processor can use at
most half the entries.
Register Rename
Since each logical processor must maintain and track its own complete
architecture state, there are two RATs, one for each logical processor. The
register renaming process is done in parallel to the allocator logic described
above, so the register rename logic works on the same uops to which the
allocator is assigning resources.
Instruction Scheduling
Each scheduler has its own scheduler queue of eight to twelve entries
from which it selects uops to send to the execution units. The schedulers
choose uops regardless of whether they belong to one logical processor or the
other. The schedulers are effectively oblivious to logical processor
distinctions. The uops are simply evaluated based on dependent inputs and
availability of execution resources. For example, the schedulers could dispatch
two uops from one logical processor and two uops from the other logical
processor in the same clock cycle. To avoid deadlock and ensure fairness,
there is a limit on the number of active entries that a logical processor can
have in each scheduler’s queue. This limit is dependent on the size of the
scheduler queue.
101seminartopics.com
Execution Units
The execution core and memory hierarchy are also largely oblivious to
logical processors. Since the source and destination registers were renamed
earlier to physical registers in a shared physical register pool, uops merely
access the physical register file to get their destinations, and they write results
back to the physical register file. Comparing physical register numbers enables
the forwarding logic to forward results to other executing uops without having
to understand logical processors.
After execution, the uops are placed in the re-order buffer. The re-order
buffer decouples the execution stage from the retirement stage. The re-order
buffer is partitioned such that each logical processor can use half the entries.
Retirement
Once stores have retired, the store data needs to be written into the
level-one data cache. Selection logic alternates between the two logical
processors to commit store data to the cache.
101seminartopics.com
MEMORY SUBSYSTEM
DTLB
The L2 and L3 caches are 8-way set associative with 128-byte lines.
The L2 and L3 caches are physically addressed. Both logical processors,
without regard to which logical processor’s uops may have initially brought
the data into the cache, can share all entries in all three levels of cache.
101seminartopics.com
Because logical processors can share data in the cache, there is the
potential for cache conflicts, which can result in lower observed performance.
However, there is also the possibility for sharing data in the cache. For
example, one logical processor may prefetch instructions or data, needed by
the other, into the cache; this is common in server application code. In a
producer-consumer usage model, one logical processor may produce data that
the other logical processor wants to use. In such cases, there is the potential for
good performance benefits.
BUS
PERFORMANCE
The Intel Xeon processor family delivers the highest server system
performance of any IA-32 Intel architecture processor introduced to date.
Initial benchmark tests show up to a 65% performance increase on high-end
server applications when compared to the previous-generation Pentium® III
Xeon™ processor on 4-way server platforms. A significant portion of those
gains can be attributed to Hyper-Threading Technology.
101seminartopics.com
All the performance results quoted above are normalized to ensure that
readers focus on the relative performance and not the absolute performance.
CONCLUSION
REFERENCES
A.A arwal, B.H.Lim, D.Kranz and J.Kubiatowicz, “APRIL: A
processor Architecture for Multiprocessing,” in Proceedings of the 17th
Annual International Symposium on Computer Architectures ,pages
104-114,May 1990.
ABSTRACT
ACKNOWLEDGEMENT
CONTENTS
1. Introduction
2. Processor Microarchitecture
3. Hyperthreading Technology Architecture
4. Benefits of Hyperthreading Technology
5. First implementation on Intel-Xeon processor family
6. Front End
7. Out-Of-Order Execution Engine
8. Memory Subsystem
9. Bus
10. Single-Task and Multi-Task Modes
11. Operating Systems and Applications
12. Performance
13. Conclusion
14. References.