Amd Micro Architecture
Amd Micro Architecture
Jack Huynh
ADVANCED MICRO DEVICES, INC.
One AMD Place
Sunnyvale, CA 94088
Founded in 1969, AMD has shipped more than 190 million PC processors
worldwide. AMD processors are the power behind desktop and mobile PCs, and a new
generation of servers and workstations. Since its introduction in 1999, the award-winning
AMD Athlon™ processor has been known as an industry leader, enabling one of the
highest system performance levels in the PC market. Since its launch in June 2001, the
AMD Athlon MP processor and computer systems based on the AMD Athlon MP
processor have won numerous awards worldwide. In all, the AMD Athlon processor
family, and systems based on such processors have won more than 100 awards
worldwide, including PC World’s World Class Award for overall Product of the Year in
2000 and 2002. The AMD Athlon processor family has provided industry-leading
processing power to pave the road to new levels of end-user capability with application
areas from productivity to compute-intensive workstation applications, including digital
content creation and computer-aided design. For the server market, the AMD Athlon MP
processor has also provided the reliability, stability, and performance needed for mission-
critical email, exchange, file, print, and networking applications.
With the increased frequency scalability resulting from 0.13 micron technology
combined with QuantiSpeed architecture and Smart MP technology, AMD continues to
deliver compelling solutions for compute-intensive applications for workstations and
servers, and delivers superb integer, floating point, and 3D multimedia performance for
applications running on x86-based platforms.
With the AMD Athlon MP processor, AMD offers Smart MP technology – a new
multiprocessing architecture enabling exceptionally fast performance and excellent
scalability beyond traditional multiprocessor system architectures. AMD’s innovative
Smart MP technology is designed to optimize the execution of multi-threaded
per clock to drive frequency improvements, deeper pipelines alone translate into less
work per clock cycle. This reduced work per clock cycle or reduced IPC can only be
offset by improvements in other areas, such as branch prediction and cache hit rates.
Taken to the extreme, processor performance can actually be reduced by forcing
frequency improvements at the expense of IPC improvements.
This key point can be illustrated in office applications which tend to be branch-
code intensive, resulting in lower performance for deeper pipelines that must be flushed
with a much greater performance penalty. As reaffirmed in the Desktop Performance and
Optimization for Intel Pentium® 4 Processor paper
(http://developer.intel.com/design/pentium4/papers/249438.htm), “Integer and basic
office productivity applications, such as Word and spreadsheet processing, tend to have
many branches in the code, thus reducing overall IPC capabilities. As a result, the
associated branch penalties and performance on these applications does not generally
scale as well with frequency and are more resistant to improvements in micro-
architectural means, such as deeper pipelines.”
Branch
2-way, 64KB Instruction Cache Predecode
Prediction
24-entry L1 TLB/256-entry L2 TLB Cache
Table
Fetch/Decode
Control
3-Way x86 Instruction Decoders
on-chip cache architecture includes a dual-ported 128K (two separate 64K) split-L1
cache with separate snoop ports, and an integrated full-speed, 16-way set-associative,
256K L2 cache using a 72-bit (64-bit data + 8-bit ECC) interface. The AMD Athlon MP
processor’s large integrated full-speed L1 cache is comprised of two separate 64K, two-
way set-associative data and instruction caches, which are much larger than the Intel
Xeon processor’s L1 cache (128K vs. 8K + 12K µop). By featuring a larger L1 cache,
applications running on the AMD Athlon MP processor perform exceptionally fast since
more instruction and data information is local to the processor. Applications exploit the
larger caches through benefiting from the increased support of instruction and data set
locality. The data cache also has eight banks to provide maximum parallelism for running
multiple applications. It supports concurrent accesses by two 64-bit loads or stores. The
instruction cache contains predecode data to assist multiple, high-performance instruction
decoders. Both instruction and data caches are dual-ported and contain dedicated snoop
ports to eliminate all system coherency traffic, common in systems with many devices,
from interfering with application performance.
The AMD Athlon MP processor cache architecture also supports error correction
code (ECC) protection. With these cache architecture features, the AMD Athlon MP
processor is designed to provide reliable high-performance computing.
The AMD Athlon MP processor offers one of the most powerful, architecturally
advanced floating point units (FPU) delivered in an x86 microprocessor. The
AMD Athlon MP processor’s three issue, superscalar floating point capability is based on
three pipelined, out-of-order floating point execution units, each with a one-cycle
throughput. Using a data format and single-instruction multiple-data (SIMD) operations
based on the MMX™ instruction model, the AMD Athlon MP processor can deliver as
many as four 32-bit, single-precision floating point results per clock cycle.
FPU Microarchitecture
Three separate executions units in the AMD Athlon MP processor’s floating point
pipeline support x87 floating point instructions, MMX instructions, and 3DNow!
Professional technology instructions. The three execution units are:
3. Fmul – This is the multiplier pipeline that contains an MMX ALU, MMX
multiplier, reciprocal unit, FP/3DNow! Professional technology
instruction multiplier, and support for FDIV instructions.
two execution units, one for both Fadd and Fmul and one for Fstore. Thus, as an example,
the AMD Athlon MP processor can do one floating point addition AND one
multiplication per clock cycle, while the Xeon processor can only do one multiplication
OR one addition per clock cycle. The seventh-generation FPU of the AMD Athlon MP
processor incorporates other features such as a 36-entry instruction scheduler and an 88-
entry register file for independent, superscalar, out-of-order, speculative execution of
floating point instructions. With three separate execution units, the AMD Athlon MP
processor’s superscalar FPU can boost the performance of floating point-intensive
applications varying from commercial applications such as 3D modeling and CAD to
consumer applications such as digital video and audio editing for workstations.
Table 1: AMD Processor Support of SIMD Instruction Extensions to the x86 Instruction Set Architecture.
AMD Processor: AMD-K6®-2 Processor AMD Athlon™ Processor AMD Athlon MP Processor
3DNow!™ technology 3DNow! technology Enhanced 3DNow! 3DNow! Professional
version supported: technology technology
Description of Original 3DNow! 3DNow! technology plus Enhanced 3DNow! technology
instructions supported: technology extensions 19 MMX™ extensions plus 51 SSE extensions
(part of SSE) plus five (completing SSE support)
DSP / communications
extensions
Many current software applications that are SIMD-optimized use different code
paths to benefit from 3DNow! technology or SSE, depending on the processor
architecture on which these applications are executed. AMD processor architectures
preceding the AMD Athlon MP processor only supported 3DNow! or enhanced 3DNow!
technology, which yielded the following three code path scenarios for developers:
Not only is the AMD Athlon MP processor designed to benefit from existing
software applications supporting 3DNow! and SSE technologies, but in the future,
software developers will have the ability to utilize the strength of both 3DNow! and SSE
technology when optimizing code paths for AMD processor architectures that support
3DNow! Professional technology. The AMD Athlon MP processor enables this advanced
To reduce the incidence of TLB entry conflicts, the L1 and L2 TLB structures
adopt an exclusive architecture design. With an exclusive TLB architecture, the L1 TLBs
can contain entries that are not duplicated in the L2 TLBs, enabling the combination of
L1 TLB and L2 TLB sizes for a larger total available entry space on both the instruction
and data TLBs. By reducing the number of conflicts caused by holding more TLB entries
within the processor, performance increases on high-end, data-intensive applications that
encounter instruction sequences that no longer have to wait for TLB entries to be
reloaded during execution.
The TLB structures of the AMD Athlon MP processor also have the ability to
enter data TLB misses in the TLBs speculatively. The AMD Athlon MP processor allows
TLB entries to be written speculatively before the first instruction is completed, while
preserving proper instruction execution ordering that removes the serialization effect and
results in improved system performance.
With these key differentiating features of the new AMD Athlon MP processor
with QuantiSpeed architecture…
Smart MP Technology:
• Dual point-to-point high-speed system buses - Allows two processors to
run independently without the overhead of sharing a common system bus
• Innovative bus-snooping capability - Offers high speed communication
between processors in a multiprocessing system
• Optimized MOESI cache-coherency protocol - Reduces memory traffic
and allows faster access to cached data
QuantiSpeed Architecture:
• Nine-issue, superscalar, fully pipelined microarchitecture – Provides a
wide executing bandwidth to improve overall productivity
• Superscalar, fully pipelined FPU – Increasing performance of floating
point-intensive applications while offering 3DNow! Professional
technology support
• Hardware data prefetch – Increasing performance on high-end software
applications using high-bandwidth system capability, especially with DDR
memory
• TLB enhancements – Increasing performance of high-end, data-intensive
applications
With compelling performance across these and a number of other applications, the
AMD Athlon MP processor implemented on 0.13 micron technology and featuring Smart
MP technology continues to increase the performance scalability provided by
QuantiSpeed architecture by delivering higher clock speeds and excellent processor
performance at lower thermal power over previous generations. The AMD Athlon MP
processor continues in the tradition of the AMD Athlon processor family by providing
compelling levels of delivered system performance for today’s and tomorrow’s
applications.
AMD Overview
AMD is a global supplier of integrated circuits for the personal and networked
computer and communications markets with manufacturing facilities in the United States,
Europe, and Asia. AMD produces microprocessors, flash memory devices, and support
circuitry for communications and networking applications. Founded in 1969 and based in
Sunnyvale, California, AMD had revenues of $3.9 billion in 2001. (NYSE: AMD).