Tanium Rocessor Icroarchitecture: Cameron Mcnairy Intel Don Soltis Hewlett-Packard
Tanium Rocessor Icroarchitecture: Cameron Mcnairy Intel Don Soltis Hewlett-Packard
THE ITANIUM 2 PROCESSOR EXTENDS THE PROCESSING POWER OF THE ITANIUM PROCESSOR FAMILY WITH A CAPABLE AND BALANCED
MICROARCHITECTURE. EXECUTING UP TO SIX INSTRUCTIONS AT A TIME, IT PROVIDES BOTH PERFORMANCE AND BINARY COMPATIBILITY FOR ITANIUMBASED APPLICATIONS AND OPERATING SYSTEMS.
On 8 July 2002, Intel introduced the Itanium 2 processorthe Itanium architectures second implementation. This event was a milestone in the cooperation between Intel and Hewlett-Packard to establish the Itanium architecture as a key workstation, server, and supercomputer building block. The Itanium 2 processor may appear similar to the Itanium processor, yet it represents signicant advances in performance and scalability. (Sharangpani and Arora give an overview of the Itanium processor.1) These advances result from improvements in frequency, pipeline depth, pipeline control, branch prediction, cache design, and system interface. The microarchitecture design enables the processor to effectively address a wide variety of computation needs. Table 1 lists the processors main features. We obtained the Spec FP2000 and Spec CPU2000 benchmark results from http://www.spec.org on 20 February 2002. We obtained the other benchmarks from http://developer.intel.com/ products/server/processors/server/itanium2/ index.htm. This site contains relevant information about the measurement circumstances.
Microarchitecture overview
Many aspects of the Itanium 2 processor microarchitecture result from opportunities
and requirements associated with Intels Itanium architecture (formerly called the IA-64 architecture).2 The architecture goes beyond simply dening 64-bit operations and register widths; it denes exible memory management schemes and several tools that compilers can use to realize performance. It enables parallel instruction execution without resorting to complex out-of-order pipeline designs by explicitly indicating which instructions can issue in parallel without data hazards. To that end, three instructions are statically grouped into 16-byte bundles. Multiple instruction bundles can execute in parallel, or explicit stops can break parallel execution to avoid data hazards. Each bundle encodes a template that indicates which type of execution resource the instructions require: integer (I), memory (M), oating point (F), branch (B), and long extended (LX). Thus, memory, oating-point, and branch operations that can execute in parallel comprise a bundle with an MFB template. The Itanium 2 processor designers took advantage of explicit parallelism to design an in-order, six-instruction-issue, parallelexecution pipeline. The relatively simple pipeline allowed the design team to focus resources on the memory subsystems performance and to exploit many of the architectures performance opportunities. Figure 1 shows the
44
core pipeline and the relationship of some microarchitecture structures to the pipeline. These structures include the instruction buffer, which decouples the front end, where instruction fetch and branch prediction occur, from the back end, where instructions are dispersed and executed. The back-end pipeline renames virtual registers to physical registers, accesses the register les, executes the operation, checks for exceptions, and commits the results.
1 GHz 8 in-order 6 instructions 2 integer, 4 memory, 3 branch, 2 oating-point 180 nm 40 million transistors 180 million transistors 421 mm2 Size Latency Protection Size Latency Protection Size Latency Protection Size Latency Protection 16 Kbytes 1 cycle Parity 16 Kbytes 1 cycle Parity 256 Kbytes 5, 7, or 9+ cycles Parity or ECC* 3 Mbytes 12+ cycles ECC
Instruction fetch
The front-end structures fetch instructions for later use by the back end. The front end chooses an instruction pointer (IP) from the next linear IP, branch prediction resteer pointers, or branch misprediction and instruction exception resteer pointers. The front end then presents the IP to the instruction cache and translation look-aside buffer (TLB). These structures are tightly coupled, allowing the processor to determine which cache way, if any, was a hit, and to deliver the cache contents in the next cycle using an innovation called prevalidated tags. This is the same idea presented in other Itanium 2 processor descriptions3 in the context of the rst-level data (L1D) cache, but here we discuss it in the context of the instruction cache.
L1 data
L2
L3
Benchmark results Spec CPU2000 score Spec FP2000 score TPCC (32-way) Stream Linpack 10K**
810 1,431 433,107 transactions per minute 3,700 Gbytes/s 13.94 Gops
The removal of the physical address from the hit detection critical path is signicant. It provides an opportunity for a single-cycle cache, but requires the TLB to be tightly coupled with the cache tags. Another implication is that a miss in the TLB also results in a cache miss, because no match lines will be driven. Moreover, the number of TLB entries determines the number of bits held in each ways tag and might limit the coupled TLBs size. Figure 2 shows how prevalidated tags tied to a 32-entry TLB determine a hit.
MARCHAPRIL 2003
45
ITANIUM 2 PROCESSOR
Branch prediction IPrelative prediction Next address L1l instruction cache Instructionstreaming buffer
Pipeline stages IPG L2l TLB ROT B B B Pipeline control REG FP4 FP3 WRB FP2 DET FP1 EXE Back end Branch Integer multimedia (6) REN EXP Front end L1l TLB IA-32 engine
Pattern history
ALAT 32 entries
L1D cache
L2 cache
Advanced-load address table Translation look-aside buffer Instruction pointer generation and fetch Instruction rotation Instruction template decode, expand, and disperse Rename (for register stack and rotating registers) and decode
Register file read ALU execution Exception detection Write back Floating-point pipe stage
TLB and the L1I cache are arranged as required for a prevalidated-tag design. The four-way set-associative L1I cache is 16 Kbytes in size, relatively small because of latency and area design constraints but still optimal. An instruction prefetch engine enhances the caches effective size. The dual-ported tags and TLB resolve demand and prefetch requests without conict. The page offset of the virtual-address bits selects a set from the tag array and the data array for demand accesses. The upper bits of the virtual address determine which, if any, way holds the requested instructions. The tag and TLB lookup results determine a L1I hit or miss, as described earlier.
The 64-byte L1I cache line holds four instruction bundles. The L1I can sustain a stream of one 32-byte read per cycle to provide two bundles per cycle to the back-end pipeline. The fetched bundles go directly to the dispersal logic or into an instruction buffer for later consumption. If the instruction buffer is full, the front-end pipeline stalls. The L1I TLB directly supports only a 4Kbyte page size. The L1I TLB indirectly supports larger page sizes by allocating additional entries as each 4-Kbyte segment of the larger page is referenced. An L1I TLB miss implies a miss in the L1I cache and can initiate L2I TLB and second-level (L2) cache accesses, as well as
46
IEEE MICRO
a transfer of page information to the L1I TLB. The L2I TLB is a 128-entry, fully associative structure with a single port. Each entry can represent all page sizes dened in the architecture from 4 Kbytes to 4 Gbytes. Up to 64 entries can be pinned as translation registers to ensure that hot pages are always available. In the event of an L2I TLB miss, the L2I TLB requests the hardware page walker (HPW) to fetch a translation from the virtual hashed page table. If a translation is available, the HPW inserts it into the L2I TLB. If a translation is not available or the HPW aborts, an exception occurs and the operating system assumes control to establish a mapping for the reference.
TLB Virtual address 1 Virtual address 2 Virtual address 3 Virtual address 4 0 0 1 0 0 Virtual address 32 Virtual address 3 Way 2 0 Hit comparator 0 1 0 0 0 0 1 0
Cache 0 0 0 0 1 0 0 0
0 Way 1
0 Way 2
1 Way 3
0 Way 4
Instruction-streaming buffer
The instruction-streaming buffer augments the instruction cache. The ISB holds eight L1I cache lines of instructions returned from the L2 or higher cache levels. It also stores virtual addresses that are scanned by the ISB hit detection logic for each IP presented to the L1I cache. An ISB hit has the same one-cycle latency as a normal L1I cache hit. Instructions typically spend little time in the ISB because the L1I cache can usually support reads and lls in the same cycle. The ISB enables branch prediction, instruction demand accesses, and instruction prefetch accesses to occur without conict.
Figure 2. Prevalidated cache tags tied to the TLB determine a hit. The presented virtual address is TLB entry 3. The TLB drives a match line indicating the match to the hit comparator, which reads and compares the ways tags against this match line. The tag in way 2 matches the match line, so way 2 is reported as a hit.
Instruction prefetching
Software can engage the instruction prefetch engine to reduce the instruction cache miss count and the associated penalty. The architecture denes hint instructions that provide the hardware early information about a future branch. In the Itanium 2 processor, these instructions direct the instruction prefetch engine to prefetch one or many L2 cache lines. The virtual address of the desired instructions allocates into the eight-entry prefetch virtual address buffer. Addresses from this buffer access the L1I TLB and L1I cache tags through the prefetch port, keeping prefetch requests from interfering with critical instruction access. If the instructions already exist in the L1I cache, the address is removed from the address buffer. If the instructions are missing, the prefetch engine sends a prefetch request to the L2 cache. The prefetch engine also supports a special prefetch hint on branch instructions to initi-
ate a streaming prefetch. For these hints, the prefetch engine continues to fetch along a linear path, up to four L2 cache lines ahead of demand accesses. Software hints can explicitly stop the current streaming prefetch or engage a new streaming prefetch. The prefetch engine automatically stops prefetching down a path if a mispredicted branch resteers the front end. The prefetch engine avoids cache pollution through software hints, branchprediction-based cancellation, self-throttle mechanisms, and an L1I cache line replacement algorithm that biases unreferenced instructions for replacement.
Branch prediction
The Itanium 2 processors branch prediction performance relies on a two-level prediction algorithm and two levels of branch history storage. The rst level of branch prediction storage is tightly coupled to the L1I cache. This coupling allows a branchs taken/not taken history and a predicted target to be delivered with every L1I demand access in one cycle. The branch prediction logic uses the history to access a pattern history table and determine a branchs final taken/not taken prediction, or trigger, according to the Yeh-Patt algorithm.4 The L2 branch cache saves the histories and triggers of branches evicted from the L1I so that they are available when the branch is revisited, providing the second storage level.
MARCHAPRIL 2003
47
ITANIUM 2 PROCESSOR
Table 2. Possible branch prediction penalties and their causes. A correctly predicted taken branch incurs no penalty.
Penalty (cycles) 1 2 6 Cause Correctly predicted taken IP-relative branch with incorrect target and return branch Nonreturn indirect branch Incorrect taken/not taken prediction or incorrect indirect target
The one-cycle latency provides a zeropenalty resteer for correctly predicted IPrelative branches. The prediction information consists of the prediction history and trigger for every branch instruction, up to three per bundle, and a portion of the predicted targets virtual address for every bundle pair. Because the bundles share the target and the target may not be sufcient to represent the entire span required by the branch, there might be times when the front end is resteered to an incorrect address. The branch prediction logic tracks this situation and provides a corrected IP-relative target one cycle later.
information suggests that branch prediction accuracy suffers when the instruction stream revisits a branch that has lost its prediction history because of an eviction. To mitigate the potential loss of branch histories, the L2 branch cache stores the trigger and histories of branches evicted from the rst-level storage. The L2B is a 24,000-entry backing store that does not use tags; instead it uses three address-based hashing functions and voting to determine the correct initialization of prediction histories and triggers for L1I lls. Limiting the L2B to prediction history and trigger but not target provides a highly effective and compact design. A branch target can be recalculated, in most cases, before a L1I ll occurs and with little penalty. It is possible that the L2B does not contain any information for the line being lled to L1I. In that case, the trigger and history bits are initialized according to the branch completers provided in the branch instruction.
Instruction buffer
The instruction buffer receives instructions from the L1I or L2 caches and lets the front end fetch instructions ahead of the back-end pipelines consumption of instructions. This eight-bundle buffer and bundle rotator can present a wide combination of two-instruction bundles to back-end dispersal logic. Thus, no matter how many instructions the back end consumes in a cycle, two bundles of instructions are available. The dispersal logic indicates that zero, one, or two bundles were consumed so that the instruction buffer can free the appropriate entries. If the pipeline is flushed or the instruction buffer is empty, a bundle can bypass the instruction buffer completely.
Branch resolution
All branch predictions are validated in the back-end pipeline. The branch prediction logic allows in-flight branch prediction to determine future branch prediction behavior; however, nonspeculative prediction state is maintained and restored in the case of a misprediction. Table 2 lists the possible branch prediction penalties and their causes.
Instruction dispersal
Figure 3 shows the design of the Itanium 2 processor front end and dispersal logic. The processor can issue and execute two instruction bundles, or six instructions, at a time. These instructions issue to one of 11 issue ports: two integer, four memory, two oating-point, and three branch.
L2 branch cache
The size and organization of the branch
48
IEEE MICRO
Issue ports M Next prediction Pattern history Prefetch virtual-address PVAB buffer +32 L1I tag L1I TLB Hit Hit Instruction bundles 0 M 1 M L1l array 2 F Instruction buffer L2 cache Instructionstreaming buffer L2 histories 1 I
IP next
These ports allocate instructions to several execution units. Two integer units execute integer operations such as shift and extract; ALU operations such as add, and, and compare; and multimedia ALU operations. Four memory units execute memory operations such as load, store, semaphore, and prefetch, in addition to the ALU and multimedia instructions that the integer units can execute. The four memory units are slightly asymmetric two are dedicated to integer loads and two to stores. Compared with a two-memory-port implementation, the four memory ports provide a threefold increase in dual-issue template combinations and many other performance improvement opportunities.5 The processors dispersal logic looks at two bundles of instructions every cycle and assigns as many instructions as possible to execution resources. There are multiple resources for each
template type (I, M, F, B, and LX), and the dispersal logic typically assigns the first I instruction to the rst I resource, the second I instruction to the second I resource, and so on until it exhausts the resources or an explicit stop bit breaks up an issue group. If instructions in the two bundles considered require more resources than available, the issue group stops at the oversubscription point, and the remaining instructions wait for dispersal in the next cycle. The instruction in an issue group is determined at dispersal and remains constant through the in-order execution pipeline. The dispersal logic dynamically maps instructions to the most appropriate resource. This is important in cases of limited or asymmetric execution resources. For example, the dispersal logic assigns a load instruction to the rst load-capable M port (M0 or M1) and a store to the rst store-capable M port (M2 or
MARCHAPRIL 2003
49
ITANIUM 2 PROCESSOR
M3) even if the store precedes the load in the issue group. In addition, the dispersal logic ignores this asymmetry for oating-point loads so that they issue to any M resource. Dynamic resource mapping also lets instructions typically assigned to I resources issue on M resources. If the template assigns an ALU or multimedia operation to an I resource, but all I resources have been exhausted, the dispersal logic dynamically reassigns the operation to an available M resource. Thus, the processor can often issue a pair of MII bundles despite having only two I resources. These capabilities remove the burden of ordering and padding instructions to ensure that they issue to correct resources from the code generator.
Floating-point execution
Each of the two floating-point execution units can execute a fused multiply-add or a miscellaneous floating-point operation. Latency is xed at four cycles for all oatingpoint calculations. The units are fully pipelined and bypassed. Eight read and six
50
IEEE MICRO
write ports access the 128 oating-point registers. Six of the read ports supply operands for calculation; the remaining two read ports are for oating-point store operations. Two of the write ports are for calculation results; the other four provide write paths for floatingpoint load returns from the L2 cache. The four M resources and the two F resources combined allow two MMF bundles to execute every cycle. This provides the memory and computational bandwidth required for technical computing.5
L1D tag The Itanium 2 processor pipeline is fully interlocked L1D array such that a stall in the excep16 Kbytes tion detect (DET) stage propagates to the instruction L2 data expand (EXP) stage and sus256 Kbytes M2 Store buffer M3 pends instruction advancement. A stall caused by one L2 tags L1D store tag instruction in the issue group stalls the entire issue group and never causes the core Fill buffer L2D TLB pipeline to flush and replay. 128 entries The DET-stage stall is the last opportunity for an instruction to halt execution before Data path ECC Address/control Parity the pipeline control logic Multi-hit commits it to architectural state. The pipeline control logic also synchronizes the Figure 4. Itanium 2 processors memory subsystem and system interface. core pipeline and the L1D pipeline at the DET stage. The control logic allows these loosely coupled oating-point, and enterprise workloads.5 Figpipelines to lose synchronization so that the ure 4 shows a simplied diagram of the memL1I and L2 caches can insert noncore requests ory subsystem and system interface, including into the memory pipeline with minimal some data and control paths and data integriimpact on core instruction execution. Table ty features. 3 lists the stages and causes of potential stalls. FP
Integer
Pipeline control
M0 M1
System interface
Bus queues
L3 tags
Memory subsystem
The relatively simple nature of the in-order core pipeline allowed the Itanium 2 processor designers to focus on the memory subsystem. The processor implements a full complement of region identiers and protection keys, along with 64 bits of virtual address and 50 bits of physical address to provide 1,024 Tbytes of addressability. The memory subsystem is a low-latency, high-bandwidth design partitioned and organized to handle integer,
L3 data 3 Mbytes
MARCHAPRIL 2003
51
ITANIUM 2 PROCESSOR
resolving that dependency dynamically. An advanced load allocates an entry in the ALAT, a four-ported, 32-entry, fully associative structure that records the register identifiers and physical addresses of advanced loads. A later store to the same address invalidates all overlapping ALAT entries. Later, when an instruction requires the loads result, the ALAT indicates whether the load is still valid. If so, a use of the load data is allowed in the same cycle as the check without penalty. If a valid entry is not found, the load is automatically reissued and the use is replayed. The Itanium architecture allows scheduling of a use in the same issue group as the check; hence, from the code schedulers perspective, an ALAT hit has zero latency.
L1D cache
The data TLBs and L1D cache are similar in design to the instruction TLBs and caches; they share key attributes such as size, latency, arrangement, and tight integration with tags and the first-level TLB. The principle of prevalidated tags enables a one-cycle L1D cache. This feature is essential for a wide inorder microprocessor to achieve high performance in many integer workloads. If the latency were two cycles, the compiler would need to schedule at least ve, and often more, instructions to cover the latency. The Itanium 2 processors single-cycle latency requires only an explicit stop between a load and its use, thus easing the burden on the code generator to extract instruction-level parallelism. The L1D is a multiported, 16-Kbyte, fourway set-associative, physically addressed cache with a 64-byte line protected by parity. Instructions access the L1D in program order; hence, it is an in-order cache. However, the scoreboard logic allows the L1D and other cache levels to be nonblocking. The L1D provides two dedicated load ports and two dedicated store ports. These ports are fixed, but the dispersal logic rearranges loads and stores within an issue group to ensure they reach the appropriate memory resource. The two load requests can hit and return data from the L1D in parallel without conict. Rotators between the data array and the register le allow integer loads to any unaligned data reference within an 8-byte datum, as well as support for bigor little-endian accesses.
The prevalidated tags and first-level TLB serve only integer loads. Stores access the second-level data (L2D) TLB and use a traditional tagging mechanism. This increases their latency, but store latency is not a performance issue, in part because store-load forwarding is provided in the store data path. The L1D enforces a write-through with a no-writeallocate policy such that it passes all stores to the L2 cache, and store misses do not allocate into the L1D. If a store hits in the L1D, the data moves to a store buffer until the data array becomes available to update the L1D. These store buffers can merge store data from other stores and forward their contents to later loads. Integer load and data prefetch misses allocate into the L1D, according to temporal hints and available resources. Up to eight L1D lines can have ll requests outstanding, but the total number of permitted L1D misses is limited only by the scoreboard and the other cache levels. If the L2 cannot accept a request, it applies back pressure and the core pipeline stalls. Before an L1D load miss or store request is dispatched to the L2, it accesses the L2D TLB. The TLB access behavior for loads differs from that of the instruction cache: The L1D and L2D TLBs are accessed in parallel for loads, regardless of an L1D hit or miss. This reduces both L1D and L2 latency. Consequently, the 128-entry, fully associative L2D TLB is fully four-ported to allow the complete issue of every possible combination of four memory operations. The L1D is highly integrated into the integer data path and the L2 tags. All integer loads must go through the L1D to return data to the register le and core bypass network. The L1D pipeline processes all memory accesses and requests that need access to the L2 tags or the integer register le. Accordingly, several types of requests arbitrate for access to the L1D. Some of these requests have higher priority than core requests, and if there are conicts, the core memory request stalls the core and reissues to the L1D when resources are available.
L2 cache
The second-level (L2) cache is a unified, 256-Kbyte, eight-way set-associative cache with a 128-byte line size. The L2 tags are true four-ported, with tag and ownership state protected by parity. The tags, accessed as part of the L1D pipeline, provide an early L2 hit or
52
IEEE MICRO
miss indication. The L2 enforces write-back and write-allocate policies. The L2s integer access latency is five, seven, nine, or more cycles. Floating-point accesses require an additional cycle for converting to the floatingpoint register format. The L2 cache is nonblocking and out of order. All memory operations that access the L2 (L1D misses and all stores) check the L2 tags and are allocated to a 32-entry queuing structure called the L2 OzQ. All stores require one of the 24 L2 data entries to hold the store until the L2 data array is updated. L1I instruction misses also go to the L2 but are stored in the instruction fetch FIFO (IFF) queue. Requests in the L2 OzQ and the IFF queue arbitrate for access to the data array or the L3 cache and system interface. This arbitration depends on the type of IFF request; instruction demand requests issue before data requests, and data requests issue before instruction prefetch requests. Up to four L2 data operations and one request to the L3 and system interface can issue every cycle. The L2 OzQ maintains all architectural ordering between memory operations, while allowing unordered accesses to complete out of order. This makes specifying a single L2 latency difficult but helps ensure that older memory operations do not impede the progress of younger ones. In many cases, incoming requests bypass allocation to the L2 OzQ and access the data array immediately. This provides the ve-cycle latency mentioned earlier. Sometimes the request can bypass the OzQ, but an L2 resource conict forces the request to have a seven-cycle latency. The minimum latency for a request that issues from the L2 OzQ is nine cycles. Resource conicts, ordering requirements, or higher-priority operations can extend a requests latency beyond nine cycles. The L2 data array has 16 banks; each bank is 16 bytes wide and ECC-protected. The array allows multiple simultaneous accesses, provided each access is to a different bank. Floating-point loads can bypass or issue from the L2 OzQ, access the L2 data array, complete four requests at a time, and fully utilize the L2s four data paths to the oating-point units and register le. The L2 does not have direct data paths to the integer units and register file; integer loads deliver data via the
L1D, which has two data paths to the integer units and register file. Stores can bypass or issue from the L2 OzQ and access the L2 data array four at a time, provided they access different banks. The ll path width from the L2 to the L1D and the L1I is 32 bytes, requiring two cycles to transfer a 64-byte L1I or L1D line. The ll bandwidth from the L3 or system interface to the L2 is also 32 bytes per cycle. Four 32-byte quantities accumulate in the L2 ll buffers for either the L3 or system interface, allowing the interleaving of system interface and L3 data returns. The 128-byte cache line is written into the L2 in one cycle, updating both tag and data arrays.
MARCHAPRIL 2003
53
ITANIUM 2 PROCESSOR
path so that L3 reads and writes can be pipelined for maximum bandwidth. The L3 is nonblocking and has an eight-entry queue to support multiple outstanding requests. This queue orders requests and prioritizes them among tag read or write and data read or write to achieve the highest performance. System interface. The system interface operates at 200 MHz and includes multiple subbuses for various functions, such as address/request, snoop, response, data, and defer. All buses, except the snoop bus, are protected against errors by parity or ECC. The data bus is 128 bits wide and operates sourcesynchronously at 400 million data transfers, or 6.4 Gbytes, per second. The system interface seamlessly supports up to four Itanium 2 processors. The system interface control logic contains an in-order queue (IOQ) and an out-of-order queue (OOQ), which track all transactions pending completion on the system interface. The IOQ tracks a requests in-order phases and is identical on all processors and the node controller. The OOQ holds only deferred processor requests. The IOQ can hold eight requests, and the OOQ can hold 18 requests. The system interface logic also contains two 128-byte coalescing buffers to support writecoalescing stores. The buffers can coalesce store requests at byte granularity, and they strive to generate full line writes for best performance. Writes of 1 to 8 bytes, 16 bytes, or 32 bytes are possible when holes exist in the coalescing buffers. The similarities between the system interfaces of the Itanium 2 and Itanium processors allowed several implementations to leverage their Itanium-based solutions for use with the Itanium 2 processor. However, large, multinode system designs required additional support for high performance and reliability. As a result, the processors system interface denes a few new transactions. The read current transaction lets the node controller obtain a current copy of data in a processor, while allowing the processor to maintain ownership of the line. The cache line replacement transaction informs a multinode snoop directory that an L3 clean eviction occurred to remove unnecessary snoop trafc. The cleanse cache transaction pushes a modied cache line out
to system memory. This allows higherperformance processor check pointing in high-availability systems without forcing the processor to give up ownership of the line.
oon after the Itanium 2 processors introduction, major computer system providers either announced or introduced single- and dual-processor workstations, four- to 128processor servers, and a 3,300-processor supercomputer, all using the Itanium 2 processor. The operating systems available for these systems include HPUX, Linux, and Windows .NET, and will eventually include OpenVMS. These systems and operating systems target a diverse set of computing problems and use the processor effectively for workstation, server, and supercomputer workloads. The Itanium 2 processor fits well in such varied environments because of its balanced design from instruction fetch to system interface and its exible underlying architecture. The design team capitalized on the performance opportunities available in the Itanium architecture to produce a highperformance, in-order implementation and provide computer system developers a powerful and versatile building block. MICRO References
1. H. Sharangpani and K. Arora, Itanium Processor Microarchitecture, IEEE Micro, vol. 20, no. 5, Sept.-Oct. 2000, pp. 24-43. 2. J. Huck et al., Introducing the IA-64 Architecture IEEE Micro, vol. 20, no. 5, Sept.-Oct. 2000, pp. 12-23. 3. D. Bradley, P. Mahoney, and B. Stackhouse, The 16KB Single-Cycle Read Access Cache on a Next Generation 64b Itanium Microprocessor, Proc. 2002 IEEE Intl SolidState Circuits Conf. (ISSCC 02), IEEE Press, 2002, pp. 110-111. 4. T.-Y. Yeh and Y.N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, Proc. 19th Intl Symp. Computer Architecture (ISCA 92), ACM Press, 1992, pp. 124-134. 5. T. Lyon et al., Data Cache Design Considerations for the Itanium 2 Processor, Proc. 2002 IEEE Intl Conf. Computer Design: VLSI in Computers and Processors (ICCD 02), IEEE Press, 2002, pp. 356-362. 6. J. McCormick and A. Knies, A Brief
54
IEEE MICRO
Analysis of the SPEC CPU2000 Benchmarks on the Intel Itanium 2 Processor, 2002; http://www.hotchips.org/archive/index.html. 7. E.S. Fetzer and J.T. Orton, A Fully-Bypassed 6-Issue Integer Datapath and Register File on an Itanium Microprocessor, Proc. 2002 IEEE Intl Sold-State Circuits Conf. (ISSCC 02), IEEE Press, 2002, pp. 420-478.
Don Soltis is an Itanium 2 processor microarchitect at Hewlett-Packard. His research interests include microprocessor cache design and microprocessor verication. Soltis has a BSEE and an MSEE from Colorado State University. Direct questions or comments about this article to Cameron McNairy, 3400 E. Harmony Road, MS 55, Fort Collins, CO 80526; [email protected].
Cameron McNairy is an Itanium 2 processor microarchitect at Intel. His research interests include high-performance technical computing and large-system design issues. McNairy has a BSEE and an MSEE from Brigham Young University. He is a member of the IEEE.
For further information on this or any other computing topic, visit our Digital Library at http://computer.org/publications/dlib.
Computer
Agile Software Development Piracy & Privacy
IT Professional
Financial Market IT
IEEE Micro
Hot Chips 14
IEEE MultiMedia
Computational Media Aesthetics
IEEE Software
Software Geriatrics: Planning the Whole Life Cycle
MARCHAPRIL 2003
55