Chapter 6 Full Updated
Chapter 6 Full Updated
6. PENTIUM 4
EVOLUTION OF MICROPROCESSOR
80186 BASIC FEATURES
The 80186 contains 16 – bit data bus
1) It has six types of memory segments, i.e. 1) It has four types of memory segments i.e.
CS,ES,FS,GS and SS CS,DS,ES and SS
2) Size of segments are variable from 1 byte 2) Size of segments are variable from 1 byte
to 4GB to 64KB
7) 80386 can access 4GB of memory 7) 8086 can access 1MB of memory
PENTIUM PROCESSOR BASIC
FEATURES
The Pentium microprocessor is almost identical to
the earlier 80386 and 80486 microprocessors
Pentium II
Cartridge
Cache
512K/
Pentium II Internal Bus
1M/
2M
A TYPICAL PENTIUM II SYSTEM
Pentium II
Cartridge
AGP SDRAM
Chipset or
Slot
DRAM
PCI Bus
USB Bridge
Bus
ISA Bus
PENTIUM II (CONT...)
L2 cache is no longer inside the µP IC
But placed very close to µP IC
This changes make the µP less expensive
PENTIUM II (CONT...)
Various versions of P II are available
Standard P II
L2 cache operates at half the processor speed
Celeron: does not contain L2 cache in the cartridge
Rather it is in the main board
Operates at processor speed
Xeon: contain up to 2M (512K/1M/2M) L2 cache
Operates at processor speed
PENTIUM II (CONT...)
Early P II requires
5.0 V
3.3 V and
variable voltage power supply for operation
may vary from 3.5v to as low as 1.8v
Requires 8.4 to 14.2A depending on operating
frequency
PENTIUM II (CONT...) MEMORY
SYSTEM
36 bit address
64 bit data
RAM used has an access time of 8 ns to 10 ns
Also include ECC
Though not used by P II system, parity checking is available
Transfers between PII and memory system are controlled by the
chipset
In fact, chipset controls PII, which is a departure from the
traditional use of processor
PENTIUM II (CONT...)
MEMORY MAP OF A PII BASED SYSTEM
Conventional Memory 0 – 1M
Application Area 0 – 640K
System Area 640K – 1M
Main Memory 1M – 1G
Optional ISA Memory 15M – 16M
Remapped AGP Data
PCI Memory 1G – 4G
AGP Aperture Texture and Instructions
PCI Access to AGP Frame Buffer
PCI Access to AGP Registers
For future expansion 4G – 64G
PENTIUM III
PENTIUM III
Improved version of PII, but based on Pro
architecture, not on P II
Two version of P III available
packaged in a slot 1 cartridge instead of IC chip like P
II with a non-blocking 512K cache running at half
speed of processor
Packaged in 370-pins IC, known as Coppermine, with
256K advanced transfer cache within the IC and
running at processor speed
It has been observed that, increasing cache size
from 256K to 512K improves the performance by
only a few percent
PENTIUM III (CONT...)
Chipset is different from P II
Coppermine increases the bus speed to either
100MHz or 133MHz
Bus speed cannot be increased arbitrarily due
to radiation problem
PENTIUM III (CONT...)
Various versions of P III are also available like PII
Standard P III
Celeron PIII uses 66MHz bus speed
Xeon PIII allows larger cache for server
applications
PENTIUM PROCESSOR FEATURES
PENTIUM IV
PENTIUM IV
PENTIUM IV (CONT...)
MEMORY INTERFACE
Typically uses Intel 850 chipset
850 provides a dual-pipe memory bus with
processor
Each pipe interfaced to a 32-bit wide section of
memory
Two pipes functions together to comprise the 64-bit
data bus
PENTIUM 4
Still translate from 80x86 to micro-ops
P4 has better branch predictor, more FUs
Instruction Cache holds micro-operations vs. 80x86 instructions
no decode stages of 80x86 on cache hit
called “trace cache” (TC).
Faster memory bus: 400 MHz v. 133 MHz.
Caches
Pentium III: L1I 16KB, L1D 16KB, L2 256 KB
Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB
Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
Clock rates:
Pentium III 1 GHz v. Pentium IV 1.5 GHz
PENTIUM 4 FEATURES
Multimedia instructions 128 bits wide vs. 64 bits wide => 144
new instructions
When used by programs?
Faster Floating Point: execute 2 64-bit FP Per clock
Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs
Using RAMBUS DRAM
Bandwidth faster, latency same as SDRAM
Cost 2X-3X vs. SDRAM
ALUs operate at 2X clock rate for many ops
Pipeline doesn’t stall at this clock rate: uops replay
Rename registers: 40 vs. 128; Window: 40 v. 126
BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)
PARAMETER PENTIUM 2 PENTIUM 3 PENTIUM 4
7 FSB speeds 66 MHz to 100 MHz 100 MHz to 133 MHz 400 MT/s to 1066
MT/s
PENTIUM 4 NETBURST MICROARCHITECTURE
The Pentium 4 NetBurst microarchitecture was introduced by Intel in
2000.
It was designed to deliver high clock speeds and improved
performance compared to its predecessors.
The key features of the netburst microarchitecture include:
Rapid Execution Engine (REX): This feature aimed to improve the
efficiency of instruction execution by allowing the processor to handle
more instructions per clock cycle.
Hyper Pipelined Technology: The NetBurst microarchitecture
included a deeper pipeline compared to previous architectures,
allowing for higher clock speeds and potentially higher performance.
Advanced Transfer Cache (ATC): The ATC was designed to
improve the efficiency of data transfer between the processor and
the system memory.
SSE2 Instructions: The NetBurst microarchitecture introduced the
Streaming SIMD Extensions 2 (SSE2) instruction set, which aimed to
enhance multimedia and floating-point performance.
THE KEY FEATURES OF THE NETBURST
MICROARCHITECTURE INCLUDE:
2
THE NETBURST
MICROARCHITECTURE
A fast processor requires balancing and tuning of many
micro architectural features that compete for processor die
cost and for design and validation efforts.
Figure shows the basic Intel Net Burst microarchitecture
of the Pentium 4 processor.
There are four main sections:
1.IN-ORDER FRONT END,
2.OUT-OF-ORDER EXECUTION ENGINE,
3.INTEGER AND FLOATING-POINT EXECUTION
UNITS, AND
4.MEMORY SUBSYSTEM.
ARCHITECTURE OF PENTIUM 4
PROCESSOR
Generally, the Architecture of Pentium 4 Processor
consists of a Bus Interface Unit (BIU), Instruction
Fetch and Decoder Unit, Trace Cache (TC),
Microcode ROM, Branch Target Buffer (BTB),
Branch Prediction, Instruction Translation Look-
aside Buffer (ITLB), Execution Unit, and Rapid
Execution Module.
The Architecture of Pentium 4 Processor has four
different modules such as (i) memory subsystem
module, (ii) front-end module, (iii) integer/floating
point execution unit, and (iv) out-of-order
execution unit.
ARCHITECTURE OF PENTIUM 4
PROCESSOR
The memory subsystem module contains a Bus
Interface Unit (BIU) and L3 cache (optional).
The front-end module consists of instruction
decoder, Trace Cache (TC), microcode ROM,
Branch Target Buffer (BTB) and branch prediction.
Integer/Floating point execution unit has the L1
data cache and execution unit.
The out-of-order execution unit consists of execu
tion unit and retirement.
IN-ORDER FRONT END
IN-ORDER FRONT END
The NetBurst microarchitecture has an
advanced form of a instruction cache called the
Execution Trace Cache.
Unlike conventional instruction caches, the
Trace Cache sits between the instruction
decode logic and the execution core as shown
in Figure 1.
In this location the Trace Cache is able to store
the already decoded IA32 instructions or uops.
Storing already decoded instructions removes
the IA-32 decoding from the main execution
loop.
TRACE CACHE (TC)
After translation of instructions into micro-operations (μ-
ops) by using an instruction decoder, the streams of
decoded instructions are fed to an instruction cache,
which is known as trace cache.
The L1 cache can store only the decoded stream of
instructions, which are actually micro-operations (μ-ops).
Hence, the speed of execution will be increased
significantly.
In a Pentium 4 processor, the trace cache can store up
to 12 K μ-ops.
Normally, the cache assembles the decoded μ-ops in
order of sequence, called traces.
A single trace contains many trace lines and each trace
line has six μ-ops.
IN-ORDER FRONT END
Typically the instructions are decoded
once and placed in the Trace Cache and
then used repeatedly from there like a
normal instruction cache on previous
machines.
The IA-32 instruction decoder is only used
when the machine misses the Trace
Cache and needs to go to the L2 cache to
get and decode new IA-32 instruction
bytes.
OUT-OF-ORDER EXECUTION LOGIC
THE RETIREMENT LOGIC
INTEGER AND FLOATING-POINT
EXECUTION UNITS
MEMORY SUBSYSTEM
This includes the L2 cache and the system bus.