0% found this document useful (0 votes)
10 views125 pages

Chapter 6 Full Updated

The document outlines the evolution of microprocessors from the 80186 to the Pentium 4, detailing their basic features, improvements, and architectural changes. Key advancements include increased data bus widths, integrated clock generators, enhanced memory management, and the introduction of dual caches in later models. The comparison highlights significant differences in segmentation, memory addressing, and processing capabilities across various microprocessor generations.

Uploaded by

yashshende802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views125 pages

Chapter 6 Full Updated

The document outlines the evolution of microprocessors from the 80186 to the Pentium 4, detailing their basic features, improvements, and architectural changes. Key advancements include increased data bus widths, integrated clock generators, enhanced memory management, and the introduction of dual caches in later models. The comparison highlights significant differences in segmentation, memory addressing, and processing capabilities across various microprocessor generations.

Uploaded by

yashshende802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 125

PRAVIN ROHIDAS PATIL COLLEGE

OF DEGREE ENGG. & TECHNOLOGY

6. PENTIUM 4
EVOLUTION OF MICROPROCESSOR
80186 BASIC FEATURES
 The 80186 contains 16 – bit data bus

 The internal register structure of 80186 is


virtually identical to the 8086

 About the only difference is that the


80186 contain additional reserved
interrupt vectors and some very powerful
built-in I/O features
80186 BASIC FEATURES
Clock Generator:
The internal clock generator replaces the external

8284A clock generator used with the 8086


microprocessors. This reduces the component count
in a system

Programmable Interrupt Controller:


The PIC arbitrates all internal and external
interrupts and controls up to two external 8259A
PICs. When an external 8259 is attached, the
80186 microprocessors function as the master and
the 8259 functions as the slave
80186 BASIC FEATURES
 Timers:
 The timer section contains three fully
programmable 16-bit timers
 The timers 0 and 1 generate wave-forms
for external use and driven by either the
master clock of the 80186 or by an
external clock

 The third timer, timer 2 is internal and


clocked by the master clock
80186 BASIC FEATURES
Programmable DMA Unit:

Theprogrammable DMA unit contains two


DMA channels, or four DMA channels in
some models

Each channel can transfer data between


memory locations, between memory and
IO, or between IO devices
80186 BASIC FEATURES
Programmable chip selection unit:

The chip selection is a built-in


programmable memory and I/O decoder

It has 6 output lines to select memory, 7


lines to select I/O
80186 BASIC FEATURES

Power save/Power Down Feature:

The power save feature allows the system


clock to be divided by 4, 8, or 16 to reduce
power consumption

The power saving feature is started by


software and exited by a hardware event
such as an interrupt
80186 BASIC FEATURES

Refresh Control Unit:

The refresh control unit generates the


refresh row address at the interval
programmed
80286 BASIC FEATURES

 The 80286 microprocessor is an advanced


version of the 8086 microprocessor that is
designed for multi user and multitasking
environments

 The 80286 addresses 16 M Byte of physical


memory and 1G Bytes of virtual memory by
using its memory-management system

 The 80286 is basically an 8086 that is optimized


to execute instructions in fewer clocking periods
than the 8086
80286 BASIC FEATURES
80286 BASIC FEATURES

 The clock is provided by the 82284


clock generator, and the system
control signals are provided by the
82288 system bus controller

 The 80286 contains the same


instructions except for a handful of
additional instructions that control the
memory-management nit
80386 BASIC FEATURES
 The 80386 microprocessor is an enhanced
version of the 80286 microprocessor and
includes a memory-management unit is
enhanced to provide memory paging

 The 80386 also includes 32-bit extended registers


and a 32-bit address and data bus

 The 80386 has a physical memory size of


4GBytes that can be addressed as a virtual
memory with up to 64TBytes
80386 BASIC FEATURES
 The 80386 is operated in the pipelined mode,
it sends the address of the next instruction or
memory data to the memory system prior to
completing the execution of the current
instruction

 This allows the memory system to begin


fetching the next instruction or data before
the current is completed

 This increases access time, thus reducing the


speed of the memory
80386 BASIC FEATURES
 The I/O structure of the 80386 is almost identical
to the 80286, except that I/O can be inhibited
when the 80386 is operated in the protected
mode through the I/O bit protection map

 The register set of the 80386 contains extended


versions of the registers introduced on the 80286
microprocessor. These extended registers include
EAX, EBX, ECX, EDX, EBP, ESP, EDI, ESI, EIP and
EFLAGS

 The instruction set of the 80386 is enhanced to


include instructions that address the 32-bit
extended register set
80386 BASIC FEATURES
 Interrupts, in the 80386 microprocessor, have
been expanded to include additional predefined
interrupts in the interrupt vector table

 The 80386 memory manager is similar to the


80286, except the physical addresses generated
by the MMU are 32 bits wide instead of 24-bits

 The 80386 is also capable of paging

 The 80386 is operated in the real mode (i.e. 8086


mode) when it is reset
80386 BASIC FEATURES
 The real mode allows the microprocessor
to address data in the first 1MByte of
memory

 In the protected mode, 80386 addresses


any location in its 4G bytes of physical
address space
80486 BASIC FEATURES

 The 80486 microprocessor is an


improved version of the 80386
microprocessor that contains an 8K-byte
cache and an 80387 arithmetic co
processor; it executes many instructions
in one clocking period

 The 80486 microprocessor executes a


few new instructions that control the
internal cache memory
80486 BASIC FEATURES
 A new feature found in the 80486 in the BIST
(built- in self-test) that tests the microprocessor,
coprocessor, and cache at reset time

 If the 80486 passes the test, EAX contains a zero

 Additional test registers are added to the 80486


to allow the cache memory to be tested

 These new test registers are TR3 (cache data),


TR4 (cache status), and TR5 (cache control)
COMPARISON OF SEGMENTATION IN 80386
WITH 8086
Segmentation in 80386 Segmentation in 8086

1) It has six types of memory segments, i.e. 1) It has four types of memory segments i.e.
CS,ES,FS,GS and SS CS,DS,ES and SS

2) Size of segments are variable from 1 byte 2) Size of segments are variable from 1 byte
to 4GB to 64KB

3) Segment selectors, descriptors, offset 3) Segment registers and offset registers


registers, GDT/LDT, page directory and are used to generate physical address from
page tables are used to generate physical logical address
address from logical address

4) Logical address is converted to linear 4) Logical address is converted to physical


and then to physical address address
COMPARISON OF SEGMENTATION IN 80386
WITH 8086
Segmentation in 80386 Segmentation in 8086

5) In this protection is provided to memory 5) In this protection is not provided to


segments by giving different privilege levels memory segments
from 0 to 3

6) Size of physical address is 32-bit 6) Size of physical address is 20 bit

7) 80386 can access 4GB of memory 7) 8086 can access 1MB of memory
PENTIUM PROCESSOR BASIC
FEATURES
 The Pentium microprocessor is almost identical to
the earlier 80386 and 80486 microprocessors

 The main difference is that the Pentium has been


modified internally to contain a dual cache
(instruction and data) and a dual integer unit

 The Pentium also operates at a higher clock


speed of 66 MHz
PENTIUM PROCESSOR BASIC
FEATURES
 The data bus on the Pentium is 64 – bits wide
and contains eight byte-wide memory banks
selected with bank enable signals

 Memory access time, without wait states, is only


about 18 ns in the 66 MHz Pentium

 The superscalar structure of the Pentium


contains three independent processing units: a
floating point processor and two integer
processing units
PENTIUM PROCESSOR BASIC
FEATURES
 A new mode of operation called the System
Memory Management (SMM) mode has been
added to the Pentium. It is intended for high-
level system functions such as power
management and security

 The Built-in Self-test (BIST) allows the Pentium to


be tested when power is first applied to the
system.

 Allows 4MByte memory pages instead of the


4KByte pages.
PENTIUM PRO PROCESSOR BASIC
FEATURES
 The Pentium Pro is an enhanced version of the
Pentium microprocessor that contains not only the
level 1 caches found inside the Pentium, but the
level 2 cache of 256 K or 512K found on most main
boards

 The Pentium Pro operates using the same 66 MHz


bus speed as the Pentium and the 80486

 It uses an internal clock generator to multiply the


bus speed by various factors to obtain higher
internal execution speeds
PENTIUM PROCESSOR BASIC
FEATURES
 The onlysignificant software difference
between the Pentium Pro and earlier
microprocessors is the
addition of FCMOV and CMOV
instructions
 The only hardware difference between

the Pentium Pro and earlier


microprocessors is the addition of 2M
paging and four extra address lines
that allow access to a memory address
space of 64G Bytes
PENTIUM II
PENTIUM II

Extension to Pro architecture with some
differences
 Internal cache in PII has been moved out of the
chip
 PII is not available as a single chip
Rather is available on a small plug-in circuit

board, known as Cartridge, along with level


2 (L2) cache chip

Various versions are available
 Cerelon is a version without L2 cache
 Xeon is enhanced by having up to 2M L2 cache
PENTIUM II INTERNAL
ARCHITECTURE

Pentium II
Cartridge
Cache
512K/
Pentium II Internal Bus
1M/
2M
A TYPICAL PENTIUM II SYSTEM

Pentium II
Cartridge
AGP SDRAM
Chipset or
Slot
DRAM

PCI Bus

USB Bridge
Bus

ISA Bus
PENTIUM II (CONT...)

L2 cache is no longer inside the µP IC
 But placed very close to µP IC

This changes make the µP less expensive
PENTIUM II (CONT...)

Various versions of P II are available
 Standard P II

L2 cache operates at half the processor speed
 Celeron: does not contain L2 cache in the cartridge

Rather it is in the main board

Operates at processor speed
 Xeon: contain up to 2M (512K/1M/2M) L2 cache

Operates at processor speed
PENTIUM II (CONT...)

Early P II requires
 5.0 V
 3.3 V and
 variable voltage power supply for operation

may vary from 3.5v to as low as 1.8v

Requires 8.4 to 14.2A depending on operating
frequency
PENTIUM II (CONT...) MEMORY
SYSTEM

36 bit address

64 bit data

RAM used has an access time of 8 ns to 10 ns

Also include ECC

Though not used by P II system, parity checking is available

Transfers between PII and memory system are controlled by the
chipset

In fact, chipset controls PII, which is a departure from the
traditional use of processor
PENTIUM II (CONT...)
MEMORY MAP OF A PII BASED SYSTEM

Conventional Memory 0 – 1M
 Application Area 0 – 640K
 System Area 640K – 1M

Main Memory 1M – 1G
 Optional ISA Memory 15M – 16M
 Remapped AGP Data

PCI Memory 1G – 4G
 AGP Aperture Texture and Instructions
 PCI Access to AGP Frame Buffer
 PCI Access to AGP Registers

For future expansion 4G – 64G
PENTIUM III
PENTIUM III

Improved version of PII, but based on Pro
architecture, not on P II

Two version of P III available
 packaged in a slot 1 cartridge instead of IC chip like P
II with a non-blocking 512K cache running at half
speed of processor
 Packaged in 370-pins IC, known as Coppermine, with
256K advanced transfer cache within the IC and
running at processor speed

It has been observed that, increasing cache size
from 256K to 512K improves the performance by
only a few percent
PENTIUM III (CONT...)

Chipset is different from P II

Coppermine increases the bus speed to either
100MHz or 133MHz

Bus speed cannot be increased arbitrarily due
to radiation problem
PENTIUM III (CONT...)

Various versions of P III are also available like PII
 Standard P III
 Celeron PIII uses 66MHz bus speed
 Xeon PIII allows larger cache for server
applications
PENTIUM PROCESSOR FEATURES
PENTIUM IV
PENTIUM IV
PENTIUM IV (CONT...)
MEMORY INTERFACE

Typically uses Intel 850 chipset

850 provides a dual-pipe memory bus with
processor
 Each pipe interfaced to a 32-bit wide section of
memory
 Two pipes functions together to comprise the 64-bit
data bus
PENTIUM 4

Still translate from 80x86 to micro-ops

P4 has better branch predictor, more FUs

Instruction Cache holds micro-operations vs. 80x86 instructions
 no decode stages of 80x86 on cache hit
called “trace cache” (TC).

Faster memory bus: 400 MHz v. 133 MHz.

Caches
 Pentium III: L1I 16KB, L1D 16KB, L2 256 KB
 Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB
 Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock

Clock rates:
 Pentium III 1 GHz v. Pentium IV 1.5 GHz
PENTIUM 4 FEATURES

Multimedia instructions 128 bits wide vs. 64 bits wide => 144
new instructions
 When used by programs?
 Faster Floating Point: execute 2 64-bit FP Per clock
 Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs

Using RAMBUS DRAM
 Bandwidth faster, latency same as SDRAM
 Cost 2X-3X vs. SDRAM

ALUs operate at 2X clock rate for many ops

Pipeline doesn’t stall at this clock rate: uops replay

Rename registers: 40 vs. 128; Window: 40 v. 126

BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)
PARAMETER PENTIUM 2 PENTIUM 3 PENTIUM 4

1 Pipeline 14 (17 with load & 12 (15 with load & 20


Stages store/retire) store/retire)

2 Max. Clock 450 MHz 450~1400 MHz 800~3000 MHz

3 Architecture Intel's sixth-generation P6 microarchitecture Net burst


microarchitecture

4 clock speeds 233 MHz to 450 MHz 1.4Ghz 1.3Ghz

5 processing 32-bit 32-bit


capabilities

6 instruction set - - SSE2 and SSE3

7 FSB speeds 66 MHz to 100 MHz 100 MHz to 133 MHz 400 MT/s to 1066
MT/s
PENTIUM 4 NETBURST MICROARCHITECTURE
The Pentium 4 NetBurst microarchitecture was introduced by Intel in
2000.
It was designed to deliver high clock speeds and improved
performance compared to its predecessors.
The key features of the netburst microarchitecture include:
Rapid Execution Engine (REX): This feature aimed to improve the
efficiency of instruction execution by allowing the processor to handle
more instructions per clock cycle.
Hyper Pipelined Technology: The NetBurst microarchitecture
included a deeper pipeline compared to previous architectures,
allowing for higher clock speeds and potentially higher performance.
Advanced Transfer Cache (ATC): The ATC was designed to
improve the efficiency of data transfer between the processor and
the system memory.
SSE2 Instructions: The NetBurst microarchitecture introduced the
Streaming SIMD Extensions 2 (SSE2) instruction set, which aimed to
enhance multimedia and floating-point performance.
THE KEY FEATURES OF THE NETBURST
MICROARCHITECTURE INCLUDE:
2
THE NETBURST
MICROARCHITECTURE
A fast processor requires balancing and tuning of many
micro architectural features that compete for processor die
cost and for design and validation efforts.
Figure shows the basic Intel Net Burst microarchitecture
of the Pentium 4 processor.
There are four main sections:
1.IN-ORDER FRONT END,
2.OUT-OF-ORDER EXECUTION ENGINE,
3.INTEGER AND FLOATING-POINT EXECUTION
UNITS, AND
4.MEMORY SUBSYSTEM.
ARCHITECTURE OF PENTIUM 4
PROCESSOR
 Generally, the Architecture of Pentium 4 Processor
consists of a Bus Interface Unit (BIU), Instruction
Fetch and Decoder Unit, Trace Cache (TC),
Microcode ROM, Branch Target Buffer (BTB),
Branch Prediction, Instruction Translation Look-
aside Buffer (ITLB), Execution Unit, and Rapid
Execution Module.
 The Architecture of Pentium 4 Processor has four
different modules such as (i) memory subsystem
module, (ii) front-end module, (iii) integer/floating
point execution unit, and (iv) out-of-order
execution unit.
ARCHITECTURE OF PENTIUM 4
PROCESSOR
 The memory subsystem module contains a Bus
Interface Unit (BIU) and L3 cache (optional).
 The front-end module consists of instruction
decoder, Trace Cache (TC), microcode ROM,
Branch Target Buffer (BTB) and branch prediction.
Integer/Floating point execution unit has the L1
data cache and execution unit.
 The out-of-order execution unit consists of execu­
tion unit and retirement.
IN-ORDER FRONT END
IN-ORDER FRONT END
 The NetBurst microarchitecture has an
advanced form of a instruction cache called the
Execution Trace Cache.
 Unlike conventional instruction caches, the
Trace Cache sits between the instruction
decode logic and the execution core as shown
in Figure 1.
 In this location the Trace Cache is able to store
the already decoded IA32 instructions or uops.
 Storing already decoded instructions removes
the IA-32 decoding from the main execution
loop.
TRACE CACHE (TC)
 After translation of instructions into micro-operations (μ-
ops) by using an instruction decoder, the streams of
decoded instructions are fed to an instruction cache,
which is known as trace cache.
 The L1 cache can store only the decoded stream of
instructions, which are actually micro-operations (μ-ops).
Hence, the speed of execution will be increased
significantly.
 In a Pentium 4 processor, the trace cache can store up
to 12 K μ-ops.
 Normally, the cache assembles the decoded μ-ops in
order of sequence, called traces.
 A single trace contains many trace lines and each trace
line has six μ-ops.
IN-ORDER FRONT END
Typically the instructions are decoded
once and placed in the Trace Cache and
then used repeatedly from there like a
normal instruction cache on previous
machines.
The IA-32 instruction decoder is only used
when the machine misses the Trace
Cache and needs to go to the L2 cache to
get and decode new IA-32 instruction
bytes.
OUT-OF-ORDER EXECUTION LOGIC
THE RETIREMENT LOGIC
INTEGER AND FLOATING-POINT
EXECUTION UNITS
MEMORY SUBSYSTEM
This includes the L2 cache and the system bus.

The L2 cache stores both instructions and data


that cannot fit in the Execution Trace Cache and
the L1 data cache.
The external system bus is connected to the
backside of the second-level cache and is used to
access main memory when the L2 cache has a
cache miss, and to access the system I/O
resources.
MEMORY SUBSYSTEM
 This includes the L2 cache and the system
bus.
 The L2 cache stores both instructions and
data that cannot fit in the Execution Trace
Cache and the L1 data cache.
 The external system bus is connected to
the backside of the second-level cache
and is used to access main memory when
the L2 cache has a cache miss, and to
access the system I/O resources.
NETBURST MICROARCHITECTURE
BUS INTERFACE UNIT (BIU)
The Bus Interface Unit (BM) is used to communicate with the
system bus, cache bus, L2 cache, L1 data cache and L1 code
cache.
INSTRUCTION DECODER
The instruction decoder is used to decode all instructions of the
Pentium 4 processor concurrently and translate them into micro-
operations (μ-ops).
One instruction decoder decodes one instruction per clock cycle.
Simple instructions are translated into one μ-ops, but other
instructions are translated into multiple numbers of μ-ops. Usually, a
complex instruction requires more than four μ-ops. Therefore, the
decoder cannot decode complex instructions and it transfers the
task to a Microcode ROM.
TRACE CACHE (TC)
 After translation of instructions into micro-operations (μ-
ops) by using an instruction decoder, the streams of
decoded instructions are fed to an instruction cache,
which is known as trace cache.
 The L1 cache can store only the decoded stream of
instructions, which are actually micro-operations (μ-ops).
Hence, the speed of execution will be increased
significantly.
 In a Pentium 4 processor, the trace cache can store up
to 12 K μ-ops.
 Normally, the cache assembles the decoded μ-ops in
order of sequence, called traces.
 A single trace contains many trace lines and each trace
line has six μ-ops.
MICROCODE ROM
 As complex instructions perform string and interrupt
operations, etc., the trace cache transfers the control
operation of complex instructions to a micro-code ROM.
 Then microcode ROM is used to generate the micro-
operations (μ-ops) of complex instructions.
 After the micro-operations (μ-ops) are issued by the
microcode ROM, the control again returns back to the
trace cache.
 Subsequently, μ-ops of complex instructions delivered by
the trace cache as well as the microcode ROM will be
buffered in a queue in order of sequence. Then the μ-
ops are fed to the execution unit for execution.
BRANCH PREDICTION
BRANCH PREDICTION
INSTRUCTION TRANSLATION LOOK
ASIDE BUFFER (ITLB)
EXECUTION UNIT
ALLOCATOR
 The allocator accepts micro-operations (µ-ops)
from the μ-ops queue and allocates the key
machine buffers to execute micro-operations.
 The allocator has 126 re-order buffer entries,
128 integer and 128 floating-point physical
registers, 48 load and 24 store buffer entries.
 Since there are two logical processors in
Pentium 4, each logical processor can use at
most half the entries that is 63 re-order buffer 24
buffers and 12 store buffer entries.
REGISTER RENAME
INSTRUCTION SCHEDULERS
 The instruction scheduler is used to schedule micro-operations
(μ-ops) to an appropriate execution unit.
 There are five instruction schedulers to schedule micro-
operations in different execution units.
 Therefore, multiple numbers of μ-ops can be distributed in each
clock cycle.
 Any micro-operation can be executed only whenever the
operands of instruction are available and the specific execution
unit must be available for execution of μ-ops.
 In this way, the scheduling strategy distributes all μ-ops
whenever the operands are ready and the execution units are
available for execution.
 Each scheduler should have its own scheduler queue of eight to
twelve entries from which the scheduler selects μ-ops to
transmit to the execution units.
RAPID EXECUTION MODULE
There are two ALUs (Arithmetic Logic Unit) and two AGUs
(Address Generation Unit) in a Pentium 4 processor.
The ALU and AGU units operate at twice the processor
speed.
For example, if the processor works at 1.4 GHz, the ALUs
can operate at 2.8 GHz.
Hence, twice the numbers of instructions are executed per
clock cycle.
All integer calculations such as addition, subtraction,
multiplication, division and logical operations are
performed in the arithmetic and logic unit.
AGUs are used to resolve indirect mode of memory
addressing.
The ALUs and AGUs are very useful for high-speed
processing.
MEMORY SUBSYSTEM
 The virtual memory and paging technique are
used in memory-subsystem representation.
 The linear address space can be mapped into
the processor’s physical address space, either
directly or using a paging technique.
 In direct mapping, paging is disabled and each
linear address represents a physical address.
 Then linear address bits are sent out on the
processor’s address lines without translation.
 When the paging mechanism becomes enabled,
the linear address space is divided into pages.
MEMORY SUBSYSTEM
BASIC PENTIUM 4 PIPELINE

TC Nxt IP TC Fetch Drive Alloc Rename Queue Schd


Schd Schd Disp Disp Reg Reg Ex FlagsBr ChkDrive
1-2 trace cache next instruction 10-12 write micro-ops into
pointer scheduler
3-4 fetch micro-ops from Trace 13-14 move up to 6 micro-ops to
Cache FU

5 drive micro-ops to alloc 15-16 read registers


17 FU execution
6 alloc resources (ROB, reg,
…) 18 computer flags e.g. for branch
instructions
7-8 rename logic reg to 128
physical reg 19 check branch output with branch
prediction
9 put renamed micro-ops into
20 drive branch check result to frontend
queue
HYPER-THREADING
 For the last few decades, Internet and
telecommunication industries have had an
unprecedented growth.
 To fulfill the requirements of up coming
telecommunication industries, the traditional micro-
architecture is not sufficient for processor design.
 Therefore, processor designers are looking for another
Architecture of Pentium 4 Processor where the ratio
between cost and gain is more reasonable.
 Hyper-threading (HT) technology is one solution.
 This technology was first implemented in Pentium® 4
Xeon processor in 2002.
HYPERTHREADING
 Hyperthreading is a technology introduced by Intel
that allows a single physical processor core to
behave like two logical processors.
 This means that the processor can work on two
sets of tasks simultaneously, potentially improving
overall performance and efficiency.
 Each logical processor has its own architectural
state, allowing it to work independently.
 Hyperthreading can be particularly beneficial in
multitasking environments, where multiple threads
of execution are present.
HYPERTHREADING
 It allows the processor to better utilize its
resources and improve overall throughput.
However, the actual performance benefits of
hyperthreading depend on the specific workload
and the software being used.
 It's important to note that while hyperthreading can
improve performance in certain scenarios, it may
not provide significant benefits in all use cases.
 Additionally, the effectiveness of hyperthreading
can vary depending on the specific processor and
the nature of the tasks being performed.
HYPER-THREADING
ARCHITECTURE STATE (AS)
Hyper-threading technology was introduced on the Intel Pentium 4
Xeon TM processor.
In this processor, there are two logical processors in a single
physical processor.
Each logical processor has a complete set of the architecture
state.
The architecture state has the following registers:
Registers including the general-purpose registers
Control register
Advanced Programmable Interrupt Controller (APIC) registers
Machine state registers
Since two architecture states are present in a single physical
processor, the processor acts as two proces­sors with respect to
software perspective.
There are three types of resources in hyper-threading technology
such as replicated resources, shared resources, and
shared/replicated resources,
HYPER-THREADING
The features of hyper-threading technology are
given below:
HT makes a single physical processor appear as
multiple logical processors.
Each logical processor has its own architecture
state where a set of single execution units are
shared between logical processors.
HT allows a single processor to fetch and execute
two separate code streams simultaneously.
In most of the applications, the physical unit is
shared by two logical units.
HYPER-THREADING
 It is well known to us that each process has a context in which all
the information related with the current state of execution of the
process are described.
 In any process, the contents of the CPU registers, the program
counter, the flag register are used as context.
 Each process should have at least one thread and sometimes
more than one thread is present in a process.
 Each threads has its own local context.
 Sometimes the context of a process is shared by the other threads
in that process.
 The common features of threads are as follows:
 The threads can be independent in a process.
 The threads can be bunched together into a process.
 The threads may be simple in structure and can be used to
increase the speed of operation of the process.
HYPER-THREADING
 Many processes may run on different
processors in a multiprocessor system.
 Different threads of the same process can
be shared and run on different processors.
Therefore, multiple threads improve the
perfor­mance of a multiprocessor system.
 In Intel’s hyper-threading technology, the
concepts of simultaneous multi­threading to
the Intel architecture have been introduced.
HYPER-THREADING
 Presently, the trend is to run multi-threaded
applications on multi-processor systems.
 The most com­mon multi-threaded applications are
Symmetric Multi-Processor (SMP) and Chip Multi-
Processing (CMP).
 Although a symmetric multi-processor has better
performance, the die-size is still significantly large
which causes higher costs and power consumptions.
 The chip multi-processing puts two processors on a
single die. Each processor has a full set of execution
and architectural resources.
 The processors can share an on-chip cache. CMP is
orthogonal to conventional multiprocessor systems.
 The cost of a CMP processor is still high as the
die size is larger than the size of a single core-
chip and power consumption is also high.
 A single processor with multi-processing/multi-
threading or CMP can be supported in different
ways such as Time-Sliced Multi-threading
(TSM), Switch-on Event multi-threading (SEM)
and Simultaneous Multi-Threading (SM).
 Time-Sliced Multi-threading (TSM) In time-sliced multi-
threading, the processor switches from one task to another
after a fixed amount of time has passed. This technique is also
called real multi­tasking. As there is only one processor, there
will be always some loss of execution cycles in the time-slice
multi-threading. But each thread must be gel the attention of the
processor whenever its turn comes. When there is a cache
miss. the processor will switch to another thread automatically.
 Switch-on Event Multi-threading (SEM) The processor could
be designed to switch to another task whenever a cache miss
occurs.
 Simultaneous Multi-threading (SM) In simultaneous multi-
threading or hyper-threading, multiple threads may be executed
on a single processor without switching. When multiple threads
are executed simultaneously, it leads to better use of
resources. Actually, hyper-threading (HT) technology brings the
SM into life in Intel architecture.
TYPES OF RESOURCES IN HT
 Replicated Resources Each processor has general-purpose
registers, control registers, flags, time stamp counters, and APIC
registers. The content of these registers are used as replicated
resources.
 Shared Resources Memory and range registers can be
independently read/write. Therefore, memory, range registers
and data buses are used as shared resources.
 Shared/Replicated Resources The caches and queues in the
hyper-threading pipeline can he shared or not shared according
to the situation.
 Logical processors share resources on the physi­cal processor,
such as caches, execution units, branch predictors, control logic,
and data buses.
 Each logical processor has its own advanced programmable
inter­rupt controller. Usually, interrupts are sent to a spe­cific
logical processor for proper handling.
THREAD-LEVEL PAR­ALLELISM
 In most of the applications, multiple numbers of processes or
threads may be executed in parallel. This kind of parallel
execution is called thread-level par­allelism, and these give
better performance in online applications such as the server
system and Internet applications.
 When the current executing process is completed in time-sliced
multi-threading, its context will be saved into the memory.
 Whenever the process starts execu­tion again, the context of the
process is again restored to exactly the same state.
 Therefore, this process con­sists of the following operations:
 Save the context of the currently executing process after the
time slice is over.
 Flush the CPU of the same process.
 Load the context of the new next process, known as context
switch.
Streaming SIMD Extension (SSE), Extension 2 (SSE2) and Extension 3 (SSE3)
Instructions
When the MMX instructions are extended incorporating floating-point instructions, the
extended instructions are called Streaming SIMD Extensions (SSE) instructions. Initially,
SSE instructions are used in Pentium III and then the SSE instruction set has further been
enhanced in Pentium 4.
The features of SSE instructions are given below:
Streaming SIMD Extensions (SSE) instruction
SSE instructions are SIMD instructions for single-precision floating-point numbers.
SSE instructions can be operate on four 32-bit floating points in parallel.
A set of eight new SIMD floating-point registers are specifically defined for SSE.
The SSE registers are named XMM0 through XMM7.
Each register for SSE is 128 bits long allowing 4 x 32 bit numbers to be handled in parallel.
As different registers have been allocated, it is possible to execute both fixed-point and
floating-point operations simultaneously.
The SSE instructions can execute non-SIMD floating-point and SIMD floating-point
instructions concurrently.
The SSE instructions can operate on packed data or on scalar data and increase the speed
of manipu­lation of 128-bit SIMD integer operations.
Streaming SIMD Extension 2 (SSE2) Instructions
The SSE instructions can be grouped as data-transfer instructions,
data-type conversion instructions, arithmetic, logic and comparison
group of instructions, jump or branch group of instruction, data
management and ordering instructions, shuffle instructions, cache-
ability instructions and state-man­agement instructions.
Streaming SIMD Extension 2 (SSE2) Instructions
In Pentium 4, the pipeline depth is increased significantly, and
execution rate for all instructions are improved. About 144 new
instructions are added with SSE instructions set which allow up to 4
Internet/multimedia based operations in the Pentium 4 processor
and these will be executed simultaneously.
These new instructions and the other improvements are called
Streaming SIMD Extension 2 (SSE2) instructions.
The SSE2 instructions support new data types, namely, double-
precision floating points. The Intel NetBurst micro-architecture has
extended the SIMD capabilities after adding SSE 2.
Streaming SIMD Extension 3 (SSE3)
Instructions
The Streaming SIMD Extensions 3 (SSE3) instructions have been
introduced in the next-generation Pentium 4 processor.
This version was developed by Intel in 2004, when the latest version
of Pentium 4, the Prescot was released.
Actually, the SSE 2 instruction set was extended to SSE3 after
adding 13 additional SIMD instructions over SSE2.
The SSE3 instructions are used for the following operations:
Complex arithmetic operations
Floating-point-to-integer conversion
Video encoding
Thread synchronization
SIMD floating-point operations using array-of-structures format
BLOCK DIAGRAM OF PENTIUM 4

You might also like