0% found this document useful (0 votes)
8 views31 pages

Microprocessor Theory Part 2

The document provides an in-depth overview of the Intel 80386DX microprocessor, detailing its salient features such as a 32-bit address and data bus, support for virtual memory, and multitasking capabilities. It describes the architecture, including the Bus Interface Unit, Prefetch Unit, Decode Unit, Execution Unit, and Memory Management Unit, along with the operating modes: Real Mode, Protected Mode, and Virtual Mode. Additionally, it explains the protection mechanisms, register organization, and the EFLAGS register, highlighting the address translation mechanism used in the processor.

Uploaded by

subodhkudle1295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

Microprocessor Theory Part 2

The document provides an in-depth overview of the Intel 80386DX microprocessor, detailing its salient features such as a 32-bit address and data bus, support for virtual memory, and multitasking capabilities. It describes the architecture, including the Bus Interface Unit, Prefetch Unit, Decode Unit, Execution Unit, and Memory Management Unit, along with the operating modes: Real Mode, Protected Mode, and Virtual Mode. Additionally, it explains the protection mechanisms, register organization, and the EFLAGS register, highlighting the address translation mechanism used in the processor.

Uploaded by

subodhkudle1295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Microprocessor -2

Chapter 4: Intel 80386DX Processor.

Salient Features of 80386:


1. Address Bus:
• Has 32 bit address bus, meaning it can a access of total 232 = 4𝐺𝐵 of physical memory.
• Though address bus is 32 bits, only 30 are used normally, 𝐴1 𝑎𝑛𝑑 𝐴0 are used interally to
produce bank-enable signals.
• Memory ranges from 0000 0000 H to FFFF FFFF H.
2. Data Bus:
• 80386 has a 32-bit data bus, so it can transfer 32 bits of data at once.
• It also has a 32-bit ALU, meaning it can do operations on 32-bit numbers in one go.
• So, 80386 is called a "32-bit microprocessor".
• 32-bit data is stored in 4 continuous memory locations.
• To transfer 32-bit data at once, memory is split into 4 banks of 1 GB each, controlled by 4
bank-enable signals.

3. Address Pipelining:
• 80386 uses address pipelining – it sends the next address during the current cycle (T2
state).
• This hides delays from the decoder and is helpful for slower devices, reducing wait time.

4. Virtual Memory:
• 80386 supports Virtual Memory using Segmentation and Paging.
• It can access up to 64 TB (2^46) of virtual memory.

5. Protection:
• It uses a Protected Mode to safely access memory and I/O.
• There are 4 Privilege Levels to control access.

6. Multitasking:
• 80386 supports Multitasking using time-sharing.
• Multiple tasks can run by taking small turns, improving system performance.

7. I/O Addressing:
• It uses a 16-bit I/O address, allowing access to 65,536 I/O devices (from address 0000h
to FFFFh).
Q.1 Explain the Architecture of 80386 with neat block diagram.
Ans.

80386 Microprocessor Architecture – Block Diagram


The architecture of the Intel 80386 microprocessor is divided into five major functional units, each
responsible for specific operations:

1. Bus Interface Unit (BIU)


• The Bus Unit is responsible for transferring data in and out of the processor.
• It is connected to external memory and I/O devices via the system bus.
• It handles instruction fetch requests from the Prefetch Unit and data transfer requests from
the Execution Unit.
• In case of simultaneous requests, preference is given to the Execution Unit.

2. Prefetch Unit
• The Prefetch Unit fetches upcoming instructions to implement pipelining and stores them in
the Prefetch Queue.
• It fetches 16 bytes of the program in advance and refills the queue when at least 4 bytes are
vacant (due to the 32-bit data bus).
• During control transfer instructions (like branches), the pre-fetched instructions become
invalid and are discarded.

3. Decode Unit
• The Decode Unit is responsible for decoding instructions into micro-operations.
• It decodes up to three instructions simultaneously and stores them in a queue in micro-
coded form.
• Like the prefetch queue, during a control transfer instructions, the decoded instructions are
discarded as they become invalid.

4. Execution Unit
• This unit performs actual execution of instructions.
• It consists of:
o A 32-bit Arithmetic and Logic Unit (ALU) for arithmetic and logical operations.
o Dedicated circuits for 32-bit multiplication and division.
o A 64-bit Barrel Shifter for fast shifting operations.
o A set of 32-bit General Purpose Registers (GPRs)(e.g., EAX, EBX).
o A 32-bit EFLAGS Register that indicates the status of the processor after operations.

5. Memory Management Unit


The memory unit translates virtual addresses into physical addresses. It is further divided into:
(a) Segment Unit
• Converts Logical Address → Linear Address.
• Uses Segment Registers and a Segment Translator.
• Segmentation is compulsory in 80386.
(b) Page Unit
• Converts Linear Address → Physical Address.
• Uses a Translation Lookaside Buffer (TLB) and a Page Translator.
• Paging is optional, but enables advanced memory management and virtual memory (up to
64 TB).

Q.2 Explain the operating modes of 80386? (5M, May-2021)


Ans.

When a computer starts, the 80386 microprocessor first enters Real


Mode, where it behaves like an older 8086 processor. This mode is
mainly used during booting, to run basic setup programs like BIOS. It
has limited features — only 1 MB of memory, no protection, and
basic 16-bit registers. Once the system is set up, the PE (Protection
Enable) bit is set to 1 in the control register CR0, which switches the
processor to Protected Mode. In Protected Mode, the full features of
80386 become available, such as paging, multitasking, and memory
protection. This mode is secure and allows the user to fully utilize the
capabilities of the processor. If old 8086 programs need to be run
inside this mode, the processor can enter Virtual Mode by setting the
VM bit to 1. To exit Virtual Mode, VM is set back to 0. It is not
possible to return from Protected Mode back to Real Mode when
required.

Operation Modes of 80386 Microprocessor


The Intel 80386 microprocessor operates in three distinct modes:

1. Real Mode (16-bit Mode)


• It is the default mode selected when the processor is reset.
• In this mode, the 80386 behaves like a fast 8086 processor.
• Only 1 MB of memory is accessible with 20-bit physical
addressing.
• Address calculation:
Physical Address = Segment Address × 10H + Offset
Eg: 5142H × 10H + 0006H = 51426H
• Only 16-bit registers (AX, BX, etc.) and segment registers (CS, DS, SS, ES, FS, GS) are used.
• 32-bit registers and advanced features are not used.
• Used to run BIOS and DOS-based systems.
• To enter Protected Mode, set the PE (Protection Enable) bit of CR0 register to 1.

2. Protected Mode (32-bit Mode)


• Provides full features of 32-bit architecture, allowing
access to 4 GB memory.
• Supports segmentation + paging for advanced memory
protection and multitasking.
• Allows Virtual Memory up to 64 TB.
• Implements four Privilege Levels (PL0 to PL3) for
process isolation and system security:
o PL0: OS Kernel (most privileged)
o PL1: System Services (e.g., drivers)
o PL2: OS Extensions
o PL3: User Applications (least privileged)
• Enables hardware-based protection, where lower-privilege processes can't access higher-
privilege data/code.
• Supports 32-bit instructions, registers, and advanced control/debug/test registers.

3. Virtual Mode (Virtual 8086 Mode)


• A sub-mode of Protected Mode, used to run 8086 programs in a virtual environment.
• Each task gets a virtual 8086 environment with its own 1 MB address space.
• Useful for running DOS applications in multitasking operating systems (like Windows).
• Provides the benefits of Protected Mode while still supporting Real Mode programs.
Q.3 Explain the protection mode in detail.
Ans.
Q. 4 Explain the protection mechanism in 80386DX with neat diagram.
Ans.
When a memory location is accessed in 80386DX, the first step is address translation, and during this
process, the protection mechanism comes into action to ensure secure and valid access.
80386 performs three types of checks for protection:

1. Limit Check:
• Every segment has a limit (size).
• The offset used to access memory is compared with this limit.
• Depending on the size of the data being accessed, the condition changes:
Data Example Condition Explanation
Size Instruction
8-bit MOV CL, [2000] Offset ≤ Limit Only one byte accessed. If limit is 2000, it's valid.
16-bit MOV CX, [2000] Offset ≤ Limit - Two bytes accessed (2000 and 2001). Limit must
1 allow both.
32-bit MOV ECX, [2000] Offset ≤ Limit - Four bytes accessed (2000–2003). So limit must
3 be ≥ 2003.

2. Type Check:
• Each segment has a descriptor that defines its type.
• The type field has:
o S = 0 → System segment
o S = 1 → User segment
o E, ED, W bits define if it's code/data and whether it's readable/writable.
• Example:
If S = 1, E = 0, ED = 0, W = 0 → It’s a data segment, read-only, and valid for read operation.
This check ensures the type of operation (read/write) is allowed on the segment.

3. Privilege Check:
• 80386 uses Privilege Levels (PL) from PL0 (highest) to PL3 (lowest).
• A program at a lower privilege level cannot access higher privileged
segments.
There are 3 important terms here:
• CPL (Current Privilege Level): Privilege of the current code
segment.
• RPL (Requested Privilege Level): Privilege requested in the
segment selector (last 2 bits).
• DPL (Descriptor Privilege Level): Privilege of the segment being
accessed.

Access Rule:
To access a segment:
Target DPL ≥ max(CPL, RPL)
If this condition is true, access is allowed.
Q.5 Differentiate between Real Mode, Protected Mode and Virtual mode.
Ans.
Q.6 Explain the Register organization of 80386. [10M].
Ans.
Register Organization of 80386
The 80386 microprocessor is an advanced 32-bit processor that significantly extends the register
architecture of its predecessors like 8086 and 80286. It supports both 16-bit and 32-bit operations,
and introduces several new registers to enhance performance, memory handling, and system
control.

1. General-Purpose Registers
The 80386 has eight general-purpose registers, each 32-bit wide, and can also be accessed as 16-bit
or 8-bit registers when needed. These include:
• EAX, EBX, ECX, EDX – Extended versions of AX, BX, CX, DX.
• ESI, EDI, EBP, ESP – Extended Source Index, Destination Index, Base Pointer, and Stack
Pointer.
These are used for arithmetic, logic, data movement, and stack operations. For example, EAX
can be split as AX, and AX further into AH (high byte) and AL (low byte).

2. Segment Registers
80386 supports six segment registers to handle memory segmentation:
• CS, DS, ES, SS – Inherited from earlier processors.
• FS and GS – New in 80386, allowing access to extra memory segments without changing
others.
These registers store selectors in protected mode and segment addresses in real mode,
enabling flexible and secure memory management.

3. Instruction Pointer & Flags


• EIP (Extended Instruction Pointer) is a 32-bit register that points to the next instruction. It
replaces the 16-bit IP from earlier processors and allows addressing up to 4 GB of memory.
• EFLAGS is a 32-bit register that holds status and control flags. It includes all previous flags
and adds:
o VM (Virtual Mode): Enables virtual 8086 mode inside protected mode.
o RF (Resume Flag): Used in debugging to control breakpoints.

4. Control Registers (CR0–CR3)


80386 introduces four 32-bit control registers used by the operating system:
• CR0: Controls operating modes like real/protected.
• CR1: Reserved.
• CR2: Holds page fault linear address.
• CR3: Holds page directory base address for paging.
These are vital for memory management, especially in protected mode and virtual memory
systems.

5. Debug Registers (DR0–DR7)


80386 includes eight 32-bit debug registers used for hardware-level debugging:
• DR0–DR3: Store breakpoint addresses.
• DR4, DR5: Reserved.
• DR6, DR7: Hold status and control info for breakpoints.
These help in efficient error tracking during software development.

6. Test Registers (TR6, TR7)


Used for testing cache and paging features:
• TR6: Test Control.
• TR7: Test Status.
These are used internally or by system software for performance tuning.

7. System Address and Segment Registers


80386 uses special registers to manage descriptor tables:
• GDTR, IDTR: Hold base addresses for Global and Interrupt Descriptor Tables.
• LDTR, TR: Refer to Local Descriptor Table and Task Register respectively.
These are essential for task switching, interrupt handling, and segmentation in protected
mode.
Q.7 Explain the EFLAG register of 80386. [10M] (Dec 2023)
Ans.
The EFLAGS register is a 32-bit register in the Intel 80386 microprocessor that holds the current state
of the processor. It stores flags that reflect the outcomes of arithmetic and logical operations, control
certain operations, and indicate system states.
The lower 16 bits of EFLAGS are called FLAGS and are compatible with the 8086/80286 architecture,
enabling backward compatibility. The register is divided into three categories of flags:

1. Status Flags
These indicate the results of arithmetic/logical instructions.
Flag Full Name Description

CF Carry Flag Set when an operation results in a carry or borrow. Used in multi-byte arithmetic
and shift operations.

PF Parity Flag Set if the least significant byte of the result has even parity (even number of 1s).

AF Auxiliary Carry Set if there is a carry/borrow between lower and upper nibbles (used in BCD
Flag arithmetic).

ZF Zero Flag Set if the result of an operation is zero.

SF Sign Flag Reflects the sign of the result. Set if the result is negative (MSB=1).

OF Overflow Flag Set if the signed result is too large to fit in the destination register. Useful in signed
arithmetic.

Example:
Adding +118 (01110110) and +54 (00110110) results in 10101100 (+172), which overflows 8-bit
signed range; OF is set.

2. Control Flags
These control CPU instruction behaviors.
Flag Full Name Description

DF Direction Determines string operation direction. Cleared (0) = forward, Set (1) = backward.
Flag

IF Interrupt Enables or disables maskable hardware interrupts. Set = enabled, Clear = disabled.
Flag

TF Trap Flag Enables single-step debugging. Each instruction generates an exception for debugging
purposes.
3. System Flags
Used mainly by the operating system.
Flag Full Name Description

IOPL I/O Privilege 2-bit field indicating the privilege level required to execute I/O instructions.
Level

NT Nested Task Set when one task invokes another (task switching).

RF Resume Flag Used to resume execution after handling debug exceptions.

VM Virtual Mode Enables Virtual 8086 mode, allowing real-mode programs to run in protected
mode.

Q.8 Explain the Address Translation Mechanism implemented on 80386 DX.


Ans.
In 80386, the address translation happens in two main steps:
Step 1: Logical Address → Linear Address (Segmentation)
• A logical address has two parts: Selector + Offset.
• The Selector is used to access a Segment Descriptor from the GDT/LDT (Global/Local
Descriptor Table).
• The Segment Descriptor gives the Base Address of the segment.
• Then, Linear Address = Segment Base Address + Offset.
• This is called Segmentation.
Example:
If Base = 1000H and Offset = 0020H, then Linear Address = 1020H.

Step 2: Linear Address → Physical Address (Paging)


• The 80386 uses a two-level paging mechanism to convert Linear to Physical Address.
• The 32-bit Linear Address is divided into 3 parts:
o Top 10 bits → Page Directory Index (selects 1 of 1024 PDEs)
o Next 10 bits → Page Table Index (selects 1 of 1024 PTEs)
o Last 12 bits → Offset (within 4 KB page)
Step-by-step:
1. CR3 Register (PDBR) gives the base address of Page Directory.
2. First 10 bits select a Page Directory Entry (PDE).
3. PDE gives the base address of a Page Table.
4. Next 10 bits select a Page Table Entry (PTE).
5. PTE gives the base address of the Page Frame in physical memory.
6. The last 12 bits (offset) are added to this to get the Physical Address.
Final Formula:
Physical Address = Page Frame Base + Offset
Extra Features:
• Each PDE and PTE includes control bits:
o P (Present): 1 if the page is in memory
o D (Dirty): 1 if modified
o A (Accessed): 1 if recently accessed
o U/S, R/W: Protection info
• Page Fault: Occurs if P = 0 (page not in memory).
• Page Replacement uses algorithms like LRU or FIFO.
• TLB (Translation Lookaside Buffer) stores recently used entries to speed up access.
Q.9 Explain descriptors and paging mechanism in protected mode of 80386 (10M)
Ans.

Descriptors in Protected Mode:

In 80386 protected mode, memory segmentation is used along with descriptors to define the
characteristics of memory segments. Descriptors are stored in special tables:

• GDT – Global Descriptor Table

• LDT – Local Descriptor Table

Each segment descriptor is 8 bytes (64 bits) and contains:

Field Purpose

Base Address (32 bits) Starting address of the segment

Segment Limit (20 bits) Size of the segment

Type, DPL, S, P Access rights, segment type, privilege level

Flags: G, D/B, AVL Granularity, operation size, software-available

Segments can be code, data, or system segments, and protection is enforced using DPL (Descriptor
Privilege Level).
In 80386, paging is used to translate a linear address (after segmentation) into a physical address.
The virtual memory is divided into 4 KB blocks called pages, and physical memory is divided into 4 KB
page frames.
Step-by-Step Translation:
1. Linear Address Structure:
o 32-bit linear address is divided into 3 parts:
▪ 10 bits → Page Directory Index (PDE)
▪ 10 bits → Page Table Index (PTE)
▪ 12 bits → Offset within the page
2. CR3 Register (PDBR):
o The Page Directory Base Register (CR3) stores the address of the Page Directory.
3. Page Directory:
o Page Directory has 1024 entries (PDEs). Each entry is 4 bytes.
o The first 10 bits of the linear address select one PDE.
o This PDE gives the base address of a Page Table.
4. Page Table:
o Each Page Table has 1024 entries (PTEs). Each entry is 4 bytes.
o The next 10 bits of the linear address select one PTE.
o This PTE gives the base address of a Page Frame in physical memory.
5. Page Frame + Offset:
o The lowest 12 bits of the linear address are used as offset within the 4 KB page.
o The physical address = Page Frame address + offset.
6. Special Bits in PDE & PTE:
o P (Present): 1 if page is in physical memory
o D (Dirty): 1 if the page was modified
o A (Accessed): 1 if page was accessed
o U/S & R/W: User/Supervisor and Read/Write permissions
7. Page Fault Handling:
o If the page is not present (P=0), a page fault occurs.
o The desired page is fetched from virtual memory and placed in physical memory.
8. Page Replacement:
o If no free page is available, page replacement is done using FIFO, LRU, or LFU
algorithms.
o If Dirty bit = 1, the old page is saved before replacement.
9. TLB (Translation Lookaside Buffer):
o A cache memory that stores recent PDEs & PTEs for faster access.
Chapter 5: Pentium Processor
Features:
• Pentium Microprocessor was available with 66MHz - 99MHZ clock. [Higher the clock gives
higher speed of execution of program. This is double or triple then earlier MP]

• Pentium Microprocessor has two 32 bits of MPU & ALU. [Due to two execution unit it is
referred as superscalar architecture. Double unit, Double the speed, more effiecient]

• Pentium Microprocessor has 64 bits data bus. [64bits data bus needs 8 level memory
banking.]

• Pentium Microprocessor has 32 bits address bus. [It can have 64TB virtual memory.]

• Pentium Microprocessor has FIVE stage Pipelining. (Actually 2 units, each has 5 stages so
indirectly there are two 5-Staged pipelines known as U Pipe and V Pipe)

• Pentium Microprocessor support Integer UV Pipelining.

• Pentium Microprocessor have On chip floating point unit. Floating point unit supports 8
stage pipelining.

• Pentium Microprocessor supports branch prediction algorithm. (means the processor


guesses the outcome of a branch (like if-else) before it's known, to keep the pipeline running
smoothly and avoid delays.)

• Pentium Microprocessor have on chip L1 cache memory. [It has 16KB split cache memory]
Architecture:

Working:
The Pentium processor begins by fetching
instructions from the Instruction Cache using
the Prefetch Buffers, and if a branch
instruction is detected, the Branch Target
Buffer predicts the next instruction to avoid
delays. These instructions are decoded and
passed to the Control Unit, which manages
the flow. The decoded instructions are then
executed in two parallel integer pipelines—
the U-pipe and V-pipe, each with its own ALU
and Address Generator. Data needed for
execution is fetched from the Dual-Access
Data Cache, and results are stored in Integer
Registers. For complex operations like
floating-point math, control passes to the
Floating Point Unit (FPU), which performs
addition, division, and multiplication using its
own dedicated registers and circuits.
Meanwhile, the Bus Interface Unit handles
communication with memory and I/O
through 64-bit data and 32-bit address buses,
and the Page Unit supports memory
management.

BUS Interface Unit:


• BIU controls all the system bus of MPU.

• Addressing of Memory and IO is done by BIU.

• Pentium has 64 Data bus and 32 Address bus.

Superscalar Structure:
• Superscalar means this architecture has more then one execution units.

• Pentium processor contains two integer units:


1. U Pipe
2. V Pipe

• Each of this integer unit has 32bits of ALU.

• Due to two execution unit, performance of Pentium also doubled.


Prefetch Unit:
• Prefetch unit is used to implement FIVE stage pipelining.

• It consists pair of prefetch buffers having a size of 32Bytes each.


• Both buffers operate independently but not at the same time.

• It takes instructions sequentially from Code Cache

Branch Prediction Unit:


• Branch Prediction plays essential role in the prediction of branch.

• Branch instruction causes pipeline stalling.

• Frequent pipeline stalling degrades performance of execution.

• So by branch prediction we can predict branch instruction in advance.

• Branch prediction is done with the use of BTB Branch Target Buffer stores 256 entries.

Execution Unit:
• Two execution units are executing instructions in parallel.

• Each execution unit has 32bits of ALU.

• As per the execution of instruction, it updates the data of register set.

Cache Memory:
• Pentium has integrated cache memory, which increases the execution speed of processor.

• It has 16Kb split Cache memory integrated on chip for data and for Code.

• Split Cache reduces the memory conflicts.

Floating Point Unit:


• It is used for execution of floating point instructions.

• Floating point unit also supports 8 stage pipeline.

• Floating point unit enhances the capabilities of microprocessor for graphics & multimedia
applications.

• It has dedicated hardwire circuits for Multiplier, Divider & Adder.

• The hardwire control improves performance.


• FPU supports single, double & extended precision floating point operations with eight 80 bits
registers.
Q.1 Explain Integer Pipeline of Pentium Processor.
Ans.
Integer Pipeline of Pentium Processor
The Pentium processor performs integer instructions using a five-stage pipeline. These stages are:
1. Prefetch (PF)
2. Instruction Decode (DI)
3. Address Generate (D2)
4. Execute (EX)
5. Write-Back (WB)

Stage 1: Prefetch (PF)


• Instructions are fetched from the L1 Cache and stored into the Prefetch Queue.
• The Prefetch Queue is 32 bytes, which can hold at least two full instructions.
• Maximum size of an instruction is 15 bytes.
• There are two prefetch queues, but only one is active at a time.
• The second queue is used when the branch prediction logic predicts a branch to be "taken".
• Since the bus width from L1 cache to prefetcher is 256 bits (32 bytes), the entire queue can
be fetched in 1 cycle (T State).

Stage 2: Instruction Decode (DI)


• The opcode of the instruction is decoded.
• Checks for instruction pairing and branch prediction.
• Not all instructions are pairable.
• If the instructions can be paired:
o First instruction goes to the U-pipe
o Second instruction goes to the V-pipe
• If they cannot be paired:
o First instruction goes to the U-pipe
o Second instruction is held back and tried to pair with the next instruction
Instruction Pairing Algorithm:
If all the following conditions are true for two instructions I1 and I2:
• I1 is a simple instruction
• I2 is a simple instruction
• I1 is not a jump instruction
• Destination of I1 ≠ Source of I2
• Destination of I1 ≠ Destination of I2
Then:
Issue I1 to U-pipe and I2 to V-pipe
Else:
Issue I1 to U-pipe only
Branch Prediction:
• Pentium uses branch prediction logic.
• It helps avoid pipeline flushing during branch instructions.
• If prediction is correct, there is no performance penalty.

Stage 3: Address Generate (D2)


• Generates physical address of the required memory operand.
• Uses segment translation and page translation.
• Performs protection checks.
• Fast address calculation using segment descriptor caches and TLB.
• Address translation is usually completed in 1 cycle.
Stage 4: Execute (EX)
• Execution is done using ALUs.
• U-pipe’s ALU has a barrel shifter, required for instructions like MUL, DIV, SHL, SHR, etc.
• V-pipe does not have a barrel shifter.
• Operands are taken from either registers or the data cache (if it's a cache hit).
• Both U and V pipes can access the data cache simultaneously.
• Stall behavior:
o If U-pipe stalls, V-pipe must also stall.
o If V-pipe stalls, U-pipe can still proceed.

Stage 5: Write-Back (WB)


• The result is written back to the destination register.
• Flags are updated based on the result of the instruction.

Q.2 Explain the floating-point pipeline of pentium processor.


Ans.
The Pentium processor's floating-point pipeline is an 8-stage pipeline, blending integer instruction
stages with dedicated floating-point processing. It ensures efficient execution of complex arithmetic
operations while supporting error handling, rounding, and format conversion for robust floating-
point computation.

The Pentium processor features a superscalar architecture that includes separate pipelines for
integer and floating-point operations. The floating-point pipeline is specifically optimized for
arithmetic operations on real numbers (fractions, decimals, scientific data).

It consists of eight pipeline stages, of which:

• The first four are shared with the integer pipeline.

• The last four are dedicated to the Floating-Point Unit (FPU).

Stage Description
1. Prefetch Identical to the integer pipeline's prefetch stage. Instructions are fetched from
memory.

2. Instruction Decode 1 (D1) Initial decoding of the instruction is performed. This includes recognizing the
opcode and preparing for further decode.

3. Instruction Decode 2 (D2) More detailed decoding. Determines the type of operation and the operand
types.

4. Execution Stage (Ex) Operand values are read from registers or memory. This includes address
generation and access.

5. FP Execution 1 (FP Ex1) The operand is loaded into a floating-point register and converted into floating-
point format if needed (e.g., from integer or double).

6. FP Execution 2 (FP Ex2) The actual floating-point arithmetic operation (like ADD, MUL, DIV, etc.) is
performed by the floating-point unit.
7. Write FP Result The computed result is rounded according to the FPU control word and written
to the destination floating-point register.

8. Error Reporting If any exception or error (like overflow, underflow, divide by zero) occurs, it is
flagged and the FPU status word is updated accordingly.

Q.3 Explain the branch prediction mechanism of 80386 processor. [10M]


Ans.
The Pentium processor includes a sophisticated branch prediction logic to minimize pipeline
flushing, which can degrade performance. This mechanism attempts to predict the outcome of a
branch instruction (whether the branch will be taken or not) before it is actually executed. This way,
the processor can continue fetching and executing instructions without waiting for the branch to
resolve.

If a branch is correctly predicted, then no performance penalty is incurred. The instructions


following the branch (based on the prediction) continue smoothly through the pipeline.
However, if the prediction is incorrect, there is a penalty:

• A three-cycle penalty is incurred if the branch is executed in the U pipeline.

• A four-cycle penalty (3 + 1 extra) may be incurred if the branch is executed in the V pipeline.

Branch Target Buffer (BTB):


The prediction mechanism is implemented using a hardware structure known as the Branch Target
Buffer (BTB).

• The BTB is a 4-way set-associative cache with 256 entries.

• It functions like a look-aside cache, monitoring the instruction stream during the Decode
Instruction (DI) stages of both pipelines.

Each entry in the BTB includes:

1. A valid bit – Indicates whether the entry is currently in use.

2. Two history bits – These bits keep track of the past behavior of the branch (how often the
branch was taken).

3. The memory address of the branch instruction – Used for identifying the instruction.

How Prediction Works:


During the DI stage, when an instruction is decoded and identified as a branch instruction, its
memory address is checked against the BTB:

• If the address is not found in the BTB (BTB miss), the processor predicts that the branch will
not be taken.
• If the address is found in the BTB (BTB hit), the two history bits are used to decide the
prediction:

History Bits Meaning Prediction

00 Strongly Not Taken Branch will not be taken

01 Weakly Not Taken Branch will not be taken

10 Weakly Taken Branch will be taken

11 Strongly Taken Branch will be taken

Instruction Queues and Control Flow:


Based on the prediction:

• If the branch is predicted to be taken, the current instruction queue is deactivated, and the
prefetcher starts fetching instructions from the target (branch) address. These instructions
are stored in a second queue, which becomes the new active queue.

• If the branch is predicted not to be taken, then nothing changes. The current active queue
continues to fetch instructions from the next sequential address after the branch.

Execution and Feedback:


When the branch instruction reaches the execution stage, the processor now knows the actual
outcome of the branch (taken or not taken). Based on this, the following actions are taken:

1. If the branch was correctly predicted taken:

o The history bits in the BTB entry are upgraded to strengthen the prediction.

o No further action is needed as the correct path is already being followed.

2. If the branch was incorrectly predicted taken:

o The history bits are downgraded.

o The instructions fetched from the wrong path must be flushed.

o The prefetcher switches back to the other queue to fetch the correct instructions.

3. If the branch was correctly predicted not taken:

o If there is a BTB entry, its history bits are downgraded slightly (toward "not taken").

o If there was no BTB entry (miss), nothing is updated.


4. If the branch was incorrectly predicted not taken:

o If a BTB entry exists, its history bits are upgraded.

o If there was no BTB entry, a new entry is created in the BTB and marked as strongly
taken.

History Bit Transition Table:

History Bits Description Prediction Made If Actually Taken If Actually Not Taken

00 Strongly Not Taken Not Taken Becomes 01 Remains 00

01 Weakly Not Taken Not Taken Becomes 10 Becomes 00

10 Weakly Taken Taken Becomes 11 Becomes 01

11 Strongly Taken Taken Remains 11 Becomes 10

Conclusion:

The branch prediction logic in the Pentium processor is designed to improve performance by
reducing pipeline stalls during control flow instructions. By using the BTB and 2-bit history-based
prediction, the processor intelligently guesses the most likely path of execution and learns over time.
Correct predictions allow seamless execution, while incorrect ones trigger minimal recovery actions
like flushing and re-fetching.

This dynamic mechanism helps maintain high instruction throughput and overall efficiency in
modern processors.

Q.4 Explain Cache organization of Pentium processor. (10M)


Ans.
Pentium processors are very fast, but accessing main memory
(DRAM) is much slower than accessing internal CPU registers. To
balance speed and size, systems use SRAM (fast but small and
costly) as cache memory, and DRAM (large and cheap) as main
memory.

Cache memory (SRAM) stores recently used instructions and


data for faster access. It’s small, fast, and often built into the
processor (on-chip), making it quicker than off-chip memory. It
works automatically and helps the CPU by taking advantage of
locality, where programs often access the same data or
instructions repeatedly. Therefore, code and data should not be
placed in the slower main memory.

Thus, fast SRAM cache is placed between the CPU and the slower DRAM through a cache controller.
SRAM cache is used to hold the most frequently accessed instructions as well as data and make it
available very quickly. The cache controller controls the complete process.
When the Pentium processor wants to read data, it sends out the memory address of desired data.
Then cache controller decides whether the address of data is in the SRAM cache or in the main
memory. When the data is in the cache, it is called a cache hit and the address is passed to cache
memory without delay.

When the processor sends out an address of data which does not exist in the cache memory, it is
called a cache miss. Subsequently, the cache controller must go out to the main memory.

Cache lines are usually 32 bytes and are read using burst mode, which increases speed by
transferring multiple bytes at once, boosting performance.

Write Operations in Cache:

1. Write-Through Strategy:
Data is written to both cache and main memory on a cache hit. It ensures memory
consistency but is slower due to frequent memory access. Write buffers can help but only
until they fill up.

2. Write-Back Strategy:
Data is written only to the cache first. Main memory is updated later during specific events
(like executing WBINVD, a cache miss, or a flush). It improves performance but takes longer
during cache replacements.

3. Write-Allocate:
On a write miss, a cache line is allocated and updated so future accesses are faster. But if the
cache is full, this might replace useful old data.

Cache Invalidation & Flush:


If other devices (like DMA controllers) write to main memory, the cache must invalidate its copy to
avoid stale data. If the latest data is only in the cache, it is flushed (written back) to memory before
external access.

Cache Organization in Pentium Processors: A Cache Memory in Pentium Processor is used to store
both the data and the address where the data is stored in the main memory. There are methods of
cache organization such as direct mapped cache and two-way set-associative cache.

1. Direct-Mapped Cache:

o Each memory address maps to one specific cache location.

o If two addresses share the same cache index, they compete (called contention).

o Fastest cache type since tag and data are accessed in parallel.

o 8KB cache with 16-byte lines uses 512 lines (A12–A4 for indexing, A31–A13 for tag).

o Data is fetched in blocks using burst mode; line size typically matches block size.

2. Two-Way Set-Associative Cache:

o The cache is divided into sets, and each set contains 2 cache lines. When data needs
to be stored in the cache, it can be placed in either of the two lines in the
corresponding set.
o Each address (from main memory) can be stored in two possible locations in cache
(reduces contention).

o Acts like two direct-mapped caches running in parallel.

o 8KB cache with 16-byte lines split into two halves of 256 lines each.

o Uses A11–A4 for indexing and A31–A12 for tag.

o Slightly slower than direct-mapped due to extra selection (multiplexing) time.

Example of 2 Way set associative:

Let’s take an example:


• Cache Size: 8 KB
• Line Size: 16 bytes
• Number of lines = 8 KB / 16 B = 512 lines
• Since it is two-way, cache is split into 256 sets, each with 2 lines
Now, to map a 32-bit memory address to the cache:

Breakdown of address bits:


Address Bits Purpose
A31–A12 (20 bits) Tag – identifies if the block in the cache is the correct one
A11–A4 (8 bits) Set Index – selects one of the 256 sets
A3–A0 (4 bits) Byte Offset – selects a byte within a 16-byte cache line

How It Works (Step-by-Step):


1. CPU requests data from memory.
2. Set index (A11–A4) is used to locate the correct set.
3. Inside that set, there are 2 cache lines → both are checked in parallel.
4. Tag (A31–A12) from the address is compared with tags stored in both lines.
o If match found (cache hit): data is returned from cache.
o If no match (cache miss): data is fetched from main memory and placed into one of
the two lines in the set.
5. If both lines are already used, one is replaced (usually using LRU – Least Recently Used
policy).
Pre-requisite for MESI protocol:
The cache consistency problem occurs when data can
be modified by more than one source. When a copy of
data is held in both main memory and a cache
memory, one copy is changed and the other copy is
stale and the system consistency will be lost. If there
are two caches in the system, the problem becomes
very complicated. Assume a multiprocessor system
consists of two processors, namely, Pentium-A and
Pentium-B. If the secondary Pentium processor
(Pentium-B) overwrites a common memory location,
the other processor (Pentium-A) should know that
this has occurred. Usually, the MESI protocol for cache
lines is used in Pentium and advanced processors to
ensure cache consistency. Figure 12.46 shows the bus
snooping when two processors each with a local
cache have access to a common main memory.

Q.5 Explain MESI Protocol (10M).


Ans.
The MESI protocol is a cache coherence protocol used in multiprocessor systems to maintain
consistency between the main memory and local processor caches. It ensures that no processor uses
stale or incorrect data, thus maintaining data integrity across the system.

A. Why the MESI Protocol is Needed


1. Data Consistency:
In a multiprocessor system, several processors may have cached copies of the same memory
block. Without a coherence protocol, different caches could hold different versions of the
same data, leading to incorrect program behavior. MESI prevents this by synchronizing cache
states.
2. Reduced Bus Traffic:
Without cache coherence, every memory access would need to be broadcast to all caches to
maintain accuracy. MESI reduces this bus traffic by maintaining proper state information for
each cache line and only involving the bus when necessary.
3. Reliable Data Storage:
MESI ensures that updates to shared data are done atomically. This allows multiple
processors to safely access and update shared data without conflicts or data corruption.

B. Working of the MESI Protocol


The MESI protocol assigns one of four states to each cache line:
• Modified (M):
The cache line has been changed and is different from main memory. Only this cache holds
the updated data.
• Exclusive (E):
The cache line matches the main memory and no other cache holds this data.
• Shared (S):
The cache line is unchanged from main memory and may exist in multiple caches.
• Invalid (I):
The cache line is invalid or not present.
Operations Based on MESI:
1. Initial State:
All cache lines begin in the Invalid (I) state.
2. Read Operation:
o If a processor (e.g., Processor-A) reads a block not in its cache, it fetches from main
memory (Read Miss).
o If no other cache has this block, it enters Exclusive state.
o If another cache (e.g., Processor-B) already has the block, both caches go to the
Shared state.
o If a block is in Exclusive or Shared state, the processor reads it directly from the
cache.
3. Write Operation:
o If the cache line is Exclusive, the processor can write and the state changes to
Modified.
o If it’s Shared, the processor invalidates all other copies and transitions to Modified.
o If it’s Invalid, the processor fetches the block and directly writes, transitioning to
Modified.
o If already Modified, the processor can continue writing without any change.

Conclusion:
The MESI protocol maintains cache consistency efficiently using its four-state mechanism. By
minimizing unnecessary memory accesses and avoiding stale data issues, it ensures high
performance and correctness in multiprocessor systems.
Chapter 6: Pentium 4

Q.1 Comparison 80386 ,Pentium 1 ,Pentium 2 and Pentium 3 Processor


Ans.
Q.2 Draw and explain Pentium 4: Net burst microarchitecture. (May 2024) (10M)
Ans.
The Pentium 4 processor, launched by Intel, uses the NetBurst microarchitecture designed for high
performance and higher clock speeds. It introduced many advancements over previous generations
like Pentium III.

Key Features of NetBurst Architecture:


1. Hyper Pipelined Technology
o 20-stage deep pipeline (vs 10 in Pentium III)
o Allows higher clock speeds but increases penalty on mispredicted branches.
2. Faster System Bus
o Operates at 400 MHz for faster data transfer between CPU and memory.
3. Execution Trace Cache (ETC)
o A special Level-1 instruction cache that stores decoded micro-operations (uops), not
raw instructions.
o Eliminates repeated decoding in loops.
4. Advanced Dynamic Execution
o Out-of-order execution for higher efficiency.
o Includes:
▪ Branch prediction logic
▪ Instruction reordering
▪ Speculative execution
5. Rapid Execution Engine
o Two ALUs and two AGUs run at double the core frequency.
o Greatly speeds up integer operations.
6. SSE2 Instructions
o Adds 144 new SIMD instructions.
o Boosts performance in multimedia, scientific, and 3D applications.

Block Diagram (Describe If Asked for It)


Four major sections in the Pentium 4 architecture:
1. In-Order Front End
• Fetches instructions from L2 cache using branch
prediction.
• Decodes IA-32 instructions into uops.
• Stores decoded instructions in the Trace Cache
(12K uops, ~150 KB).
2. Out-of-Order Execution Engine
• Reorders instructions to utilize CPU resources
efficiently.
• Keeps pipeline busy by executing ready uops first.
• Uses retirement logic to commit results in correct
program order.
3. Integer and Floating-Point Units
• Executes actual computations.
• Includes register files, ALUs, FPUs, and L1 Data
Cache.
4. Memory Subsystem
• Includes L1, L2 caches and system bus.
• L2 holds instructions/data not in Trace Cache or L1.
• System bus handles cache misses and I/O operations.
Pipeline Stages Overview:
• TC Nxt IP: Determines next instruction using Branch Target Buffer (2 stages)
• TC Fetch: Fetches from Trace Cache (2 stages)
• Drive & Alloc: Sends uops and allocates resources
• Rename: Maps x86 registers to 128 internal ones (2 stages)
• Queue (Que): Uops wait in their type-specific queue
• Schedule (Sch): Reorders and sends uops to execution units (3 stages)
• Dispatch, Register Read, Execute: Micro-ops executed
• Flags & Branch Check: Updates flags and verifies branch predictions

Summary to Remember:
"Pentium 4 with NetBurst is all about SPEED — deeper pipelines (Hyper Pipelining), smarter
instruction handling (Trace Cache, Out-of-Order Execution), faster buses, and SIMD boosts (SSE2)
— all making it ideal for high-performance tasks."

Q.3 Explain hyper threading technology and its use in Pentium 4. [10M] (May 2023)
Ans.
Hyper-Threading Technology (HTT) is Intel’s simultaneous multithreading (SMT) technology that
allows a single physical processor core to behave like two logical processors, so it can execute two
threads (instruction sequences) at the same time.

Key Concepts:
• Normally, a single core executes one thread at a time.
• With Hyper-Threading, the Pentium 4 can execute two threads simultaneously using the
same core.
• This improves CPU efficiency by utilizing idle resources (like ALUs, FPUs, cache, etc.) when
one thread is waiting (e.g., for memory access).

How Hyper-Threading Works:


• The CPU maintains two sets of architectural states (registers, program counters, etc.) for the
two threads.
• But it shares the execution units, caches, and buses between them.
• The operating system sees two logical processors, and schedules two tasks on them as if it
were a dual-core CPU.

Use in Pentium 4:
• Introduced with Pentium 4 (NetBurst architecture).
• Improves performance by:
o Increasing CPU throughput.
o Reducing idle time of execution units.
o Enhancing multitasking (running multiple apps smoothly).
• Best performance gains seen in multi-threaded applications (like video editing, gaming,
compiling).

Advantages:
• Up to 30% better performance in multithreaded workloads.
• More efficient use of CPU resources.
• Improves responsiveness in multitasking environments.
Limitations:
• Does not double performance; gains depend on workload.
• Performance can degrade slightly in non-optimized or heavily memory-bound applications
due to shared resources.

Easy to Remember:
"One core, two threads = Hyper-Threading. Introduced in Pentium 4 to increase efficiency and
performance without extra hardware."

You might also like