Microprocessor Theory Part 2
Microprocessor Theory Part 2
3. Address Pipelining:
• 80386 uses address pipelining – it sends the next address during the current cycle (T2
state).
• This hides delays from the decoder and is helpful for slower devices, reducing wait time.
4. Virtual Memory:
• 80386 supports Virtual Memory using Segmentation and Paging.
• It can access up to 64 TB (2^46) of virtual memory.
5. Protection:
• It uses a Protected Mode to safely access memory and I/O.
• There are 4 Privilege Levels to control access.
6. Multitasking:
• 80386 supports Multitasking using time-sharing.
• Multiple tasks can run by taking small turns, improving system performance.
7. I/O Addressing:
• It uses a 16-bit I/O address, allowing access to 65,536 I/O devices (from address 0000h
to FFFFh).
Q.1 Explain the Architecture of 80386 with neat block diagram.
Ans.
2. Prefetch Unit
• The Prefetch Unit fetches upcoming instructions to implement pipelining and stores them in
the Prefetch Queue.
• It fetches 16 bytes of the program in advance and refills the queue when at least 4 bytes are
vacant (due to the 32-bit data bus).
• During control transfer instructions (like branches), the pre-fetched instructions become
invalid and are discarded.
3. Decode Unit
• The Decode Unit is responsible for decoding instructions into micro-operations.
• It decodes up to three instructions simultaneously and stores them in a queue in micro-
coded form.
• Like the prefetch queue, during a control transfer instructions, the decoded instructions are
discarded as they become invalid.
4. Execution Unit
• This unit performs actual execution of instructions.
• It consists of:
o A 32-bit Arithmetic and Logic Unit (ALU) for arithmetic and logical operations.
o Dedicated circuits for 32-bit multiplication and division.
o A 64-bit Barrel Shifter for fast shifting operations.
o A set of 32-bit General Purpose Registers (GPRs)(e.g., EAX, EBX).
o A 32-bit EFLAGS Register that indicates the status of the processor after operations.
1. Limit Check:
• Every segment has a limit (size).
• The offset used to access memory is compared with this limit.
• Depending on the size of the data being accessed, the condition changes:
Data Example Condition Explanation
Size Instruction
8-bit MOV CL, [2000] Offset ≤ Limit Only one byte accessed. If limit is 2000, it's valid.
16-bit MOV CX, [2000] Offset ≤ Limit - Two bytes accessed (2000 and 2001). Limit must
1 allow both.
32-bit MOV ECX, [2000] Offset ≤ Limit - Four bytes accessed (2000–2003). So limit must
3 be ≥ 2003.
2. Type Check:
• Each segment has a descriptor that defines its type.
• The type field has:
o S = 0 → System segment
o S = 1 → User segment
o E, ED, W bits define if it's code/data and whether it's readable/writable.
• Example:
If S = 1, E = 0, ED = 0, W = 0 → It’s a data segment, read-only, and valid for read operation.
This check ensures the type of operation (read/write) is allowed on the segment.
3. Privilege Check:
• 80386 uses Privilege Levels (PL) from PL0 (highest) to PL3 (lowest).
• A program at a lower privilege level cannot access higher privileged
segments.
There are 3 important terms here:
• CPL (Current Privilege Level): Privilege of the current code
segment.
• RPL (Requested Privilege Level): Privilege requested in the
segment selector (last 2 bits).
• DPL (Descriptor Privilege Level): Privilege of the segment being
accessed.
Access Rule:
To access a segment:
Target DPL ≥ max(CPL, RPL)
If this condition is true, access is allowed.
Q.5 Differentiate between Real Mode, Protected Mode and Virtual mode.
Ans.
Q.6 Explain the Register organization of 80386. [10M].
Ans.
Register Organization of 80386
The 80386 microprocessor is an advanced 32-bit processor that significantly extends the register
architecture of its predecessors like 8086 and 80286. It supports both 16-bit and 32-bit operations,
and introduces several new registers to enhance performance, memory handling, and system
control.
1. General-Purpose Registers
The 80386 has eight general-purpose registers, each 32-bit wide, and can also be accessed as 16-bit
or 8-bit registers when needed. These include:
• EAX, EBX, ECX, EDX – Extended versions of AX, BX, CX, DX.
• ESI, EDI, EBP, ESP – Extended Source Index, Destination Index, Base Pointer, and Stack
Pointer.
These are used for arithmetic, logic, data movement, and stack operations. For example, EAX
can be split as AX, and AX further into AH (high byte) and AL (low byte).
2. Segment Registers
80386 supports six segment registers to handle memory segmentation:
• CS, DS, ES, SS – Inherited from earlier processors.
• FS and GS – New in 80386, allowing access to extra memory segments without changing
others.
These registers store selectors in protected mode and segment addresses in real mode,
enabling flexible and secure memory management.
1. Status Flags
These indicate the results of arithmetic/logical instructions.
Flag Full Name Description
CF Carry Flag Set when an operation results in a carry or borrow. Used in multi-byte arithmetic
and shift operations.
PF Parity Flag Set if the least significant byte of the result has even parity (even number of 1s).
AF Auxiliary Carry Set if there is a carry/borrow between lower and upper nibbles (used in BCD
Flag arithmetic).
SF Sign Flag Reflects the sign of the result. Set if the result is negative (MSB=1).
OF Overflow Flag Set if the signed result is too large to fit in the destination register. Useful in signed
arithmetic.
Example:
Adding +118 (01110110) and +54 (00110110) results in 10101100 (+172), which overflows 8-bit
signed range; OF is set.
2. Control Flags
These control CPU instruction behaviors.
Flag Full Name Description
DF Direction Determines string operation direction. Cleared (0) = forward, Set (1) = backward.
Flag
IF Interrupt Enables or disables maskable hardware interrupts. Set = enabled, Clear = disabled.
Flag
TF Trap Flag Enables single-step debugging. Each instruction generates an exception for debugging
purposes.
3. System Flags
Used mainly by the operating system.
Flag Full Name Description
IOPL I/O Privilege 2-bit field indicating the privilege level required to execute I/O instructions.
Level
NT Nested Task Set when one task invokes another (task switching).
VM Virtual Mode Enables Virtual 8086 mode, allowing real-mode programs to run in protected
mode.
In 80386 protected mode, memory segmentation is used along with descriptors to define the
characteristics of memory segments. Descriptors are stored in special tables:
Field Purpose
Segments can be code, data, or system segments, and protection is enforced using DPL (Descriptor
Privilege Level).
In 80386, paging is used to translate a linear address (after segmentation) into a physical address.
The virtual memory is divided into 4 KB blocks called pages, and physical memory is divided into 4 KB
page frames.
Step-by-Step Translation:
1. Linear Address Structure:
o 32-bit linear address is divided into 3 parts:
▪ 10 bits → Page Directory Index (PDE)
▪ 10 bits → Page Table Index (PTE)
▪ 12 bits → Offset within the page
2. CR3 Register (PDBR):
o The Page Directory Base Register (CR3) stores the address of the Page Directory.
3. Page Directory:
o Page Directory has 1024 entries (PDEs). Each entry is 4 bytes.
o The first 10 bits of the linear address select one PDE.
o This PDE gives the base address of a Page Table.
4. Page Table:
o Each Page Table has 1024 entries (PTEs). Each entry is 4 bytes.
o The next 10 bits of the linear address select one PTE.
o This PTE gives the base address of a Page Frame in physical memory.
5. Page Frame + Offset:
o The lowest 12 bits of the linear address are used as offset within the 4 KB page.
o The physical address = Page Frame address + offset.
6. Special Bits in PDE & PTE:
o P (Present): 1 if page is in physical memory
o D (Dirty): 1 if the page was modified
o A (Accessed): 1 if page was accessed
o U/S & R/W: User/Supervisor and Read/Write permissions
7. Page Fault Handling:
o If the page is not present (P=0), a page fault occurs.
o The desired page is fetched from virtual memory and placed in physical memory.
8. Page Replacement:
o If no free page is available, page replacement is done using FIFO, LRU, or LFU
algorithms.
o If Dirty bit = 1, the old page is saved before replacement.
9. TLB (Translation Lookaside Buffer):
o A cache memory that stores recent PDEs & PTEs for faster access.
Chapter 5: Pentium Processor
Features:
• Pentium Microprocessor was available with 66MHz - 99MHZ clock. [Higher the clock gives
higher speed of execution of program. This is double or triple then earlier MP]
• Pentium Microprocessor has two 32 bits of MPU & ALU. [Due to two execution unit it is
referred as superscalar architecture. Double unit, Double the speed, more effiecient]
• Pentium Microprocessor has 64 bits data bus. [64bits data bus needs 8 level memory
banking.]
• Pentium Microprocessor has 32 bits address bus. [It can have 64TB virtual memory.]
• Pentium Microprocessor has FIVE stage Pipelining. (Actually 2 units, each has 5 stages so
indirectly there are two 5-Staged pipelines known as U Pipe and V Pipe)
• Pentium Microprocessor have On chip floating point unit. Floating point unit supports 8
stage pipelining.
• Pentium Microprocessor have on chip L1 cache memory. [It has 16KB split cache memory]
Architecture:
Working:
The Pentium processor begins by fetching
instructions from the Instruction Cache using
the Prefetch Buffers, and if a branch
instruction is detected, the Branch Target
Buffer predicts the next instruction to avoid
delays. These instructions are decoded and
passed to the Control Unit, which manages
the flow. The decoded instructions are then
executed in two parallel integer pipelines—
the U-pipe and V-pipe, each with its own ALU
and Address Generator. Data needed for
execution is fetched from the Dual-Access
Data Cache, and results are stored in Integer
Registers. For complex operations like
floating-point math, control passes to the
Floating Point Unit (FPU), which performs
addition, division, and multiplication using its
own dedicated registers and circuits.
Meanwhile, the Bus Interface Unit handles
communication with memory and I/O
through 64-bit data and 32-bit address buses,
and the Page Unit supports memory
management.
Superscalar Structure:
• Superscalar means this architecture has more then one execution units.
• Branch prediction is done with the use of BTB Branch Target Buffer stores 256 entries.
Execution Unit:
• Two execution units are executing instructions in parallel.
Cache Memory:
• Pentium has integrated cache memory, which increases the execution speed of processor.
• It has 16Kb split Cache memory integrated on chip for data and for Code.
• Floating point unit enhances the capabilities of microprocessor for graphics & multimedia
applications.
The Pentium processor features a superscalar architecture that includes separate pipelines for
integer and floating-point operations. The floating-point pipeline is specifically optimized for
arithmetic operations on real numbers (fractions, decimals, scientific data).
Stage Description
1. Prefetch Identical to the integer pipeline's prefetch stage. Instructions are fetched from
memory.
2. Instruction Decode 1 (D1) Initial decoding of the instruction is performed. This includes recognizing the
opcode and preparing for further decode.
3. Instruction Decode 2 (D2) More detailed decoding. Determines the type of operation and the operand
types.
4. Execution Stage (Ex) Operand values are read from registers or memory. This includes address
generation and access.
5. FP Execution 1 (FP Ex1) The operand is loaded into a floating-point register and converted into floating-
point format if needed (e.g., from integer or double).
6. FP Execution 2 (FP Ex2) The actual floating-point arithmetic operation (like ADD, MUL, DIV, etc.) is
performed by the floating-point unit.
7. Write FP Result The computed result is rounded according to the FPU control word and written
to the destination floating-point register.
8. Error Reporting If any exception or error (like overflow, underflow, divide by zero) occurs, it is
flagged and the FPU status word is updated accordingly.
• A four-cycle penalty (3 + 1 extra) may be incurred if the branch is executed in the V pipeline.
• It functions like a look-aside cache, monitoring the instruction stream during the Decode
Instruction (DI) stages of both pipelines.
2. Two history bits – These bits keep track of the past behavior of the branch (how often the
branch was taken).
3. The memory address of the branch instruction – Used for identifying the instruction.
• If the address is not found in the BTB (BTB miss), the processor predicts that the branch will
not be taken.
• If the address is found in the BTB (BTB hit), the two history bits are used to decide the
prediction:
• If the branch is predicted to be taken, the current instruction queue is deactivated, and the
prefetcher starts fetching instructions from the target (branch) address. These instructions
are stored in a second queue, which becomes the new active queue.
• If the branch is predicted not to be taken, then nothing changes. The current active queue
continues to fetch instructions from the next sequential address after the branch.
o The history bits in the BTB entry are upgraded to strengthen the prediction.
o The prefetcher switches back to the other queue to fetch the correct instructions.
o If there is a BTB entry, its history bits are downgraded slightly (toward "not taken").
o If there was no BTB entry, a new entry is created in the BTB and marked as strongly
taken.
History Bits Description Prediction Made If Actually Taken If Actually Not Taken
Conclusion:
The branch prediction logic in the Pentium processor is designed to improve performance by
reducing pipeline stalls during control flow instructions. By using the BTB and 2-bit history-based
prediction, the processor intelligently guesses the most likely path of execution and learns over time.
Correct predictions allow seamless execution, while incorrect ones trigger minimal recovery actions
like flushing and re-fetching.
This dynamic mechanism helps maintain high instruction throughput and overall efficiency in
modern processors.
Thus, fast SRAM cache is placed between the CPU and the slower DRAM through a cache controller.
SRAM cache is used to hold the most frequently accessed instructions as well as data and make it
available very quickly. The cache controller controls the complete process.
When the Pentium processor wants to read data, it sends out the memory address of desired data.
Then cache controller decides whether the address of data is in the SRAM cache or in the main
memory. When the data is in the cache, it is called a cache hit and the address is passed to cache
memory without delay.
When the processor sends out an address of data which does not exist in the cache memory, it is
called a cache miss. Subsequently, the cache controller must go out to the main memory.
Cache lines are usually 32 bytes and are read using burst mode, which increases speed by
transferring multiple bytes at once, boosting performance.
1. Write-Through Strategy:
Data is written to both cache and main memory on a cache hit. It ensures memory
consistency but is slower due to frequent memory access. Write buffers can help but only
until they fill up.
2. Write-Back Strategy:
Data is written only to the cache first. Main memory is updated later during specific events
(like executing WBINVD, a cache miss, or a flush). It improves performance but takes longer
during cache replacements.
3. Write-Allocate:
On a write miss, a cache line is allocated and updated so future accesses are faster. But if the
cache is full, this might replace useful old data.
Cache Organization in Pentium Processors: A Cache Memory in Pentium Processor is used to store
both the data and the address where the data is stored in the main memory. There are methods of
cache organization such as direct mapped cache and two-way set-associative cache.
1. Direct-Mapped Cache:
o If two addresses share the same cache index, they compete (called contention).
o Fastest cache type since tag and data are accessed in parallel.
o 8KB cache with 16-byte lines uses 512 lines (A12–A4 for indexing, A31–A13 for tag).
o Data is fetched in blocks using burst mode; line size typically matches block size.
o The cache is divided into sets, and each set contains 2 cache lines. When data needs
to be stored in the cache, it can be placed in either of the two lines in the
corresponding set.
o Each address (from main memory) can be stored in two possible locations in cache
(reduces contention).
o 8KB cache with 16-byte lines split into two halves of 256 lines each.
Conclusion:
The MESI protocol maintains cache consistency efficiently using its four-state mechanism. By
minimizing unnecessary memory accesses and avoiding stale data issues, it ensures high
performance and correctness in multiprocessor systems.
Chapter 6: Pentium 4
Summary to Remember:
"Pentium 4 with NetBurst is all about SPEED — deeper pipelines (Hyper Pipelining), smarter
instruction handling (Trace Cache, Out-of-Order Execution), faster buses, and SIMD boosts (SSE2)
— all making it ideal for high-performance tasks."
Q.3 Explain hyper threading technology and its use in Pentium 4. [10M] (May 2023)
Ans.
Hyper-Threading Technology (HTT) is Intel’s simultaneous multithreading (SMT) technology that
allows a single physical processor core to behave like two logical processors, so it can execute two
threads (instruction sequences) at the same time.
Key Concepts:
• Normally, a single core executes one thread at a time.
• With Hyper-Threading, the Pentium 4 can execute two threads simultaneously using the
same core.
• This improves CPU efficiency by utilizing idle resources (like ALUs, FPUs, cache, etc.) when
one thread is waiting (e.g., for memory access).
Use in Pentium 4:
• Introduced with Pentium 4 (NetBurst architecture).
• Improves performance by:
o Increasing CPU throughput.
o Reducing idle time of execution units.
o Enhancing multitasking (running multiple apps smoothly).
• Best performance gains seen in multi-threaded applications (like video editing, gaming,
compiling).
Advantages:
• Up to 30% better performance in multithreaded workloads.
• More efficient use of CPU resources.
• Improves responsiveness in multitasking environments.
Limitations:
• Does not double performance; gains depend on workload.
• Performance can degrade slightly in non-optimized or heavily memory-bound applications
due to shared resources.
Easy to Remember:
"One core, two threads = Hyper-Threading. Introduced in Pentium 4 to increase efficiency and
performance without extra hardware."