0% found this document useful (0 votes)
5 views54 pages

MP Lect 2

The document covers microprocessor systems design, focusing on assembly language programming, the evolution of microprocessors, and the instruction execution cycle in x86 processors. It discusses the applications of assembly language, the differences between high-level and assembly languages, and the historical development of Intel microprocessors. Additionally, it explains concepts like pipelined execution, superscalar architecture, and cache memory in relation to CPU operations.

Uploaded by

amohamed5373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views54 pages

MP Lect 2

The document covers microprocessor systems design, focusing on assembly language programming, the evolution of microprocessors, and the instruction execution cycle in x86 processors. It discusses the applications of assembly language, the differences between high-level and assembly languages, and the historical development of Intel microprocessors. Additionally, it explains concepts like pipelined execution, superscalar architecture, and cache memory in relation to CPU operations.

Uploaded by

amohamed5373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Microprocessor Systems Design

Lecture 2
Lecture Content
 Assembly Language Programming
 Assembly Language Applications
 Integer Storage Sizes
 Evolution of the Microprocessor
 Instruction Execution Cycle in x86 Processors
 Registers in x86 Processors
Assembly Language Applications
 In the early days of programming, most
application programs were written partially or
entirely in assembly language . Why?
 Because programs had to fit in a small area of
memory and had to run as efficiently as possible.

 As computers became more powerful,


programs became longer and more complex;
this demanded the use of high-level.
Assembly Language Applications
(Cont’d)
 It is rare to see large application programs
written completely in assembly language.
Why?
 Because they would take too much time to write
and maintain.

 Instead, assembly language is used to


optimize certain sections of application
programs for speed and to access computer
hardware.

 Assembly language is also used when writing


embedded systems programs and device
Comparison between Assembly language
and High-level languages
Type of Application High-Level Assembly Language
Languages
Business Formal structures make Minimal formal
application it easy to organize and structure. so one must
software, written maintain large sections be imposed by
of code. programmers who have
for single
varying levels of
platform, medium experience.
to large size.
This leads to difficulties
maintaining existing
code .

Hardware device Language may not Hardware access is


driver provide for direct straightforward and
hardware access. simple.

Even if it does, Easy to maintain when


uncomfortable coding programs are short
Comparison between Assembly
language and High-level languages

Type of Application High-Level Assembly Language


Languages
Business Usually very portable. Must be recoded
application separately for each
written for The source code can platform, often using
be recompiled on an assembler with a
multiple
each target operating different syntax.
platforms system with minimal
(different changes. Difficult to maintain.
operating
systems).
Embedded Produces too much Ideal, because the
systems and executable code, and executable code is
computer games may not run small and runs
efficiently. quickly.
requiring direct
hardware access.
Integer Storage Sizes
Signed Integers
 signed binary integers can be either positive
or negative.
 In general, the most significant bit (MSB)
indicates the number's sign.
 A value of 0 indicates that the integer is
positive , and 1 indicates that it is negative
Two's Complement Notation
Two's Complement of Hexadecimal
 6A3D --> 95C2 + 1 --> 95C3
 95C3 --> 6A3C + 1 --> 6A3D
 21F0 --> DE0F + 1 --> DE10
 DE10 - -> 21EF + 1 --> 21F0
Converting Signed Binary to Decimal
Maximum and Minimum Values
Evolution of the Microprocessor
 Early Intel Microprocessors
 The IBM-AT (Advanced Technology)
 Intel IA-32 Family (Intel Architecture)
 Intel P6 Family (Intel Pentium)
 CISC and RISC
Early Intel Microprocessors
 Intel 8080 (1972)
 64K addressable RAM
 8-bit registers
 8-inch floppy disks!

 Intel 8086/8088 (1978) XT: eXtended


Technology
 IBM-PC Used 8088
 1 MB addressable RAM
 16-bit registers
 16-bit data bus (8-bit for 8088)
 separate floating-point unit (8087)
The IBM-AT
 Intel 80286 (1983)
 16 MB addressable RAM
 Protected memory
 several times faster than 8086
 introduced IDE bus architecture
 80287 floating point unit
Intel IA-32 Family
 Intel386 (1985)
 4 GB addressable RAM, 32-bit
registers, paging (virtual memory)
 Intel486 (1989)
 instruction pipelining
 Pentium (1993)
 superscalar, 32-bit address bus, 64-
bit internal data path

Clock Pulse Cache Memory


Intel P6 Family
 Pentium Pro
 advanced optimization techniques in
microcode
 Pentium II
 MMX (multimedia) instruction set
 Pentium III
 SIMD (streaming extensions) instructions
 Pentium 4 and Xeon
 Intel NetBurst micro-architecture, tuned for
multimedia

17
CISC
 The earliest Intel processors were based on
Complex Instruction Set (CISC) approach.
 The Intel instruction set includes high-level
complex operations (powerful).
 The compilers would have less work to do if
individual machine-language instructions were
powerful.
 A major disadvantage to the CISC approach
is that complex instructions require a
relatively long time for the processor to
decode and execute.
 An interpreter program inside the CPU written
in a language called microcode decodes and
RISC
 A RISC (Reduced instruction Set) machine
language consists of a relatively small number of
short instructions that can be executed very
quickly.
 Rather than using a microcode interpreter to
decode and execute machine instructions, a RISC
processor directly decodes and executes
instructions using hardware.
 High-speed engineering and graphics workstations
have been built using RISC processors for many
years. These systems have been expensive
because the processors were produced in small
quantities.
 Intel recognized many advantages to the RISC
CISC and RISC
 CISC – complex instruction set
 large instruction set
 high-level operations
 requires microcode interpreter
 examples: Intel 80x86 family

 RISC – reduced instruction set


 simple, atomic instructions
 small instruction set
 directly executed by hardware
 examples:
 ARM (Advanced RISC Machines)
 DEC Alpha (now Compaq)
Basic Microcomputer Design
 clock synchronizes CPU operations
 control unit (CU) coordinates sequence of
execution steps
 ALU performs arithmetic and bitwise
data bus
processing
registers

I/O I/O
Central Processor Unit Memory Storage
Device Device
(CPU) Unit
#1 #2

ALU CU clock

control bus

address bus
Clock
 synchronizes all CPU and BUS operations
 Machine (clock) cycle measures time of a
single operation
 Clock is used to trigger events

one cycle

0
Instruction Execution Cycle
 Before executing, a program is loaded into
memory.
 Executing a machine instruction requires three
basic steps: fetch, decode and execute.
 Two more steps are required when the
instruction uses a memory operand: fetch
operand and store output operand.
Loop
fetch the next instruction and advance the instruction pointer
(IP)
decode the instruction
IF memory operand needed, read value from memory (fetch
operand)
execute the instruction
IF result is memory operand, write result to memory (store
output operand)
Continue loop
Instruction Execution Cycle
 Fetch
 The control unit fetches the instruction from the instruction queue
(cache memory) and increments the program counter (instruction
pointer IP).
 Decode
 The control unit decodes the instruction using Instruction Decoder.
 Fetch Operand
 If the instruction uses an input operand located in memory, the
control unit uses the read operation to retrieve the operand and
copy it into internal registers. Internal registers are not visible to
user programs.
 Execute
 The ALU executes the instruction and sends the result to memory
via internal registers.
 Store output Operand
 If the output operand is in memory, the control unit uses a write
operation to store the data.
Instruction Execution Cycle
PC program
I-1 I-2 I-3 I-4
Instruction Queue
(Cache Memory)
memory fetch
op1
read
op2
registers registers
instruction
I-1 register

decode
write

write

flags ALU

execute
(output)
Pipelined Execution
 Pipelining makes it possible for processor to execute
instructions in parallel
 Instruction execution divided into discrete stages
 More efficient use of cycles, greater throughput of
instructions:
Stages
S1 S2 S3 S4 S5 S6
1 I-1
2 I-2 I-1

Cycles
3 I-2 I-1
4 I-2 I-1
5 I-2 I-1
6 I-2 I-1
7 I-2

For k states and n instructions, the number of


required cycles is: k + (n – 1)
Pipelined Execution
 Each step in the instruction cycle takes at
least one tick of the system clock, called a
clock cycle.
 But this doesn't mean that the processor must
wait until all steps are completed before
beginning to process the next instruction.
 The processor can execute the steps in
parallel, a technique known as pipelining.
 The Intel386 used a six-stage execution
cycle.

 The six stages and the parts of the processor


that carry them out are listed here:
Pipelined Execution
 I . Bus interface Unit (BIU): accesses memory and
provides input-output.
 2. Code Prefetch Unit: receives machine instructions
from the BIU and inserts them into a holding area
named the instruction queue
 3. Instruction Decode Unit: decodes machine
instructions from the prefetch queue and translates
them into microcode.
 4. Execution Unit: executes the microcode
instructions produced by the instruction decode unit.
 5. Segment Unit: translates logical addresses to
linear addresses and performs protection checks.
 6. Paging Unit : translates linear addresses into
physical addresses , performs page protection checks,
and keeps a list of recently accessed pages.
Pipelined Execution
 Example: Let's assume that each execution
stage in the processor requires a single clock
cycle.
 The figure one uses a grid to represent a six-
stage non-pipelined processor, the type used
by Intel prior to the Intel 486.
 When instruction I-1 has finished stage S6,
instruction I-2 begins. Twelve clock cycles are
required to execute the two instructions.
 In other words , for k execution stages, n
instructions require (n * k) cycles to process.
Of course, Figure one represents a major
waste of CPU resources because each stage is
Pipelined Execution
 If a processor supports pipelining, as in Figure
Two, a new instruction can enter stage SI
during the second clock cycle.
 Meanwhile, the first instruction has entered
stage S2.
 This enables the overlapped execution of the
two instructions.
 Two instructions , I-1 and I-2, are shown
progressing through the pipeline.
 I-2 enters stage SI as soon as I-1 has moved to
stage S2.
 As a result, only seven clock cycles are
required to execute the two instructions .
Pipelined Execution
 When the pipeline is full, all six stages are in
use all the time.

 In general, for k execution stages, n


instructions require k + (n - I) cycles to
process.

 Whereas the non-pipelined processor we


showed earlier required 12 cycles to process
2 instructions, the pipelined processor can
process 7 instructions in the same amount
of time .
Superscalar
 A superscalar processor has multiple
execution pipelines.
 In the following, note that Stage S has left and
4
right pipelines Stages
(u and v). S4
S1 S2 S3 u v S5 S6
1 I-1
2 I-2 I-1
3 I-3 I-2 I-1

Cycles
4 I-4 I-3 I-2 I-1
5 I-4 I-3 I-1 I-2
6 I-4 I-3 I-2 I-1
7 I-3 I-4 I-2 I-1
8 I-4 I-3 I-2
9 I-4 I-3
10 I-4

For k states and n instructions, the number of


required cycles is: k + n (odd , even)
Superscalar
 A superscalar processor has two or more
execution pipelines, making it possible for
two instructions to be in the execution
stage at the same time (how???).

 In order to better understand why a


superscalar processor would be useful, let's
consider the preceding pipelined example, in
which we assumed that the execution stage
(S4) required a single clock cycle.

 What would happen if stage S4 required two


clock cycles? Then a bottleneck would occur,
Superscalar
 Instruction I-2 cannot enter stage S4 until I-1
has completed the stage, so I-2 has to wait
one more cycle before entering stage S4.
 As more instructions enter the pipeline,
wasted cycles occur (shaded in gray).
 In general, for k stages (where one stage
requires 2 cycles), n instructions require (k +
2n - I) cycles to process.
 A superscalar processor design is used to
set multiple instructions in the execution
stage at the same time.
 For n pipelines, n instructions can execute
during the same clock cycle.
Superscalar
 In Figure two, odd-numbered instructions
enter the u-pipeline and even numbered
instructions enter the v-pipeline.
 This removes the wasted cycles, and it is now
possible to process n instructions in (k + n)
cycles .

 The Intel Pentium, which had two


execution pipelines, was the first
superscalar processor in the IA-32 family.
 The Pentium Pro processor was the first to
use three execution pipelines.
Reading from Memory
 Multiple machine cycles are required when reading from
memory, because it responds much more slowly than the CPU.
The steps are:
 address placed on address bus
 Read Line (RD) set low
 CPU waits one cycle for memory to respond
 Read Line (RD) goes to 1, indicating that the data is on the
data bus Cycle 1 Cycle 2 Cycle 3 Cycle 4

CLK

Address
ADDR

RD

Data
DATA
Cache Memory
 High-speed expensive static RAM both
inside and outside the CPU.
 Level-1 cache: inside the CPU
 Level-2 cache: outside the CPU

 Cache hit: when data to be read is already


in cache memory

 Cache miss: when data to be read is not in


cache memory.
Multitasking
 OS can run multiple programs at the same
time.
 Multiple threads of execution within the same
program.
 Scheduler utility assigns a given amount of
CPU time to each running program.
 Rapid switching of tasks
 gives illusion that all programs are running at once
 the processor must support task switching.
Addressable Memory
 Protected mode
 4 GB
 32-bit address
 Real-address and Virtual-8086
modes
 1 MB space
 20-bit address
General-Purpose Registers
Named storage locations inside the CPU,
optimized for speed.
32-bit General-Purpose Registers

EAX EBP
EBX ESP
ECX ESI
EDX EDI

16-bit Segment Registers

EFLAGS CS ES
SS FS
EIP
DS GS
General-Purpose Registers
 EAX and EBX
 They are automatically used by multiplication and
division instructions . It is often called the
extended accumulator register.
 ECX and EDX
 They are used by the CPU automatically as a
loop counter such as (For and While Loop).
 ESP
 It addresses data on the stack (a system memory
structure) such as pointers.
 It should never be used for ordinary arithmetic or
data transfer.
 It is often called the extended stack pointer
register.
General-Purpose Registers
 ESI and EDI
 They are used by high-speed memory transfer
instructions.
 They are sometimes called the extended source
index and extended destination index registers.
 EBP
 It is used by high-level languages to reference
function parameters and local variables on
the stack.
 It should not be used for ordinary arithmetic or
data transfer except at an advanced level of
programming.
 It is often called the extended frame pointer
register.
Accessing Parts of Registers
 Use 8-bit name, 16-bit name, or 32-bit
name
 Applies to EAX, EBX,
8 ECX,
8 and EDX
AH AL 8 bits + 8 bits

AX 16 bits

EAX 32 bits
Index and Base Registers
 Some registers have only a 16-bit name for
their lower half:
Some Specialized Register Uses (1 of 2)
 General-Purpose
 EAX – Accumulator
 ECX – Loop Counter
 ESP – Stack Pointer
 ESI, EDI – Index Registers
 EBP – extended frame pointer (stack)

 Segment
 CS – code segment
 DS – data segment
 SS – stack segment
 ES, FS, GS - additional segments
Some Specialized Register Uses (2 of 2)
 EIP – Instruction Pointer
 EFLAGS
 status and control flags
 each flag is a single binary bit
Segment Registers
 The segment registers are used as base
locations for preassigned memory areas called
segments.

 Some segments hold program instructions


(code), others hold variables (data), and
another segment called the stack segment
holds local function variables and function
parameters
Instruction Pointer
 Instruction Pointer (EIP), or instruction pointer
register contains the address of the next
instruction to be executed.

 Certain machine instructions manipulate this


address, causing the program to branch to a
new location.
EFLAGS Register
 The EFLAGS (or just Flags) register consists of
individual binary bits that either control the
operation of the CPU or reflect the outcome
of some CPU operation.
 There are machine instructions that can
test and manipulate the processor flags.
 A flag is set when it equals I: it is clear (or
reset) when it equals 0.
EFLAGS Register
 Control Flags
 Individual bits can be set in the EFLAGS register
by the programmer to control the CPU's operation.
 Examples are the Direction and interrupt flags .
 We will cover these on an as-needed basis later in
the book .
 Status Flags
 The Status flags reflect the outcomes of arithmetic
and logical operations performed by the CPU.
 They are the Overflow, Sign , Zero, Auxiliary Carry,
Parity, and Carry flags.
Status Flags
 Carry (CF)
 CF is set when the result of an unsigned arithmetic
operation is too large to fit into the destination.
 Overflow (OF)
 OF is se t when the result of a signed arithmetic
operation is either too large or too small to fit into
the destination
 Sign (SF)
 SF is set when the result of an arithmetic or logical
operation generates a negative result.
 Zero (ZF)
 ZF is set when the result of an arithmetic or logical
operation generates a result of zero.0
Floating-Point Registers
 The IA-32 has a floating-point unit (FPU) that
is used expressly for high-speed floating-
point arithmetic.

 At one time a separate coprocessor chip


was required for this , but beginning with the
Inte1486, it was integrated into the main
processor chip .

 There are eight floating-point data registers


in the FPU, named ST(0), ST(l ), ST(2), ST(3 ),
ST(4) , ST(5), ST(6), and ST(7).
Floating-Point Registers
MMX, XMM Registers
 Eight 64-bit MMX registers
 Eight 128-bit XMM registers for single-
instruction multiple-data (SIMD) operations

You might also like