CSC303 Architecture Lecture Note 2019
CSC303 Architecture Lecture Note 2019
By
References
1. Introduction.
3. System Buses.
4. Internal Memory.
5. External Memory.
6. Input/Output.
8. Computer Arithmetic.
2
INTRODUCTION
Computer systems have conventionally been defined through their interfaces at a number of layered
abstraction levels, each providing functional support to its predecessor. Included among the levels
are the application programs, the high- level languages, and the set of machine instructions. Based on
the interface between different levels of the system, a number of computer architectures can be
defined.
The interface between the application programs and a high- level language is referred to as language
architecture. The instruction set architecture defines the interface between the basic machine
instruction set and the runtime and I/O control.
A different definition of computer architecture is built on four basic viewpoints. These are the
structure, the organization, the implementation, and the performance.
In this definition:
The structure defines the interconnection of various hardware components.
The organization defines the dynamic interplay and management of the various components.
The implementation defines the detailed design of hardware components, and
The Performance specifies the behavior of the computer system.
A computer system, like any system, consists of an interrelated set of components. The system is best
characterized in terms of structure-the way in which components are interconnected, and functio n-
the operation of the individual components. Furthermore, a computer's organization is hierarchic.
Each major component can be further described by decomposing it into its major subcomponents and
describing their structure and function. For clarity and ease of understanding, this hierarchica l
organization is described in this Lecture from the top down:
• Computer system: Major components are processor, memory, I/O.
• Processor: Major components are control unit, registers, ALU, and instruction execution unit.
• Control unit: Major components are control memory, microinstruction sequencing logic, and
registers.
Throughout the discussion, aspects of the system are viewed from the points of view of both
architecture (those attributes of a system visible to a machine language programmer) and organiza tio n
(the operational units and their interconnections that realize the architecture).
Organization and Architecture
• Computer Architecture refers to those attributes of a system that have a direct impact on the
logical execution of a program. Examples:
3
o The instruction set
o The number of bits used to represent various data types
o I/O mechanisms
o Memory addressing techniques
• Computer Organization refers to the operational units and their interconnections that realize
the architectural specifications. Examples are things that are transparent to the programmer:
o control signals
o interfaces between computer and peripherals
o the memory technology being used
• So, for example, the fact that a multiply instruction is available is a computer architecture
issue. How that multiply is implemented is a computer organization issue.
I. Harvard and Von Neuman Architectures which differ based on the way the CPU accesses the
main memory.
• Harvard Architecture
The Harvard architecture is a computer architecture with physically separate storage (memory) and
signal pathways for instructions and data
In contrast with the Harvard Architecture, the Von Neuman machine uses same memory for both
instructions and data.
All contemporary computer designs are based on concepts developed by John von Neumann at the
Institute for Advanced Studies Princeton. Such a design is referred to as the vonNeumann
architecture and is based on three key concepts:
Function
• A functional view of the computer
• Basic functions that a computer can perform:
o Data Processing - a wide variety of forms, but only a few fundamental methods or types
o Data Storage - long-term or short, temporary storage
o Data Movement
▪ Input/Output - when data are received from or delivered to a peripheral, a device
connected directly to the computer
▪ Data Communications - when data is moved over longer distances, to or from a remote
device
5
• Control - of the above functions, by instructions provided by the user of the computer (i.e.
their programs)
Four (4) Possible types of operations with this basic structure
Device for Processing Data in Storage
Device for Processing Data En-route Between the Outside World and Storage
Structure
• Simplest possible view of a computer:
o Storage
o Processing
o Peripherals
o Communication Lines
• Internal Structure of the Computer Itself:
o Central Processing Unit (CPU): Controls the operation of the computer and performs its
data processing functions. Often simply referred to as processor.
o Main Memory: Stores data.
o I/O: Moves data between the computer and its external environment.
o System Interconnection: Some mechanism that provides for communication among CPU,
main memory, and I/O.
6
• Main Structural components of the CPU:
o Control Unit: Controls the operation of the CPU and hence the computer.
o Arithmetic and Logic Unit (ALU): Performs the computer's data processing functions.
o Registers: Provides storage internal to the CPU.
o CPU Interconnection: Some mechanism that provides for communication among the
control unit, ALU, and registers.
7
A Brief History of Computers
• 1946: Princeton Institute for Advanced Studies (IAS) computer
o Prototype for all subsequent general-purpose computers. With rare exceptions, all of
today’s computers have this same general structure, and are thus referred to as von
Neumann machines.
8
Figure 2.2 Expanded Structure of IAS Computer
THE COMPUTER SYSTEM
System Buses
Interconnecting Basic Components
Computer Components
• The von Neumann architecture is based on three key concepts:
o Data and instructions are stored in a single read-write memory
o The contents of this memory are addressable by location, without regard to the type of
data contained there
o Execution occurs in a sequential fashion (unless explic itly modified) from one instructio n
to the next
• Two approaches to programming
o Hardwired Programming - Constructing a configuration of hardware logic components
to perform a particular set of arithmetic and logic operations on a set of data
o Software Programming - A sequence of codes or instructions, each of which supply the
necessary control signals to a general-purpose configuration of control and logic functio ns
(which may themselves be hardwired programs)
• Other components needed
o I/O Components - a means to:
▪ Accept data and instructions in some form, and convert to an internal form of signals
9
▪ Report results
o Main Memory
▪ Distinguished from external storage/peripherals
▪ A place to temporarily store both:
• Instructions - Data interpreted as codes for generating control signals
• Data - Data upon which computations are performed
▪ Interactions among Computer Components
o Memory Address Register – specifies address for next read or write
o Memory Buffer Register – contains data to be written into or receives data read from
memory
o I/O address register - Specifies a particular I/O device
o I/O buffer register - used for exchange of data between an I/O module and CPU (or
memory)
o Memory Module - a set of locations
▪ With sequentially numbered addresses
▪ Each holds a binary number that can be either an instruction or data
10
Computer Function
• Processing required for a single instruction is called an instruction cycle
• Simple POV (Point-Of-View): 2 steps
12
• It must support the following types of transfers:
o Memory to CPU
o CPU to Memory
o I/O to CPU
o CPU to I/O
o I/O to or from Memory - using Direct Memory Access (DMA)
Bus Interconnection
• A bus is a shared transmission medium
o Must only be used by one device at a time
o When used to connect major computer components (CPU, memory, I/O) is called a system
bus
• Three functional groups of communication lines
13
o Data lines (data bus) - move data between system modules
▪ Width is a key factor in determining overall system performance
o Address lines - designate source or destination of data on the data bus
▪ Width determines the maximum possible memory capacity of the system (may be a
multiple of width)
▪ Also used to address I/O ports. Typically:
➢ high-order bits select a particular module
➢ lower-order bits select a memory location or I/O port within the module
o Control lines - control access to and use of the data and address lines. Typical control
lines include:
▪ Memory write: causes data on the bus to be written into the addressed location
▪ Memory read: causes data from the addressed location to be placed on the bus
▪ I/O write: causes data on the bus to be output to the addressed I/O port
▪ I/O read: causes data from the addressed I/O port to be placed on the bus
▪ Transfer ACK: indicates that data have been accepted from or placed on the bus
▪ Bus request: indicates that a module needs to gain control of the bus
▪ Bus grant: indicates that a requesting module has been granted control of the bus
▪ Interrupt request: indicates that an interrupt is pending
▪ Interrupt ACK: acknowledges that the pending interrupt has been recognized
▪ Clock: is used to synchronize operations
▪ Reset: initializes all modules.
• If one module wishes to send data to another, it must:
o Obtain use of the bus
o Transfer data via the bus
• If one module wishes to request data from another, it must:
o Obtain use of the bus
14
o Transfer a request to the other module over control and address lines
o Wait for second module to send data
• Typical physical arrangement of a system bus
o A number of parallel electrical conductors
o Each system component (usually on one or more boards) taps into some or all of the bus
lines (usually with a slotted connector)
o System can be expanded by adding more boards
o A bad component can be replaced by replacing the board where it resides
Multiple Bus Hierarchies
• A great number of devices on a bus will cause performance to suffer
o Propagation delay - the time it takes for devices to coordinate the use of the bus
o The bus may become a bottleneck as the aggregate data transfer demand approaches the
capacity of the bus (in available transfer cycles/second)
• Traditional Hierarchical Bus Architecture
o Use of a cache structure insulates CPU from frequent accesses to main memory
o Main memory can be moved off local bus to a system bus
o Expansion bus interface
▪ buffers data transfers between system bus and I/O controllers on expansion bus
▪ insulates memory-to-processor traffic from I/O traffic
• Traditional Hierarchical Bus Architecture Example
15
• High-performance Hierarchical Bus Architecture
o Traditional hierarchical bus breaks down as higher and higher performance is seen in the
I/O devices
o Incorporates a high-speed bus
▪ Specifically designed to support high-capacity I/O devices
▪ Brings high-demand devices into closer integration with the processor and at the same
time is independent of the processor
▪ Changes in processor architecture do not affect the high-speed bus, and vice versa
o Sometimes known as a mezzanine architecture
• High-performance Hierarchical Bus Architecture Example
17
o Asynchronous Timing
▪ The occurrence of one event on a bus follows and depends on the occurrence of a previous
event
▪ Allows system to take advantage of advances in device performance by having a mixture of
slow and fast devices, using older and newer technology, sharing the same bus
▪ BUT harder to implement and test than synchronous timing
18
• Bus Width
o Data bus: wider = better performance
o Address bus: wider = more locations can be referenced
• Data Transfer Type
o All buses must support write (master to slave) and read (slave to master) transfers
• Combination operations
o Read-modify-write
▪ A read followed immediately by a write to the same address.
▪ Address is only broadcast once, at the beginning of the operation
▪ Indivisible, to prevent access to the data element by other potential bus masters
▪ Principle purpose is to protect shared memory in a multiprogramming system
19
o Read-after-write - indivisible operation consisting of a write followed immediately by a
read from the same address (for error checking purposes)
• Block data transfer
o one address cycle followed by n data cycles
o first data item to or from specified address
o remaining data items to or from subsequent addresses
PCI
• PCI = Peripheral Component Interconnect
o High-bandwidth
o Processor independent
o Can function as a mezzanine or peripheral bus
• Current Standard
o up to 64 data lines at 33Mhz
o requires few chips to implement
o supports other buses attached to PCI bus
o public domain, initially developed by Intel to support Pentium-based systems
o supports a variety of microprocessor-based configurations, including multiple processors
o uses synchronous timing and centralized arbitration
• Typical Desktop System
Note: Bridge acts as a data buffer so that the speed of the PCI bus may differ from that of the
processor’s I/O capability.
• Typical Server System
20
Note: In a multiprocessor system, one or more PCI configurations may be connected by bridges to
the processor’s system bus.
• Bus Structure
o 50 mandatory signal lines, divided into the following groups:
▪ System Pins - includes clock and reset
▪ Address and Data Pins - 32 time-multiplexed lines for addresses and data, plus lines
to interpret and validate these
▪ Interface Control Pins - control timing of transactions and provide coordination among
initiators and targets
▪ Arbitration Pins - not shared, each PCI master has its own pair to connect to PCI bus
arbiter
▪ Error Reporting Pins - for parity and other errors
o 50 optional signal lines, divided into the following groups:
▪ Interrupt Pins - not shared, each PCI device has its own interrupt line or lines to an
interrupt controller
▪ Cache Support Pins
▪ 64-bit Bus Extension Pins - 32 additional time-multiplexed lines for addresses and
data, plus lines to interpret and validate these, and to provide agreement between two
PCI devices on use of these
▪ ITAG/Boundary Scan Pins - support testing procedures from IEEE Standard 149.1
• PCI Commands
o issued by the initiator (the master) to the target (the slave)
21
o Use the C/BE lines
o Types
- Interrupt Ack - Memory Read Multiple
- Special Cycle - Memory Write
- I/O Read - Memory Write & Invalidate
- I/O Write - Configuration Read
- Memory Read - Configuration Write
- Memory Read Line - Dual Address Cycle
THE COMPUTER SYSTEM
MEMORY
Internal Memory
Characteristics of Computer Memory Systems
• Location
o CPU (registers and L1 cache)
o Internal Memory (main)
o External (secondary)
• Capacity
o Word Size - typically equal to the number of bits used to represent a number and to the
instruction length.
o Number of Words - has to do with the number of addressable units (which are typically
words, but are sometimes bytes, regardless of word size). For addresses of length A (in
bits), the number of addressable units is 2A.
• Unit of Transfer
o Word
o Block
• Access Method
o Sequential Access
▪ information used to separate or identify records is stored with the records
▪ access must be made in a specific linear sequence
▪ the time to access an arbitrary record is highly variable
o Direct Access
▪ individual blocks or records have an address based on physical location
▪ access is by direct access to general vicinity of desired information, then some search
22
▪ access time is still variable, but not as much as sequential access
o Random Access
▪ each addressable location has a unique, physical location
▪ access is by direct access to desired location
▪ access time is constant and independent of prior accesses
o Associative
▪ desired units of information are retrieved by comparing a sub-part of the unit with a
desired mask -- location is not needed
▪ access time is constant and independent of prior accesses
▪ most useful for searching - a search through N possible locations would take O(N)
with Random Access Memory, but O(1) with Associative Memory
• Performance
o Access Time
o Memory Cycle Time - primarily for random-access memory = access time + additiona l
time required before a second access can begin (refresh time, for example)
o Transfer Rate
▪ Generally measured in bits/second
▪ Inversely proportional to memory cycle time for random access memory
• Physical Type
o Most common - semiconductor and magnetic surface memories
o Others - optical, bubble, mechanical (e.g. paper tape), core etc
• Physical Characteristics
o volatile - information decays or is lost when power is lost
o non-volatile - information remains without deterioration until changed -- no electrical
power needed
o non-erasable
▪ information cannot be altered with a normal memory access cycle As a practical
matter, must be non-volatile
• Organization - the physical arrangement of bits to form words.
The Memory Hierarchy
• Design Constraints
o How much? “If you build it, they will come.” Applications tend to be built to use any
commonly available amount, so question is open-ended.
23
o How fast? Must be able to keep up with the CPU -- don’t want to waste cycles waiting for
instructions or operands.
o How expensive? Cost of memory (also associated with “How much?”) must be reasonable
vs. other component costs.
• There are trade-offs between the 3 key characteristics of memory (cost, capacity, and access
time) which yield the following relationships:
o Smaller access time -> greater cost per bit
o Greater capacity -> smaller cost per bit
o Greater capacity -> greater access time
• Contemporary Memory Hierarchy
o Magnetic Tape
o Optical/Magnetic Disk
o Disk Cache
o Main Memory
o Cache
o Registers
Semiconductor or Main Memory
• Types of Random-Access Semiconductor Memory
o RAM - Random Access Memory
▪ possible both to read data from the memory and to easily and rapidly write new data
into the memory
▪ volatile - can only be used for temporary storage (all the other types of random-access
memory are non-volatile)
▪ possible both to read data from the memory and to easily and rapidly write new data
into the memory
▪ Types:
✓ Dynamic - stores data as charge on capacitors
✓ tend to discharge over time
✓ require periodic charge (like a memory reference) to refresh
✓ more dense and less expensive than comparable static RAMs
✓ Static - stores data in traditional flip-flop logic gates
✓ no refresh needed
✓ generally faster than dynamic RAMs
24
o ROM - Read Only Memory
▪ contains a permanent pattern of data which cannot be changed
▪ data is actually wired-in to the chip as part of the fabrication process
▪ cheaper for high-volume production
o PROM - Programmable Read Only Memory
▪ writing process is performed electrically
▪ may be written after chip fabrication
✓ writing uses different electronics than normal memory writes
o EPROM - Erasable Programmable Read Only Memory
▪ read and written electrically, as with PROM
▪ before a write, all cells must be erased by exposure to UV radiation (erasure takes
about 20 minutes)
✓ writing uses different electronics than normal memory writes
✓ errors can be corrected by erasing and starting over
▪ more expensive than PROM
o EEPROM - Electrically Erasable Programmable Read Only Memory
▪ byte-level writing - any part(s) of the memory can be written at any time
▪ updateable in place - writing uses ordinary bus control, address, and data lines
▪ writing takes much longer than reading
▪ more expensive (per bit) and less dense than EPROM
o Flash Memory
▪ uses electrical erasing technology
▪ allows individual blocks to be erased, but not byte-level erasure, and modern flash
memory is updateable in place (some may function more like I/O modules)
▪ much faster erasure than EPROM
▪ same density as EPROM
▪ Sometimes refers to other devices, such as battery-backed RAM and tiny hard-
disk drives which behave like flash memory for all intents and purposes.
• Organization
o Typical organization
▪ bits read/written at a time
▪ Logically 4 square arrays of 2048x2048 cells
▪ Horizontal lines connect to Select terminals
25
▪ Vertical lines connect to Data-In/Sense terminals
▪ Multiple DRAMs must connect to memory controller to read/write an 8 bit word
▪ Illustrates why successive generations grow by a factor of 4 -- each extra pin
devoted to addressing doubles the number of rows and columns
• Chip Packaging
26
o Typical Pin outs
▪ A0-An: Address of word being accessed (may be multiplexed row/column) for an
n bit (n*2 bit) address
▪ D0-Dn: Data in/out for n bits
▪ Vcc: Power supply
▪ Vss: Ground
▪ CE: Chip enable - allows several chips to use same circuits for everything else, but
only have one chip use them
▪ Vpp: Program Voltage - used for writes to (programming) an EPROM
▪ RAS: Row Address Select
▪ CAS: Column Address Select
▪ W or WE: Write enable
▪ OE: Output enable
• Error Correction Principles
o Hard Failure
▪ A permanent defect
▪ Causes same result all the time, or randomly fluctuating results
o Soft Error - A random, nondestructive event that alters the contents of one or more
memory cells, without damaging the memory. Caused by:
▪ Power supply problems
▪ Alpha particles
o Detection and Correction
Cache Memory
• Principles
o Intended to give memory speed approaching that of fastest memories available but with
large size, at close to price of slower memories.
o Cache is checked first for all memory references.
o If not found, the entire block in which that reference resides in main memory is stored in
a cache slot, called a line.
o Each line includes a tag (usually a portion of the main memory address) which identifies
which particular block is being stored.
o The proportion of memory references, which are found already stored in cache, is called
the hit ratio.
27
• Elements of Cache Design
o Cache Size
▪ Small enough that overall average cost/bit is close to that of main memory alone
▪ large enough so that overall average access time is close to that of cache alone
▪ large caches tend to be slightly slower than small ones
▪ studies indicate that 1K-512K words is optimum cache size
28
29
THE COMPUTER SYSTEM
External Memory
Magnetic Disk
A disk is a circular platter constructed of nonmagnetic material, called the substrate, coated with a
magnetizable material. Traditionally, the substrate has been an aluminum or aluminum alloy material.
The glass substrate has a number of benefits, including the following:
• Improvement in the uniformity of the magnetic film surface to increase disk reliability
• A significant reduction in overall surface defects to help reduce read-write errors
• Better stiffness to reduce disk dynamics
• Greater ability to withstand shock and damage
Magnetic Read and Write Mechanisms
• Data are recorded on and later retrieved from the disk via a conducting coil named the head
• During a read or write operation, the head is stationary while the platter rotates beneath it
• The write mechanism exploits the fact that electricity flowing through a coil produces a
magnetic field. Electric pulses are sent to the write head, and the resulting magnetic patterns
are recorded on the surface below, with different patterns for positive and negative currents.
Physical Characteristics
• In a fixed-head disk, there is one read-write head per track. All of the heads are mounted on
a rigid arm that extends across all tracks; such systems are rare today.
• In a movable-head disk, there is only one read-write head. Again, the head is mounted on
an arm.
The disk itself is mounted in a disk drive, which consists of the arm, a spindle that rotates the
disk, and the electronics needed for input and output of binary data.
• A nonremovable disk is permanently mounted in the disk drive; the hard disk in a
personal computer is a nonremovable disk.
• A removable disk can be removed and replaced with another disk.
31
RAID
32
• If a single I/O request consists of multiple contiguous strips, up to n strips can be handled in
parallel, greatly reducing I/O transfer time.
Level 1 (Mirrored)
• Only level where redundancy is achieved by simply duplicating all the data
• Data striping is used as in RAID 0, but each logical strip is mapped to two separate physical
disks
• A read request can be serviced by disk with minimal seek and latency time
• Write requests require updating 2 disks, but both can be updated in parallel, so no penalty
• When a drive fails, data may be accessed from other drive
• High cost for high performance
o Usually used only for highly critical data.
o Best performance when requests are mostly reads
Level 2 (Redundancy through Hamming Code)
• Uses parallel access – all member disks participate in every I/O request
• Uses small strips, often as small as a single byte or word
• An error-correcting code (usually Hamming) is calculated across corresponding bits on each
data disk, and the bits of the code are stored in the corresponding bit positions on multip le
parity disks.
• Useful in an environment where a lot of disk errors are expected
o Usually expensive overkill.
o Disks are so reliable that this is never implemented
33
Level 3 (Bit-Interleaved Parity)
• Uses parallel access – all member disks participate in every I/O request
• Uses small strips, often as small as a single byte or word
• Uses only a single parity disk, no matter how large the disk array
o A simple parity bit is calculated and stored
o In the event of a failure in one disk, the data on that disk can be reconstructed from the
data on the others
o Until the bad disk is replaced, data can still be accessed (at a performance penalty) in
reduced mode
Level 4 (Block-Level Parity)
34
• Like Level 4, but distributes parity strips across all disks, removing the parity bottleneck
Level 6 (Dual Redundancy)
• Like Level 5, but provides 2 parity strips for each stripe, allowing recovery from 2
simultaneous disk failures.
SOLID STATE DRIVES
• One of the most significant developments in computer architecture in recent years is the
increasing use of solid state drives (SSDs) to complement or even replace hard disk drives
(HDDs), both as internal and external secondary memory.
• The term solid state refers to electronic circuitry built with semiconductors.
• A solid state drive is a memory device made with solid state components that can be used as
a replacement to a hard disk drive.
Flash Memory
Flash memory is a type of semiconductor memory that has been around for a number of years and is
used in many consumer electronic products, including smart phones, GPS devices, MP3 players,
digital cameras, and USB devices.
• In a flash memory cell, a second gate—called a floating gate, because it is insulated by a thin
oxide layer—is added to the transistor.
• Initially, the floating gate does not interfere with the operation of the transistor, in this state,
the cell is deemed to represent binary 1.
35
• Applying a large voltage across the oxide layer causes electrons to tunnel through it and
become trapped on the floating gate, where they remain even if the power is disconnected, in
this state, the cell is deemed to represent binary 0.
There are two distinctive types of flash memory, designated as NOR and NAND.
• In NOR flash memory, the basic unit of access is a bit, and the logical organization resembles
a NOR logic device.
• For NAND flash memory, the basic unit is 16 or 32 bits, and the logical organiza tio n
resembles NAND devices.
• NOR flash memory provides high-speed random access. It can read and write data to specific
locations, and can reference and retrieve a single byte.
• NOR flash memory flash memory is used to store cell phone operating system code and on
Windows computers for the BIOS program that runs at startup.
• NAND reads and writes in small blocks. It is used in USB flash drives, memory cards (in
digital cameras, MP3 players, etc.), and in SSDs
OPTICAL MEMORY
• In 1983, one of the most successful consumer products of all time was introduced, the compact
disk (CD) digital audio system.
• The CD is a nonerasable disk that can store more than 60 minutes of audio information on
one side.
36
CD Operation
The disk is formed from a resin, such as polycarbonate. Digitally recorded information (either
music or computer data) is imprinted as a series of microscopic pits on the surface of the
polycarbonate. This is done, first of all, with a finely focused, high intensity laser to create a
master disk. The master is used, in turn, to make a die to stamp out copies onto polycarbonate.
The pitted surface is then coated with a highly reflective surface, usually aluminum or gold. This
shiny surface is protected against dust and scratches by a top coat of clear acrylic. Finally, a label
can be silkscreened onto the acrylic.
• Data on the CD-ROM are organized as a sequence of blocks. A typical block format is shown
in Figure 6.13. It consists of the following fields:
37
• Sync: The sync field identifies the beginning of a block. It consists of a byte of all 0s, 10 bytes
of all 1s, and a byte of all 0s.
• Header: The header contains the block address and the mode byte. Mode 0 specifies a blank
data field; mode 1 specifies the use of an error-correcting code and 2048 bytes of data; mode
2 specifies 2336 bytes of user data with no error-correcting code.
• Data: User data.
• Auxiliary: Additional user data in mode 2. In mode 1, this is a 288-byte error correcting code.
38
MAGNETIC TAPE
Tape systems use the same reading and recording techniques as disk systems. The medium is
flexible polyester (similar to that used in some clothing) tape coated with magnetizable material.
The coating may consist of particles of pure metal in special binders or vapor-plated metal films.
• The tape and the tape drive are analogous to a home tape recorder system.
• Tape widths vary from 0.38 cm (0.15 inch) to 1.27 cm (0.5 inch).
• Tapes used to be packaged as open reels that have to be threaded through a second spindle
for use.
• Data on the tape are structured as a number of parallel tracks running lengthwise.
• Earlier tape systems typically used nine tracks. This made it possible to store data one
byte a time, with an additional parity bit as the ninth track. This was followed by tape
systems using 18 or 36 tracks, corresponding to a digital word or double word is referred
as parallel recording while most modern systems instead use serial recording.
• The typical recording technique used in serial tapes is referred to as serpentine recording.
• Data are still recorded serially along individual tracks, but blocks in sequence are stored
on adjacent tracks, as suggested by Figure 6.16b.
• A tape drive is a sequential-access device. If the tape head is positioned at record 1, then
to read record N, it is necessary to read physical records 1 through N - 1, one at a time. If
39
the head is currently positioned beyond the desired record, it is necessary to rewind the
tape certain distance and begin reading forward.
INPUT/OUTPUT (I/O)
Introduction
• Why not connect peripherals directly to system bus?
o Wide variety of peripherals with various operating methods
o Data transfer rate of peripherals is often much slower/faster than memory or CPU
o Different data formats and word lengths than used by computer
• Major functions of an I/O module
o Interface to CPU and memory via system bus or central switch
o Interface to one or more peripheral devices by tailored data links
40
External Devices
• External devices, often called peripheral devices or just peripherals, make computer systems
useful.
• Three broad categories of external devices:
o Human-Readable (ex. terminals, printers)
o Machine-Readable (ex. disks, sensors)
o Communication (ex. modems, NIC’s)
• Basic structure of an external device:
o Data - bits sent to or received from the I/O module
o Control signals - determine the function that the device will perform
o Status signals - indicate the state of the device (esp. READY/NOT-READY)
o Control logic - interprets commands from the I/O module to operate the device
o Transducer - converts data from computer-suitable electrical signals to the form of energy
used by the external device.
o Buffer - temporarily holds data being transferred between I/O module and the external
device.
41
I/O Modules
• An I/O Module is the entity within a computer responsible for:
o control of one or more external devices
o Exchange of data between those devices and main memory and/or CPU registers
• It must have two interfaces:
o Internal, to CPU and main memory
o External, to the device(s)
• Major function/requirement categories
o Control and Timing
▪ Coordinates the flow of traffic between internal resources and external devices
▪ Cooperation with bus arbitration
o CPU Communication
▪ Command Decoding
▪ Data
▪ Status Reporting
▪ Address Recognition.
o Device Communication (see diagram under External Devices)
▪ Commands
▪ Status Information
▪ Data
o Data Buffering
▪ Rate of data transfer to/from CPU is orders of magnitude faster than to/from external
devices
42
▪ I/O module buffers data so that peripheral can send/receive at its rate, and CPU can
send/receive at its rate
o Error Detection
▪ Must detect and correct or report errors that occur
▪ Types of errors
▪ Mechanical/electrical malfunctions
▪ Data errors during transmission
• I/O Module Structure
Programmed I/O
• With programmed I/O, data is exchanged under complete control of the CPU
o CPU encounters an I/O instruction
o CPU issues a command to appropriate I/O module
o I/O module performs requested action and sets I/O status register bits
o CPU must wait, and periodically check I/O module status until it finds that the operation
is complete
• To execute an I/O instruction, the CPU issues:
o An address, specifying I/O module and external device
o A command, 4 types:
▪ Control - activate a peripheral and tell it what to do
43
▪ Test - querying the state of the module or one of its external devices
▪ Read - obtain an item of data from the peripheral and place it in an internal buffer (data
register from preceding illustration)
▪ Write - take an item of data from the data bus and transmit it to the peripheral
• Two modes of addressing are possible:
o Memory-mapped I/O
o Isolated I/O
Interrupt-Driven I/O
• Problem with programmed I/O is CPU has to wait for I/O module to be ready for either
reception or transmission of data, taking time to query status at regular intervals.
• Interrupt-driven I/O is an alternative
o It allows the CPU to go back to doing useful work after issuing an I/O command.
o When the command is completed, the I/O module will signal the CPU that it is ready with
an interrupt.
Direct Memory Access
• Drawbacks of Programmed and Interrupt-Driven I/O
o The I/O transfer rate is limited by the speed with which the CPU can test and service a
device
o The CPU is tied up in managing an I/O transfer; a number of instructions must be executed
for each I/O transfer
• DMA Function
44
o When CPU wishes to read or write a block of data it issues a command to the DMA
module containing:
▪ Whether a read or write is requested
▪ The address of the I/O device involved
▪ The starting location in memory to read from or write to
▪ The number of words to be read or written
o CPU continues with other work
o DMA module handles entire operation. When memory has been modified as ordered, it
interrupts the CPU
o CPU is only involved at beginning and end of the transfer
o DMA module can force CPU to suspend operation while it transfers a word
▪ called cycle stealing
▪ not an interrupt, just a wait state
▪ slows operation of CPU, but not as badly as non-DMA
• Possible DMA Configurations
o Single Bus, Detached DMA
46
THE COMPUTER SYSTEM
47
• Process States - for short-term scheduling, a process is understood to be in one of 5 basic
states
o New - admitted by the high-level scheduler, but not yet ready to execute
o Ready - needs only the CPU
o Running - currently executing in the CPU
o Waiting - suspended from execution, waiting for some system resource
o Halted - the process has terminated and will be destroyed by the operating system
Integer Representation
• Sign-Magnitude Representation
o Leftmost bit is sign bit: 0 for positive, 1 for negative
o Remaining bits are magnitude
48
o Drawbacks
▪ Addition and subtraction must consider both the signs and relative magnitudes -- more
complex
▪ Testing for zero must consider two possible zero representations
• Two’s Complement Representation
o Leftmost bit still indicates sign
o Positive numbers exactly same as sign-magnitude
o Zero is only all zeros (positive)
o Negative numbers found by taking 2’s complement
▪ Take complement of positive version
▪ Add 1
Integer Arithmetic (8.3)
• 2’s complement examples (with 8 bit numbers)
o Getting -55
▪ Start with +55: 0110111
▪ Complement that: 1001000
▪ Add 1: +0000001
▪ Total is -55: 1001001
o Negating -55
▪ Complement -55: 0110110
▪ Add 1: +0000001
▪ Total is 55 (see top) 0110111
o Adding -55 + 58
▪ Start with -55: 1001001
▪ Add 58: +0111010
▪ Result is 3: 0000011
▪ Overflow into and out-of sign bit is ignored
• Overflow Rule - if two numbers are added, and they are both positive or both negative, then
overflow occurs if and only if the result has the opposite sign
• Converting between different bit lengths
o Move sign bit to new leftmost position
o Fill in with copies of the sign bit
o Examples (8 bit -> 16 bit)
49
▪ +18: 00010010 -> 0000000000010010
▪ -18: 11101110 -> 1111111111101110
• Multiplication
o Repeated Addition
o Unsigned Integers
▪ Generating partial products, shifting, and adding
▪ Just like longhand multiplication
• Two’s Complement Multiplication
o Straightforward multiplication will not work if either the multiplier or multiplicand are
negative
▪ Multiplicand would have to be padded with sign bit into a 2n-bit partial product, so
that the signs would line up
▪ In a negative multiplier, the 1’s and 0’s would no longer correspond to add-shift’s and
shift-only’s
o Simple solution
▪ Convert both multiplier and multiplicand to positive numbers
▪ Perform multiplication
▪ Take 2’s complement of result if and only if the signs of original numbers were differe nt
▪ Other methods do not require this final transformation step
• Booth’s Algorithm
• Why does Booth’s Algorithm work?
o Consider multiplying some multiplicand M by 30: M * (00011110) which would take 4
shift-adds of M (one for each 1)
o That is the same as multiplying M by (32 - 2): M * (00100000 - 00000010) = M *
(00100000) - M * (00000010) which would take:
▪ 1 shift-only on no transition (imagine last bit was 0)
▪ 1 shift-subtract on the transition from 0 to 1
▪ 3 shift-only’s on no transition
▪ 1 shift-add on the transition from 1 to 0
▪ 2 shift-only’s on no transition
• Division
o Unsigned integers 00001101 Quotient Divisor 1011 10010011 Dividend 1011 001110
1011 001111 1011 100 Remainder
50
Floating-Point Representation
• Principles
o Using scientific notation, we can store a floating point number in 3 parts ±S * B±E :
▪ Sign
▪ Significand (or Mantissa)
▪ Exponent
▪ (The Base stays the same, so need not be stored)
o The sign applies to the significand. Exponents use a biased representation, where a fixed
value called the bias is subtracted from the field to get the actual exponent.
• We require that numbers be normalized, so that the decimal in the significand is always in the
same place
o we will choose just to the right of a leading 0
o format will be ±0.1bbb… b * 2±E
o thus, it is unnecessary to store either that leading 0, or the next 1, since all numbers will
have them
• IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754)
o Facilitates portability of programs from one processor to another
o Defines both 32-bit single, 64-bit double formats and
o .128-bit double formats
• A small amount of internal memory, called the registers, is needed by the CPU to fulfill these
requirements
Register Organization
• Registers are at top of the memory hierarchy. They serve two functions:
o User-Visible Registers - Enable the machine- or assembly-language programmer to
minimize main-memory references by optimizing use of registers
o Control and Status Registers - used by the control unit to control the operation of the CPU
and by privileged, OS programs to control the execution of programs
• User-Visible Registers
o Categories of Use
▪ General Purpose
52
▪ Data
▪ Address
▪ Condition Codes
o Design Issues
▪ Completely general-purpose registers or specialized use?
✓ Specialized registers save bits in instructions because their use can be implicit
✓ General-purpose registers are more flexible
✓ Trend is toward use of specialized registers
▪ Number of registers provided?
✓ More registers require more operand specifier bits in instructions
✓ 8 to 32 registers appears optimum (RISC systems use hundreds, but are a completely
different approach)
▪ Register Length?
✓ Address registers must be long enough to hold the largest address
✓ Data registers should be able to hold values of most data types
✓ Some machines allow two contiguous registers for double-length values
▪ Automatic or manual save of condition codes?
✓ Condition restore is usually automatic upon call return
✓ Saving condition code registers may be automatic upon call instruction, or may be
manual
• Control and Status Registers
o Essential to instruction execution
▪ Program Counter (PC)
▪ Instruction Register (IR)
▪ Memory Address Register (MAR) - usually connected directly to address lines of bus
▪ Memory Buffer Register (MBR) - usually connected directly to data lines of bus
o Program Status Word (PSW) - also essential, common fields or flags contained include:
▪ Sign - sign bit of last arithmetic operation
▪ Zero - set when result of last arithmetic operation is 0
▪ Carry - set if last operation resulted in a carry into or borrow out of a high-order bit
▪ Equal - set if a logical compare result is equality
▪ Overflow - set when last arithmetic operation caused overflow
▪ Interrupt Enable/Disable - used to enable or disable interrupts
53
▪ Supervisor - indicates if privileged ops can be used
o Other optional registers
▪ Pointer to a block of memory containing additional status info (like process control
blocks)
▪ An interrupt vector
▪ A system stack pointer
▪ A page table pointer
▪ I/O registers
o Design issues
▪ Operating system support in CPU
▪ How to divide allocation of control information between CPU registers and first part
of main memory (usual tradeoffs apply)
• Example Microprocessor Register Organization
Instruction Pipelining
• Concept is similar to a manufacturing assembly line
o Products at various stages can be worked on simultaneously
o Also referred to as pipelining, because, as in a pipeline, new inputs are accepted at one end
before previously accepted inputs appear as outputs at the other end
• Consider subdividing instruction processing into two stages:
o Fetch instruction
o Execute instruction
54
• During execution, there are times when main memory is not being accessed.
• During this time, the next instruction could be fetched and buffered (called instructio n
prefetch or fetch overlap).
• If the Fetch and Execute stages were of equal duration, the instruction cycle time would be
halved.
• However, doubling of execution time is unlikely because:
o Execution time is generally longer than fetch time (it will also involve reading and storing
operands, in addition to operation execution)
o A conditional branch makes the address of the next instruction to be fetched unknown
(although we can minimize this problem by fetching the next sequential instruction anyway)
• To gain further speedup, the pipeline must have more stages. Consider the following
decomposition of instruction processing:
o Fetch Instruction (FI)
o Decode Instruction (DI) - determine opcode and operand specifiers
o Calculate Operands (CO) - calculate effective address of each source operand
o Fetch Operands (FO)
o Execute Instruction (EI)
o Write Operand (WO)
• Timing diagram, assuming 6 stages of fairly equal duration and no branching
55
Notes on the diagram
o Each instruction is assumed to use all six stages
▪ Not always true in reality
▪ To simplify pipeline hardware, timing is set up assuming all 6 stages will be used
o It assumes that all stages can be performed in parallel
▪ Not actually true, especially due to memory access conflicts
▪ Pipeline hardware must accommodate exclusive use of memory access lines, so delays
may occur
▪ Often, the desired value will be in cache, or the FO or WO stage may be null, so
pipeline will not be slowed much of the time
• If the six stages are not of equal duration, there will be some waiting involved for shorter
stages
• The CO (Calculate Operands) stage may depend on the contents of a register that could be
altered by a previous instruction that is still in the pipeline
• It may appear that more stages will result in even more speedup
56
o There is some overhead in moving data from buffer to buffer, which increases with more
stages
o The amount of control logic for dependencies, etc. for moving from stage to stage increases
exponentially as stages are added
• Conditional branch instructions and interrupts can invalidate several instruction fetches
THE CENTRAL PROCESSING UNIT
Reduced Instruction Set Computers (RISCs)
Introduction
• RISC is one of the few true innovations in computer organization and architecture in the last
50 years of computing.
• Key elements common to most designs:
o A limited and simple instruction set
o A large number of general purpose registers, or the use of compiler technology to optimize
register usage
o An emphasis on optimizing the instruction pipeline
Instruction Execution Characteristics
• Overview
o Semantic Gap - the difference between the operations provided in high-level languages and
those provided in computer architecture
o Symptoms of the semantic gap:
▪ Execution inefficiency
▪ Excessive machine program size
▪ Compiler complexity
o New designs had features trying to close gap:
▪ Large instruction sets
▪ Dozens of addressing modes
▪ Various HLL statements in hardware
o Intent of these designs:
▪ Make compiler-writing easier
▪ Improve execution efficiency by implementing complex sequences of operations in
microcode
▪ Provide support for even more complex and sophisticated HLL's
o Concurrently, studies of the machine instructions generated by HLL programs
57
▪ Looked at the characteristics and patterns of execution of such instructions
▪ Results lead to using simpler architectures to support HLL's, instead of more complex
o To understand the reasoning of the RISC advocates, we look at study results on 3 main aspects
of computation:
▪ Operations performed - the functions to be performed by the CPU and its interactio n
with memory.
▪ Operands used - types of operands and their frequency of use. Determine memory
organization and addressing modes.
▪ Execution Sequencing - determines the control and pipeline organization.
o Study results are based on dynamic measurements (during program execution), so that we can
see effect on performance
• Operations
o Simple counting of statement frequency indicates that assignment (data moveme nt)
predominates, followed by selection/iteration.
o Weighted studies show that call/return actually accounts for the most work
o Target architectural organization to support these operations well
o Patterson study also looked at dynamic frequency of occurrence of classes of variables.
Results showed a preponderance of references to highly localized scalars:
▪ Majority of references are to simple scalars
▪ Over 80% of scalars were local variables
▪ References to arrays/structures require a previous ref to their index or pointer, which
is usually a local scalar
• Operands
o Another study found that each instruction (DEC-10 in this case) references 0.5 operands in
memory and 1.4 registers.
o Implications:
▪ Need for fast operand accessing
▪ Need for optimized mechanisms for storing and accessing local scalar variables
• Execution Sequencing
o Subroutine calls are the time-consuming operation in HLL's
o Minimize their impact by
▪ Streamlining the parameter passing
▪ Efficient access to local variables
58
▪ Support nested subroutine invocation
o Statistics
▪ 98% of dynamically called procedures passed fewer than 6 parameters
▪ 92% use less than 6 local scalar variables
▪ Rare to have long sequences of subroutine calls followed by returns (e.g., a recursive
sorting algorithm)
▪ Depth of nesting was typically rather low
• Implications
o Reducing the semantic gap through complex architectures may not be the most efficient use
of system hardware
o Optimize machine design based on the most time-consuming tasks of typical HLL programs
o Use large numbers of registers
▪ Reduce memory reference by keeping variables close to CPU (more register refs
instead)
▪ Streamlines instruction set by making memory interactions primarily loads and stores
o Pipeline design
▪ Minimize impact of conditional branches
o Simplify instruction set rather than make it more complex
Large Register Files
• How can we make programs use registers more often?
o Software - optimizing compilers
▪ Compiler attempts to allocate registers to those variables that will be used most in a
given time period
▪ Requires sophisticated program-analysis algorithms
o Hardware
▪ Make more registers available, so that they'll be used more often by ordinary compilers
▪ Pioneered at Berkeley by first commercial RISC product, the Pyramid
Reduced Instruction Set Architecture
• Why CISC?
o CISC trends to richer instruction sets
▪ More instructions
▪ More complex instructions
o Reasons
59
▪ To simplify compilers
▪ To improve performance
• Are compilers simplified?
o Assertion: If there are machine instructions that resemble HLL statements, compiler
construction is simpler
o Counter-arguments:
▪ Complex machine instructions are often hard to exploit because the compiler must
find those cases that fit the construct
▪ Other compiler goals
▪ Minimizing code size
▪ Reducing instruction execution count
▪ Enhancing pipelining are more difficult with a complex instruction set
▪ Studies show that most instructions actually produced by CISC compilers are the
relatively simple ones
• Is performance improved?
o Assertion: Programs will be smaller and they will execute faster
▪ Smaller programs save memory
▪ Smaller programs have fewer instructions, requiring less instruction fetching
▪ Smaller programs occupy fewer pages in a paged environment, so have fewer page
faults
o Counter-arguments:
▪ Inexpensive memory makes memory savings less compelling
• CISC programs may be shorter, but bits used for each instruction are more, so total memory
used may not be smaller
o Opcodes require more bits
o Operands require more bits because they are usually memory addresses, as opposed to register
identifiers (which are the usual case for RISC)
• The entire control unit must be more complex to accommodate seldom used complex
operations, so even the more often-used simple operations take longer
• The speedup for complex instructions may be mostly due to their implementation as simpler
instructions in microcode, which is similar to the speed of simpler instructions in RISC
(except that the CISC designer must decide a priori which instructio ns to speed up in this way)
• Characteristics of RISC Architectures
60
o One instruction per cycle
▪ A machine cycle is defined by the time it takes to fetch two operands from registers,
perform and ALU operation, and store the result in a register
▪ RISC machine instructions should be no more complicated than, and execute about as
fast as microinstructions on a CISC machine
▪ No microcoding needed, and simple instructions will execute faster than their CISC
equivalents due to no access to microprogram control store.
o Register-to-register operations
▪ Only simple LOAD and STORE operations access memory
▪ Simplifies instruction set and control unit
▪ Ex. Typical RISC has 2 ADD instructions
▪ Ex. VAX has 25 different ADD instructions
▪ Encourages optimization of register use
o Simple addressing modes
▪ Almost all instructions use simple register addressing
▪ A few other modes, such as displacement and PC relative, may be provided
▪ More complex addressing is implemented in software from the simpler ones
▪ Further simplifies instruction set and control unit
o Simple instruction formats
▪ Only a few formats are used
▪ Further simplifies the control unit
▪ Instruction length is fixed and aligned on word boundaries
▪ Optimizes instruction fetching
▪ Single instructions don't cross page boundaries
▪ Field locations (especially the opcode) are fixed
▪ Allows simultaneous opcode decoding and register operand access
• Potential benefits
o More effective optimizing compilers
o Simpler control unit can execute instructions faster than a comparable CISC unit
o Instruction pipelining can be applied more effectively with a reduced instruction set
o More responsiveness to interrupts
▪ They are checked between rudimentary operations
▪ No need for complex instruction restarting mechanisms
61
• VLSI implementation
o Requires less "real estate" for control unit (6% in RISC I vs. about 50% for CISC microcode
store)
o Less design and implementation time
RISC Pipelining
• The simplified structure of RISC instructions allows us to reconsider pipelining
o Most instructions are register-to-register, so an instruction cycle has 2 phases
▪ I: Instruction Fetch
▪ E: Execute (an ALU operation w/ register input and output)
o For load and store operations, 3 phases are needed
o I: Instruction fetch
▪ E: Execute (actually memory address calculation)
▪ D: Memory (register-to-memory or memory-to-register)
• Since the E phase usually involves an ALU operation, it may be longer than the other phases.
In this case, we can divide it into 2 sub phases:
o E1: Register file read
o E2: ALU operation and register write
The RISC vs. CISC Controversy
• In spite of the apparent advantages of RISC, it is still an open question whether the RISC
approach is demonstrably better.
• Studies to compare RISC to CISC are hampered by several problems (as of the textbook
writing):
o There is no pair of RISC and CISC machines that are closely comparable
o No definitive set of test programs exist.
o It is difficult to sort out hardware effects from effects due to skill in compiler writing.
• Most of the comparative analysis on RISC has been done on “toy” machines, rather than
commercial products.
• Most commercially available “RISC” machines possess a mixture of RISC and CISC
characteristics.
• The controversy has died down to a great extent
o As chip densities and speeds increase, RISC systems have become more complex
o To improve performance, CISC systems have increased their number of general purpose
registers and increased emphasis on instruction pipeline design.
62