Computer System Architecture
Computer System Architecture
Just as buildings, each computer has a visible structure, referred to as its architecture. The
architecture of a building can be examined at various levels of detail, namely, the number of
stories, the size of the rooms, the details of door and window placement and so on. One can
look at a computers’ architecture at similar levels of detail of basic hardware elements, which
in turn depends on the type of computer (personal computer, super computer, etc.) required.
Computer architecture is defined as the science of selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. It can also
be described as the logical structure of the computer system. The computer architecture forms
the backbone for building successful computer systems. The architecture largely precludes a
computer system's quality attributes such as performance and reliability.
Learning objectives
After studying this chapter, student should be able to:
Define architecture and state the difference between computer architecture and
computer organization.
State and describe the abstraction layers of a computer architecture
Name some examples of architecture and describe the functioning of the Von
Neumann architecture
State the difference between RISC and CISC instruction sets
State and explain some examples of modern processors
Contents
I. INTRODUCTION TO COMPUTER SYSTEM........................................................................................................................................2
There are four components required for the implementation of a computerized input-process-
output model:
1. The computer hardware, which provides the physical mechanisms to input and
output data, to manipulate and process data, and to electronically control the various
input, output, and storage components.
2. The software, both application and system, which provides instructions that tell the
hardware exactly what tasks are to be performed and in what order.
3. The data that is being manipulated and processed. This data may be numeric, it may
be alphanumeric, it may be graphic, or it may take some other form, but in all cases it
must be representable in a form that the computer can manipulate.
4. The communication component, which consists of hardware and software that
transport programs and data between interconnected computer systems.
resources that realizes the architecture which include the CPU, the memory and I/O
controllers. These are digital systems with registers, buses, ALUs, sequencers, etc.
Computer systems span many levels of detail, which in computer science we call levels of
abstraction. Abstractions help us express intangible concepts in visible representations
that can be manipulated. In layered architecture, complex problems can be segmented
into smaller and more manageable form. Each layer is specialized for specific functioning.
Team development is possible because of logical segmentation. A team of programmers will
build. The system and work has to be sub-divided of along clear boundaries.
Figure 1 illustrates another view of a computer system, which is comprised of different levels
of language and means of translating these languages into lower-level languages. Finally, the
microprogram is loaded onto the hardware.
This layer consists of digital circuits. Digital circuits form digital systems of the
microarchitecture level. Digital circuits use two types of components : gates and flip-flops. A
gate outputs 1 or 0, depending on its current input values, i.e. the output now is a function of
the inputs now. Most common gates used are AND, OR, NOT, NAND and NOR gates.
A very good example of computer architecture is Von Neumann architecture, which is still
used by most types of computers today. This was proposed by the mathematician John Von
Neumann in 1945. It is comprised of the five classical components (input, output,
processor, memory, and datapath). The processor is divided into an arithmetic logic unit
(ALU) and control unit, a method of organization that persists to the present. Within the
ALU, an accumulator supports efficient addition or incrementation of values corresponding
to variables such as loop indices.
The Von Neumann Architecture comprises the following components: Central Processing
Unit (CPU), Input Unit, Output Unit, and Storage Unit. The diagram below shows the
logical functioning of all those components.
The von Neumann architecture has a significant disadvantage - its speed is dependent on the
bandwidth or throughput of the datapath between the processor and memory. This is called
the von Neumann bottleneck.
The central processing unit, also known as processor, is the brain of the computer system
that processes data (input) and converts it into meaningful information (output). It is referred
to as the administrative section of the computer system that interprets the data and
instructions, coordinates the operations and supervises the instructions. CPU works with data
in discrete form, that is, either 1 or 0. Some of the basic functions of a CPU are as follows:
The CPU consists of three main subsystems: the arithmetic/logic unit (ALU), the control unit
(CU) and the registers. These three subsystems work together to provide operational
capabilities to the computer.
This unit performs the arithmetic (add and subtract) and logical operations (and, or) on the
available data. Whenever an arithmetic or logical operation is to be performed, the required
data are transferred from the memory unit to ALU, the operation is performed, and the result
is returned to memory unit. Before the completion of the processing, data may need to be
transferred back and forth several times between these two sections. The ALU comprises two
units: an arithmetic unit and a logic unit.
Arithmetic Unit: The arithmetic unit contains the circuitry that is responsible for
performing the actual computing and carrying out the arithmetic calculations such as
The control unit can be thought of as the heart of the CPU. It checks the correctness of the
sequence of operations. It fetches the program instructions from the memory unit, interprets
them and ensures correct execution of the program. It also controls the input/output devices
and directs the overall functioning of the other units of the computer.
Figure 5 illustrates how control unit instructs the other parts of the CPU (i.e. ALU and
registers) and the I/O devices on what to do and when to do. In addition, it determines what
data are needed, where they are stored and where to store the results of the operation as well
as sends the control signals to the devices involved in the execution of the instructions. It
administers the movement of large amount of instructions and data used by the computer. To
maintain the proper sequence of events required for any processing task, the control unit uses
clock inputs. Thus, the control unit repeats a set of four basic operations: fetching, decoding,
executing and storing.
1. Fetching: It is the process of obtaining a program instruction or data item from the
memory.
2. Decoding: It is the process of translating the instruction into commands the computer
can execute.
3. Executing: It is the process of carrying out the commands.
4. Storing: It is the process of writing the results to the memory.
IV.1.3 Registers:
These are the special-purpose, high-speed temporary memory units that can hold varied
information such as data, instructions, addresses and intermediate results of calculations.
Essentially, they hold the information that the CPU is currently working on. Registers can be
considered the CPU's working memory, an additional storage location that provides the
advantage of speed. Registers work under the direction of the control unit to accept, hold and
transfer instructions or data and perform arithmetic or logical comparisons at high speed. The
control unit uses a data storage register in a similar way a store owner uses a cash register as
a temporary, convenient place to store the transactions. As soon as a particular instruction or
piece of data is processed, the next instruction immediately replaces it, and the information
that result from the processing are returned to main memory. Figure 6 reveals various types
of registers present inside a CPU.
The size or length of each register is determined by its function. For example, the memory
address register, which holds the address of the next location in memory to be accessed, must
have the same number of bits as the memory address. Instruction register holds the next
instruction to be executed and, therefore, should be of the same number of bits as the
instruction. (NB: The number and sizes of registers vary from processor to processor.)
Random access memory (RAM) directly provides the required information to the processor.
RAM can be defined as a block of sequential memory locations, each of which has a unique
address determining the location and those locations contain a data element. It stores
programs and data that are in active use. It is volatile in nature, which means the information
stored in it remains as long as the power is switched ON. RAM can be further classified into
two categories:
• Dynamic Random Access Memory (DRAM): This type of RAM holds the data in
dynamic (keeps on refreshing) manner with the help of a refresh circuitry. Each second
or even less than that the content of each memory cell is read and the reading action
refreshes the contents of the memory. DRAMs are made from transistors and
capacitors. The capacitor holds the electrical charge if the bit contains 1, and no
charge if the bit is 0. The transistor reads the contents of the capacitor. The charge is
held for a short period and then it fades away, that is, when refresh circuitry comes in.
• Static Random Access Memory (SRAM): SRAM along with DRAM is essential for
a system to run optimally, because it is very fast as compared to DRAM. It is effective
because most of the programs access the same data repeatedly and keeping all these
information in the fast SRAM allows the computer to avoid accessing the slower
DRAM. Data are first written to SRAM assuming that they will be used again soon.
SRAM is generally included in a computer system by the name of cache.
As the name suggests, read-only memory (ROM) can only be read, not written. In other
words, the CPU can only read from any location in the ROM but cannot write. The ROM
stores the initial start-up instructions and routines in the BIOS (basic input/output system).
The contents of ROM are not lost even in case of a sudden power failure, thus making it non-
volatile in nature. The instructions in the ROM are built into the electronic circuits of the
chip, which is called firmware. The ROM is also random access in nature. Various types of
ROM, namely, programmable read-only memory (PROM), erasable programmable read-
only memory (EPROM) and electrically erasable programmable read-only memory
(EEPROM) are in existence.
The Bus is the functional units are interconnected to enable data transport (e.g. writeCPU
register data content to a certain address in memory). It is a set of connections between two
or more components/devices, which is designed to transfer several/all bits of a word from
source to destination. A bus consists of multiple paths, which are also termed as lines and
each line is capable of transferring one bit at a time. A bus can be unidirectional
(transmission of data can be only in one direction) or bidirectional (transmission of data can
be in both directions). A bus that connects to all the three components (CPU, memory and I/O
devices) is called a system bus.
a) Data lines: Data lines provide a path for moving data between the system modules.
These are collectively known as data bus. Normally, a data bus consists of 8, 16 or
32 separate lines. The number of lines present in a data bus is called the width of data
bus. Data bus width limits the maximum number of bits, which can be transferred
simultaneously between two modules. The width of data bus helps in determining the
overall performance of a computer system.
b) Address Lines: Address lines are used to designate the source of data for data bus.
Address lines are collectively called address bus Thus, the width of the address bus
specifies the maximum possible memory supported by a system. For example, if a
system has 16-bit wide address bus, it can support memory size equal to 2 16 (or
65536) bytes.
c) Control lines: Control lines are used to control the access to data and address bus.
This is required as bus is a shared medium. The control lines are collectively called
control bus. These lines are used for the transmission of commands and timing signals
(which validate data and address) between the system modules. Timing signals
indicate whether data and address information is valid, whereas command signals
specify which operations are to be performed. Some of the control lines of bus are
required for providing clock signals to synchronize operations and for resetting
signals to initialize the modules. The control lines are also required for
reading/writing to I/O devices or memory.
The bus architecture comes with a serious disadvantage: An electronic bus can transfer only
one item at a time (e.g., one data word, one address). The bus transmission speed thus poses a
limit on the overall performance of the system (this phenomenon is known as the bus
bottleneck)
IV.3.2 Cache
A cache is a piece of very fast memory, made from high-speed static RAM that reduces the
time of accessing data. It is very expensive and generally incorporated in the processor,
where valuable data and program segments are kept. This enables the processor to access data
quickly whenever it is needed. Major reason for incorporating cache in the system is that the
CPU is much faster than the DRAM and needs a place to store information that can be
accessed rapidly. The cache facilitates the system to catch up with the processor's speed. The
cache fetches the frequently used data from the dram and buffers (stores) it for further
processor usage. Cache can be further categorized into three levels
• Level 1 Cache (L1): Level 1 cache, also known as primary cache, is built into the
processor chip. It is a small fast memory area that works together with the Level 2
cache to provide the processor much faster access to important and often used data.
• Level 2 Cache (L2): Level 2 cache, also known as secondary cache, is a collection of
static RAM chips that are built onto the motherboard. It is little larger and slower than
L1, but is faster than the main memory. L1 and L2 cache are used together for optimal
use of the processor.
• Level 3 Cache (L3): L3 cache memory is an enhanced form of memory present on the
motherboard of the computer. It is an extra cache built into the motherboard between
the processor and the main memory to speed up the processing operations. It reduces
the time gap between the request and the retrieval of the data and instructions, thereby
accessing data much more quickly than the main memory.
Computer understands instructions only in terms of 0s and 1s, which is called the machine
language. To accomplish significant tasks, the processor must have two inputs: instructions
and data. The instructions tell the processor what actions need to be performed on the data.
Each machine language instruction is composed of two parts: the op-code and the operand.
The bit pattern appearing in the op-code field indicates which operations (e.g. STORE, ADD,
SUB and so on) are instructed. The bit pattern of the operand field provides further details
about the operation specified by the op-code.
Figure 9 illustrates the format of an instruction for the processor. The first three bits represent
the op-code and the final six bits represent the operand. The middle bit represents whether the
operand is a memory address or a number. When the bit is set to 1, the operand represents a
number.
• Data Transfer Instructions: These are used to transfer or copy data from one location
to another either in the registers or in the external main memory.
• Arithmetic Instructions: These instructions are used to perform operations on
numerical data.
• Logical Instructions: These are used to perform Boolean operations on non-
numerical data.
• Program Control Instructions: These are used to change the sequence of a program
execution.
• Input–output Instructions: These are used to transfer data from and to I/O devices.
Now, let us discuss few very basic instructions in the assembly language. These instructions
tell the processor to carry out various operations.
Instructions Functions
ADD Perform addition
SUB Perform subtraction
MUL Perform multiplication
MOV Move the contents from one location to another
DIV Perform division
LDA Load the contents of variable
JMP Jump to the instruction
ABS Calculate absolute value
Example:
ADD R1, R2 will add the content of register R1 and R2.
MOV R1, R2 will move the content of register R2 to R1.
LDA Var1 will load the contents of 'Var1' into the accumulator.
During this cycle, the instruction, which is to be executed next, is fetched from the memory
to the processor. The steps performed during the fetch cycle are as follows:
1. The program counter (PC) keeps track of the memory location of the next instruction.
2. This address is transferred from PC to MAR.
3. The instruction is read from the memory.
4. Then, the PC is incremented by 1 (PC = PC + 1) and instruction so obtained is
transferred to the IR.
5. In the IR, the unique bit patterns that make up machine language are extracted and
sent to the decoder.
The decode cycle is responsible for recognizing the operation that the bit pattern represents
and activating the correct circuitry to perform that operation. The steps performed during the
decode cycle are as follows:
1. The operation code (op-code) of the instruction is first read, and then interpreted in
the machine language.
2. The data required by the instruction (operand) are then transferred to the data register
(DR).
Once the instruction has been decoded, the operation specified by the op-code is performed
on user-provided data in ALU. The execution cycle involves following steps:
1. The data is fetched into ALU from the memory location pointed by memory address
register.
2. The operation specified by the decoded op-code is performed on the data in ALU.
After the fetch, decode and execute cycles have executed, the results are ready to be stored.
The steps involved in the store cycle are as follows:
1. The results from the execution cycle are stored in the memory buffer register.
2. Then, the results from the memory buffer register are stored back in the memory.
This is the situation with 8086 instructions which range from one byte to a maximum of 6
bytes in length. Such instructions are called variable length instructions and are commonly
used on CISC machines. The advantage of using such instructions, is that each instruction
can use exactly the amount of space it requires, so that variable length instructions
reduce the amount of memory space required for a program.
On the other hand, it is possible to have fixed length instructions, where as the name
suggests, each instruction has the same length. Fixed length instructions are commonly used
with RISC processors such as the PowerPC and Alpha processors. Since each instruction
occupies the same amount of space, every instruction must be long enough to specify a
memory operand, even if the instruction does not use one. Hence, memory space is wasted by
this form of instruction.
The processors are built with the ability to execute a limited set of basic operations. The
collections of these operations are known as the processor's instruction set. An instruction
set is necessary so that a user can create machine language programs to perform any logical
and/or mathematical operations. The instruction set is hardwired (embedded) in the
processor, which determines the machine language for the processor. The more complicated
the instruction set, the slower the processor works.
Processors differ from one another by their instruction sets. If the same program can run on
two different processors, they are said to be compatible. For example, programs written for
IBM computers may not run on Apple computers because these two architectures (different
processors) are not compatible. Since each processor has its unique instruction set, machine
language programs written for one processor will normally not run on a different processor.
Based upon the instruction sets, there are two common types of architectures: complex
instruction set computer (CISC) and reduced instruction set computer (RISC).
processors with more extensive and complex instruction set. It shifted most of the burden of
generating machine instructions to the processor. For example, instead of making a compiler
to write long machine instructions for calculating a square root, a CISC processor would
incorporate a hardwired circuitry for performing the square root in a single step. Writing
instructions for a CISC processor is comparatively easy because a single instruction is
sufficient to utilize the built-in ability. Most of the PCs today include a CISC processor.
As each instruction is executed directly via the processor, no hardwired circuitry (used for
complex instructions) is required. This allows RISC processors to be smaller, consume less
power and run cooler than CISC processors. Due to these advantages, RISC processors are
ideal for embedded applications such as mobile phones, PDAs and digital cameras. In
addition, the simple design of an RISC processor reduces its development time as compared
to a CISC processor.
Hardware interrupts are used by devices to communicate that they require attention from the
operating system. Some common examples are a hard disk signaling that is has read a series
of data blocks, or that a network device has processed a buffer containing network packets.
Interrupts are also used for asynchronous events, such as the arrival of new data from an
external network. Hardware interrupts are delivered directly to the CPU using a small
network of interrupt management and routing devices. This chapter describes the different
types of interrupt and how they are processed by the hardware and by the operating system. It
also describes how the MRG Realtime kernel differs from the standard kernel in handling the
types of interrupt.
Hardware interrupts are referenced by an interrupt number. These numbers are mapped back
to the piece of hardware that created the interrupt. This enables the system to monitor which
device created the interrupt and when it occurred.
In most computer systems, interrupts are handled as quickly as possible. When an interrupt is
received, any current activity is stopped and an interrupt handler is executed. The handler
will preempt any other running programs and system activities, which can slow the entire
system down, and create latencies.
An interrupt is said to be masked when it has been disabled, or when the CPU has been
instructed to ignore it. A non-maskable interrupt (NMI) cannot be ignored, and is generally
used only for critical hardware errors.
NMIs are normally delivered over a separate interrupt line. When an NMI is received by the
CPU, it indicates that a critical error has occurred, and that the system is probably about to
crash. The NMI is generally the best indication of what might have caused the problem.
Because NMIs are not able to be ignored, they are also used by some systems as a hardware
monitor. The device sends a stream of NMIs, which are checked by an NMI handler in the
processor. If certain conditions are met - such as an interrupt not being triggered after a
specified length of time - the NMI handler can produce a warning and debugging information
about the problem. This helps to identify and prevent system lockups.
System management interrupts (SMIs) are used to offer extended functionality, such as
legacy hardware device emulation. They can also be used for system management tasks.
SMIs are similar to NMIs in that they use a special electrical signalling line directly into the
CPU, and are generally not able to be masked.
When an SMI is received, the CPU will enter System Management Mode (SMM). In this
mode, a very low-level handler routine is run to handle the SMIs. The SMM is typically
provided directly from the system management firmware, often the BIOS or the EFI.
VII.5 Polling
Polling, or polled operation, in computer science, refers to actively sampling the status of an
external device by a client program as a synchronous activity. Polling is most often used in
terms of input/output (I/O), and is also referred to as polled I/O or software-driven I/O.
Polling is sometimes used synonymously with busy-wait polling (busy waiting). In this
situation, when an I/O operation is required, the computer does nothing other than check the
status of the I/O device until it is ready, at which point the device is accessed. Polling has the
disadvantage that if there are too many devices to check, the time required to poll them can
exceed the time available to service the I/O device.
In principle, the bus computer solves the von Neumann bottleneck problem by using a fast
bus. In practice, the bus is rarely fast enough to support I/O for the common case (90 percent
of practical applications), and bus throughput can be significantly reduced under large
amounts of data.
Recall the old saying, "Many hands make less work." In computers, the use of many
processors together reduces the amount of time required to perform the work of solving a
given problem. Due to I/O and routing overhead, this efficiency is sublinear in the number of
processors. That is, if W(N) [or T(N)] denotes the work [or time to perform the work]
associated with N processors, then the following relationships hold in practice:
The first equation means that the work performed by N processors working on a task, where
each processor performs work W(1) [the work of one processor in a sequential computation
paradigm], will be slightly less than N times W(1). Note that we use "<" instead of "="
because of the overhead required to
The second equation means essentially the same thing as the first equation, but the work is
replaced by time. Here, we are saying that if one processor takes time T(1) to solve a
problem, then that same problem solved on an N-processor architecture will take time slightly
greater than T(1)/N, assuming all the processors work together at the same time. As in the
preceding paragraph, this discrepancy is due to the previously-described overhead.
This is a simple architecture that is useful for solving selected types of compute-intensive
problems. However, if you try to solve data-intensive problems on such an architecture, you
encounter the von Neumann bottleneck trying to read and write the large amount of data from
and to the shared memory.
- each CPU can solve part of a problem with part of the problem's data, and
- there is little need for data interchange between processors.
Modern processors overlap these stages in a pipeline, like an assembly line. While one
instruction is executing, the next instruction is being decoded, and the one after that is being
fetched... Pipelining is an implementation technique where multiple instructions are
overlapped in execution.
Now the processor is completing 1 instruction every cycle (CPI = 1). This is a four-fold
speedup without changing the clock speed at all.
The pipeline designer’s goal is to balance the length of each pipeline stage. If the stages are
perfectly balanced, then the time per instruction on the pipelined machine is equal to
Time per instruction on unpipelined machine
Number of pipe stages
Non-Pipelined
Instruction
Order
0 200 400 600 800 1000 1200 1400 1600 1800
Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
800ps
Instruction
lw $3, 300($0)
Fetch
800ps
800ps
Pipelined0
Instruction 200 400 600 800 1000 1200 1400 1600
Time
Order
Instruction REG REG
lw $1, 100($0) Fetch
ALU MEM
RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
200ps
Instruction REG REG
lw $3, 300($0) Fetch
ALU MEM
RD WR
200ps
200ps 200ps 200ps 200ps 200ps
Limit of pipelining
Limits to pipelining: Hazards prevent next instruction from executing during its designated
clock cycle
– Structural hazards: two different instructions use same h/w in same cycle
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Pipelining of branches & other instructions that change the PC
Since the clock speed is limited by (among other things) the length of the longest stage in the
pipeline, the logic gates that make up each stage can be subdivided, especially the longer
ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter
stages. Then the whole processor can be run at a higher clock speed! Of course, each
instruction will now take more cycles to complete (latency), but the processor will still be
completing 1 instruction per cycle (throughput), and there will be more cycles per second, so
the processor will complete more instructions per second (actual performance)...
Since the execute stage of the pipeline is really a bunch of different functional units, each
doing its own task, it seems tempting to try to execute multiple instructions in parallel, each
in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced
so that they can decode multiple instructions in parallel and send them out to the "execution
resources"... A superscalar CPU architecture implements a form of parallelism called
instruction level parallelism within a single processor. It therefore allows faster CPU
throughput than would otherwise be possible at a given clock rate.
Most computers have just one CPU, but some models have several. There are even computers
with thousands of CPUs. With single-CPU computers, it is possible to perform parallel
processing by connecting the computers in a network. However, this type of parallel
processing requires very sophisticated software called distributed processing software.
Note that parallel processing differs from multitasking, in which a single CPU executes
several programs at once. Parallel processing is also called parallel computing.
Intel has been making class leading processors for computers for a long time now. They
overcame a period where AMD reigned as king by releasing their lineup of Core 2 processors
in 2006. Now Intel has a new line of processors called the Core I series. The i3, the i5 and the
i7 are the new kids on the block. This guide will help you understand, which one is right for
you.
CORE i3 is a basic level processor type of the new generation launched by Intel
All Core i3s are dual-core processors. They have a Clock speeds ranging from 2.93 to 3.06
GHz They possess 3MB of cache It is only a dual core, supports Hyper-Threading. It can
actually serve two threads per core i.e total 4 threads. The point to be noted is that Integrated
graphics processor of i3 processor is restricted to a maximum clock speed of 1100 MHz
32 nm Silicon (less heat and energy) and they are the cheapest in the lot.
CORE i5-They have 2 categories. Dual core and Quad core. Lets talk about both of them in
a nutshell.
– i5-Dual core has a 32 nm fabrication? 4 MB cache is present in i5. The range of their
clock speeds is between 3.2 to 3.6 GHz for Dual cores. Just like Core i3′s have Hyper
threading support & Integrated graphics processor. i5 has the support of remarkable
Turbo Boost technology
– i5–Quad Core. Quad cores clock speeds of 2.4 and 2.66 GHz.
Turbo Boost technology Supported, don’t support Hyper threading and don’t have a
Integrated Graphics Processor. They have 6MB-8MB of cache
COREi7-High end processor. They are also the fastest and the most expensive in the lot.
Total 4 Cores are present as they are quad core. The range of clock speeds is 1.06 GHz to
3.20 GHz, 8MB of cache is given. Turbo Boost technology Supported, Hyper-Threading
support+ a total of eight threads simultaneously. The IGP [Integrated Graphics] on Core i7
processors can also reach a higher maximum clock speed of 1350 MHz
32-45 nm Silicon (less heat and energy)