0% found this document useful (0 votes)
16 views115 pages

Unit1 Aca

The document discusses the evolution of computer architecture and performance, highlighting significant advancements since the 1980s due to innovations in technology and design. It outlines the transition from instruction-level parallelism to thread-level and data-level parallelism, as well as the growth of various computer classes, including desktops, servers, supercomputers, and embedded systems. Additionally, it defines computer architecture, emphasizing the importance of instruction set architecture and the role of computer architects in meeting functional and market requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views115 pages

Unit1 Aca

The document discusses the evolution of computer architecture and performance, highlighting significant advancements since the 1980s due to innovations in technology and design. It outlines the transition from instruction-level parallelism to thread-level and data-level parallelism, as well as the growth of various computer classes, including desktops, servers, supercomputers, and embedded systems. Additionally, it defines computer architecture, emphasizing the importance of instruction set architecture and the role of computer architects in meeting functional and market requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 115

ADVANCED COMPUTER ARCHITECTURE

1
Fundamentals of Computer Design
Introduction
Incredible progress of Computer technology i.e. more
performance, more main memory, and more disk storage.
Due to: Innovations in
Technology – Performance is fairly steady
Computer Design/ Architectures – Performance is much
less consistent

With two significant changes in the computer marketplace


contributed to the new architectures :

• Virtual elimination of assembly language programming


reduced the need for object-code compatibility.

• Creation of standardized, vendor-independent operating


systems, such as UNIX and its clone, Linux.
Growth in processor performance since 1980s
Growth in processor performance up to 1980
• 1st 25 years – Both innovations (i.e. Technology + Architecture/Design)
contributed – 25% performance growth/year
• Late 1970s – Microprocessors – IC technology – 35%
• 1980s – RISC Architecture/Design – exploited & pioneered two design
aspects:
1. Instruction level parallelism [ILP](initially through pipelining and later
through multiple instruction issue)
2. Use of caches (initially in simple forms and later using more sophisticated
organizations and optimizations).
RISC-based computers raised the performance bar, forcing prior
architectures to keep up or disappear.
E.g.
 The Digital Equipment VAX could not, and so it was replaced
by a RISC architecture.
 Intel adopted many of the innovations of RISC designs i.e. it
Internally translated x86 instructions to RISC instructions.
Growth since the mid-1980s up to 20th century: (1986-2002)
Architectural and organizational enhancements led to 16 years of sustained
growth in performance at an annual rate of over 50% —a rate that is
unprecedented in the computer industry.
The effect of this dramatic growth rate in the 20th century: is twofold:
i.Capability enhancement: E.g. 20th century highest-performance
microprocessors outperformed the supercomputer of less than 10 years ago.
ii.Dominance of microprocessor-based computers across the entire range of the
computer design:
E.g.
Range of Computers Effect

PCs and Workstations Microprocessor based.

Minicomputers Traditionally made from off-the-shelf logic or from


gate arrays are replaced by servers made using
microprocessors.
Mainframes Replaced with multiprocessors consisting of small
numbers of off-the-shelf microprocessors.
Even High-end Built with collections of microprocessors
supercomputers
16- year (1986-2002) renaissance in computer design till 20th century:

Due to the past innovations in


 Computer design/ architecture
 Technology and its efficient use.

By 2002 performance growth rate is compounded due to


the renaissance such that ,

High-performance microprocessors are about seven times


faster than what would have been obtained by relying solely
on technology, including improved circuit design.
Since 2002, processor performance improvement has
dropped to about 20% per year !
Reasons:
o Maximum power dissipation of air-cooled chips
o Little instruction-level parallelism left to exploit efficiently
o Almost unchanged memory latency.
20% performance growth with
• Multiple processors per chip rather than via faster uni-processors

Hence a Historic switch from relying solely on……….


Instruction level parallelism (ILP) – Implicit parallelism with the help of
compiler and hardware
to introduce
Thread-level parallelism (TLP) and Data-level parallelism (DLP) : Explicit
parallelism i.e. requiring the programmer to write parallel code to
gain performance.
Classes (Ranges)of Computers:
Year Type Description

1960s Mainframe Large; costing millions of dollars and stored in computer rooms with
multiple operators overseeing their support. Typical applications included
business data processing and large-scale scientific computing.

1970s Minicomputer Smaller-sized; Applications in Scientific laboratories +


Timesharing systems (i.e. multiple users sharing a computer interactively
through independent terminals).
Supercomputer High-performance computers for scientific computing. Although few in
number, they were important historically because they pioneered
innovations that later trickled down to less expensive computer classes.

1980s Desktop computer Based on microprocessors in the form of both personal computers and
workstations.
Servers Time-sharing systems are replaced by individually owned desktop
computers acting as servers (i.e. computers that provided larger-scale
services such as reliable, long-term file storage and access, larger memory,
and more computing power.)

1990s Handheld computing Internet and the World Wide Web;


devices personal High-performance digital consumer Electronics (from video games to set-top
digital assistants or boxes)
PDAs

Since 2000 Embedded Computers are lodged in other devices and their presence is not
computers immediately obvious. E.g. Cell phones.
The changes in different classes of computers changed views on!
 Computing Technologies
 Computing applications
 Computer markets

3 different computing markets (3 mainstream computing


classes):
Each characterized by different applications, requirements, and computing
Technologies:
Desktop Computing
 The first, and still the largest market in dollar terms.

 Spans from low-end systems that sell for under $500 to high-end,
heavily configured workstations that may sell for $5000. Throughout
this range in price and capability, the desktop market tends to be
driven to optimize Price-performance.

 The combination of performance (measured primarily in terms of


compute performance and graphics performance) and price of a
system is what matters most to customers in this market, and hence to
computer designers. As a result, the newest, highest-performance
microprocessors and cost-reduced microprocessors often appear first
in desktop systems.

 Also well characterized in terms of applications and benchmarking,


though the increasing use of Web-centric, interactive applications
poses new challenges in performance evaluation.
Servers
 Provide larger-scale and more reliable file and computing services.
 Its growth accelerated to Web servers: Offer World Wide Web web-
based services (backbone of large-scale enterprise computing,
replacing the traditional mainframe)
 Expected design characteristics of (key metrics for) servers:
1.Dependability: Failure is far more catastrophic than failure of a single desktop, since
these servers must operate seven days a week, 24 hours a day. Even the estimated costs
of an unavailable system are high and such costs do not account for lost employee
productivity or the cost of unhappy customers.
2.Scalability: Ability to scale up the computing capacity, the memory, the storage, and the
I/O bandwidth i.e. ability to grow in response to an increasing demand for the services
they support or an increase in functional requirements.
3.Throughput: Overall performance of the server—in terms of transactions per minute or
Web pages served per second.
4.Responsiveness to an individual request.
5.Overall efficiency and cost-effectiveness: Determined by how many requests can be
handled in a unit time.
Supercomputers

• A category related to servers.


• Most expensive computers (costing tens of millions of dollars)
• Emphasize floating-point performance.
• Clusters of desktop computers overtaken this class of computer. As
clusters grow in popularity, the number of conventional
supercomputers (and also the number of companies who make
them) is shrinking,.
Embedded Computers:

• Fastest growing portion of the computer market.

• Having widest spread of processing power and cost. 8-bit; 16-bit


processors that may cost less than a dime, 32-bit microprocessors
that execute 100 MIPS and cost under $5, and high-end processors
for the newest video games or network switches that cost $100 and
can execute a billion instructions per second.
Although the range of computing power in the
embedded computing market is very large, price is a key factor in
the design of computers for this space. Performance requirements
do exist, of course, but the primary goal is often meeting the
performance need at a minimum price, rather than achieving higher
performance at a higher price.
Performance requirement in an embedded application:

• Real-time (Reactive) execution:


i. A (hard/ Immediate) real-time performance requirement is when a
segment of the application has an absolute maximum execution
time. E.g. in a digital set-top box, the time to process each video
frame is limited, since the processor must accept and process the
next frame shortly.

ii. Soft real-time performance: Constrains the average time for a


particular task as well as the number of instances when some
maximum time is exceeded. Such approaches—sometimes called
soft real-time —arise when it is possible to occasionally miss the
time constraint on an event, as long as not too many are missed.

• Real-time performance tends to be highly application dependent.


Other key characteristics in embedded applications:
i. To minimize memory: In many embedded applications, the
memory can be a substantial portion of the system cost, and it is
important to optimize memory size in such cases. Sometimes the
application is expected to fit totally in the memory on the
processor chip; other times the application needs to fit totally in a
small off-chip memory. In any event, the importance of memory
size translates to an emphasis on code size, since data size is
dictated by the application.

ii. To minimize power: Larger memories also mean more power, and
optimizing power is often critical in embedded applications.
Although the emphasis on low power is frequently driven by the
use of batteries, the need to use less expensive packaging—plastic
(versus ceramic)—and the absence of a fan for cooling also limit
total power consumption.
Defining Computer Architecture
Computer Architecture
1. Determines what attributes are important for a new
computer.
2. Designs a computer to maximize performance while staying
within cost, power, and availability constraints(a.c.p):
Aspects in this task include:
i. Instruction set design
ii. Functional organization
iii. Logic design
iv. Implementation – It may encompass integrated
circuit design, packaging, power, and cooling.
Optimizing the design requires familiarity with a very wide
range of technologies, from compilers and operating
systems to logic design and packaging.
Instruction Set Architecture [ISA] and its 7 dimensions
[using examples from MIPS (Microprocessor without Interlocked
Pipeline Stages) and 80x86] :

• Refers to the actual programmer-visible instruction set.


• Serves as Boundary between the software and hardware.
• 7 Dimensions:
1. Class of ISA
2. Memory addressing
3. Addressing modes
4. Types and sizes of operands
5. Operations
6. Control flow instructions
7. Encoding an ISA
1. Class of ISA:

Nearly all ISAs today are classified as general-purpose register


architectures, where the operands are either registers or
memory locations.
Two types of general-purpose register architectures:

Register-memory ISA Load-store ISA

Can access memory as part of Can access memory only with load or
many instructions. store instructions. All recent ISAs are
load-store.

E.g. 80x86: Contains 16 E.g. MIPS: Contains


general-purpose registers 16 32 general-purpose registers
floating-point data registers. 32 floating-point registers
2. Memory addressing: Byte addressing

Virtually all desktop and server computers, including the


80x86 and MIPS, use byte addressing to access memory
operands.
Aligned byte addressing: An access to an object of size
s-bytes at byte address A is aligned if A mod s = 0.
E.g. MIPS, require that objects must be aligned.
The 80x86 does not require alignment, but accesses are
generally faster if operands are aligned.
3. Addressing modes:
In addition to specifying registers and constant operands,
addressing modes specify the address of a memory object.

MIPS addressing modes 80x86 addressing modes

Register Register
Immediate (for constants) Immediate (for constants)
Displacement: i.e. a Displacement: i.e. a constant offset is added to a register to
constant offset is added to form the memory address.
a register to form the
memory address.
3 variations of displacement: No register (absolute)
Two registers (based indexed with displacement)
Two registers where one register is multiplied by the size of
the operand in bytes (based with scaled index and
displacement).
It has more like the last three, minus (i.e. without) the
displacement field: register indirect, indexed, and based with
scaled index.
4. Types and sizes of operands:

Like most ISAs,


MIPS and 80x86 support operand sizes of
 8-bit (ASCII character)
 16-bit (Unicode character or half word)
 32-bit (integer or word)
 64-bit (double word or long integer), and
 IEEE 754 floating point in 32-bit (single precision) and 64-bit (double
precision).
The 80x86 also supports 80-bit floating point (extended double precision).
5. Operations:

The general categories of operations are


Data transfer
Arithmetic
Logical
Control
Floating point.
MIPS is a simple and easy-to-pipeline ISA, and it is representative of
the RISC architectures being used in 2006.
The 80x86 has a much richer and larger set of operations.
Subset of the instructions in MIPS64.
SP = single precision; DP = double precision. For data, the most significant bit number is 0; least is 63.
Subset of the instructions in MIPS64.
SP = single precision; DP = double precision. For data, the most significant bit number is 0; least is 63.
6.Control flow instructions
80x86 MIPS
Both support
conditional branches
unconditional jumps
procedure calls and returns
Both use PC-relative addressing, where the branch address is
specified by an address field that is added to the PC.
DIFFERENCES
Conditional branches ( JE, JNE, Conditional branches (BE,
etc.) test condition code bits BNE, etc.) test the contents of
set as side effects of registers.
arithmetic/logic operations.
Procedure call ( CALLF ) places Procedure call ( JAL ) places
the return address on a stack the return address in a
in memory. register.
7. Encoding an ISA:
Two basic choices on encoding:
i. Fixed length and
ii. Variable length
 All MIPS instructions are 32 bits long, which simplifies
instruction decoding.
 The 80x86 encoding is variable length, ranging from 1
to 18 bytes. Variable-length instructions can take less
space than fixed-length instructions, so a program
compiled for the 80x86 is usually smaller than the same
program compiled for MIPS.
 Different choices will affect how the instructions are
encoded into a binary representation.
The Rest of Computer Architecture: Designing the
Organization and Hardware to Meet Goals and Functional Requirements
The implementation of a computer has two components:
i.Organization: includes the high-level aspects of a computer’s design, such as
the memory system, the memory interconnect, and the design of the internal
processor or CPU (central processing unit—where arithmetic, logic,
branching, and data transfer are implemented).
E.g. two processors with the same ISAs but very different organizations are the
AMD Opteron 64 and the Intel Pentium 4. Both processors implement the x86 instruction
set, but they have very different pipeline and cache organizations.
ii.Hardware: refers to the specifics of a computer, including the detailed
logic design and the packaging technology of the computer. Computers
with identical ISAs and nearly identical organizations may differ in the
detailed hardware implementation.
E.g. Pentium 4 and the Mobile Pentium 4 are nearly identical, but offer different
clock rates and different memory systems,
“Architecture” means ?
Architecture covers Three aspects of computer design:
1.Instruction set architecture
2.Organization
3.Hardware.
Role of Computer architects:
Determine the requirements: These may be
Specific features inspired by the market: The presence of a large market for a particular class of
applications might encourage the designers to incorporate requirements that would make the
computer competitive in that market.

Driven by the application software which determines how the computer will be used. If a
large body of software exists for a certain ISA, the architect may decide that a new computer
should implement an existing instruction set (to reduce the software size).

Design a computer to meet functional requirements as well as price, power, performance, and
availability goals.

Must be aware of important trends in both the technology and the use of computers, as such
trends not only affect future cost, but also the longevity of an architecture.
Functional requirements to consider in designing a new computer:

Functional requirements Typical features required or supported


Application area Different targets of computer depending on
application area.
Level of software compatibility Determines amount of existing software for
computer
Operating system requirements Necessary features to support chosen OS
Standards Certain standards may be required by marketplace
Functional requirements to consider in designing a new computer:

Application area Target of computer


General-purpose desktop Balanced performance for a range of tasks, including
interactive performance for graphics, video, and
audio.
Scientific desktops and servers High-performance floating point and graphics.
Commercial servers Support for databases and transaction processing;
enhancements for reliability and availability; support
for scalability.
Embedded computing Often requires special support for graphics or video
(or other application-specific extension); power
limitations and power control may be required.
Functional requirements to consider in designing a new computer:

Operating system requirements Necessary features to support chosen OS


Size of address space Very important feature; may limit applications
Memory management Required for modern OS; may be paged or
segmented.
Protection Different OS and application needs: page vs.
segment; virtual machines
Trends in Technology
Implementation technology:
Success of a computer depends on its lifetime.
Lifetime of a computer depends on its architecture/design i.e. the design
must survive rapid changes in computer technology.
E.g. The core of the IBM mainframe has been in use for more than 40
years.
Aware of rapid changes in implementation technology…..
Four implementation technologies critical to modern implementations:
1.Integrated circuit logic technology—
•Growth rate in transistor count on a chip of about 40% to 55% per year
due to increase in:
i.Transistor density: By about 35% per year, quadrupling in somewhat
over four years.
ii.Die size ranging from 10% to 20% per year (slow and less predictable).
Trends in Technology

Device speed scales more slowly…….


2. Semiconductor DRAM (dynamic random-access memory)—Capacity
increases by about 40% per year, doubling roughly every two
years.
3. Magnetic disk technology—
– Prior to 1990, density increased by about 30% per year, doubling in
three years.
– disks are 50–100 times cheaper per bit than DRAM.
4. Network technology—
Network performance depends both on the performance of
switches and on the performance of the transmission system.
Designers often design for the next technology !!

• A computing system cycle (i.e. 2 years of design and 2 to 3 years of


production) = Key technologies cycle (say DRAMs capacity change every 5
years) => Designer must plan for these changes.

• Traditionally, cost has decreased at about the rate at which density


increases.
Technology thresholds have a significant impact on a wide variety of design
decisions !
• Technology improves continuously; whereas the impact of these
improvements can be in discrete leaps when the threshold is reached.
E.g.
The threshold point of MOS technology allowing 25,000 to 50,000
transistors fit on a single chip, lead to:

a) Single-chip, 32-bit microprocessor (in the early 1980s).


b) On-chip first-level caches (by the late 1980s): This improved the
cost-performance and power-performance by eliminating chip crossings
within the processor (Single-chip) and between the processor and the
cache (On-chip).
Performance Trends i.e.
Bandwidth (throughput) over Latency (response time):
Bandwidth: Total amount of work done in a given time, such as
megabytes per second for a disk transfer.
Latency/Response time: Time between the start and the
completion of an event, such as milliseconds for a disk access.
Bandwidth improves much more rapidly than latency i.e.
Bandwidth has outpaced latency across the technologies and will likely
continue to do so. E.g.
i.Microprocessors and networks have seen the greatest performance
gains: 1000–2000X in bandwidth and 20–40X in latency.
ii.Memory and disks have seen the performance gains: 120–140X in
bandwidth and 4–8X in latency .
However, Capacity is generally more important than performance
for memory and disks and hence the capacity has improved most
than the performance.
Log-log plot of bandwidth and latency milestones
(Relative improvement in bandwidth and latency for technology milestones for
microprocessors, memory, networks, and disks)
A simple rule of thumb

Bandwidth grows by at least the square of the


improvement in latency.

Computer designers should make plans accordingly.


Scaling of Transistor Performance and Wires
Feature size: is the characteristic of Integrated circuit process. It is the minimum
size of a transistor or a wire in either the x or y dimension.
1. Feature size Vs. Transistor Density:
Density of transistors (i.e. transistor count per square millimeter) increases
quadratically with a linear decrease in feature size. Why?
Because it is determined by the surface area of a transistor.
Feature sizes have decreased from 10 microns in 1971 to 0.09 microns (or 90
nanometers )in 2006

2. Feature size Vs. Transistor Performance:


Complex to define performance. However, to a first approximation, transistor
performance improves linearly with decreasing feature size.
As feature sizes shrink, devices shrink quadratically in the horizontal dimension
and also shrink in the vertical dimension. The shrink in the vertical dimension
requires a reduction in operating voltage to maintain correct operation and
reliability of the transistors.
1 and 2 => transistor count improves quadratically with a linear
improvement in transistor performance.

In the early days of microprocessors, the higher rate of


improvement in density was used to move quickly from 4-bit, to 8-
bit, to 16-bit, to 32-bit microprocessors.

More recently, density improvements have supported the


introduction of 64-bit microprocessors as well as many of the
innovations in pipelining and caches
In general, wire delay (metric of performance of wires) scales
poorly with decrease in feature size
[when compared to transistor performance]

Wires in an integrated circuit do not improve in performance with


decreased feature size !
•As feature size shrinks, wires get shorter, but the resistance and
capacitance per unit length get worse

•the signal delay for a wire increases in proportion to the product


of its resistance and capacitance.

Wire delay has become a major (critical) design limitation than transistor
switching delay for large integrated circuits !
Trends in Power in Integrated Circuits
What are the challenges the power provides as devices are scaled?
i.Bringing in and distributing the power around the chip: e.g. Modern
microprocessors use hundreds of pins and multiple interconnect layers for just
power and ground.
ii.Removal of heat dissipated by the power.
iii.Preventing hot spots.
Different types of power distribution issues:
i.Dynamic power: Energy consumption in switching transistors for CMOS chips.
It is proportional to the product of the load capacitance of the transistor, the
square of the voltage, and the frequency of switching, with watts being the unit:

ii.Energy: Mobile devices care about battery life more than power, so energy is
the proper metric, measured in joules:
iii. Static power: Needed to gate the voltage to inactive modules (in which
transistors are off) to control loss due to leakage current flows.

i.e. Increasing the number of transistors increases power even if they


are turned off, and leakage current increases in processors with
smaller transistor sizes.
As a result, even very low power systems utilize static
power. Thus the leakage in high-performance designs is more [i.e. >
25% of the total power consumption].
PROCESS OF REDUCING DYNAMIC POWER
Varying different parameters of dynamic power:

• Lowering the voltage:


Greatly reduces dynamic power and energy. So voltages
have dropped from 5V to just over 1V in 20 years.

• Decreasing capacitive load:


It is a function of the number of transistors connected to an
output and it determines the capacitance of the wires and
the transistors.

• Slowing the clock rate/ frequency:


For a fixed task, reduces power, but not energy.
Despite the PROCESS OF REDUCING DYNAMIC POWER,
power/energy consumption is more ! WHY?
Reason:
Increase in the number of transistors switching, and the
frequency with which they switch, dominates the decrease in
load capacitance and voltage, leading to an overall growth in
power consumption and energy.

E.g. The first microprocessors consumed tenths of a watt, while


a 3.2 GHz Pentium 4 Extreme Edition consumes 135 watts.
Removal of heat

Power dissipated as heat is now the major limitation to using


transistors:
REMEDY:
i.Air Cooling:
Power dissipated as heat (beyond limits) can not be cooled
by air.
i.Temperature diodes: They reduce activity automatically if the
chip (say microprocessor) gets too hot. E.g. they may reduce voltage
and clock frequency or the instruction issue rate.
ii.Turn off the clock of inactive modules in processors:
Saves energy and dynamic power. E.g. if no floating-point
instructions are executing, the clock of the floating-point unit is
disabled.
Preventing hot spots

Better to explore multiple processors on a chip running at lower


voltages and clock rates.
Trends in Cost
Cost-sensitive designs

Major theme in the computer industry:


Use of technology improvements to lower cost, as well as
increase performance.
Cost-sensitive designs are of growing significance although in
some computer designs (say supercomputers) costs are less
important.

Understanding of cost and its factors is


essential for designers. Why ?
To make intelligent decisions about whether or not a new
feature should be included in designs where cost is an issue.
Major factors that influence the cost of a computer that
change over time

I. Time
II. Volume
III. Commodification
IMPACT OF TIME OVER COST
The cost of a manufactured computer component decreases over time
even without major improvements in the basic implementation technology.

The underlying principle that drives costs down is the learning curve—
i.The more times a task has been performed, the less time will be required on
each subsequent iteration.
ii.As the quantity of items produced doubles, costs decrease at a predictable
rate.
iii.Manufacturing costs decrease over time.

Yield—the percentage of manufactured devices that survives the testing


procedure during manufacturing.
processor price trends for Intel microprocessors.

E.g.

Price per megabyte of DRAM has dropped over the long term by 40% per year. Since
DRAMs tend to be priced in close relationship to cost—with the exception of periods
when there is a shortage or an oversupply—price and cost of DRAM track closely.

Microprocessor prices also drop over time, but because they are less standardized
than DRAMs, the relationship between price and cost is more complex. In a period of
significant competition, price tends to track cost closely, although microprocessor
vendors probably rarely sell at a loss.
IMPACT OF VOLUME OVER COST

Volume determines cost.


Increasing volumes affect cost in several ways.

i.Volume is proportional to the number of systems (or chips) manufactured. It


decreases the time needed to get down the learning curve.

ii.Volume increases purchasing and manufacturing efficiency.


As a rule of thumb, some designers have estimated that cost decreases about
10% for each doubling of volume.

iii.Volume decreases the amount of development cost thus allowing cost and
selling price to be closer.
IMPACT OF COMMODITIES OVER COST
 What are commodities? Commodities are products that are sold by
multiple vendors in large volumes and are essentially identical.

E.g. All the products sold on the shelves of grocery stores are commodities, as
are standard DRAMs, disks, monitors, and keyboards.

 How commodity business of computers reduces the cost?


Business is highly competitive: Because many vendors ship virtually
identical products. This competition decreases the
i. Cost : Because commodity market has a clear product definition.
ii. Gap between cost and selling price : Because commodity market has
volume.

Thus the commodity market allows multiple suppliers to compete in building


components for the commodity product.
Thus the low end of the computer business has achieved better price-
performance than other sectors and yielded greater growth at the low
end, although with very limited profits (as is typical in any commodity
business).
Cost of an Integrated Circuit or Chip

Why to study cost of IC? Because they

i.Occupy greater portion of the cost.


ii.Vary between computers.

Thus, computer designers must understand the costs of chips to understand


the costs of current computers.

Although the costs of integrated circuits have dropped exponentially, the


basic process of silicon manufacture is unchanged: A wafer is still tested and
chopped into dies that are packaged.
• The cost of packaged integrated circuit (Cost equation) is:
Cost of die + Cost of testing die + Cost of packaging and final test
Cost of integrated circuit = -----------------------------------------------------------------------------------------
Final test yield
• Cost of Die is :
Cost of wafer
cost of die = ---------------------------------------------------
Dies per wafer x Die yield

• The number of dies per wafer is approximately the area of wafer divided by the area of
the die. It can be more accurately estimated by
π x ( Wafer Diameter / 2 ) 2 π x Wafer Diameter
Dies per wafer = ----------------------------------------- – ---------------------------------
Die area sqrt (2 x Die area)

The first term is the ratio of wafer area (πr2 ) to die area. The second compensates for the “square peg in
a round hole” problem—rectangular dies near the periphery of round wafers. Dividing the
circumference (πd ) by the diagonal of a square die is approximately the number of dies along the
edge.

61
What is the fraction of good dies on a wafer number, or the die
yield?
A simple model of integrated circuit yield, which assumes that
defects are randomly distributed over the wafer and that
yield is inversely proportional to the complexity of the

fabrication process, leads to the following:

64
The formula is an empirical model developed by looking at the yield of many
manufacturing lines.

Wafer yield: accounts for wafers that are completely bad and so need not be
tested. For simplicity, wafer yield is assumed to be 100%.

Defects per unit area: It is a measure of the random manufacturing defects


that occur. In 2006, these value is typically 0.4 defects per square centimeter
for 90 nm, as it depends on the maturity of the process.

α: It is a parameter that corresponds roughly to the number of critical


masking levels, a measure of manufacturing complexity. For multilevel metal
CMOS processes in 2006, a good estimate is α = 4.0.
Dependability
Integrated circuits are one of the most reliable components of a computer.

However, decrease in feature sizes to 65 nm and smaller, leads to both transient


faults and permanent faults.

So architects must design systems to cope with these challenges.


Architectural view to find ways to build dependable computers:
•Computers are designed and constructed at different layers of abstraction. A
computer can be descended recursively down, seeing components enlarge
themselves to full subsystems until individual transistors are seen.

•Although some faults are widespread, like the loss of power, many can be
limited to a single component in a module. Thus, utter failure of a module at one
level may be considered merely a component error in a higher-level module.
When a system is operating properly?

Difficult to decide as it is a philosophical point.

However, now it became concrete with the popularity of Internet services.

Can be decided with an SLA (Service Level Agreements):

System providers offer Service Level Agreements (SLA) [or Service Level
Objectives (SLO)] to guarantee their service would be dependable.

E.g. they would pay the customer a penalty if they did not meet an
agreement more than some hours per month.

Thus, an SLA could be used to decide whether the system was up or down.
When a system is operating properly?

Systems alternate between two states of service with respect to an SLA:

1. Service accomplishment, where the service is delivered as specified.

2. Service interruption, where the delivered service is different from the


SLA. It is measured as mean time to repair (MTTR).

Transitions between these two states are caused by failures.


(from state 1 to state 2) or restorations (2 to 1).

Quantifying these transitions leads to the two main measures of


dependability.

– reliability
– availability
Two main measures (Quantitative metrics) of dependability:

Module reliability Module availability


It is a measure of the continuous service It is a measure of the service
accomplishment (or, equivalently, of the accomplishment with respect to the
time to failure) from a reference initial alternation between the two states of
instant. accomplishment and interruption.
Has two measures: For non-redundant systems with repair,
1.Mean time to failure (MTTF): The
reciprocal of MTTF is a rate of failures,
generally reported as failures per billion
hours of operation, or FIT (for failures in
time).Thus, an MTTF of 1,000,000 hours
equals 109⁄ 106 or 1000 FIT.

2.Mean time between failures (MTBF) is


simply the sum of MTTF + MTTR. It is widely
used.
Reliability of a system can be estimated quantitatively with some
assumptions

About the reliability of components and


That failures are independent.

How to cope with failure? The primary way is redundancy, either

In time (repeat the operation to see if it still is erroneous) or

In resources (have other components to take over from the one that
failed).

Once the component is replaced and the system fully repaired, the
dependability of the system is assumed to be as good as new.
Example:

Assume a disk subsystem with the following components and MTTF:


o 10 disks, each rated at 1,000,000-hour MTTF
o 1 SCSI controller, 500,000-hour MTTF
o 1 power supply, 200,000-hour MTTF
o 1 fan, 200,000-hour MTTF
o 1 SCSI cable, 1,000,000-hour MTTF
Using the simplifying assumptions that the lifetimes are exponentially
distributed and that failures are independent, compute the MTTF of the
system as a whole.
Answer:

The sum of the failure rates is

Failure rate of the system


Disk subsystems often have redundant power supplies to
improve dependability.

Using the components and MTTFs from earlier example,


calculate the reliability of a redundant power supply.

Assume
One power supply is sufficient to run the disk subsystem and
One redundant power supply is added.

To simplify the calculations, it is assumed that the lifetimes of


the components are exponentially distributed and that there is
no dependency between the component failures.
MTTF for the redundant power supplies is:
the mean time until one power supply fails

the chance that the other will fail before the first one is replaced.

Thus, if the chance of a second failure before repair is small, then MTTF of
the pair is large.
With two power supplies and independent failures,

the mean time until one disk fails is = MTTF power supply ⁄ 2.

A good approximation of the probability of a second failure is MTTR over the


mean time until the other power supply fails. Hence, a reasonable
approximation for a redundant pair of power supplies is:
Using the MTTF numbers above and assuming that it takes on average 24
hours for a human operator to notice that a power supply has failed and
replace it, the reliability of the fault tolerant pair of power supplies is:

making the pair about 4150 times more reliable than a single power supply.
Measuring, Reporting, and
Summarizing Performance
Performance measures/ metrics:
i.Response time (or execution time): Time between the start and
the completion of an event. – Design must reduce this time.
Performance is reciprocal of response time.
E.g. The user of a desktop computer may say a computer
is faster when a program runs in less time.

ii.Throughput: Total amount of work done in a given time.


Design must increase throughput.
E.g. Amazon.com administrator may say a computer is
faster when it completes more transactions per hour.
Why to relate the performance of two different computers ?
To compare design alternatives.
Relating the performance using response time:
The phrase “X is faster than Y” is used here to mean that the response time
(or execution time) is lower on X than on Y for the given task. In particular, “X
is n times faster than Y” will mean

Since execution time is the reciprocal of performance, the following


relationship holds:
Relating the performance using throughput:

The phrase “the throughput of X is 1.3 times higher than Y” signifies here that
the number of tasks completed per unit time on computer X is 1.3 times
the
number of tasks completed on Y.
iii.CPU time:

Need for defining CPU time:


Execution time can be defined in different ways depending on context. The most
straightforward definition of time is called wall-clock time, response time, or elapsed
time, which is the latency to complete a task, including disk accesses, memory
accesses, input/output activities, operating system overhead — everything.

With multiprogramming, the processor works on another program while waiting for
I/O and may not necessarily minimize the elapsed time of one program. Hence, a term
is needed to consider this activity. CPU time recognizes this distinction.

Definition:
CPU time is the time the processor is computing, not including the time
waiting for I/O or running other programs. (Clearly, the response time seen
by the user is the elapsed time of the program, not the CPU time.)
Computer users who routinely run the same programs would be the perfect
candidates to evaluate a new computer.

To evaluate a new system the users would simply compare the execution time
of their workloads—the mixture of programs and operating system
commands that users run on a computer.

“Time to execute a real program” - as a performance metric - has led to


misleading claims or even mistakes in computer design hoping that these
methods will predict performance for their usage of the new computer.

So better to rely on other methods to evaluate computers.


Benchmarks:

Standard programs to measure performance.

Two types:
Real applications: Examples include
Compiler
 Kernels, which are small, key pieces of real applications;
 Toy programs, which are 100-line programs from beginning programming
assignments, such as quick-sort;
 Synthetic benchmarks: which are fake programs invented to try
to match the profile and behavior of real applications E.g. Dhrystone.

The Dhrystone benchmark contains no floating point operations, thus the name is a pun on the
then-popular Whetstone benchmark for floating point operations. The output from the
benchmark is the number of Dhrystones per second (the number of iterations of the main code
loop per second).
Drawback of using benchmarks:

Attempts at running programs that are much


simpler than a real application have led to
performance pitfalls.

Compiler writer and architect can conspire to make


the computer appear faster on these stand-in
programs than on real applications.

Another issue is the conditions under which the


benchmarks are run.
Quantitative Principles of Computer Design

• Scalability: The ability of expanding memory and the number


of processors and disks is called scalability.

• Principles of locality: Programs tend to reuse data and


instructions they have used recently.

• Two different types of Locality: Temporal locality (states


that recently accessed items are likely to be accessed in
the near feature). Spatial Locality (says that items
whose addresses are near one another tend to be
referenced close together in time).
Amdahl’s Law
• Amdahl’s law states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction of the time the
faster mode can be used.
• It defines the speed up that can be gained by using a particular feature.

Performance for entire task using the enhancement when possible


Speed up = --------------------------------------------------------------
Performance for entire task without using the enhancement when possible

Alternatively,
The execution time for entire task without using the enhancement
Speed up = --------------------------------------------------------------
The execution time for entire task using the enhancement when possible

Speed up tells us how much faster a task will run using the computer
with the enhancement as opposed to the original computer.
Two factors of speed up enhancement
• Fraction Enhanced : The fraction of the computation time in the original
computer that can be converted to take advantage of the enhancement.
For example: If 20 seconds of the execution time of a program that takes
60 seconds in total can use an enhancement, the fraction is 20/60. The
value, is always less than or equal to 1.
• Speed up Enhanced : The improvement gained by the enhancement,
execution mode, that is, how much faster the task would run if the
enhanced mode were used for the entire program.
For example : If the enhanced mode takes 2 seconds for the portion of
the program, while it is 5 seconds in the original mode, the
improvement is 5/2. The value is always greater than 1.
Calculation of Execution time and Speedup Overall

• The execution time using the original computer enhanced mode


will be the time spent using the unenhanced portion of the
computer plus the time spent using the enhancement :
Fraction enhanced
Execution Time new = execution timeOld x (1 – Fraction enhanced ) + ------------------------
Speedup enhanced

• The overall Speedup is the ratio of the execution times:


Execution time old 1
Speedup Overall = ------------------------- = ------------------------------------------------------------

Execution time New Fraction enhanced


( 1 – Fraction enhanced ) + --------------------------
Speedup enhanced

Example1
Suppose that we want to enhance the processor used for Web serving. The new processor
is 10 times faster on computation in the Web serving application than the original
processor. Assuming that the original processor is busy with computation 40% of time and
is waiting for I/O 60% of the time. What is the overall speedup gained by incorporating the
enhancement?
Guess ?!?!
• Hint :
• The overall Speedup is the ratio of the execution times:
Execution time old 1
Speedup Overall = ------------------------- = ------------------------------------------------------------

Execution time New Fraction enhanced


( 1 – Fraction enhanced ) + --------------------------
Speedup enhanced
1 1
• Fraction Enhanced = 0.4 ; Speedup Enhanced = 10 ; Speedup Overall = -------------- = -------- = 1.56
0.4 0.64
0.6 + -----
10
Example 2
• Suppose that a given architecture does not have hardware support for multiplication,
so multiplications have to be done through repeated addition. If it takes 200 cycles to
perform a multiplication in software, and 4 cycles to perform a multiplication in
hardware, what is the overall speedup from hardware support for multiplication if a
program spends 10% of its time doing multiplications? What about a program that
spends 40% of its time doing multiplications?

• Guess ? ! ? ! ?
Example 2
• Suppose that a given architecture does not have hardware support for multiplication,
so multiplications have to be done through repeated addition. If it takes 200 cycles to
perform a multiplication in software, and 4 cycles to perform a multiplication in
hardware, what is the overall speedup from hardware support for multiplication if a
program spends 10% of its time doing multiplications? What about a program that
spends 40% of its time doing multiplications?
• In both the cases, the speedup when the multiplication hardware is used is 200 / 4 =
50 (ratio of time to do a multiplication without the hardware to time with the
hardware). In the case where the program spends 10% of its time doing
multiplications, Fraction Enhanced = 0.1; i.e., 1–Fraction Enhanced = 0.9.
• By Amdahl’s law, we get,
Speedup Overall = 1 / [ 0.9 + ( 0.1 / 50 ) ] = 1.11
• If the program spends 40% of its time doing multiplications before the addition of
hardware multiplication, then Fraction Enhanced is ________________

Hence 1–Fraction Enhanced is _______ and Speed up = ____________


Example 2
• Suppose that a given architecture does not have hardware support for multiplication,
so multiplications have to be done through repeated addition. If it takes 200 cycles to
perform a multiplication in software, and 4 cycles to perform a multiplication in
hardware, what is the overall speedup from hardware support for multiplication if a
program spends 10% of its time doing multiplications? What about a program that
spends 40% of its time doing multiplications?
• In both the cases, the speedup when the multiplication hardware is used is 200 / 4 =
50 (ratio of time to do a multiplication without the hardware to time with the
hardware). In the case where the program spends 10% of its time doing
multiplications, Fraction Enhanced = 0.1; i.e., 1–Fraction Enhanced = 0.9.
• By Amdahl’s law, we get,
Speedup Overall = 1 / [ 0.9 + ( 0.1 / 50 ) ] = 1.11
• If the program spends 40% of its time doing multiplications before the addition of
hardware multiplication, then Fraction Enhanced is 0.4;

Hence 1–Fraction Enhanced is 0.6; We get, Speedup = 1 / [ 0.6 + (0.4 / 50) ] = 1.64
Example 3
• If the 1998 version of a computer executes a program in 200
sec and the 2000 version of the computer executes the
program in 150 sec. What is the speedup that the
manufacturer has achieved over the two years period ?

• Hint :
Execution Time old
Speedup = -------------------------------
Execution Time new
Example 3
• If the 1998 version of a computer executes a program in 200 sec
and the 2000 version of the computer executes the program in
150 sec. What is the speedup that the manufacturer has achieved
over the two years period ?

• Execution Time old


Speedup = ------------------------------- = 200 / 150 = 1.33
Execution Time new
Example 4
• To achieve speedup of 3 on a program that originally
took 78 sec to execute, what must the execution
time of the program be reduced ?
Example 4
• To achieve speedup of 3 on a program that originally
took 78 sec to execute, what must the execution
time of the program be reduced ?

Given Data: Speedup = 3, Execution Time Old = 78 sec

Speedup = (Execution Time Old) / (Execution Time new)

Execution Time New = 26 Sec.


Improving the performance of the FP operations overall is slightly
better because of the higher frequency
Processor Performance Equation
• All computers are constructed using a clock running at a constant rate.

• These discrete time events are called ticks, clock ticks, clock periods,
clocks, or cycles.

• Time of a clock period is referred by its duration ( ex. 1ns) or by its rate (1
GHz).

• CPU time can be expressed in two ways:


CPU Time = CPU Clock cycles for a program x Clock cycle time.

OR

CPU Clock Cycles for a Program


CPU Time = -----------------------------------------------------
Clock Rate
Clock cycles Per Instruction (CPI)
• In addition to the clock cycles needed to execute a program, we can also count
the number of instructions executed.

• This is also known as instruction path length or Instruction count (IC).

• If we know the number of clock cycles and the instruction count, we can
calculate Clock cycles Per Instructions (CPI). Instruction Per Clock (IPC) is
inverse of CPI.
CPU Clock cycles for a program
CPI = --------------------------------------------------------
Instruction Count
The total number of Clock cycles for a program can be defined as IC x CPI
Hence CPU Time = Instruction Count x Cycles Per Instructions x Clock Cycle Time
Instructions Clock Cycles Seconds Seconds
CPU Time = ------------------ x -------------------- x ----------------- = -------------
Program Instruction Clock Cycle Program
Calculating CPU Clock Cycles
n

• CPU Clock cycles = Σ ICi x CPIi


i=1

where ICi represents number of times instruction i is executed in a


program and CPIi represents the average number of clocks per instruction
for instruction i.

• Hence CPU Time is:


n

CPU Time = Σ ICi x CPIi x Clock Cycle Time


i=1

Overall CPI is :
n

Σ ICi x CPIi
i=1 n
ICi
CPI = ------------------------------- = Σ -------------------------------- x CPIi
Instruction Count i=1
Instruction Count
Example1
• Suppose we have made the following
measurements:
Frequency of FP operations = 25 %
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR = 2 %
CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the
CPI of FPSQR to(by ?) 2 or to decrease the average CPI of all
FP operations to 2.5. Compare these two design alternatives
using the processor performance equation.
Answer

CPI Original = Σ (CPIi x ( ICi / Instruction count ) )


i=1

= (4 x 25%) + (1.33 x 75%) = 2.0


Answer

CPI with new FPSQR = CPI Original – 2% x (CPI Old FPSQR - CPI new FPSQR Only )

= 2.0 – 2% x (20 – 2) = 1.64


Answer

CPI new FP = (2.5 x 25%) + (1.33 x 75% ) = 1.625


Answer

speedup new FP = (CPU Time Original) / (CPU Time New Fp)

= ( IC x Clock Cycles x CPI Original) / ( IC x Clock Cycles x CPI new Fp)

= (CPI Original) / (CPI New FP) = 2.0 / 1.625 = 1.23.


Example 2
• When a run on a given system, a program takes 1,000,000 cycles.
If the system achieves a CPI of 40, how many instructions were
executed in running the program?
Example 2
• When a run on a given system, a program takes 1,000,000 cycles.
If the system achieves a CPI of 40, how many instructions were
executed in running the program?
• Ans: CPI = # Cycles / # Instructions
Instructions = # Cycles / CPI = 1,000,000 cycles / 40 = 25000.
Hence 25000 instructions were executed in running the program.
Example 3
• What is the IPC of a program that executes
35,000 instructions and requires 17,000 cycles
to complete ?
Example 3
• What is the IPC of a program that executes
35,000 instructions and requires 17,000 cycles
to complete ?

IPC = # Instructions / # Cycles


= 35,000 / 17000
= 2.06
End of Chapter – 1

You might also like