Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
Computer Architecture
ECE 6373
Pauline Markenscoff
N320 Engineering Building 1
E-mail: [email protected]
1
Introduction
Improvements in Computer Performance
• Advances in Technology
2
Introduction
1945-1970:
• Both forces contributed to performance improvements
(25% to 30% /year)
Early 70’s:
• Emergence of Microprocessors
• Minicomputers
• Mainframes
• Performance improvements are due mainly to
improvements in technology (35% /year)
3
Introduction
Early 80’s: New Set of Architectures
4
Introduction
• RISC (Reduced Instruction Set Architectures)
• Instruction Level Parallelism
- Pipelining
- Multiple Instruction Issue
• Cache Organizations
- Simple
- More Sophisticated
Mid-80s:
• Higher performance growth rates (over 50%/year)
5
Highest performance microprocessors outperform
the supercomputer of less than 10 years ago.
6
Dominance of
microprocessor-based computers
• Minicomputers have been replaced by servers made
using microprocessors.
7
Older architectures, such as the x86 (or IA-32) adopted
many of the innovations of the RISC designs.
8
From mid-80s to 2002
• A renaissance in computer design
(performance improvements 52%/year)based on
- Both architectural innovation and
- Efficient use of technology improvements.
9
Since 2002 processor performance improvement has
dropped to about 20% per year
10
In 2004 Intel canceled its high performance
uniprocessor project
Road to higher performance would be via
• Multiple processors per chip
and not via
• Faster uniprocessors
11
Swift from relying solely on
12
Instruction Level Parallelism (ILP)
• exploited implicitly by compiler and hardware
(no need for programmer’s attention).
13
Growth in processor performance
Fig. 1.1
14
Changing Face of computing
1960’s: Mainframes
1970’s: Minicomputers
Supercomputers
15
Three classes of computing systems
Desktop Computers
Servers
Embedded Computers
16
Characteristics of the three
computing classes
Fig. 1.2
17
Desktop Computers
• Personal Computers
• Workstations
18
Desktop Computers
19
Servers
Provide large scale services such as
20
Servers
• Characteristics:
• Dependability
• Scalability
• Memory
• Storage
• I/O Bandwidth
• Responsiveness
21
Supercomputers
Cost tens of millions of dollars
Emphasize FP performance
22
Availability vs. Reliability
23
Cost of unavailability
Fig. 1.3
24
Embedded Computers
• user programmable or
• the only programming occurs with the initial loading of the
application.
25
Embedded Computers
26
Embedded Computers
27
Embedded Computers
What is real-time performance requirement?
Hard real-time:
• Absolute max. execution time allowed for a segment of the
application.
Soft real-time:
• Av. time of a particular task is constrained, as well as the
number of instances when max. time is exceeded.
28
Embedded Computers
Other requirements for some embedded computers:
• Minimize memory
- Sometimes memory totally on processor chip, other times on a small off-
chip chip
- Emphasis on code size (data size dictated by the application)
• Minimize power
- (Use of batteries, less expensive packaging, absence of a cooling fan)
29
Embedded Computers
Approaches for the design of embedded systems:
30
Levels of Computer Design
31
Instruction Set Architecture (ISA)
Programmer’s visible instruction set
32
Class of ISA
General-purpose Register Architectures
• Operands are either registers or memory locations
• Register-Memory Architectures
- 80x86 has 16 general purpose registers and 16 floating point
registers
• Load-Store Architectures
- 32 general purpose registers and 32 floating point registers
33
Memory Addressing
To access memory operands
• Byte addressing (virtually all desktop and server computers)
Memory Alignment
• 80x86 does not require alignment, but accesses are faster if operands are aligned.
34
Addressing Modes
MIPS
• Register
• Immediate (constants)
• Displacement (a constant offset is added to a register to form the memory address)
80x86
• 3 variations for Displacement
- No register (absolute)
- Two registers (based indexed with displacement)
- Two registers where one register is multiplied by the size of the operand in bytes (based
with scaled index and displacement)
• 3 variations without Displacement
- Register indirect
- Indexed
- Based with scaled index
35
Types and Sizes of Operands
MIPS and 80x86 support
• 8-bit (ASCII character)
• 16 bit (Unicode character or half word)
• 32 bit (integer or word)
• 64-bit (double word or long integer)
• IEEE 754 floating point
- 32 bit (single precision)
- 64 bit (double precision)
• 80x86 also supports 80 bit floating point (extended double
precision)
36
Operations
• Data transfer
• Arithmetic-Logical
• Floating Point
• Control
37
Control Flow Instructions
All ISAs (including MIPS and 80x86) provide support for
• Conditional Branches
• Unconditional Jumps
• Procedure Calls and Returns
PC-relative addressing
• Branch address is specified by an address field that is added to the PC
MIPS conditional branches test the contents of registers
80x86 branches test condition code bits set as side effects of previous
arithmetic/logic operations
MIPS procedure call (JAL) places the return address in a register
80x86 procedure call (CALLF) places the return address on a stack in
memory
38
Encoding an ISA
Fixed Length versus Variable Length
• Fixed Length encoding simplifies decoding of instructions
• Variable length instructions take less space than fixed length instructions
39
Subset of the
Instructions in
MIPS64
Fig. 1.5
40
MIPS64 instruction set architecture format
Fig. 1.6
41
Functional Organization
• Memory System
• Memory Interconnect
• Design of CPU
42
Two machines can have
• the same instruction set but
• different functional organizations.
- AMD Opteron 64 and Intel Pentium 4
- The embedded processors NEC VR 5432 and NEC VR 4122)
43
The task of the computer designer
Optimize design
• Maximize performance while meeting cost, power and
availability constraints.
• Requires familiarity with a wide range of technologies, from
compilers and OS to logic design and packaging.
44
Summary of some of the most important
functional requirements
Fig. 1.7
45
A computer designer must follow
• Technology trends
• Cost trends
46
Technology trends
A successful architecture must be designed to survive
rapid changes in technology
• Core of the IBM mainframe (in use for more than 40 years)
47
Technology trends
Implementation technologies
• Semiconductor DRAM
• Magnetic Disk
• Network
48
Technology trends
Integrated Circuit Technology:
49
Technology trends
Semiconductor DRAM
50
Technology trends
51
Technology trends
Network technology
52
Technology trends
53
Technology trends
Technology Thresholds
• Although technology improves fairly continuously, the
impact of the technology improvements can be seen in
discrete steps
• Example:
- When MOS technology reached the point of 25,000-50,000
transistors on a chip in the early 1980s, it became possible to
build a 32-bit microprocessor.
- By the late 1980s, first level caches could go on the chip.
54
Technology trends
55
Performance Trends:
Bandwidth over Latency
Bandwidth or Throughput
• Total amount of work done in a given time
- Ex: Megabytes per second for a disk transfer
56
Log-Log Plot of Bandwidth and Latency
Rule of Thumb:
Bandwidth grows by at least the
Fig. 1.8 square of the improvement in
latency.
Fig. 1.9
58
Scaling of Transistor Performance
Feature size: Minimum size of a transistor or a wire in either
the x or y dimension.
• Decrease from 10 microns in 1971 to 0.09 microns (90 nanometers) in
2006.
• 65 nanometers are underway.
59
Scaling of Transistor Performance
As feature sizes shrink, devices shrink quadratically in the
horizontal dimension and also in the vertical dimension.
60
Scaling of Transistor Performance
61
Scaling of Transistor Performance
Transistors generally improve in performance with decreased
feature size.
As feature size shrinks, wires get shorter, but the resistance and
capacitance per unit length get worse.
62
Resistance and capacitance depend on
• Detailed aspects of the process
• Geometry of a wire
• Loading on a wire
• Adjacency to other structures
63
Scaling of Transistor Performance
Wire delay has become a major design limitation for large ICs.
64
Scaling of Transistor Performance
65
Scaling of Transistor Performance
In 2001 the Pentium 4
• Allocated 2 stages of its 20+ pipeline just for propagating
signals across the chip.
66
Trends in Power in Integrated Circuits
67
Trends in Power in Integrated Circuits
68
Mobile devices care about battery life than power, so
energy (measured in joules) is the proper metric:
69
Power dynamic=
1/2 * Capacitive Load * Voltage 2 * Frequency switched
Dynamic power and energy are greatly reduced by lowering the voltage.
• Voltages have dropped from 5V to just over 1V in 20 years.
70
Example
Some microprocessors today are designed to have
adjustable voltage
• A 15% reduction in voltage may result in a 15% reduction
in frequency
What would the impact be on dynamic power?
71
Trends in Power in Integrated Circuits
Power dynamic= 1/2 * Capacitive Load * Voltage2 * Frequency switched
72
Trends in Power in Integrated Circuits
• The first microprocessor consumed tenths of a watt.
• A 3.2 GHz Pentium 4 Extreme Edition consumes 135 watts.
73
Trends in Power in Integrated Circuits
74
Static Power:
• Due to leakage current which flows even when a transistor
is off.
Powerstatic = Current static * Voltage
75
Goal for static power :
• 25% of the total power consumption
76
Trends in Cost
Super computers
• Design for performance, cost tends to be less important
77
The impact of Time, Volume and
Commodification on Cost
78
The impact of Time on Cost
Learning Curve:
Manufacturing costs decrease over time even without major
improvements in the basic implementation technology because
of increase in the yield.
79
The impact of Time on Cost
• Microprocessor prices drop over time but because they are less
standardized than DRAMs relationship between price and cost is more
complex.
80
Fig. 1.10
The price of an Intel Pentium 4 and Pentium M at a given frequency decreases over time as
yield enhancements decrease the cost of a good die and competition forces price reductions.
81
The impact of Volume on Cost
Volume: A second key factor in determining cost.
82
The impact of Commodification on Cost
83
The impact of Commodification on Cost
In Commodity market cost decreases because of
• Volume
• A clear product definition that allows multiple suppliers to compete.
84
Cost of an Integrated Circuit
85
Cost of an Integrated Circuit
IC costs are becoming a greater portion of the cost of
the system.
A wafer is tested and chopped into dies that are
packaged.
The cost of a packaged IC is:
Cost of integrated circuit =
Cost of die + Cost of testing die + Cost of packaging and final test
Final test yield
86
Cost of an Integrated Circuit
Cost of wafer
Cost of die =
Dies per wafer Die yield
87
“The square peg in a round hole” problem
• Rectangular dies near the periphery of round wafers
Fig. 1.12
88
Cost of an Integrated Circuit
89
Cost of an Integrated Circuit
Example:
Find the number of dies per 300 mm (30cm) wafer for a die that
is 1.5 cm on a side.
90
Cost of an Integrated Circuit
Die Yield:
• What is the percentage of good dies on a wafer?
Empirical Model of IC Yield:
• It assumes defects are randomly distributed over the wafer
and that yield is inversely proportional to the complexity of
the fabrication process.
Die yield =
Defects per unit area Die area a
Wafer yield (1 )
a
91
Cost of an Integrated Circuit
Wafer yield: Percentage of good wafers
• We will assume that wafer yield is 100%
• In 2006 this value is typically 0.4 defects per cm2 for 90nm technology
- Depends on the maturity of the process.
92
Cost of an Integrated Circuit
93
Cost of an Integrated Circuit
Assume a defect density of 0.4 per cm2 and a = 4.0.
0.4 2.25 4
Die yield = (1 + ) 0.44
4.0
Die yields for dies that are 1 cm on a side:
0.4 1.00 4
Die yield = (1 + ) 0.68
4.0
94
Cost of an Integrated Circuit
95
Cost of an Integrated Circuit
Die size of most 32-bit and 64-bit microprocessors
processors (in a 90 nm technology):
• Between 0.49cm2 and 2.25 cm2.
96
Cost of an Integrated Circuit
Tremendous pressures to lower costs in DRAMs and
SRAMs.
97
Cost of an Integrated Circuit
Processing a 300 mm (12 inch) diameter wafer in a leading
technology with 4 to 6 metal layers had a cost of between
$5000 and $6000 in 2006.
Cost of wafer
Cost of die =
Dies per wafer Die yield
Assuming a wafer cost of $5, 500
• Cost of a 2.25 cm2 die : $ 46
• Cost of a 1 cm2 die : $ 13
98
Cost of an Integrated Circuit
Cost of wafer
Cost of die =
Dies per wafer Die yield
Die yield =
Defects per unit area Die area
Wafer yield (1 )
99
Cost of an Integrated Circuit
The manufacturing process dictates
• Wafer cost
• Wafer yield
• Defects per unit area
100
Cost of an Integrated Circuit
Computer designer controls
• Die area
Hence he affects cost by
What functions are included on or excluded from the
die and
The number of I/O pins.
101
Cost of an Integrated Circuit
Cost of High Volume ICs:
• Variable cost of producing a functional die.
102
Cost versus Price
Margin between the cost of manufacturing a product and the
price a product sells has been shrinking.
These margins pay for
• R&D
• Marketing
• Sales
• Manufacturing Equipment Maintenance
• Building Rental
• Cost of Financing
• Pretax Profits and Taxes
103
Cost vs. Price
104
Dependability
Integrated circuits were one of the most reliable components
of a computer (error rate inside the chip was very low) but this
is changing as we head to feature sizes of 65 nm and smaller.
105
Dependability
Systems alternate between two states of service with respect
to a SLA:
106
Dependability
Module reliability: Measure of the continuous service
accomplishment (time to failure).
107
Dependability
If a collection of modules has exponentially
distributed lifetimes (i.e., the age of a module is not
important in probability of failure), the overall failure
rate of the collection is the sum of the failure rates of
the modules.
108
Dependability
Module Availability: Measure of service
accomplishment.
For nonredundant systems with repair:
MTTF
Module availability =
(MTTF + MTTR)
109
Dependability
Assumption: failures are independent
110
Example 1
Assume a disk with the following components and
MTTF:
• 10 disk, each rated at 1,000,000 MTTF
• 1 SCSI controller, 5000,000-hour MTTF
• 1 power supply, 200,000 MTTF
• 1 fan, 200,000 MTTF
• 1 SCSI cable, 1,000,000-hour MTTF
111
Example 1
Failure rate of the system is equal to the sum of the
failure rates of all the components:
1 1 1 1 1
Failure rate system = 10
1, 000, 000 500, 000 200, 000 200, 000 1, 000, 000
10 + 2 + 5 + 5 +1 23 23, 000
=
1, 000, 000 hours 1, 000, 000 1, 000, 000 hours
112
Redundancy
Primary way to cope with failure:
• Redundancy in time
- Repeat the operation to see if it is still erroneous
• Redundancy in resources
- Once the component is replaced, the dependability of the system is
assumed to be as good as new.
113
Example 2
Disk subsystems often have redundant power supplies to improve
dependability.
• Assume one power supply is sufficient to run the disk subsystem and that
we are adding one redundant power supply.
• MTTF for two power supplies is the mean time until one power supply fails
divided by the chance that the other will fail before the first one is replaced.
115
Quantitative Approach to Computer Design
Experimentation
Simulation
116
Measuring, Reporting and
Summarizing Performance
117
Measuring Performance
A computer user interested in
Response time:
Time between the start and the completion of a job.
118
Measuring Performance
Computer X is n times faster than Y:
Execution time Y
n
Execution time X
1
Execution Time Y Performance Y Performance X
n
Execution Time X 1 Performance Y
Performance X
119
Measuring Performance
120
Measuring Performance
121
Measuring Performance
Execution time can be defined in different ways:
• Wall-clock time
• Response time or Elapsed Time
- Latency to complete a task, including disk accesses, memory accesses,
input/output activities, operating system overhead.
- In multiprogramming the processor works on another program while waiting
for I/O and may not necessarily minimize the elapsed time of one program.
• CPU Time
- Time the processor is computing;
Does not include the time waiting for I/O or running other programs.
• Response time seen by the user is the elapsed time of the program, not
the CPU time.
122
Comparing Performance
Comparing performance of computers is not easy!
• Programs
• experimental environments
• definition of faster
123
Comparing Performance
A is 10 times faster than B for program P1. B is 10 times faster than A for program P2.
B is 2 times faster than C for program P1. C is 5 times faster than B for program P2.
A is 20 times faster than C for program P1. C is 50 times faster than A for program P2.
124
Execution times of two programs
on three machines
125
Comparing Performance
126
Choosing Programs to
Evaluate Performance
127
Choosing Programs to
Evaluate Performance
128
Choosing Programs to
Evaluate Performance
Kernels
• Small key pieces of code from real programs that are used to evaluate performance.
- For example, “Livermore Loops” and Linpack.
- They were used to isolate performance of individual features of a machine to explain for differences in
performance of real programs.
Toy benchmarks
• They are typically between 10 and 100 lines of code and produce a result the user already knows
before running the program.
- Examples: Quicksort, Sieve of Eratosthenes, Puzzle etc.
Synthetic benchmarks
• Fake programs invented to try to match the profile and behavior of real applications, such as
Dhrystone.
129
Choosing Programs to
Evaluate Performance
All three are discredited today.
Attempts at using those have led to performance
pitfalls.
• The compiler writer and architect can conspire to make the
computer appear faster on these than on real applications.
130
Benchmark Suites
131
Benchmark Suites
132
Desktop Benchmarks
• CPU-intensive benchmarks:
- SPEC2006:
• 12 integer programs
• 17 FP programs
- Real programs that are portable and vary from a C compiler to a chess program to
a quantum computer simulation.
- Useful for processor benchmarking for both desktop and single-processor servers.
• Graphics-intensive benchmarks:
133
Evolution of the SPEC benchmarks
Fig. 13
134
Server Benchmarks
SPEC CPU throughput-oriented benchmark SPECCPU2000
uses the SPEC CPU benchmarks to construct a throughput
benchmark that measures SPEC rate:
• The processing rate of a multiprocessor is measured by running
multiple copies (usually as many as there are CPUs of each SPEC CPU
benchmark and converting the CPU time into a rate).
135
Server Benchmarks
Benchmarks for file servers
• SPECSFS (benchmark for measuring NFS (Network File
System) performance using a script of file server requests.
- It tests performance of the I/O (both disk and network I/O) and the
CPU.
Benchmarks for web servers
• SPECWeb
- It simulates multiple clients requesting pages from a server, as well
as clients posting data to the server.
136
Server Benchmarks
137
Server Benchmarks
138
Embedded Benchmarks
Benchmarks for embedded computing systems are in
a far more primitive state than those for either
desktops or servers.
The use of a single benchmark is unrealistic due to:
• the variety in embedded applications as well as
• differences in performance requirements (hard real-time,
soft real-time and overall cost-performance)
139
Embedded Benchmarks
EEMBC (Embedded Microprocessor Benchmark Consortium)
• Set of 41 kernels to predict performance of different embedded applications
- Automotive/industrial
- Consumer
- Networking
- Office Automation
- Telecommunications
EEMBC does not have a good reputation of being a good predictor of relative
performance of different embedded computers.
140
Reporting Performance Results
Guiding principle: reproducibility
141
Reporting Performance Results
Tremendous pressure on improving performance of
programs widely used in evaluating machines.
• This has led companies to add optimizations that improve
performance of synthetic programs, toy programs, kernels
or even real programs.
• Adding such optimizations is more difficult in real
programs.
• This fact has led benchmark providers to specify the rules
under which compilers must operate.
142
Reporting Performance Results
A system’s software configuration can significantly
affect the performance results for a benchmark.
OS performance and support can be very important
for server benchmarks.
• These benchmarks are sometimes run in single-user mode.
143
Reporting Performance Results
The impact of compiler technology can be especially
large
• when modification of the source is allowed or
• when a benchmark is particularly susceptible to an
optimization.
It is very important to describe exactly the software
system, as well as any special modifications.
144
Reporting Performance Results
To customize the software to improve the
performance of a benchmark
• Benchmark-specific flags.
These flags often caused transformations that
• would be illegal on many programs or
• would slow down performance on others.
145
Reporting Performance Results
Baseline performance:
To increase the significance of results benchmark
developers often require the vendor to use one
compiler and one set of flags for all the programs in
the same language (C or Fortran).
146
Reporting Performance Results
Key issue
147
Reporting Performance Results
148
Reporting Performance Results
149
150
151