Da Ci
Da Ci
Chapter Objectives
The knowledge provided in this chapter will prove valuable to you, whether you
are tasked with choosing a new system or aiming to enhance the performance
of an existing one.
Additionally, the chapter explores various factors influencing performance.
By the end of this chapter, you will have a clear understanding of what to
examine in system tuning reports and how each piece of information contributes
to the broader perspective of overall system performance.
Introduction
The word architecture covers all three aspects of computer design: Software,
Instruction set architecture, and hardware.
Optimization Targets
• Software
• Instruction set architecture (ISA)
• Hardware
• Programming language
• Compiler
• Microarchitecture
• Transistor
Functional Requirements
Definition
This refers to the intended functionality and capabilities of the computer system.
1
Application Area
• Personal mobile device: Real-time performance, graphics, videos and
audio, energy efficiency.
• Desktop computer: Real-time performance, graphics, videos and audio.
• Servers: Support for databases and transaction processing; enhance-
ments for reliability and availability; support for scalability.
• Clusters computers: Throughput performance for many independent
tasks; error correction for memory; energy proportionality.
• Internet of things / Embedded computing: Special support for
graphics or video or other application-specific extension; power limitations
and power control may be required; real-time constraints.
Trends in Technology
Computer architects must stay updated on swiftly changing implementation
technologies, including:
• Integrated circuit logic technology: Transistor density and increases in die
size. However, this increase does not follow Moore’s law.
• Semiconductor DRAM (dynamic random-access memory).
• Semiconductor Flash (electrically erasable programmable read-only mem-
ory). This nonvolatile semiconductor memory is the standard storage
device in PMDs.
• Magnetic disk technology.
• Network technology.
2
Performance Measurement and Analysis
Question 1
What does it mean when we say that computer X has better performance than
computer Y?
Answer 1
Computer X is faster than computer Y.
Question 2
What does it mean that computer X is faster than computer Y?
Answer 2
It depends on the perspectives of the users and on both external and internal
considerations of the machine.
User Perspective
The user of a desktop computer may say a computer is faster when a program
runs in less time, while a computer center administrator may say a computer is
faster when it completes more transactions per unit time.
Metrics
• Response time (execution time): Defined as the time between the
start and the completion of an event.
• Throughput: Defined as the total amount of work done in a given time.
Important
The primary, consistent, and reliable indicator measure of performance is the
execution time of real programs.
3
Clock Signal Generation
• Quartz crystal
• Analog to Digital conversion
Example 1
1-GHz processor receives 1 billion pulses per second.
The rate of pulses is known as the clock rate, or clock speed (Frequency).
One increment, or pulse, of the clock is referred to as a clock tick.
The time between pulses is the cycle time, clock periods, cycles.
• CPU time = CPU clock cycles for a program × Clock cycle time (period)
• CPU time = CPU clock cycles for a program
Clock rate
Definitions
• CPU Time (execution time): This is the total time the CPU spends
executing a specific program. It is often measured in seconds.
• CPU Clock Cycles for a Program: This refers to the number of
clock cycles (periods) the CPU takes to execute all the instructions in the
program.
Example 2
A program P1 consists of 30 instructions.
Clock frequency = 1 GHz
Number of cycles per instruction = 3 cycles
1
Cycle time = 1000 = 0.001µs = 1ns
CPU time for P1 = Execution time for P1 = 30 × 3 × 1 = 90ns
4
Expressing the Initial Formula in Terms of Units
of Measurement
• Instructions
• Clock cycles
• Seconds
Remarks
Executing an instruction involves multiple steps, such as retrieving it from mem-
ory, decoding, and performing operations. Thus, most instructions on most pro-
cessors require multiple clock cycles to complete. Some instructions may take
only a few cycles, while others require dozens.
On any given processor, the number of clock cycles required varies for dif-
ferent types of instructions, such as load, store, branch, and so on.
A straight comparison of clock speeds (frequency) on different processors
does not tell the whole story about performance.
• The overall version of the CPI calculation considers each specific CPI and
ICi
its frequency in a program (i.e., Instruction count ).
• Because it must include pipeline effects, cache misses, and any other mem-
ory system inefficiencies, CPI should be measured and not just calculated
from a table in the back of a reference manual.
Example 3
Suppose we made the following measurements:
5
• Average CPI of FP operations: 4 cycles
• Average CPI of other instructions: 1.33 cycles
What is the CPI global?
CPI global = 0.25 × 4 + 0.75 × 1.33 = 2 cycles
Performance Comparison
We often compare the performance of two different computers, X and Y, by
using the assessment ”X is faster than Y”, which means that execution time is
lower on X than on Y for the given task.
In particular, ”X is n times as fast as Y” will mean:
Execution timeY
=n
Execution timeX
We suppose that the execution time is the reciprocal of performance, thus
we have the following relationship:
Execution timeY PerformanceX
=
Execution timeX PerformanceY
Throughput Metric
The execution time can be replaced by the throughput metric to compare the
performance between X and Y in terms of the amount of work done in a given
time.
Example
The throughput of X is 5.2 times as fast as Y signifies here that the number of
tasks completed per unit time on computer X is 5.2 times the number completed
on Y.
Remarks
• Execution time is expressed in seconds. It may include or not: instruction
processing; memory access; I/O; interruptions; operating system overhead.
• Output throughput is expressed in the number of instructions per second
(for a processor), the number of queries processed per hour (for a server),
MIPS (Million Instructions Per Second), and MFLOPS (Million Floating-
point Operations Per Second).
6
Benchmarks
Definition
Performance benchmarking involves objectively evaluating the performance of
one system (e.g., computer, software, component) in comparison to another.
Reliable benchmarks play a crucial role in cutting through marketing exag-
gerations and statistical manipulations. In essence, effective benchmarks help
pinpoint systems that deliver optimal performance at a reasonable cost.
Benchmark Types
• Kernels: Represents small, key pieces of real applications, such as Quick-
sort.
• Synthetic benchmarks: Consists of fake programs invented to imitate
the behavior of real applications, such as Dhrystone.
7
SPEC: Standard Performance Evaluation Corpo-
ration
The most recognized standardized benchmark application suites have been the
SPEC (Standard Performance Evaluation Corporation).
The first benchmark suites version was developed in 1980 to benchmark
workstations. Currently, there are SPEC benchmarks to cover many application
classes. All the SPEC benchmark suites and their reported results are found at
http://www.spec.org.
SPEC Benchmarks
• Cloud: Cloud, JaaS 2016
• CPU: CPU2017
8
For example, suppose that the SPECRatio of computer A on a benchmark
is 2.56 times as fast as computer B; then we know:
Execution timereference PerformanceA
2.56 = =
Execution timeA PerformanceB
Geometric Mean
After choosing a benchmark suite, the performance results of the suite are sum-
marized in a unique number that is the geometric mean of the SPECRatio of
the programs in the suite.
v
u n
uY
n
Geometric mean = t Samplei
i=1
Definition
Amdahl’s Law states that the performance improvement to be gained from using
some faster mode of execution is limited by the fraction of the time the faster
mode can be used.
Speedup
Amdahl’s Law defines the speedup that can be gained by using a particular
feature. Speedup is the ratio given by:
Performance for entire task using the enhancement when possible
Speedup =
Performance for entire task without using the enhancement
Or, function of the execution times:
Execution time for entire task without using the enhancement
Speedup =
Execution time for entire task using the enhancement when possible
9
Amdahl’s Law Factors
• Fractionenhanced : T hef ractionof thecomputationtimeintheoriginalcomputerthatcanbeconvertedtotakeadvan
T heimprovementgainedbytheenhancedexecutionmode.T hisvalueisthetimeof theoriginalmodeoverthetimeof t
Fractionenhanced = 0.4
Speedupenhanced = 10
1
Speedupoverall = 0.4 = 1.54
0.6 + 10
10
System Architect Perspective
• Thermal Design Power (TDP): A metric that quantifies the maxi-
mum amount of heat generated through power consumption by a com-
puter component under normal operating conditions. Expressed in Watt.
Serves as a guideline for system designers to understand the amount of
heat dissipation that needs to be managed by the cooling system.
• Energy and Energy Efficiency: Power is energy per unit time: 1 watt
= 1 joule per second. Using energy as a metric is better since it is linked
to a specific task and the time needed to accomplish that task. The energy
to complete a workload is equal to the average power times the execution
time for the workload.
Remarks
• For a specific task, slowing the frequency reduces power, but not energy.
• The dynamic power and energy are reduced by lowering the voltage.
• The capacitive load consists in the number of transistors connected to
an output and the technology (i.e., the capacitance of the wires and the
transistors).
• The dynamic power is the primary source of power dissipation in CMOS,
however, static power is also an important issue because of leakage current
flows. The static power is given by:
11
Examples
• First microprocessors consumed 1 watt.
• Intel Core i9-9900K 9th Gen consumes 95 watt (168.48 watt at full work-
load).
Consequences
• The limits of air cooling process are nearly reached.
Remarks
• In today’s microprocessor design, with so many transistors that they can-
not all be turned on at the same time: dark silicon phenomenon.
• The importance of power and energy has led to a new metric for evaluation:
tasks per joule or performance per watt rather than performance per mm2
of silicon as in the past.
12
Relative Energy Cost
• 8b Add: 0.03 pJ
• 16b Add: 0.05 pJ
• 32b Add: 0.1 pJ
• 16b FB Add: 0.4 pJ
13