0% found this document useful (0 votes)
10 views13 pages

Da Ci

Chapter 2 discusses comprehensive performance assessment across computer architectures, focusing on optimization targets, functional requirements, and performance measurement. It emphasizes the importance of understanding system tuning reports, benchmarking, and the impact of various factors like clock speed and execution time on performance. Additionally, it introduces Amdahl's Law and energy considerations in system design.

Uploaded by

Ammar Dridi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Da Ci

Chapter 2 discusses comprehensive performance assessment across computer architectures, focusing on optimization targets, functional requirements, and performance measurement. It emphasizes the importance of understanding system tuning reports, benchmarking, and the impact of various factors like clock speed and execution time on performance. Additionally, it introduces Amdahl's Law and energy considerations in system design.

Uploaded by

Ammar Dridi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 2: Comprehensive Performance

Assessment Across Architectures


Dr. A. Djenadi

Chapter Objectives
The knowledge provided in this chapter will prove valuable to you, whether you
are tasked with choosing a new system or aiming to enhance the performance
of an existing one.
Additionally, the chapter explores various factors influencing performance.
By the end of this chapter, you will have a clear understanding of what to
examine in system tuning reports and how each piece of information contributes
to the broader perspective of overall system performance.

Introduction
The word architecture covers all three aspects of computer design: Software,
Instruction set architecture, and hardware.

Optimization Targets
• Software
• Instruction set architecture (ISA)
• Hardware
• Programming language

• Compiler
• Microarchitecture
• Transistor

Functional Requirements
Definition
This refers to the intended functionality and capabilities of the computer system.

1
Application Area
• Personal mobile device: Real-time performance, graphics, videos and
audio, energy efficiency.
• Desktop computer: Real-time performance, graphics, videos and audio.
• Servers: Support for databases and transaction processing; enhance-
ments for reliability and availability; support for scalability.
• Clusters computers: Throughput performance for many independent
tasks; error correction for memory; energy proportionality.
• Internet of things / Embedded computing: Special support for
graphics or video or other application-specific extension; power limitations
and power control may be required; real-time constraints.

Level of Software Compatibility


• Operating system requirements (Necessary features to support chosen
OS).
• Certain standards may be required by marketplace.
• Floating point Format and arithmetic: IEEE 754 standard, special arith-
metic for graphics or signal processing.
• I/O interfaces: For I/O devices: Serial ATA, Serial Attached SCSI, PCI
Express.
• Networks: Support required for different networks: Ethernet.
• Programming languages: Languages (ANSI C, C++, Java, Fortran) affect
instruction set.

Trends in Technology
Computer architects must stay updated on swiftly changing implementation
technologies, including:
• Integrated circuit logic technology: Transistor density and increases in die
size. However, this increase does not follow Moore’s law.
• Semiconductor DRAM (dynamic random-access memory).
• Semiconductor Flash (electrically erasable programmable read-only mem-
ory). This nonvolatile semiconductor memory is the standard storage
device in PMDs.
• Magnetic disk technology.
• Network technology.

2
Performance Measurement and Analysis
Question 1
What does it mean when we say that computer X has better performance than
computer Y?

Answer 1
Computer X is faster than computer Y.

Question 2
What does it mean that computer X is faster than computer Y?

Answer 2
It depends on the perspectives of the users and on both external and internal
considerations of the machine.

User Perspective
The user of a desktop computer may say a computer is faster when a program
runs in less time, while a computer center administrator may say a computer is
faster when it completes more transactions per unit time.

Metrics
• Response time (execution time): Defined as the time between the
start and the completion of an event.
• Throughput: Defined as the total amount of work done in a given time.

Important
The primary, consistent, and reliable indicator measure of performance is the
execution time of real programs.

Time & Computer: The Clock System


The actions carried out by a processor, such as retrieving an instruction, in-
terpreting the instruction, loading and storing data, and executing arithmetic
operations, are controlled by a system clock.
Typically, all operations begin with the pulse of the clock.
At the most fundamental level, the speed of a processor is dictated by the
pulse frequency produced by the clock, measured in cycles per second, or Hertz
(Hz).

3
Clock Signal Generation
• Quartz crystal
• Analog to Digital conversion

Example 1
1-GHz processor receives 1 billion pulses per second.
The rate of pulses is known as the clock rate, or clock speed (Frequency).
One increment, or pulse, of the clock is referred to as a clock tick.
The time between pulses is the cycle time, clock periods, cycles.

CPU Time (Execution Time): The Processor Per-


formance Equation
CPU time (execution time) for a program can be expressed in seconds in two
ways:

• CPU time = CPU clock cycles for a program × Clock cycle time (period)
• CPU time = CPU clock cycles for a program
Clock rate

Definitions
• CPU Time (execution time): This is the total time the CPU spends
executing a specific program. It is often measured in seconds.
• CPU Clock Cycles for a Program: This refers to the number of
clock cycles (periods) the CPU takes to execute all the instructions in the
program.

• Clock Cycle Time (period): This is the duration of a single clock


cycle, measured in seconds. It represents the time it takes for the CPU to
complete one clock cycle.
• Clock rate: This is the clock frequency (the number of clock cycles per
second).

Example 2
A program P1 consists of 30 instructions.
Clock frequency = 1 GHz
Number of cycles per instruction = 3 cycles
1
Cycle time = 1000 = 0.001µs = 1ns
CPU time for P1 = Execution time for P1 = 30 × 3 × 1 = 90ns

4
Expressing the Initial Formula in Terms of Units
of Measurement
• Instructions

• Clock cycles
• Seconds

As this formula demonstrates, processor performance is dependent upon


three characteristics:
• Clock cycle time (period): Hardware technology and organization.
• Clock cycles per instruction (CPI): Organization and instruction set ar-
chitecture.
• Instruction count: Instruction set architecture and compiler technology.

Remarks
Executing an instruction involves multiple steps, such as retrieving it from mem-
ory, decoding, and performing operations. Thus, most instructions on most pro-
cessors require multiple clock cycles to complete. Some instructions may take
only a few cycles, while others require dozens.
On any given processor, the number of clock cycles required varies for dif-
ferent types of instructions, such as load, store, branch, and so on.
A straight comparison of clock speeds (frequency) on different processors
does not tell the whole story about performance.

The Overall CPI or the Global CPI


P
(Instruction count×CPI)
• Global CPI = Instruction count

• The overall version of the CPI calculation considers each specific CPI and
ICi
its frequency in a program (i.e., Instruction count ).

• Because it must include pipeline effects, cache misses, and any other mem-
ory system inefficiencies, CPI should be measured and not just calculated
from a table in the back of a reference manual.

Example 3
Suppose we made the following measurements:

• Frequency of floating point (FP) operations: 25%

5
• Average CPI of FP operations: 4 cycles
• Average CPI of other instructions: 1.33 cycles
What is the CPI global?
CPI global = 0.25 × 4 + 0.75 × 1.33 = 2 cycles

Performance Comparison
We often compare the performance of two different computers, X and Y, by
using the assessment ”X is faster than Y”, which means that execution time is
lower on X than on Y for the given task.
In particular, ”X is n times as fast as Y” will mean:
Execution timeY
=n
Execution timeX
We suppose that the execution time is the reciprocal of performance, thus
we have the following relationship:
Execution timeY PerformanceX
=
Execution timeX PerformanceY

Throughput Metric
The execution time can be replaced by the throughput metric to compare the
performance between X and Y in terms of the amount of work done in a given
time.

Example
The throughput of X is 5.2 times as fast as Y signifies here that the number of
tasks completed per unit time on computer X is 5.2 times the number completed
on Y.

Remarks
• Execution time is expressed in seconds. It may include or not: instruction
processing; memory access; I/O; interruptions; operating system overhead.
• Output throughput is expressed in the number of instructions per second
(for a processor), the number of queries processed per hour (for a server),
MIPS (Million Instructions Per Second), and MFLOPS (Million Floating-
point Operations Per Second).

6
Benchmarks
Definition
Performance benchmarking involves objectively evaluating the performance of
one system (e.g., computer, software, component) in comparison to another.
Reliable benchmarks play a crucial role in cutting through marketing exag-
gerations and statistical manipulations. In essence, effective benchmarks help
pinpoint systems that deliver optimal performance at a reasonable cost.

Benchmark Types
• Kernels: Represents small, key pieces of real applications, such as Quick-
sort.
• Synthetic benchmarks: Consists of fake programs invented to imitate
the behavior of real applications, such as Dhrystone.

Flaws and Limitations


• The compiler writer and architect can manipulate the test results by mak-
ing the computer appear faster on these surrogate programs than on real
applications.
• The use of benchmark-specific compiler flags to improve the performance
of a benchmark. These flags often caused transformations that would be
illegal on many programs or would slow down performance.

• Modification of the source code of the benchmarks:


– No modifications allowed.
– Modifications allowed but impossible to be made (Database bench-
marks).
– Source modifications are allowed, as long as the altered version pro-
duces the same output.

Better Benchmarking Solution: Benchmark Suites


An accepted solution for performance assessment is the use of collections of
benchmark applications, called benchmark suites.
A key advantage of such suites is that the weakness of any one benchmark
is lessened by the presence of the other benchmarks.

7
SPEC: Standard Performance Evaluation Corpo-
ration
The most recognized standardized benchmark application suites have been the
SPEC (Standard Performance Evaluation Corporation).
The first benchmark suites version was developed in 1980 to benchmark
workstations. Currently, there are SPEC benchmarks to cover many application
classes. All the SPEC benchmark suites and their reported results are found at
http://www.spec.org.

SPEC Benchmarks
• Cloud: Cloud, JaaS 2016
• CPU: CPU2017

• Graphics and Workstation: SPECviewperf12, SPECvpe V2.0, SPECapeSM


for 3ds Max 2015, SPECapeSM for Maya20212, SPECapeSM for PTC
Creo 3.0, SPECapeSM for Siemens NX 9.0 and 10.0, SPECapeSM for
SolidWorks 2015
• High Performance Computing: ACCEL, MPI2007, OMP2012

• Java client/server: SPECjbb2015


• Power: SPECpower ssj2008
• Server (SFS): SFS2014, SPECsfs2008
• Virtualization: SPECvirt sc2013

Reporting Performance Results


The key principle in presenting performance measurements should prioritize
reproducibility, ensuring that another experimenter can replicate the results.
A SPEC benchmark report requires an extensive description of the computer
and the compiler flags, as well as the publication of both the baseline and the
optimized results.
Alongside hardware, software, and baseline tuning details, a SPEC report
includes performance times displayed in tables and graphs.

SPEC Results Comparison: SPECRatio


A normalization of the execution times to a reference computer by dividing
the time on the reference computer by the time on the computer being rated,
yielding a ratio proportional to performance. SPEC uses the SPECRatio.

8
For example, suppose that the SPECRatio of computer A on a benchmark
is 2.56 times as fast as computer B; then we know:
Execution timereference PerformanceA
2.56 = =
Execution timeA PerformanceB

Geometric Mean
After choosing a benchmark suite, the performance results of the suite are sum-
marized in a unique number that is the geometric mean of the SPECRatio of
the programs in the suite.
v
u n
uY
n
Geometric mean = t Samplei
i=1

In the case of SPEC, samplei is the SPECRatio for program i.

Why Use Geometric Mean


• The geometric mean of the ratios is the same as the ratio of the geometric
means.
• The ratio of the geometric means is equal to the geometric mean of the
performance ratios, which implies that the choice of the reference computer
is irrelevant.

Performance Enhancement: Amdahl’s Law


Objective
Enhancing the performance by improving a portion of a computer.

Definition
Amdahl’s Law states that the performance improvement to be gained from using
some faster mode of execution is limited by the fraction of the time the faster
mode can be used.

Speedup
Amdahl’s Law defines the speedup that can be gained by using a particular
feature. Speedup is the ratio given by:
Performance for entire task using the enhancement when possible
Speedup =
Performance for entire task without using the enhancement
Or, function of the execution times:
Execution time for entire task without using the enhancement
Speedup =
Execution time for entire task using the enhancement when possible

9
Amdahl’s Law Factors
• Fractionenhanced : T hef ractionof thecomputationtimeintheoriginalcomputerthatcanbeconvertedtotakeadvan
T heimprovementgainedbytheenhancedexecutionmode.T hisvalueisthetimeof theoriginalmodeoverthetimeof t

The New Enhanced Execution Time


The execution time using the original computer with the enhanced mode will
be the time spent using the unenhanced portion of the computer plus the time
spent using the enhancement:
Fractionenhanced
Execution timenew = Execution timeold ×(1−Fractionenhanced )+
Speedupenhanced

The overall speedup is given by:


Execution timeold
Speedupoverall =
Execution timenew

Example: Amdahl’s Law


Suppose that we want to enhance the processor used for web serving. The new
processor is 10 times faster on computation in the web serving application than
the old processor. Assuming that the original processor is busy with computa-
tion 40% of the time and is waiting for I/O 60% of the time.
What is the overall speedup gained by incorporating the enhancement?

Fractionenhanced = 0.4
Speedupenhanced = 10
1
Speedupoverall = 0.4 = 1.54
0.6 + 10

Power and Energy


Introduction
In today’s energy-sustained world, energy is considered the most significant
design aspect in every computer class design. Two main challenges arise from
this aspect:
• Power supply: Power must be efficiently transported in and distributed
around the chip.
• Cooling solutions: The dissipation of power as heat must be effectively
managed and removed.

10
System Architect Perspective
• Thermal Design Power (TDP): A metric that quantifies the maxi-
mum amount of heat generated through power consumption by a com-
puter component under normal operating conditions. Expressed in Watt.
Serves as a guideline for system designers to understand the amount of
heat dissipation that needs to be managed by the cooling system.
• Energy and Energy Efficiency: Power is energy per unit time: 1 watt
= 1 joule per second. Using energy as a metric is better since it is linked
to a specific task and the time needed to accomplish that task. The energy
to complete a workload is equal to the average power times the execution
time for the workload.

Energy and Power Within a Microprocessor


For CMOS chips, the energy consumption has been mostly occurring during
the transistor switching, also called dynamic energy. The energy required per
transistor of pulse of the logic transition of 0 → 1 → 0 or 1 → 0 → 1 is given
by:
Energydynamic = Capacitive load × voltage2

Remarks
• For a specific task, slowing the frequency reduces power, but not energy.
• The dynamic power and energy are reduced by lowering the voltage.
• The capacitive load consists in the number of transistors connected to
an output and the technology (i.e., the capacitance of the wires and the
transistors).
• The dynamic power is the primary source of power dissipation in CMOS,
however, static power is also an important issue because of leakage current
flows. The static power is given by:

Powerstatic = Currentstatic × Voltage

• The static power is proportional to the number of devices.

Energy, Power and Performance Enhancement


During the computer architecture evolution, the increase in the number of tran-
sistors and the frequency has dominated the decrease in load capacitance and
voltage, leading to an overall growth in power consumption and energy.

11
Examples
• First microprocessors consumed 1 watt.
• Intel Core i9-9900K 9th Gen consumes 95 watt (168.48 watt at full work-
load).

Consequences
• The limits of air cooling process are nearly reached.

• Decrease in the clock rates lead to a period of slow performance improve-


ment range.
• Distributing the power, removing the heat, and preventing hot spots have
become increasingly difficult challenges.

Methods for Improving Energy Efficiency


• Do nothing well: Consists in turning off the clock of inactive modules
to save energy and dynamic power. For example, if some cores are idle,
their clocks are stopped.

• Dynamic voltage-frequency scaling (DVFS): Consists in scaling down


the working voltage and/or frequency to use lower power and energy. For
example: energy saving mode in a laptop.
• Design for the typical case: Design components with energy saving
mode. For example: DRAM designed with a low power mode, disks that
have a mode that spins more slowly when unused to save power. However,
you cannot access DRAMs or disks in these modes, so you must return to
fully active mode to read or write.
• Overclocking (Ex Intel Turbo mode): Consists in a chip running at a
higher clock rate for a short time. For example: For single-threaded code,
the microprocessors can turn off all cores but one and run it faster.

Remarks
• In today’s microprocessor design, with so many transistors that they can-
not all be turned on at the same time: dark silicon phenomenon.

• The importance of power and energy has led to a new metric for evaluation:
tasks per joule or performance per watt rather than performance per mm2
of silicon as in the past.

12
Relative Energy Cost
• 8b Add: 0.03 pJ
• 16b Add: 0.05 pJ
• 32b Add: 0.1 pJ
• 16b FB Add: 0.4 pJ

• 32b FB Add: 0.9 pJ


• 8b Mult: 0.2 pJ
• 32b Mult: 3.1 pJ

• 16b FB Mult: 1.1 pJ


• 32b FB Mult: 3.7 pJ
• 32b SRAM Read 8KB: 5 pJ
• 32b DRAM Read: 640 pJ

13

You might also like