0% found this document useful (0 votes)
16 views76 pages

01 Abstraction and Technology

Uploaded by

20jasmine.asami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views76 pages

01 Abstraction and Technology

Uploaded by

20jasmine.asami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

EECS 112 (Spring 2024)

Organization of Digital Computers

Chapter 01
Computer Abstraction and Technology
Hyoukjun kwon
[email protected]
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 1. Technology

2
Introduction: The Computer Revolution
Three revolutions for civilization:
§ Agricultural revolution
§ Industrial revolution
§ Information revolution
• Computer revolution is the foundation

Computer revolution makes novel applications feasible


§ Computers in automobiles
§ Cell phones
§ Human genome project
§ World Wide Web
Computers are pervasive
§ Search Engines

3
Classes of Computers
§ Personal computers (“PC”)
• General purpose, variety of software
• Subject to cost/performance tradeoff

§ Server computers
• Network based
• High capacity, performance, reliability
• Range from small servers to building sized

§ Supercomputers
• High-end scientific and engineering calculations
• Highest capability but represent a small fraction of the overall computer market

§ Embedded computers
• Hidden as components of systems
• Stringent power/performance/cost constraints
4
The Post PC Era

5
The Post PC Era
§ Personal Mobile Device (PMD)
• Battery operated
• Connects to the Internet
• Hundreds of dollars
• Smart phones, tablets, electronic glasses

§ Cloud computing
• Warehouse Scale Computers (WSC)
• Software as a Service (SaaS)
• Portion of software run on a PMD, and a portion run in the Cloud
• Amazon and Google

§ Others:
• AR/VR
• Autonomous driving
• …
6
Opening the Box

Inside iPhone XS Max

7
Opening the Box

Inside iPhone XS Max

8
Inside the Processor (CPU)
§ Datapath: performs operations on data (i.e., computation)

§ Control: sequences datapath, memory, and other components

§ On-chip Memory: Stores data near CPU cores


• Cache memory: Fast and small SRAM memory for immediate access to data

High-level comparison of on-chip and off-chip memory Subject to the chip size
• Cache memories require 1-2 cycles for access (but small size: KB- 10s MB range)
• DRAMs (off-chip memory) requires 50-60 cycles* (but large size: GB range)

* S. Eyerman et al., “DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks.” ISPASS 2022 (Intel Paper) 9
Inside the Processor

Apple A12 System-on-Chip (SoC)

10
Networks
§ Functionalities
• Communication: exchange information
• Resource sharing
o e.g., many users can share one GPU server
• Nonlocal access to remote resources
o e.g., provide access to a computer server

§ Types of Networks
• Local area network (LAN)
o Wired: Ethernet
o Wireless: WiFi
• Wide area network (WAN): the Internet
• Personal area network (PAN)
o Wireless network: Mainly WiFi and Bluetooth

11
Memory and Storage
§ Volatile main memory (DRAM)
• Loses instructions and data when power off
§ Non-volatile secondary memory (storage)
• Flash memory (solid state drive; SSD)
• Magnetic disk (hard disk drive)
• Optical disk (CDROM, DVD)

No single option is ideal for all use cases; they all have trade-offs

12
Technology Trends
§ Electronics technology continues to evolve
• Increased capacity and performance Year Technology Relative performance/cost
• Reduced cost 1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
§ Semiconductor Technology
• Built upon silicon
• Add materials (conductors and insulators) to transform properties
• Organize the structure to become transistors
• Transistors work as electrically controlled switches
o (conduct or insulate under specific conditions) => transistors
What does a transistor look like?

13
Insulator: material
Modern Transistor: MOSFET where electric current
doesn’t flow freely
(e.g., rubber blocks
§ Metal Oxide Semiconductor Field-Effect Transistor electricity)
Insulator
Gate
Source Drain
Gate
Gate Oxide
n n

p-substrate (silicon)
<n-channel MOSFET>
Insulator
Gate
Source Drain
Gate
Gate Oxide
p p

n-substrate (silicon)
<p-channel MOSFET>

• Figure source: M. Riordan et al., “The invention of the transistor.” Reviews of Modern Physics 71.2, 1999. 14
Modern Transistor: MOSFET
§ Metal Oxide Semiconductor Field-Effect Transistor
- - -
Insulator - - -
Gate
- Si - - Si - - Si -
Source Drain
- Si - - Si - - Si -
Gate - - -
- - -
Gate Oxide “Free” electron
“Hole”
- - -
- - -
n n
- Si - - P - - Si - - Si - - B - - Si -

- - -
p-substrate (silicon) - - -

- - - - - -
<n-channel MOSFET>
- Si - - Si - - Si - - Si - - Si - - Si -
Insulator
Gate - - - - - -
Source Drain
Gate N-doped Substrate P-doped Substrate
Gate Oxide
p p
- Add atoms with five electrons - Add atoms with three electrons
- ”Free” electrons can flow away - ”Holes” can accommodate
n-substrate (silicon)
incoming electrons
<p-channel MOSFET>

15
VT: A voltage large enough to create an n-channel
MOSFET as a Switch
§ How n-type MOSFET works as a switch
Vh > VT
Electrons attracted to the surface due to Vh
Gate
Source Gate Drain
Gate Oxide
-- -- -- -- -- -- -- -- --
n n
“n-channel”

p-substrate (silicon) “Closed” switch (i.e., connected)

GND
Gate
Source Gate Drain
Gate Oxide
n No n-channel n

p-substrate (silicon) “Open” switch (i.e., disconnected)

P-type MOSFET works in the opposite way (If GND applied to GATE, the switch is closed) 16
MOSFET as a Switch
+5V “High” voltage on
Gate Gate Gate == Pushing the
Source Drain switch to close it
Gate
Gate Oxide
-- -- -- -- -- -- -- -- --
n n
“n-channel”
Source Drain
p-substrate (silicon)
<N-type MOSFET>
0V
Gate “Low” voltage on
Gate
Gate == Do not
Source Gate Drain
push the switch;
Gate Oxide switch is open
n No n-channel n
Source Drain
p-substrate (silicon)
<N-type MOSFET>

What you need to remember: MOSFET is an electrically controlled switch 17


Two Types of MOSFET
Gate Gate
Source Drain Source Drain
Gate Gate
Gate Oxide Gate Oxide
n n p p

p-substrate (silicon) n-substrate (silicon)

<N-type MOSFET> <P-type MOSFET>


Gate Voltage Connectivity Gate Voltage Connectivity
“Low” Disconnected (Open) “Low” Connected (Closed)
“High” Connected (Closed) “High” Disconnected (Open)
<Analogy>
A switch with a “short” spring A switch with a “long” spring
We need to push the switch We need to pull the switch (i.e.,
(i.e., “high” voltage) to connect “high” voltage) to disconnect

Source Drain Source Drain

18
VGS: The voltage between gate and source
Switch Abstraction of MOSFET VT: A threshold voltage for a connection

Source
Insulator
Gate
Source Drain
Gate Connected if VGS > VT
Gate Oxide Gate
n n

p-substrate (silicon) Drain


<n-channel MOSFET> <nMOS Switch>
Circle means
Source “inversion”
Insulator
Gate
Source Drain
Gate
Gate Oxide Gate Connected if VGS < VT
p p
== inverted nMOS
n-substrate (silicon) Drain
<p-channel MOSFET> <pMOS Switch>

19
Building Logic Gates with Transistors

Input Output
Vdd = 5 V “Low” == 0 V “High” == 5 V
“High” == 5 V “Low” == 0 V

The input signal is “inverted”


Input Output

In Out

Vss = 0 V “Inverter” or “NOT” gate

Take-away: we can build all logic gates using transistors 20


Manufacturing ICs

Yield: proportion of working dies per wafer


* Resource: From Sand to Silicon: The Making of a Microchip (by Intel) https://youtu.be/_VMYPLXnd7E?si=IFXOEHxxqL9TvPSn 21
Intel® Core 10th Gen “Ice Lake” CPUs Wafer
§ 300mm wafer, 506 chips, 10nm
technology
§ Each chip is 11.4 x 10.7 mm

22
Integrated Circuit Cost
Cost per wafer
Cost per die =
Dies per wafer×Yield
Dies per wafer ≈ Wafer area⁄Die area

1 Empirical observations of
Yield = yields at IC factories;
(1 + (Defects per area×Die area))! N related to the number of
critical processing steps

§ Nonlinear relation to area and defect rate


• Wafer cost and area are fixed
• Defect rate determined by manufacturing process
We should minimize
• Die area determined by architecture and circuit design the die area

23
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 2. Abstraction

24
Seven Great Ideas in Computer Architecture

§ Use abstraction to simplify design

§ Make the common case fast

§ Performance via parallelism


We will focus on
§ Performance via pipelining these in EECS 112

§ Performance via prediction

§ Hierarchy of memories

§ Dependability via redundancy

25
Below Your Program
§ Application software
• Written in high-level language (HLL)
§ System software
• Compiler: translates HLL code to machine code
• Operating System: service code
o Handling input/output
o Managing memory and storage
o Scheduling tasks & sharing resources
§ Hardware
• Processor, memory, I/O controllers

26
Levels of Program Code
§ High-level language (Python, C, C++, …)
• Level of abstraction closer to problem domain
• Provides for productivity and portability
§ Assembly language
• Textual representation of instructions
§ Hardware representation
• Binary digits (bits)
• Encoded instructions and data

Thanks to the multiple levels of


abstractions, we don’t have to deal
with low-level details like this

27
Below Your Program
§ Application software
• Written in high-level language (HLL)
§ System software
• Compiler: translates HLL code to machine code
• Operating System: service code
o Handling input/output
o Managing memory and storage
o Scheduling tasks & sharing resources
§ Hardware
• Processor, memory, I/O controllers

28
Components of a Computer
§ Same components for all kinds of computer
• Desktop, server, embedded, …

§ Input/output includes
• User-interface devices
o Display, keyboard, mouse, touch screen
• Storage devices
o Hard disk, CD/DVD, flash
• Network adapters
o For communicating with other computers
§ Processor
• Control + Datapath + on-chip memory; cache

29
Components of a Computer

Thanks to the multiple-level of abstractions, programmers do not have to


worry about the hardware level details
However, to be a strong programmer who can write high-performance code, you should understand how
underlying hardware (CPU, GPU, etc.) works! 30
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 3. Performance

31
Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance Formula
§ Factors affecting the performance

32
Defining Performance
§ Which airplane has the best performance?

33
Performance Metrics: Response Time and Throughput
§ Response time (i.e., latency)
• How long it takes to complete a task
Vs.
§ Throughput
• Total work done per unit time
o e.g., tasks/transactions/… per hour

§ Latency- and throughput-oriented optimization strategies


• Latency-oriented (CPU): Replacing the processor with a faster version
• Throughput oriented (e.g., GPU): Adding many cores (simplified than a
regular processor): individual cores can be slower
NVIDIA’s video on Throughput:
https://youtu.be/-P28LKWTzrI?si=z3_I8AV0TG-fHhh1

§ We’ll focus on response time for now


34
Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance Formula
§ Factors affecting the performance

35
Relative Performance
!
§ Define Performance =
"#$%&'&()* ,(-$
§ “X is n time faster than Y”

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒. 𝐸𝑥𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒/


= =𝑛
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒/ 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒.

§ Example: time taken to run a program


• 10s on A, 15s on B
• Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
• So A is 1.5 times faster than B

36
Measuring Execution Time
§ Elapsed time (aka wall clock time or response time)
• Total response time, including all aspects (i.e., “end-to-end” latency)
o Includes processing (computation), I/O (data movement), OS overhead, idle time
(“Stalls”; to be discussed later in the lecture), and so on
• Determines system performance

§ CPU time
• Time spent processing a given job on a CPU
o Discounts I/O time, other jobs’ shares
• Consists of user CPU time (time spent on user-defined programs) and
system CPU time (time spent on OS/system services for running the program)
• Different programs are affected differently by CPU and system performance

Example: What if an OS takes too long for dynamic memory allocation (e.g., Malloc)? 37
Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance Formula
§ Factors affecting the performance

38
CPU Clocking
§ Operation of digital hardware governed by a constant-rate clock
Clock period

Clock (cycles)

Data transfer
and computation

Update state

§ Clock period (Clock cycle time): duration of a clock cycle


• e.g., 250 ps = 0.25 ns = 250×10–12 s

§ Clock frequency (Clock rate): Number of cycles per second


• e.g., 4.0 GHz = 4000 MHz = 4.0×109 Hz

39
Clock Frequency and Cycle
§ Frequency: How many times does a signal oscillate for each second?

Signal
Value Once a second => 1 Hz
1 The duration of one clock signal: 1 second

(1 second / 1 clock = 1)
0
1 sec Time Hz (“Hertz”): The number of oscillations of per second
Signal (i.e., frequency)
Value
1 Twice a second => 2 Hz

The duration of one clock signal: 0.5 second
0
Time
(1 second / 2 clock = 0.5)
1 sec

Clock Cycle: The duration of one clock signal

40
Clock Frequency and Cycle
§ Frequency: How many times does a signal oscillate for each second?

Signal
Value N times per second => N Hz
1 The duration of one clock signal (Clock Cycle): 1/N second
… (1 second / N clocks = 1/N)
0
1 sec Time

Key Idea: Clock cycle must be longer than the critical path delay

Because we sample values at the end of each clock cycle

41
Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance Formula
§ Factors affecting the performance

42
CPU Time

CPU Time = CPU Clock Cycles ´ Clock Cycle Time


e.g., 1 ns
CPU Clock Cycles
=
Clock Rate e.g., 1 GHz

1
𝐶𝑙𝑜𝑐𝑘 𝐶𝑦𝑐𝑙𝑒 𝑇𝑖𝑚𝑒 (𝑐𝑙𝑜𝑐𝑘 𝑝𝑒𝑟𝑖𝑜𝑑) =
𝐶𝑙𝑜𝑟𝑘 𝑅𝑎𝑡𝑒
§ Performance can be improved by
• Reducing number of clock cycles (i.e., decrease the numerator of the above)
• Increasing clock rate (i.e., increase the denominator of the above)

Hardware designers often need to trade off clock rate against cycle count

43
CPU Time Example
§ Computer A: 2 GHz clock, 10 s CPU time
§ Designing Computer B
• Aim for 6s CPU time
• Can do faster clock, but causes 1.2 × clock cycles
§ How fast must Computer B clock be?

Clock CyclesB 1.2 ´ Clock CyclesA


Clock RateB = =
CPU TimeB 6s
Clock CyclesA = CPU Time A ´ Clock Rate A
= 10s ´ 2GHz = 20 ´ 10 9
1.2 ´ 20 ´ 10 9 24 ´ 10 9
Clock RateB = = = 4GHz
6s 6s
44
Instruction Count and CPI
§ Instruction Count for a program
ISA: Instruction Set Architecture
• Determined by program, ISA, and compiler • defines the supported instructions, data
types, registers, etc.
§ CPI: Clock cycles Per Instruction
• Average number of clock cycles each instruction takes to execute a program
• Determined by CPU hardware
• Each instruction has different CPI: Average CPI affected by instruction mix
• e.g, 20% load/store + 80% compute vs. 30% load/store + 5% conditional + 65% compute

#𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐶𝑜𝑢𝑛𝑡 × 𝐶𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 (𝐶𝑃𝐼)


𝐶𝑃𝑈 𝑇𝑖𝑚𝑒 = # 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠× 𝐶𝑙𝑜𝑐𝑘 𝐶𝑦𝑐𝑙𝑒 𝑇𝑖𝑚𝑒
= 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐶𝑜𝑢𝑛𝑡 × 𝐶𝑃𝐼 × 𝐶𝑙𝑜𝑐𝑘 𝐶𝑦𝑐𝑙𝑒 𝑇𝑖𝑚𝑒
"#$%&'(%)*# ,*'#%× ,."
= ,/*(0 12%3

45
Example: Using CPI to compute CPU Time
§ Computer A: Cycle Time = 250ps, CPI = 2.0
§ Computer B: Cycle Time = 500ps, CPI = 1.2
§ Same ISA
§ Which is faster, and by how much?

CPU Time = Instruction Count ´ CPI ´ Cycle Time


A A A
= I ´ 2.0 ´ 250ps = I ´ 500ps A is faster…
CPU Time = Instruction Count ´ CPI ´ Cycle Time
B B B
= I ´ 1.2 ´ 500ps = I ´ 600ps
CPU Time
B = I ´ 600ps = 1.2
…by this much
CPU Time I ´ 500ps
A

46
CPI in More Detail
§ If different instruction classes (e.g., Integer Add vs. Floating
point Add) take different numbers of cycles
n
Clock Cycles = å (CPIi ´ Instruction Count i )
i=1

Weighted average CPI

Clock Cycles n
æ Instructio n Count i ö
CPI = = å ç CPIi ´ ÷
Instructio n Count i=1 è Instructio n Count ø

Reflects the relative


occurrence frequency of
instruction type i

47
CPI Example
§ Alternative compiled code sequences using instructions in classes A, B, C

Instruction Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1

n Sequence 1: IC = 5 n Sequence 2: IC = 6
n Clock Cycles n Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
n Avg. CPI = 10/5 = 2.0 n Avg. CPI = 9/6 = 1.5

* IC: Instruction Count 48


Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance Formula
§ Factors affecting the performance

49
Performance Summary
Seconds Instructions Clock cycles Seconds
CPU Time = = × ×
Program Program Instruction Clock cycle

CPI Clock cycle time


This determined by algorithm, programming languages, compiler, and ISA

§ Performance depends on
• Algorithm: affects Instruction Count (IC), possibly CPI
• Programming language: affects IC, CPI
• Compiler: affects IC, CPI

• Instruction Set Architecture (ISA): affects IC, CPI, clock rate


+) Microarchitecture (Hardware implementation details)

50
Topics in Performance
§ Definition of performance in computer system
§ Relative performance based on the execution time
§ Clock frequency and period (clock cycle time)
§ CPU time and CPI
§ Performance formulation
§ Factors affecting the performance

51
Understanding Factors Affecting “Performance”
§ Algorithm
Ø What is the problem-solving strategy (Mathematics level)?

§ Programming Language, Compiler, and Architecture


Ø How will we generate low-level (very detailed) instructions for our computer to run
the algorithm?
Ø What kind of hardware modules to we have?

§ Microarchitecture: Processor and Memory System


Ø How are underlying hardware modules implemented?

§ Input and Output (I/O): Hardware and Software


Ø How fast can we move data into / out of the processor?

52
Understanding Factors Affecting “Performance”
§ Algorithm
• What it means
Ø What is the problem-solving strategy (Mathematics level)?

• Example: Add integers - 0 to 100


• Algorithm Choice 1: Add individual numbers from 0 to 100 100 adds
!(!#$) 1 add, 1 mult,
• Algorithm Choice 2: Use the mathematical formula 𝑆𝑢𝑚 𝑛 = and 1 div
&

Which algorithm is better (light-weighted; faster)?


From mathematics perspective, choice 2

What if we have a very efficient addition engine that can handle 128 adds every cycle?

We need to understand the underlying hardware architecture to precisely analyze


performance and optimize our program

53
Understanding Factors Affecting “Performance”
§ Programming Language, Compiler, and Architecture
• What it means
Ø How will we generate instructions for our computer to run the algorithm?
• High-level Example
• You want to purchase stamps and a burger.
• Choice 1
(1) Get to the post office and purchase stamps
(2) Go back home and put your stamps on your desk
(3) Get to the Xn-N-Xout to get a burger
(4) Go back home and enjoy the burger
• Choice 2
(1) Get to the post office and purchase stamps
(2) Get to the Xn-N-Xout to get a burger
(3) Go back home, put your stamps on your desk, and enjoy the burger

Which choice is better?

What if the post office and Xn-N-Xout are in the opposite direction?
54
Understanding Factors Affecting “Performance”
§ Microarchitecture: Processor and Memory System
• What it means
Ø How are underlying hardware modules implemented?
• High-level Example
• CPU 1
• Data load from on-chip memory: 10 cycles
• Summing up 128 integers with a special instruction: 1 cycle
• Summing up 128 integers without the special instruction: 128 cycles
• CPU 2
• Data load from on-chip memory: 1 cycle
• Summing up 128 integers: 128 cycles

Which choice is better?


It depends on the problem, algorithm, and instruction choice

55
Understanding Factors Affecting “Performance”
§ Input and Output (I/O): Hardware and Software
• What it means
Ø How fast can we move data into / out of the processor?
• High-level Example Bottleneck: Cooking Speed
Avg: Avg: Avg:
20 Ingredients / minute 1 burger / minute 1 burger / minute

Deliver to
Restaurant

Assumption:
One burger needs five ingredients 56
Understanding Factors Affecting “Performance”
§ Input and Output (I/O): Hardware and Software
• What it means
Ø How fast can we move data into / out of the processor?
• High-level Example Bottleneck: Ingredient delivery (i.e., I/O)
Avg: Avg: Avg:
20 Ingredients / minute 6 burgers / minute 4 burger / minute

Deliver to
Restaurant

57
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 4. Power

58
Static and Dynamic Power in CMOS Technology
§ Static Power
• Mainly based on the leakage: When a transistor is inactive, small current flows
through a transistor when there is no activity (activity: change in the value)

§ Dynamic Power
• Power consumption when we change the value of a transistor
1
𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑃𝑜𝑤𝑒𝑟 ∝ 𝐶𝑓𝑉 5
2
• C: Capacitance. Dependent on the technology node (e.g., 45 nm, 16 nm, 7 nm, …) and fanout (i.e.,
how many transistors are connected on the downstream?)
• f: Transition frequency (i.e., how often do we change between 0 and 1?)
• V: Voltage

f is relevant to the clock rate (The higher clock rate, the higher f)
Power and Clock Frequency Trend over 30 years
Technology innovations
(smaller transistors)

Architectural optimizations
(e.g., multi-core)

1
𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑃𝑜𝑤𝑒𝑟 ∝ 𝐶𝑓𝑉 5 Reached “Power wall”: Cannot further increase the
2 clock frequency due to the heat
How Severe was the Power Wall?

Video: CPU overclocking (over 9GHz) using liquid nitrogen

* Source Sung Hwan Kim, “Germanium-Source Tunnel Field Effect Transistors for Ultra-Low Power Digital Logic.” University of
Uniprocessor Performance

Constrained by power, instruction-


level parallelism, memory latency 62
Reducing Power
§ Example: A new CPU
• 85% of capacitive load of old CPU (e.g., 40 nm ->32 nm technology)
• 15% voltage and 15% frequency reduction

Pnew Cold ´ 0.85 ´ (Vold ´ 0.85) 2 ´ Fold ´ 0.85


= 2
= 0.85 4
= 0.52
Pold Cold ´ Vold ´ Fold

§ The power wall


• We can’t reduce voltage further (further voltage lowering makes the transistors
too leaky)
• We can’t remove more heat efficiently Power is a challenge for integrated circuits:
• Power must be brought in and
distributed around the chip
• Power is dissipated as heat and must be
§ What else can we do to improve performance? removed

63
Multiprocessors
§ Multicore microprocessors
• More than one processor per chip

§ To fully utilize the performance potential, explicit parallel


programming is required
• Hard to do:
o Programming for performance
o Load balancing
o Optimizing communication and synchronization

Often referred to as “CMP” : Chip MultiProcessor

64
FYI: TDP vs Power Consumption
§ TDP: Thermal Design Power
• How much heat dissipation can the target cooling system (i.e,,
default coolers) can manage?
Common Error: TDP == Actual Power
(Relevant, but not the same!)
1. If the load is not 100%, the power
consumption is less than the TDP

2. When load is 100%, if your cooling solutions


is better than the default one, your system
may draw more power than the TDP

* NVIDIA, “GPU Power Primer.” 2019 65


FYI: TDP vs Power Consumption

Consumed at most 221.87 W


while the TDP is 170W

Note: This is provided as an example.


Exact power consumption depends
on many factors (e.g., computer
case, room temperature, system
Benchmark: Prime 95 configuration, etc.)

* AnandTech, “The AMD Ryzen 9 7950X 3D Review. (link)” 2023. 66


EECS 112 (Spring 2024)
Organization of Digital Computers

Section 5. Performance Optimization Pitfalls

67
Pitfall 01: Amdahl’s Law
§ Improving an aspect of a computer and expecting a proportional improvement in
overall performance T/00-12-.
T'()*+,-. = + T34/00-12-.
improvement factor

§ Example: multiply accounts for 80s/100s


• How much improvement in multiply performance to get 5× overall?
80
20 = + 20 Can’t be done!
𝑛

§ Amdahl’s Law:
• 𝑆!"#$%&& : theoretical speedup of whole task;
1 • 𝑠 : speedup of the part of the task that benefits
𝑆DEFGHII = 𝑝 from improvements
𝑠 + (1 − 𝑝) • 𝑝 : proportion of execution time that the part
benefiting from improvements originally occupied
§ Corollary: make the common case fast
68
Deep dive into Amdahl’s Law

• f: fraction of a computation that will get speedup by optimization


• S: The amount of speedup
• Speedupenhanced(f,S): Overall (end-to-end) speedup with f and S

* M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” Computer, vol. 41, no. 7, pp. 33-38, July, 2008. 69
Amdahl’s Law: Example
• f: fraction of a computation that will get speedup by optimization
• S: The amount of speedup
• Speedupenhanced(f,S): Overall (end-to-end) speedup with f and S

0.4s 0.6s
Before
Cannot be parallelized Can be parallelized
Parallelization

1-f = 40% f = 60%


Parallelize on a dual-core CPU
(Assumption: perfect parallelization; 2X speed up)

0.4s 0.3s
After Cannot be parallelized Can be parallelized
Parallelization

70
Amdahl’s Law: Example
• f: fraction of a computation that will get speedup by optimization
• S: The amount of speedup
• Speedupenhanced(f,S): Overall (end-to-end) speedup with f and S

0.4s 0.3s
After Cannot be parallelized Can be parallelized
Parallelization

1 1 1
1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝3#N2#(3O 𝑓 = 0.6, 𝑆 = 2 = = = = = 1.43
𝑓 0.6 0.4 + 0.3 0.7
1−𝑓 +𝑆 1 − 0.6 + 2

Only 43% effective speed up with 2X speed up on the parallelizable portion

71
Amdahl’s Law: Easier Version
• f: fraction of a computation that will get speedup by optimization
• S: The amount of speedup
• Speedupenhanced(f,S): Overall (end-to-end) speedup with f and S

0.4s 0.3s
After Cannot be parallelized Can be parallelized
Parallelization

#𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛
𝑆𝑝𝑒𝑒𝑑2P%3& 𝐿𝑎𝑡𝑒𝑛𝑐𝑦2P%3& 𝐿𝑎𝑡𝑒𝑛𝑐𝑦R3P*&3
𝑆𝑝𝑒𝑒𝑑𝑢𝑝3#N2#(3O 𝑓 = 0.6, 𝑆 = 2 = = =
𝑆𝑝𝑒𝑒𝑑Q3P*&3 #𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝐿𝑎𝑡𝑒𝑛𝑐𝑦SP%3&
𝐿𝑎𝑡𝑒𝑛𝑐𝑦Q3P*&3
0.4 + 0.6 1.0 1
= = = = 1.43
0.6 0.4 + 0.3 0.7
0.4 +
2
2X speedup == half latency
72
Amdahl’s Law: Understanding the Original Version
The latency before optimization
(normalized to 1)

The latency after optimization if we view


the original latency as 1 (i.e., normalize)

• f: fraction of a computation that will get speedup by optimization


• S: The amount of speedup
• Speedupenhanced(f,S): Overall (end-to-end) speedup with f and S

73
Amdahl’s Law: Implication
§ Question: What do we want to optimize? (investing in engineering costs)
0.2s 0.8s
Before
Opt. candidate 1 Opt. candidate 2
Optimization

§ Best Case for Candidate 1 (Infinite speedup)


0.0s 0.8s
After
Opt. candidate 2
Optimization

§ Best Case for Candidate 2 (Infinite speedup)


0.2s 0.0s
After
Opt. candidate 1
Optimization

Implication: Let’s optimize something significant; not minor aspects 74


Pitfall02 : Using a Subset of Performance Equation (e.g., MIPS)

§ MIPS: Millions of Instructions Per Second


• Doesn’t account for
o Differences in ISAs between computers
o Differences in complexity between instructions

Instructio n count
MIPS =
Execution time ´ 10 6
Instructio n count Clock rate
= =
Instructio n count ´ CPI CPI ´ 10 6
´ 10 6

Clock rate

• CPI varies between programs on a given CPU

75
Concluding Remarks
§ Cost/performance is improving
• Due to underlying technology development
§ Hierarchical layers of abstraction
• In both hardware and software
§ Instruction set architecture
• The hardware/software interface
§ Execution time: A useful performance measure
§ Power is a limiting factor
• Use parallelism to improve performance

76

You might also like