0% found this document useful (0 votes)
39 views266 pages

Basics Computer Architecture by Pooyan Jamshidi 1731311297

Uploaded by

chieh1280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views266 pages

Basics Computer Architecture by Pooyan Jamshidi 1731311297

Uploaded by

chieh1280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 266

Lecture 1: Introduction and Basics

Week 1: January 9 & 11, 2024

CSCE 212: Introduction to Computer Architecture | Spring 2024 | https://pooyanjamshidi.github.io/csce212/


[Slides are primarily based on those of Onur Mutlu for the Computer Architecture Course at CMU]
Basic Goals & Structure the
Computer Architecture Course

14
What Will We Learn in This Course?

How Computers Work


(from the ground up)

15
We Will Study How Something Like This Works
Sensors

SoC
with lots of
Storage Main Memory compute Main Memory Storage
& caches
Apple M1 Ultra System (2022)
16
https://www.gsmarena.com/apple_announces_m1_ultra_with_20core_cpu_and_64core_gpu-news-53481.php
Major High-Level Goals of This Course
n In Introduction to Computer Architecture

n Understand the basics


n Understand the principles (of design)
n Understand the precedents

n Based on such understanding:


q learn how a modern computer works underneath
q evaluate tradeoffs of different designs and ideas
q implement a principled design (a simple microprocessor)
q learn to systematically debug increasingly complex systems
q Hopefully enable you to develop novel, out-of-the-box designs

n The focus is on basics, principles, precedents, and how to


use them to create/implement good designs
17
Why These Goals?
n Because you are here for a Computer Science degree!

n Regardless of your future direction, learning the principles


of computer architecture will be useful to
q design better hardware
q design better software
q design better systems
q make better tradeoffs in design
q understand why computers behave the way they do
q solve problems better
q think “in parallel”
q think critically
q …
18
Course Components
n Lectures (understanding concepts)
n Readings (reinforcing & going deeper)
n Homework (problem-solving & preparation)
n Project (hands-on experience in some concepts)
n Exam (test of understanding)

n In all, you have the freedom to adapt to your learning style

n My advice: Focus on learning & scholarship &


understanding

19
Learning & Exam

n We will enable you to learn + prepare you for the exam

n My suggestions:
q focus on understanding, learning, mastering the material
n lectures, readings, labs, HWs all enable this and prepare you
q reinforce problem solving skills with homeworks
q do not worry about the exam while listening to lectures
n most of you will pass this course (historically >80%)

n We will release a lot of material to help you with the exam


q Problem solving sessions
q Exam guidance
q All past exams (and basic solutions) are already online 20
Summary

n Learning is for life (never ends)

Focus on
learning and scholarship
21
How to Approach This Course

Learning experience
Long-term tradeoff
analysis
Critical thinking &
decision making
22
How to Approach This Course

Your mindset
will determine
what you
get out of the course
23
How to Approach This Course

Find and choose


the learning style
that works best for you

24
What Will We Learn in This Course?

25
Answer

How Computers Work


(from the ground up)

26
Answer Continued

And Why We Care

27
Why Do We Have Computers?

28
Why Do We Do Computing?

29
Answer

To Solve Problems

30
Answer Reworded

To Gain Insight

Hamming, “Numerical Methods for Scientists and Engineers,” 1962. 31


Answer Extended

To Enable
a Better Life & Future

32
How Does a Computer
Solve Problems?

33
Answer

Orchestrating Electrons

In today’s dominant technologies


34
How Do Problems
Get Solved by Electrons?

35
So, I Hope You Are Here for This
“C” as a model of computation
CSCE 145/206
Programmer’s view of how
a computer system works

n How does an assembly


program end up executing as Architect/microarchitect’s view:
digital logic? How to design a computer that
meets system design goals.
n What happens in-between? Choices critically affect both
n How is a computer designed the SW programmer and
the HW designer
using logic gates and wires
to satisfy specific goals?
HW designer’s view of how
a computer system works
CSCE 211 Digital logic as a
model of computation
36
The Transformation Hierarchy

Problem
Algorithm
Program/Language
System Software
Computer Architecture SW/HW Interface Computer Architecture
(expanded view) (narrow view)
Micro-architecture
Logic
Devices
Electrons

37
Levels of Transformation
“The purpose of computing is [to gain] insight” (Richard Hamming)
We gain and generate insight by solving problems
How do we ensure problems are solved by electrons?

Algorithm Problem

Step-by-step procedure that is Algorithm


guaranteed to terminate where Program/Language
each step is precisely stated Runtime System
and can be carried out by a ISA
(VM, OS, MM) (Instruction Set Architecture)
computer
ISA (Architecture)
- Finiteness Interface/contract between
Microarchitecture
- Definiteness SW and HW.
Logic
- Effective computability
Devices What the programmer
Many algorithms for the same Electrons assumes hardware will
problem satisfy.
Microarchitecture Digital logic circuits
An implementation of the ISA Building blocks of micro-arch (e.g., gates)
38
Computer Architecture
n is the science and art of designing computing platforms
(hardware, interface, system SW, and programming model)

n to achieve a set of design goals


q E.g., highest performance on earth on workloads X, Y, Z
q E.g., longest battery life at a form factor that fits in your
pocket with cost < $$$ CHF
q E.g., best average performance across all known workloads at
the best performance/cost ratio
q …

q Designing a supercomputer is different from designing a


smartphone à But, many fundamental principles are similar
39
Axiom
To achieve the highest energy efficiency and performance:

we must take the expanded view


of computer architecture

Problem
Algorithm
Program/Language
System Software Co-design across the hierarchy:
SW/HW Interface Algorithms to devices
Micro-architecture
Logic Specialize as much as possible
Devices within the design goals
Electrons
40
Different Platforms, Different Goals

41
Source: http://www.sia-online.org (semiconductor industry association)
Different Platforms, Different Goals

Source: https://iq.intel.com/5-awesome-uses-for-drone-technology/
42
Different Platforms, Different Goals

Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg 43
Different Platforms, Different Goals

Source: http://sm.pcmag.com/pcmag_uk/photo/g/google-self-driving-car-the-guts/google-self-driving-car-the-guts_dwx8.jpg 44
Different Platforms, Different Goals

Source: http://datacentervoice.com/wp-content/uploads/2015/10/data-center.jpg
45
Different Platforms, Different Goals

Source: https://fossbytes.com/wp-content/uploads/2015/06/Supercomputer-TIANHE2-china.jpg 46
Different Platforms, Different Goals

Source: https://www.itmagazine.ch/artikel/72401/Fugaku_Der_schnellste_Supercomputer_der_Welt.html 47
Different Platforms, Different Goals

Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.

48
Different Platforms, Different Goals

250 TFLOPS per chip in 2021


New ML applications (vs. TPU3): vs 90 TFLOPS in TPU3
• Computer vision
• Natural Language Processing (NLP)
• Recommender system
• Reinforcement learning that plays Go 1 ExaFLOPS per board
https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests
49
Different Platforms, Different Goals
n ML accelerator: 260 mm2, 6 billion transistors,
600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
n Two redundant chips for better safety.

50
https://www.youtube.com/watch?v=j0z4FweCy4M
Different Platforms, Different Goals
n Tesla Dojo Chip & System

51
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Different Platforms, Different Goals
n Tesla Dojo Chip & System

52
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Different Platforms, Different Goals
n Tesla Dojo Chip & System

53
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Different Platforms, Different Goals

NVIDIA is claiming a 7x improvement in dynamic programming


algorithm (DPX instructions) performance on a single H100
versus naïve execution on an A100.

54
https://www.nvidia.com/en-us/data-center/h100/
Different Platforms, Different Goals

n The largest ML
accelerator chip (2021)

n 850,000 cores

Cerebras WSE-2 Largest GPU


2.6 Trillion transistors 54.2 Billion transistors
46,225 mm2 826 mm2
NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning 55
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Different Platforms, Different Goals
Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu
“Accelerating Genome Analysis: A Primer on an Ongoing Journey” IEEE Micro, August 2020.

MinION from ONT

SmidgION from ONT


56
56
Different Platforms, Different Goals
Main Memory

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip
PIM-enabled
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip memory
x2
Host
CPU 0

DRAM
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip CPU 1
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM-enabled
x10 memory
PIM-enabled Memory
PIM-enabled
Main Memory
memory
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip

x2
Host
CPU 1
DRAM
CPU 0
PIM PIM PIM PIM PIM PIM PIM PIM
Chip Chip Chip Chip Chip Chip Chip Chip

PIM PIM PIM PIM PIM PIM PIM PIM


Chip Chip Chip Chip Chip Chip Chip Chip
x10
PIM-enabled Memory

PIM-enabled
memory

57
https://arxiv.org/pdf/2105.03814.pdf
Axiom
To achieve the highest energy efficiency and performance:

we must take the expanded view


of computer architecture

Problem
Algorithm
Program/Language
System Software Co-design across the hierarchy:
SW/HW Interface Algorithms to devices
Micro-architecture
Logic Specialize as much as possible
Devices within the design goals
Electrons
58
What is Computer Architecture?

n The science and art of designing, selecting, and


interconnecting hardware components and designing the
hardware/software interface to create a computing system
that meets functional, performance, energy consumption,
cost, and other specific goals.

59
Why Study Computer Architecture?
n Enable better systems: make computers faster, cheaper,
smaller, more reliable, …
q By exploiting advances and changes in underlying technology/circuits

n Enable new applications


q Life-like 3D visualization 20 years ago? Virtual reality?
q Self-driving cars?
q Personalized genomics? Personalized medicine?

n Enable better solutions to problems


q Software innovation is built on trends and changes in computer architecture
n > 50% performance improvement per year has enabled this innovation

n Understand why computers work the way they do


60
Computer Architecture Today (I)
n Today is a very exciting time to study computer architecture

n Industry is in a large paradigm shift (to novel architectures)


– many different potential system designs possible

n Many difficult problems motivating and caused by the shift


q Huge hunger for data and new data-intensive applications
q Power/energy/thermal constraints
q Complexity of design
q Difficulties in technology scaling
q Memory bottleneck
q Reliability problems
q Programmability problems
q Security and privacy issues
n No clear, definitive answers to these problems
61
Computer Architecture Today (II)
n Computing landscape is very different from 10-20 years ago

n Applications and technology both demand novel architectures

Hybrid Main Memory

Heterogeneous Persistent Memory/Storage


Processors and
Accelerators Every component and its
interfaces, as well as
entire system designs
are being re-examined
General Purpose GPUs
62
Historical: Opportunities at the Bottom

https://en.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_the_Bottom
65
Historical: Opportunities at the Bottom (II)

https://en.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_the_Bottom
66
Historical: Opportunities at the Top

67
https://www.science.org/doi/10.1126/science.aam9744
Axiom, Revisited

There is plenty of room both at the top and at the bottom

but much more so

when you

communicate well between and optimize across

the top and the bottom

68
Hence the Expanded View

Problem
Algorithm
Program/Language
System Software
Computer Architecture SW/HW Interface
(expanded view)
Micro-architecture
Logic
Devices
Electrons

69
Computer Architecture
Why Is It So Exciting Today?

70
Many Interesting Things
Are Happening Today
in Computer Architecture

71
Many Interesting Things
Are Happening Today
in Computer Architecture

Performance
Energy Efficiency
Sustainability
72
Many Interesting Things
Are Happening Today
in Computer Architecture

Reliability
Safety
Security
Privacy 73
Many Interesting Things
Are Happening Today
in Computer Architecture

More Demanding Workloads

74
Many Interesting Things
Are Happening Today
in Computer Architecture

New (Device) Technologies

75
Many Interesting Things
Are Happening Today
in Computer Architecture

76
Many Interesting Things
Are Happening Today
in Computer Architecture

Performance
Energy Efficiency
Sustainability
77
Do We Want This?

Source: V. Milutinovic 78
Or This?

Source: V. Milutinovic 79
Challenge and Opportunity for Future

High Performance,
Energy Efficient,
Sustainable
80
Many Difficult Problems: Climate

Source: https://farm9.staticflickr.com/8571/16376102935_8628150df8_o.jpg
81
Many Difficult Problems: Intelligence

Source: http://spectrum.ieee.org/image/MjYzMzAyMg.jpeg 82
Many Difficult Problems: Intelligence

Source: http://spectrum.ieee.org/image/MjYzMzAyMg.jpeg
Source: https://www.forbes.com/sites/robtoews/2020/06/17/deep-learnings-climate-change-problem/
83
Source: https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
Many Difficult Problems: Congestion

Source: https://blogs-images.forbes.com/jimgorzelany/files/2015/10/China-G4-backup-this-oct-reuters.jpg 84
Many Difficult Problems: Public Health

Source: https://blog.wego.com/7-crowded-places-and-events-that-you-will-love/ 85
Many Difficult Problems: Genome Analysis
development of high-throughput
sequencing (HTS) technologies

Number of Genomes
Sequenced

http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped 86
Huge Demand for Performance & Efficiency

Source: https://youtu.be/Bh13Idwcb0Q?t=283 87
Computation vs. Data Storage Dichotomy
Sensors

SoC
with lots of
Storage Main Memory compute Main Memory Storage
& caches
Apple M1 Ultra System (2022)
88
https://www.gsmarena.com/apple_announces_m1_ultra_with_20core_cpu_and_64core_gpu-news-53481.php
Data Movement vs. Computation Energy

Dally, HiPEAC 2015

A memory access consumes ~100-1000X


the energy of a complex addition
89
Data Movement vs. Computation Energy
Energy (pJ) ADD (int) Relative Cost 6400X
Energy for a 32-bit Operation (log scale)

10000

1000

100
640
10

1 3.1 3.7 5
1
0.9
0.1
0.1
ADD (int) ADD Register MULT MULT SRAM DRAM
A memory access consumes 6400X
(float) File (int) (float) Cache

the energy of a simple integer addition


90
Han+, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” ISCA 2016.
Challenge and Opportunity for Future

Computing Architectures
with
Minimal Data Movement

91
UPMEM Processing-in-DRAM Engine (2019)
n Processing in DRAM Engine
n Includes standard DIMM modules, with a large
number of DPU processors combined with DRAM chips.

n Replaces standard DIMMs


q DDR4 R-DIMM modules
n 8GB+128 DPUs (16 PIM chips)
n Standard 2x-nm DRAM process
q Large amounts of compute & memory bandwidth

https://www.anandtech.com/show/14750/hot-chips-31-analysis-inmemory-processing-by-upmem 92
https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/
UPMEM Memory Modules

n E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz


n P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz

www.upmem.com
2,560-DPU Processing-in-Memory System
Main Memory

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip
PIM-enabled
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip memory
x2
Host
CPU 0

DRAM
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip CPU 1
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM-enabled
x10 memory
PIM-enabled Memory
PIM-enabled
Main Memory
memory
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip

x2
Host
CPU 1
DRAM
CPU 0
PIM PIM PIM PIM PIM PIM PIM PIM
Chip Chip Chip Chip Chip Chip Chip Chip

PIM PIM PIM PIM PIM PIM PIM PIM


Chip Chip Chip Chip Chip Chip Chip Chip
x10
PIM-enabled Memory

PIM-enabled
memory

94
https://arxiv.org/pdf/2105.03814.pdf
FPGA-based Processing Near Memory
n Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios
Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu,
"FPGA-based Near-Memory Acceleration of Modern Data-Intensive
Applications"
IEEE Micro (IEEE MICRO), to appear, 2021.

95
Samsung Function-in-Memory DRAM (2021)

96
https://news.samsung.com/global/samsung-develops-industrys-first-high-bandwidth-memory-with-ai-processing-power
Samsung Function-in-Memory DRAM (2021)

97
Samsung Function-in-Memory DRAM (2021)

98
Samsung Function-in-Memory DRAM (2021)

99
Samsung Function-in-Memory DRAM (2021)

100
Samsung AxDIMM (2021)
Baseline System
n DDRx-PIM
q Deep learning recommendation system

AxDIMM System

101
Ke et al. "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM", IEEE Micro (2021)
SK Hynix Accelerator-in-Memory (2022)

102
https://news.skhynix.com/sk-hynix-develops-pim-next-generation-ai-accelerator/
AliBaba PIM Recommendation System (2022)

103
PIM Review and Open Problems

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun,


"A Modern Primer on Processing in Memory"
Invited Book Chapter in Emerging Computing: From Devices to Systems -
Looking Beyond Moore and Von Neumann, Springer, to be published in 2021.

107
https://arxiv.org/pdf/1903.03988.pdf
Cerebras’s Wafer Scale ML Engine (2019)

n The largest ML
accelerator chip

n 400,000 cores

Cerebras WSE Largest GPU


1.2 Trillion transistors 21.1 Billion transistors
46,225 mm2 815 mm2
NVIDIA TITAN V
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
112
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Cerebras’s Wafer Scale ML Engine-2 (2021)

n The largest ML
accelerator chip (2021)

n 850,000 cores

Cerebras WSE-2 Largest GPU


2.6 Trillion transistors 54.2 Billion transistors
46,225 mm2 826 mm2
NVIDIA Ampere GA100
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
113
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Challenge and Opportunity for Future

Computing Architectures
with
Minimal Data Movement

114
Challenge and Opportunity for Future

Fundamentally
Energy-Efficient
(Data-Centric)
Computing Architectures
115
Challenge and Opportunity for Future

Fundamentally
High-Performance
(Data-Centric)
Computing Architectures
116
Many Interesting Things
Are Happening Today
in Computer Architecture

Performance
Energy Efficiency
Sustainability
Specialized Accelerators 117
Apple M1 System on Chip (2021)

118
Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested
Apple M1 Max System on Chip (2021)

119
Source: https://www.anandtech.com/show/17024/apple-m1-max-performance-review
Bigger and More Powerful Systems (2021)

120
Source: https://www.golem.de/news/m1-pro-max-dieses-apple-silicon-ist-gigantisch-2110-160415.html
Bigger and More Powerful Systems (2022)

121
https://www.anandtech.com/show/17431/apple-announces-m2-soc-apple-silicon-updated-for-2022
Google’s Video Coding Unit (2021)

122
Source: https://dl.acm.org/doi/pdf/10.1145/3445814.3446723
Google’s Video Coding Unit (2021)

Source: https://dl.acm.org/doi/pdf/10.1145/3445814.3446723
123
Source: https://arstechnica.com/gadgets/2021/04/youtube-is-now-building-its-own-video-transcoding-chips/
TESLA Full Self-Driving Computer (2019)
n ML accelerator: 260 mm2, 6 billion transistors,
600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
n Two redundant chips for better safety.

124
https://youtu.be/Ucp0TTmvqOE?t=4236
Tesla Dojo ML Training Chip (2021)
n Tesla Dojo Chip

125
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Tesla Dojo ML Training System (2021)
n Tesla Dojo System

126
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Tesla Dojo ML Training System (2021)
n Tesla Dojo Chip & System

127
https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s
Cerebras’s Wafer Scale ML Engine (2019)

n The largest ML
accelerator chip

n 400,000 cores

Cerebras WSE Largest GPU


1.2 Trillion transistors 21.1 Billion transistors
46,225 mm2 815 mm2
NVIDIA TITAN V
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
128
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Cerebras’s Wafer Scale ML Engine-2 (2021)

n The largest ML
accelerator chip (2021)

n 850,000 cores

Cerebras WSE-2 Largest GPU


2.6 Trillion transistors 54.2 Billion transistors
46,225 mm2 826 mm2
NVIDIA Ampere GA100
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
129
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Google Tensor Processing Unit (~2016)

Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.

130
Google TPU Generation II (2017)

4 TPU chips
vs 1 chip in TPU1

High Bandwidth Memory


vs DDR3

Floating point operations


vs FP16

45 TFLOPS per chip


vs 23 TOPS

https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/ Designed for training


and inference
vs only inference

131
Google TPU Generation III

More More
High Bandwidth Memory Systolic Arrays
132
https://cloud.google.com/tpu/docs/system-architecture
Google TPU Generation IV (2021)

250 TFLOPS per chip in 2021


New ML applications (vs. TPU3): vs 90 TFLOPS in TPU3
• Computer vision
• Natural Language Processing (NLP)
• Recommender system
• Reinforcement learning that plays Go 1 ExaFLOPS per board
https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests
133
An Example Modern Systolic Array: TPU (II)

Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.
134
An Example Modern Systolic Array: TPU (III)

135
Many (Other) AI/ML Chips
n Alibaba
n Amazon
n Facebook
n Google
n Huawei
n Intel
n Microsoft
n NVIDIA
n Tesla
n Many Others and Many Startups…

n Many More to Come…


136
Many (Other) AI/ML Chips (2021)
n Alibaba
n Amazon
n Facebook
n Google
n Huawei
n Microsoft
n NVIDIA
n Tesla
n Many Startups…

n Many More to Come…

137
https://basicmi.github.io/AI-Chip/
Recall Our Axiom
To achieve the highest energy efficiency and performance:

we must take the expanded view


of computer architecture

Problem
Algorithm
Program/Language
System Software Co-design across the hierarchy:
SW/HW Interface Algorithms to devices
Micro-architecture
Logic Specialize as much as possible
Devices within the design goals
Electrons
138
Many Interesting Things
Are Happening Today
in Computer Architecture

Reliability
Safety
Security
Privacy 139
Collapse of the “Galloping Gertie”

Source: AP 140
http://www.wsdot.wa.gov/tnbhistory/connections/connections3.htm
Another View

Source: AP Source: http://www.seattlepi.com/science/article/A-Tacoma-Narrows-Galloping-Gertie-bridge-6617030.php 141


How Secure Are These People?

Security is about preventing unforeseen consequences


Source: https://s-media-cache-ak0.pinimg.com/originals/48/09/54/4809543a9c7700246a0cf8acdae27abf.jpg
142
How Safe & Secure Is This Platform?

Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg 143


Security: RowHammer (2014)

144
The Story of RowHammer
n One can predictably induce bit flips in commodity DRAM chips
q All tested DRAM chips are vulnerable

n First example of how a simple hardware failure mechanism


can create a widespread system security vulnerability

145
Modern DRAM is Prone to Disturbance Errors

Row of Cells Wordline


Row Row
Victim
Row Opened
Hammered Closed
Row VHIGH
LOW
Row Row
Victim
Row

Repeatedly reading a row enough times (before memory


gets refreshed) induces disturbance errors in adjacent rows
in most real DRAM chips you can buy today
Flipping Bits in Memory Without Accessing Them: An Experimental
146 Study of DRAM
Disturbance Errors, (Kim et al., ISCA 2014)
Most DRAM Modules Are Vulnerable
A company B company C company
86% 83% 88%
(37/43) (45/54) (28/32)

Up to Up to Up to
1.0×10 2.7×10 3.3×10
7 6 5
errors errors errors
Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM
Disturbance Errors, (Kim et al., ISCA 2014) 147
One Can Take Over an Otherwise-Secure System

Flipping Bits in Memory Without Accessing Them:


An Experimental Study of DRAM Disturbance Errors
(Kim et al., ISCA 2014)

Exploiting the DRAM rowhammer bug to


gain kernel privileges (Seaborn+, 2015)

148
Security: RowHammer (2014)

149
More Security Implications (II)
“Can gain control of a smart phone deterministically”

Drammer: Deterministic Rowhammer


153
Source: https://fossbytes.com/drammer-rowhammer-attack-android-root-devices/ Attacks on Mobile Platforms, CCS’16
More Security Implications (VI)
n IEEE S&P 2020
More Security Implications (VII)

n USENIX Security 2019


More Security Implications (VIII)

n USENIX Security 2020


Can We Truly Depend on Computers?

Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg 157


Security: Meltdown and Spectre (2018)

158
Source: J. Masters, Redhat, FOSDEM 2018 keynote talk.
Silent Data Corruption In-the-Field (2021)

159
https://www.youtube.com/watch?v=QMF3rqhjYuM
Silent Data Corruption In-the-Field (2021)

160
https://www.youtube.com/watch?v=QMF3rqhjYuM
Many Interesting Things
Are Happening Today
in Computer Architecture

More Demanding Workloads

161
Huge Demand for Performance & Efficiency

Source: https://youtu.be/Bh13Idwcb0Q?t=283 162


Increasingly Demanding Applications

Dream

and, they will come


As applications push boundaries, computing platforms will become increasingly strained.

163
New Genome Sequencing Technologies

Oxford Nanopore MinION

Senol Cali+, “Nanopore Sequencing Technology and Tools for Genome


Assembly: Computational Analysis of the Current State, Bottlenecks
Data → performance & energy bottleneck
and Future Directions,” Briefings in Bioinformatics, 2018.
[Preliminary arxiv.org version]

164
Why Do We Care? An Example

165
Source: https://nanoporetech.com/about-us/news/200-oxford-nanopore-sequencers-have-left-uk-china-support-rapid-near-sample
Population-Scale Microbiome Profiling

https://blog.wego.com/7-crowded-places-and-events-that-you-will-love/ 166
City-Scale Microbiome Profiling

Afshinnekoo+, "Geospatial Resolution of Human and


Bacterial Diversity with City-Scale Metagenomics", Cell
Systems, 2015
167
Example: Rapid Surveillance of Ebola Outbreak

Quick+, “Real-time, portable genome sequencing for Ebola surveillance”, Nature, 2016
168
High-Throughput Genome Sequencers
Oxford
Nanopore
PromethION

Pacific
Biosciences
Illumina MiSeq
Sequel II

Oxford Nanopore MinION

Oxford
Nanopore
SmidgION
Illumina NovaSeq 6000 Pacific Biosciences RS II
… and more! All produce data with different properties.
169
High-Throughput Genome Sequencers
Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu
“Accelerating Genome Analysis: A Primer on an Ongoing Journey” IEEE Micro, August 2020.

MinION from ONT

SmidgION from ONT


170
170
The Genomic Era
development of high-throughput
sequencing (HTS) technologies

Number of Genomes
Sequenced

http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped 171
C CA TC AT TT AA AT
G C AC
A C G
T

C 0 1 2

AA 1 0 1 2

CC 2 1 0 1 2

TT 2 1 0 1 2
Billions of Short Reads AA 2 1 2 1 2

CCCCCCT AT AT AT ACGT ACT AGT ACGT TG 2 2 2 1 2

AA 3 2 2 2 2

ACGAC T T TAGT ACGT ACGT TA 3 3 3 2 3

T AT AT AT ACGT ACT AGT ACGT AC 4 3 3 2 3

CT 4 4 3 2

GT 5 4 3

ACGT ACG CCCCT ACGT A Short Read Read


T AT AT AT ACGT ACT AGT ACGT
Alignment
ACGAC T T TAGT ACGT ACGT
T AT AT AT ACGT ACT AAAGT ACGT
T AT AT AT ACGT ACT AGT ACGT
ACG T T T T TAAAACGT A
T AT AT AT ACGT ACT AGT ACGT
ACGAC GGGGAGT ACGT ACGT
... ...
Reference Genome

1 Sequencing Genome Read Mapping 2


Analysis

Data → performance & energy bottleneck

3 Variant Calling Scientific Discovery 4


We Need Faster & Scalable Genome Analysis

Understanding genetic variations, Predicting the presence and relative


species, evolution, … abundance of microbes in a sample

Rapid surveillance of disease outbreaks Developing personalized medicine


173
And, many, many other applications …
Our Dream (circa 2007)

n An embedded device that can perform comprehensive


genome analysis in real time (within a minute)
q Which of these DNAs does this DNA segment match with?
q What is the likely genetic disposition of this patient to this
drug?
q What disease/condition might this particular DNA/RNA piece
associated with?
q What potential viruses & variants might be lurking around?
q ...

174
Software Acceleration: Eliminate Useless Work

n Download the source code and try for yourself


q Download link to FastHASH

175
Hardware Acceleration: Vectorizable Algorithms
https://github.com/CMU-SAFARI/Shifted-Hamming-Distance

Xin+, "Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter


to Accelerate Alignment Verification in Read Mapping”, Bioinformatics 2015.

176
GateKeeper: FPGA-Based Acceleration

1
st

Alignment
Filter FPGA-based
Alignment Filter.
Low Speed & High Accuracy
Medium Speed, Medium Accuracy
High Speed, Low Accuracy

x10 12 Hardware Accelerator


x10 3
mappings mappings
CA TC A
T T
T A
A A
G T
C A
A C
C G
T

C 0 1 2

AA 1 0 1 2

CC 2 1 0 1 2

TT 2 1 0 1 2

AA 2 1 2 1 2

TG 2 2 2 1 2

AA 3 2 2 2 2

TA 3 3 3 2 3

AC 4 3 3 2 3

CT 4 4 3 2

Billions of Short Reads GT 5 4 3

High throughput DNA Read Pre-Alignment Filtering Read Alignment


1 2 3
sequencing (HTS) technologies Fast & Low False Positive Rate Slow & Zero False Positives
177
GateKeeper: FPGA-Based Acceleration
n Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur
Mutlu, and Can Alkan
"GateKeeper: A New Hardware Architecture for
Accelerating Pre-Alignment in DNA Short Read Mapping"
Bioinformatics, [published online, May 31], 2017.
[Source Code]
[Online link at Bioinformatics Journal]

178
In-Memory DNA Sequence Analysis
n Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan
Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu,
"GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-
Memory Technologies"
BMC Genomics, 2018.
Proceedings of the 16th Asia Pacific Bioinformatics Conference (APBC), Yokohama, Japan, January
2018.
[Slides (pptx) (pdf)]
[Source Code]
[arxiv.org Version (pdf)]
[Talk Video at AACBB 2019]

179
Shouji (障子) [Alser+, Bioinformatics 2019]
Mohammed Alser, Hasan Hassan, Akash Kumar, Onur Mutlu, and Can Alkan,
"Shouji: A Fast and Efficient Pre-Alignment Filter for Sequence Alignment"
Bioinformatics, [published online, March 28], 2019.
[Source Code]
[Online link at Bioinformatics Journal]

180
SneakySnake [Alser+, Bioinformatics 2020]
Mohammed Alser, Taha Shahroodi, Juan-Gomez Luna, Can Alkan, and Onur Mutlu,
"SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment
Filter for CPUs, GPUs, and FPGAs"
Bioinformatics, to appear in 2020.
[Source Code]
[Online link at Bioinformatics Journal]

181
GenASM Framework [MICRO 2020]
n Damla Senol Cali, Gurpreet S. Kalsi, Zulal Bingol, Can Firtina, Lavanya Subramanian, Jeremie S.
Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand,
Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu,
"GenASM: A High-Performance, Low-Power Approximate String Matching
Acceleration Framework for Genome Sequence Analysis"
Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual,
October 2020.
[Lighting Talk Video (1.5 minutes)]
[Lightning Talk Slides (pptx) (pdf)]
[Talk Video (18 minutes)]
[Slides (pptx) (pdf)]

182
SeGraM Framework [ISCA 2022]
n Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zulal Bingol, Gurpreet S.
Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika MansouriGhiasi,
Gagandeep Singh, Juan Gomez-Luna, Nour Almadhoun Alserr, Mohammed Alser,
Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu,
"SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph
and Sequence-to-Sequence Mapping"
Proceedings of the 49th International Symposium on Computer Architecture (ISCA), New
York, June 2022.
[arXiv version]

https://arxiv.org/pdf/2205.05883.pdf 183
FPGA-based Near-Memory Analytics
n Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios
Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu,
"FPGA-based Near-Memory Acceleration of Modern Data-Intensive
Applications"
IEEE Micro (IEEE MICRO), 2021.

184
In-Storage Genome Filtering [ASPLOS 2022]
n Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid
Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata
Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
"GenStore: A High-Performance and Energy-Efficient In-Storage Computing
System for Genome Sequence Analysis"
Proceedings of the 27th International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), Virtual, February-March
2022.
[Lightning Talk Slides (pptx) (pdf)]

185
Future of Genome Sequencing & Analysis
Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu
“Accelerating Genome Analysis: A Primer on an Ongoing Journey” IEEE Micro, August 2020.

MinION from ONT

SmidgION from ONT


186
186
COVID-19 Nanopore Sequencing (I)

• From ONT (https://nanoporetech.com/covid-19/overview)


187
COVID-19 Nanopore Sequencing (II)

• From ONT (https://nanoporetech.com/covid-19/overview)


188
Accelerating Genome Analysis: Overview
n Mohammed Alser, Zulal Bingol, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can
Alkan, and Onur Mutlu,
"Accelerating Genome Analysis: A Primer on an Ongoing Journey"
IEEE Micro (IEEE MICRO), Vol. 40, No. 5, pages 65-75, September/October 2020.
[Slides (pptx)(pdf)]
[Talk Video (1 hour 2 minutes)]

189
Beginner Reading on Genome Analysis
Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao,
Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu
“From Molecules to Genomic Variations to Scientific Discovery:
Intelligent Algorithms and Architectures for Intelligent Genome Analysis”
Computational and Structural Biotechnology Journal, 2022
[Source code]

https://arxiv.org/pdf/2205.07957.pdf 190
192
Many Interesting Things
Are Happening Today
in Computer Architecture

More Demanding Workloads

193
The Problem

Computing
is Bottlenecked by Data

194
Data is Key for AI, ML, Genomics, …

n Important workloads are all data intensive

n They require rapid and efficient processing of large amounts


of data

n Data is increasing
q We can generate more than we can process

195
Data is Key for Future Workloads

In-memory Databases Graph/Tree Processing


[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]

In-Memory Data Analytics Datacenter Workloads


[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
Data Overwhelms Modern Machines

In-memory Databases Graph/Tree Processing


[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]
Data → performance & energy bottleneck

In-Memory Data Analytics Datacenter Workloads


[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
Data is Key for Future Workloads

Chrome TensorFlow Mobile


Google’s web browser Google’s machine learning
framework

Video Playback Video Capture


Google’s video codec Google’s video codec
Data Overwhelms Modern Machines

Chrome TensorFlow Mobile


Google’s web browserGoogle’s machine learning
Data → performance & energy bottleneck
framework

Video Playback Video Capture


Google’s video codec Google’s video codec
Data Movement Overwhelms Modern Machines
n Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul
Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu,
"Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks"
Proceedings of the 23rd International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018.

62.7% of the total system energy


is spent on data movement

200
Data Movement Overwhelms Accelerators
n Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira,
Xiaoyu Ma, Eric Shiu, and Onur Mutlu,
"Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine
Learning Inference Bottlenecks"
Proceedings of the 30th International Conference on Parallel Architectures and Compilation
Techniques (PACT), Virtual, September 2021.
[Slides (pptx) (pdf)]
[Talk Video (14 minutes)]

> 90% of the total system energy


is spent on memory in large ML models

201
Data Movement vs. Computation Energy

Dally, HiPEAC 2015

A memory access consumes ~100-1000X


the energy of a complex addition
202
Data Movement vs. Computation Energy
Energy (pJ) ADD (int) Relative Cost 6400X
Energy for a 32-bit Operation (log scale)

10000

1000

100
640
10

1 3.1 3.7 5
1
0.9
0.1
0.1
ADD (int) ADD Register MULT MULT SRAM DRAM
A memory access consumes 6400X
(float) File (int) (float) Cache

the energy of a simple integer addition


203
Han+, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” ISCA 2016.
Many Interesting Things
Are Happening Today
in Computer Architecture

204
Many Novel Concepts Investigated Today
n New Computing Paradigms (Rethinking the Full Stack)
q Processing in Memory, Processing Near Data
q Neuromorphic Computing, Quantum Computing
q Fundamentally Secure and Dependable Computers

n New Accelerators & Systems (Algorithm-Hardware Co-Designs)


q Artificial Intelligence & Machine Learning
q Graph & Data Analytics, Vision, Video
q Genome Analysis

n New Memories, Storage Systems, Interconnects, Devices


q Non-Volatile Main Memory, Intelligent Memory Systems, Quantum
q High-Speed Interconnects, Disaggregated Systems
205
Increasingly Demanding Applications

Dream

and, they will come


As applications push boundaries, computing platforms will become increasingly strained.

206
Increasingly Diverging/Complex Tradeoffs
Energy (pJ) ADD (int) Relative Cost 6400X
Energy for a 32-bit Operation (log scale)

10000

1000

100
640
10

1 3.1 3.7 5
1
0.9
0.1
0.1
ADD (int) ADD Register MULT MULT SRAM DRAM
A memory access consumes 6400X
(float) File (int) (float) Cache

the energy of a simple integer addition


207
Han+, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” ISCA 2016.
Increasingly Complex Systems

Past systems

Microprocessor Main Memory Storage (SSD/HDD)

208
Increasingly Complex Systems
FPGAs
Modern systems

Hybrid Main Memory

Heterogeneous Persistent Memory/Storage


Processors and
Accelerators

(General Purpose) GPUs


209
Increasingly Complex Systems on Chip

210
Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested
Bigger and More Powerful Systems (2021)

211
Source: https://www.golem.de/news/m1-pro-max-dieses-apple-silicon-ist-gigantisch-2110-160415.html
Computer Architecture Today
n Computing landscape is very different from 10-20 years ago

n Applications and technology both demand novel architectures

Hybrid Main Memory

Heterogeneous Persistent Memory/Storage


Processors and
Accelerators Every component and its
interfaces, as well as
entire system designs
are being re-examined
General Purpose GPUs
212
Computer Architecture Today (II)
n You can revolutionize the way computers are built, if you
understand both the hardware and the software (and
change each accordingly)

n You can invent new paradigms for computation,


communication, and storage

n Recommended book: Thomas Kuhn, “The Structure of


Scientific Revolutions” (1962)
q Pre-paradigm science: no clear consensus in the field
q Normal science: dominant theory used to explain/improve
things (business as usual); exceptions considered anomalies
q Revolutionary science: underlying assumptions re-examined
213
Computer Architecture Today (II)
n You can revolutionize the way computers are built, if you
understand both the hardware and the software (and
change each accordingly)

n You can invent new paradigms for computation,


communication, and storage

n Recommended book: Thomas Kuhn, “The Structure of


Scientific Revolutions” (1962)
q Pre-paradigm science: no clear consensus in the field
q Normal science: dominant theory used to explain/improve
things (business as usual); exceptions considered anomalies
q Revolutionary science: underlying assumptions re-examined
214
Takeaways
n It is an exciting time to be understanding and designing
computing architectures

n Many challenging and exciting problems in platform design


q That no one has tackled (or thought about) before
q That can have huge impact on the world’s future

n Driven by huge hunger for data (Big Data), new applications


(ML/AI, graph analytics, genomics), ever-greater realism, …
q We can easily collect more data than we can analyze/understand

n Driven by significant difficulties in keeping up with that


hunger at the technology layer
q Five walls: Energy, reliability, complexity, security, scalability

215
Let’s Start with Some Puzzles

a.k.a. Computer Architecture resembles Building Architecture

216
What Is This?

217
Source: https://www.flickr.com/photos/tambako/2286064777/in/photostream/
What About This?

Source: By Toni_V, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=4087256


218
What Do the Following
Have in Common?

219
Gare do Oriente, Lisbon

220
Source: By Martín Gómez Tagle - Lisbon, Portugal, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=13764903
Milwaukee Art Museum

221
Source: By Andrew C. from Flagstaff, USA - Flickr, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=379223
Athens Olympic Stadium

222
Source: By Spyrosdrakopoulos - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=16172519
City of Arts and Sciences, Valencia

223
Source: CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=172107
Florida Polytechnic University (I)

224
Source: http://www.architectmagazine.com/design/buildings/florida-polytechnic-university-designed-by-santiago-calatrava_o
Oculus, New York City

225
Source: https://www.dezeen.com/2016/08/29/santiago-calatrava-oculus-world-trade-center-transportation-hub-new-york-photographs-hufton-crow/
What do All Those Have in Common
with Bahnhof Stadelhofen?

226
Answer: All Designed by a Famous Architect
n ETH Alumnus, PhD Civil Engineering

n “The train station has several of the features that became


signatures of his work; straight lines and right angles are
rare.“

Source: By 準建築人手札網站 Forgemind ArchiMedia - Flickr: IMG_2489.JPG, CC BY 2.0,


227
https://commons.wikimedia.org/w/index.php?curid=31493356, https://en.wikipedia.org/wiki/Santiago_Calatrava
Your First Comp. Architecture Assignment
n Go and find the closest Calatrava building to this classroom
q For those who like a challenge, find the furthest building that was
designed by Calatrava to his classroom J

n Appreciate the beauty & out-of-the-box and creative thinking


n Think about tradeoffs in the design
q Strengths, weaknesses, goals of design
n Derive principles on your own for good design and innovation

n Due date: Any time during or after this course


q Later during the course is better
q Apply what you have learned in this course
q Think out-of-the-box
228
But First, Today’s First Assignment

229
Find The Differences of
This and That

230
This

231
Source: By Toni_V from Zurich, Switzerland - Stadelhofen2, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=4087256
That

232
Source: http://cookiemagik.deviantart.com/art/Train-station-207266944 - Göttingen, DE
Many Tradeoffs Between Two Designs
n You can list them after you complete the first assignment…

233
Aside: Evaluation Criteria for the Designs
n Functionality (Does it meet the specification?)
n Reliability
n Space requirement
n Cost
n Expandability
n Comfort level of users
n Happiness level of users
n Aesthetics
n Security
n …

n How to evaluate goodness of design is always a critical


question à “Performance“ evaluation and metrics
234
A Key Question
n How was Calavatra able to design especially his key buildings?
n Can have many guesses
q (Very) hard work, perseverance, dedication (over decades)
q Experience
q Creativity, Out-of-the-box thinking
q A good understanding of past designs
q Good judgment and intuition
q Strong skill combination (math, architecture, art, engineering, …)
q Funding ($$$$), luck, initiative, entrepreneurialism
q Strong understanding of and commitment to fundamentals
q Principled design
q …

n You will be exposed to and hopefully develop/enhance many of


these skills in this course
235
Principled Design
n “To me, there are two overriding principles to be found in
nature which are most appropriate for building:
q one is the optimal use of material,
q the other the capacity of organisms to change shape, to grow,
and to move.”
q Santiago Calatrava

n “Calatrava's constructions are inspired by natural forms like


plants, bird wings, and the human body.”

Source: http://www.arcspace.com/exhibitions/unsorted/santiago-calatrava/
236
Gare do Oriente, Lisbon, Revisited

Source: By Martín Gómez Tagle - Lisbon, Portugal, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=13764903 237
Source: http://www.arcspace.com/exhibitions/unsorted/santiago-calatrava/
A Principled Design

238
What Does This Remind You Of?

239
Source: https://www.dezeen.com/2016/08/29/santiago-calatrava-oculus-world-trade-center-transportation-hub-new-york-photographs-hufton-crow/
The Architect’s Answer

240
Source: https://en.wikipedia.org/wiki/World_Trade_Center_station_(PATH)
Strengths and Praise

241
Source: https://en.wikipedia.org/wiki/World_Trade_Center_station_(PATH)
Design Constraints and Criticism

242
Source: https://en.wikipedia.org/wiki/World_Trade_Center_station_(PATH)
Source: https://en.wikipedia.org/wiki/Stegosaurus

Susannah Maidment et al. & Natural History Museum, London - Maidment SCR, Brassey C, Barrett PM (2015)
The Postcranial Skeleton of an Exceptionally Complete Individual of the Plated Dinosaur Stegosaurus stenops
(Dinosauria: Thyreophora) from the Upper Jurassic Morrison Formation of Wyoming, U.S.A. PLoS ONE 10(10): 243
e0138352. doi:10.1371/journal.pone.0138352
Design Constraints: Noone is Immune

244
Source: https://en.wikipedia.org/wiki/World_Trade_Center_station_(PATH)
The Lecture Was Slightly Different
When I Was at CMU

245
What Is This?

246
Source: https://roadtrippers.com/stories/falling-water
Answer: Masterpiece of A Famous Architect

Source: https://en.wikipedia.org/wiki/Fallingwater 247


Find The Differences of
This and That

248
This

Source: http://www.fallingwater.org/ 249


This

250
That

251
A Key Question
n How was Wright able to design his masterpiece?
n Can have many guesses
q (Very) hard work, perseverance, dedication (over decades)
q Experience
q Creativity, Out-of-the-box thinking
q A good understanding of past designs
q Good judgment and intuition
q Strong skill combination (math, architecture, art, engineering, …)
q Funding ($$$$), luck, initiative, entrepreneurialism
q Strong understanding of and commitment to fundamentals
q Principled design
q …

n You will be exposed to and hopefully develop/enhance many of


these skills in this course
252
A Quote from The Architect Himself
n “architecture […] based upon principle, and not upon
precedent”

253
Source: http://www.fallingwater.org/
A Principled Design

254
A Key Question
n How was Wright able to design his masterpiece?
n Can have many guesses
q (Very) hard work, perseverance, dedication (over decades)
q Experience
q Creativity, Out-of-the-box thinking
q A good understanding of past designs
q Good judgment and intuition
q Strong skill combination (math, architecture, art, engineering, …)
q Funding ($$$$), luck, initiative, entrepreneurialism
q Strong understanding of and commitment to fundamentals
q Principled design
q …

n You will be exposed to and hopefully develop/enhance many of


these skills in this course
255
Takeaways

n It all starts from the basic building blocks and design


principles

n And, knowledge of how to use, apply, enhance them

n Underlying technology might change (e.g., steel vs. wood)


q but methods of taking advantage of technology bear resemblance
q methods used for design depend on the principles employed

256
The Same Applies to Processor Chips
n There are basic building blocks and design principles

Intel Core i7 IBM Cell BE IBM POWER7


AMD Barcelona 8 cores 8+1 cores 8 cores
4 cores

Nvidia Fermi Intel SCC Tilera TILE Gx


Sun Niagara II 448 “cores” 48 cores, networked 100 cores, networked
8 cores
257
The Same Applies to Computing Systems
n There are basic building blocks and design principles

258
source: http://www.sia-online.org (semiconductor industry association)
The Same Applies to Computing Systems
n There are basic building blocks and design principles

Source: http://datacentervoice.com/wp-content/uploads/2015/10/data-center.jpg
259
Different Platforms, Different Goals

Source: https://iq.intel.com/5-awesome-uses-for-drone-technology/
260
Different Platforms, Different Goals

Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg 261


Different Platforms, Different Goals

Source: http://sm.pcmag.com/pcmag_uk/photo/g/google-self-driving-car-the-guts/google-self-driving-car-the-guts_dwx8.jpg 262


Different Platforms, Different Goals

Source: https://fossbytes.com/wp-content/uploads/2015/06/Supercomputer-TIANHE2-china.jpg 263


Different Platforms, Different Goals

Source: https://www.itmagazine.ch/artikel/72401/Fugaku_Der_schnellste_Supercomputer_der_Welt.html 264


Apple M1 Max System on Chip (2021)

265
Source: https://www.anandtech.com/show/17024/apple-m1-max-performance-review
Google Tensor Processing Unit (~2016)

Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.

266
Google TPU Generation IV (2021)

250 TFLOPS per chip in 2021


New ML applications (vs. TPU3): vs 90 TFLOPS in TPU3
• Computer vision
• Natural Language Processing (NLP)
• Recommender system
• Reinforcement learning that plays Go 1 ExaFLOPS per board
https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests
267
TESLA Full Self-Driving Computer (2019)
n ML accelerator: 260 mm2, 6 billion transistors,
600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
n Two redundant chips for better safety.

268
https://youtu.be/Ucp0TTmvqOE?t=4236
Cerebras’s Wafer Scale ML Engine-2 (2021)

n The largest ML
accelerator chip (2021)

n 850,000 cores

Cerebras WSE-2 Largest GPU


2.6 Trillion transistors 54.2 Billion transistors
46,225 mm2 826 mm2
NVIDIA Ampere GA100
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
269
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Google’s Video Coding Unit (2021)

270
Source: https://dl.acm.org/doi/pdf/10.1145/3445814.3446723
UPMEM Processing-in-DRAM Engine (2019)
n Processing in DRAM Engine
n Includes standard DIMM modules, with a large
number of DPU processors combined with DRAM chips.

n Replaces standard DIMMs


q DDR4 R-DIMM modules
n 8GB+128 DPUs (16 PIM chips)
n Standard 2x-nm DRAM process
q Large amounts of compute & memory bandwidth

https://www.anandtech.com/show/14750/hot-chips-31-analysis-inmemory-processing-by-upmem 271
https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/
Different Platforms, Different Goals
Main Memory

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip
PIM-enabled
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip memory
x2
Host
CPU 0

DRAM
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip CPU 1
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM
Chip
PIM-enabled
x10 memory
PIM-enabled Memory
PIM-enabled
Main Memory
memory
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Chip Chip Chip Chip Chip Chip Chip Chip

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip

x2
Host
CPU 1
DRAM
CPU 0
PIM PIM PIM PIM PIM PIM PIM PIM
Chip Chip Chip Chip Chip Chip Chip Chip

PIM PIM PIM PIM PIM PIM PIM PIM


Chip Chip Chip Chip Chip Chip Chip Chip
x10
PIM-enabled Memory

PIM-enabled
memory

272
https://arxiv.org/pdf/2105.03814.pdf
Samsung Function-in-Memory DRAM (2021)

273
Samsung AxDIMM (2021)
Baseline System
n DDRx-PIM
q Deep learning recommendation system

AxDIMM System

274
Ke et al. "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM", IEEE Micro (2021)
AliBaba PIM Recommendation System (2022)

275
Recall: Takeaways

n It all starts from the basic building blocks and design


principles

n And, knowledge of how to use, apply, enhance them

n Underlying technology might change (e.g., steel vs. wood)


q but methods of taking advantage of technology bear resemblance
q methods used for design depend on the principles employed

276
Basic Building Blocks
n Electrons
n Transistors
n Logic Gates
n Combinational Logic Circuits
n Sequential Logic Circuits
q Storage Elements and Memory
n …
n Cores
n Caches
n Interconnect
n Memories
n ...
277
Reading Assignments for This Week

n Chapter 1 in
Harris & Harris

n Supplementary
Lecture Slides on
Binary Numbers

n Chapters 1-2 in
Patt and Patel

278
Recall: High-Level Goals of This Course
n In Digital Design & Computer Architecture

n Understand the basics


n Understand the principles (of design)
n Understand the precedents

n Based on such understanding:


q learn how a modern computer works underneath
q evaluate tradeoffs of different designs and ideas
q implement a principled design (a simple microprocessor)
q learn to systematically debug increasingly complex systems
q Hopefully enable you to develop novel, out-of-the-box designs

n The focus is on basics, principles, precedents, and how to


use them to create/implement good designs
279
Recall: Why These Goals?
n Because you are here for a Computer Science degree

n Regardless of your future direction, learning the principles


of digital design & computer architecture will be useful to
q design better hardware
q design better software
q design better systems
q make better tradeoffs in design
q understand why computers behave the way they do
q solve problems better
q think “in parallel”
q think critically
q …
280
Course Info and Logistics

281
If You Need Help
■ Post your question on Moodle Q&A Forum
q https://moodle-
app2.let.ethz.ch/course/view.php?id=19395
❑ We will create a forum on Moodle for each activity
❑ Preferred for technical questions

■ Write an e-mail to:


[email protected]
❑ The instructor and all assistants will receive this e-mail

■ Come to office hours


❑ We will provide office locations & Zoom links
❑ TBD

282
Where to Get Up-to-date Course Info?
■ Website:
❑ https://pooyanjamshidi.github.io/csce212/
❑ Lecture slides and (videos)
❑ Readings
❑ Course schedule, handouts, FAQs
❑ Software
❑ Any other useful information for the course
❑ Check frequently for announcements and due dates
❑ This is your single point of access to all resources

■ TA

283
Reading Assignments for This Week

n Chapter 1 in
Harris & Harris

n Chapters 1-2 in
Patt and Patel
(encouraged)

284
Reading Assignments for Next Week
n Combinational Logic chapters from both books
q Harris and Harris, Chapter 2
q Patt and Patel, Chapter 3

n Check the course website for all future readings


q Required
q Recommended
q Mentioned

285
Future Lectures and Assignments
■ You can also anticipate (and plan for) future lectures and
assignments based on Spring 2023 schedule:
❑ https://pooyanjamshidi.github.io/csce212/lectures/

286
287
Takeaways
n It is an exciting time to be understanding and designing
computing architectures

n Many challenging and exciting problems in platform design


q That no one has tackled (or thought about) before
q That can have huge impact on the world’s future

n Driven by huge hunger for data (Big Data), new applications


(ML/AI, graph analytics, genomics), ever-greater realism, …
q We can easily collect more data than we can analyze/understand

n Driven by significant difficulties in keeping up with that


hunger at the technology layer
q Five walls: Energy, reliability, complexity, security, scalability

289
Major High-Level Goals of This Course
In Computer Architecture
n Understand the basics

n Understand the principles (of design)

n Understand the precedents

n Based on such understanding:


q learn how a modern computer works underneath
q evaluate tradeoffs of different designs and ideas
q implement a principled design (a simple microprocessor)
q Hopefully enable you to develop novel, out-of-the-box designs

n The focus is on basics, principles, precedents, and how to


use them to create/implement good designs, tradeoffs are
important!
291
Why These Goals?
n Because you are here for a Computer Science degree

n Regardless of your future direction, learning the principles


of computer architecture will be useful to
q design better systems (software + hardware)
q make better tradeoffs in design
q understand why computers behave the way they do
q solve problems better
q think “in parallel”
q think critically
q …

292
I presume you all know the number systems?

n Binary Number
n Hexadecimal Numbers
n Bits, Bytes, Words
n least significant bit (lsb), most significant bit (msb)
n Least Significant Byte (LSB), Most Significant Byte (MSB)
n KB, MB, GB, TB
n Binary Addition
n Signed Binary Numbers

293

You might also like