0% found this document useful (0 votes)

8 views33 pages

00_CourseIntroduction

An explaination about the course introduction

Uploaded by

polix36021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views33 pages

00_CourseIntroduction

An explaination about the course introduction

Uploaded by

polix36021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

SIMD - Single Instruction/Multiple Data

TPU - Tensor processing Unit 2

• Single instruction, multiple data (SIMD) programming on classical CPUs
• Common GPU architectures
• Multiple precision of data: double (FP64), single (SP32), half
• (FP16, BF16), int8

• Data locality and new programming paradigms

• Hardware-speciﬁc language: CUDA (NVIDIA)

• OpenACC
• Kokkos c++

• TensorFlow (Google)
• Optimization
• Gpus with python 3
• Computer lab: 1/2
• Due: September 15

• Computer lab 3/4

• Due: September 22

• Computer lab 5/6

• Due: September 28

• Computer lab on 7/8

• Due: October 10

4
5
• We have a project for the course, UPPMAX 2023/2-23
• You need to set up an account SUPR and on UPPMAX now, see the basic
information https://www.uppmax.uu.se/support/getting-started/ and speciﬁcally
step 1 on https://www.uppmax.uu.se/support/getting-started/applying-for-a-user-
account/ (if you do not have an account yet)

• Then apply for an UPPMAX account

• We MUST follow the user rules for UPPMAX, see
https://www.uppmax.uu.se/support/user-guides/snowy-user-guide/
• Speciﬁc information for the GPU nodes:
https://www.uppmax.uu.se/support/user-guides/using-the-gpu-nodes-on-snowy/
-

HOMEWORK 6
7
Week Day Date Start time End time Lokal (Ångström) Type Content
w35 Tuesday 2023-08-29 10:15 12:00 101130, Lecture Lecture 1 Syllabus, Introduction Gpu hardware
w35 Thursday 2023-08-31 10:15 12:00 101130, Lecture Lecture 2 Background Computer Architecture
w36 Monday 2023-09-04 15:15 17:00 4103/4102 ,Computer room Laboratory experiment Lab 1 Lab assignment 1
w36 Tuesday 2023-09-05 08:15 10:00 101130 Lecture Lecture 3 Cuda Programming+ Nvidia whitepaper
w36 Thursday 2023-09-07 10:15 12:00 101130 Lecture Lecture 4 Cuda Programming+ Final Project
w37 Tuesday 2023-09-12 08:15 10:00 4103/4102, Computer room, Laboratory experiment Lab 2 Lab assignment 1/2

w37 Thursday 2023-09-14 15:15 17:00 6K1101/6K1107, Computer room, Laboratory experiment Lab 3
w37 Friday 2023-09-15 13:15 15:00 101130 Lecture Lecture 5 Neural Networks + Tensorflow
w38 Monday 2023-09-18 10:15 12:00 10K1203, Computer room Laboratory experiment Lab 4

w38 Friday 2023-09-22 15:15 17:00 10K1203, Computer room, Laboratory experiment Lab 5
w39 Monday 2023-09-25 13:15 15:00 101130 Lecture Lecture 6 Kokkos
w39 Tuesday 2023-09-26 15:15 17:00 4103/4104 Laboratory experiment Lab 6

w39 Friday 2023-09-29 15:15 17:00 10K1203, Computer room Laboratory experiment Lab 7

w40 Monday 2023-10-02 15:15 17:00 4103/4102,Computer room Laboratory experiment Lab 8
w40 Tuesday 2023-10-03 13:15 15:00 2004, Ångström Lecture Lecture 7 Kokkos
w40 Friday 2023-10-06 13:15 15:00 10K1203, Computer room Laboratory experiment Lab 9 Project assignment

w41 Wednesday 2023-10-11 15:15 17:00 10K1203, Computer room Laboratory experiment Lab 10 Time to work on Project on your own

w41 Friday 2023-10-13 13:15 15:00 10K1203, Computer room Laboratory experiment Lab 11 Project assignment

w43 Wednesday 2023-10-25 08:15 17:00 101125 Oral exam Oral exam Discussion of Project
8
• GPU Hardware (today)
• CPU hardware (Thursday)

9
• GPUs (graphics processing units) for 2D
rendering of a 3D scene (transform, clipping and
lightning)
• Question: Use extended computing capabilities
be somethings else

• Programmable stage in the rendering pipeline

• Vertex and pixel shaders
• Vectors (colors, position) and matrices natively
• Very messy!

• NVIDIA CUDA: make GPU programmable

11
12
• Divisions & square roots run slower
• Trigonometric functions sometimes in software by many (20-100)
ﬂoating point operations, sometimes in hardware

• Can diﬀer for read and write operations

• Diﬀers between fast cache memory and slow main memory

13
Compute Memory BW Power What
NVIDIA 9.7 TFlop/s 1.6 TB/s 400 W HPC-oriented
Ampere A100 GPU
Intel Xeon 2.9 TFlop/s 205 GB/s 270 W generic CPU
Platinum (Ice 40 cores
Lake) 2.3 GHz,
AMD Epyc 2.0 TFlop/s 205 GB/s 225 W generic CPU,
7713 64 cores
2.0 GHz
Fujitsu A64FX 3.1 TFlop/s 900 GB/s 130 W special-
purpose HPC
48 cores
2.2 GHz

14
Source: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html 15
• CPU must be good at everything
• GPU focuses on massively parallel computations with independent work
• Less ﬂexible and more specialized

• A lot of area devoted to extract “parallelism” on the ﬂy (pipelining, out-of-order,

speculative execution)
• High clock frequencies
• Execution units with low latency but more power draw (“bigger” transistors)

• Provide lots of resources to enable many threads in ﬂight, e.g. registers to hold
operands
• Multithreading can hide latency → limited-sized caches
• Share control logic across many threads

16
1. SIMD width not wide enough to hide control logic enough
2. SIMD execution units work with caches primarily designed by latency
tradeoﬀs
3. Less parallelism exposed by traditional CPU programs, out-of-order
mechanism does not reach far enough
4. Memory access latency covered by prefetching: More power-hungry
than real access
5. SIMD execution units “extend” latency hardware → have lower latency
and thus bigger, more power-hungry transistors
6. Memory outside caches (RAM) uses less parallel but lower latency
DDR RAM architecture rather than GDDR or HBM (high-bandwidth
memory)
7. Higher clock frequencies (NVIDIA A100: ∼ 1.4 GHz, CPUs >2 GHz all-
core)

17
• Classical CPU combined with GPU
• GPU is accelerator for suitable tasks (parallel, compute)
• CPU runs the remaining tasks (bookkeeping, setup, IO, . . . )
• CPU and GPU might also share parallel tasks
• Typical hardware is a hybrid of accelerator and host
• Case for UPPMAX GPU nodes: Intel host CPU and NVIDIA T4 accelerator

18
19
20
• CUDA core or Streaming Processor

• A collection of CUDA cores (8 / 16 / 32 / 192)

• All cores in one SM run the same instructions
• Has some fast, shared cache memory
• Can synchronize

21
• A collection of SMs +
memory

• 2x2 8-core SMs, 32 cores

C-cache-Constant cache for fast broadcast of

reads from constant memory 22
ROP – Redner output unit
SFU – Special function unit
MT – Multiple thread
I-Cache – Instruction

• A collection of SMs + memory

• 8x2 8-core SMs, 128 cores

23
• A collection of SMs + memory

• 10x3 8-core SMs, 240 cores

24
25
• CUDA cores
(INT/FP32/FP64)
• LD/ST (Load/Store)
• Special function units
• Tensor Core
• Register file
• Warp scheduler
• Data caches
• Instruction buffers/caches
• Texture units

26
nvidia-ampere-architecture-whitepaper.pdf

27
• US supercomputer Frontier (ﬁrst to reach 1 Exaﬂop/s in double precision) uses AMD GPUs. Greenest
supercomputer Tflop/Watt

Year Petaflops/s Architecture language

Fugaku 2020 500 Fujitsu ARM C++/Fortran
Summit 2019 200 IBM CPUs + Nvidia GPUs CUDA
Perlmutter 2021 100 AMD CPUs + Nvidia GPUs CUDA
Aurora 2021 1000 Intel CPUs + Intel GPUs SYCL
Frontier 2021 1500 AMD CPUs + AMD GPUs HIP
El Capitan 2023 2000 AMD CPUs + AMD GPUs HIP
28
29
• Can use pointers on both host and device transparently
• But how do we make sure the data is where we want it to be?

30
31
• both host and device memory;
• data transfer between host and device;
• starting device kernels.

• are comprised by huge amounts of parallel threads at the same

time
• divide the data-parallel workload among these threads
• switches execution between groups of threads to hide memory
latency

32
You need to set up an account on SUPR and UPPMAX now, see the basic
information https://www.uppmax.uu.se/support/getting-started/ and speciﬁcally step
1 on https://www.uppmax.uu.se/support/getting-started/applying-for-a-user-
account/ (if you do not have an account yet)

• Then apply for an UPPMAX account

• We MUST follow the user rules for UPPMAX, see

https://www.uppmax.uu.se/support/user-guides/snowy-user-guide/
• Speciﬁc information for the GPU nodes:
https://www.uppmax.uu.se/support/user-guides/using-the-gpu-nodes-on-snowy/

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
chapter-8
No ratings yet
chapter-8
58 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Presentation-Group # 7
No ratings yet
Presentation-Group # 7
17 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
note2_4
No ratings yet
note2_4
11 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Gpu Cuda Part1
No ratings yet
Gpu Cuda Part1
27 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
CUDA
No ratings yet
CUDA
46 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Presentation1 (1) hpc mod 3
No ratings yet
Presentation1 (1) hpc mod 3
51 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
PART19
No ratings yet
PART19
20 pages
COE4590_15_GPU1
No ratings yet
COE4590_15_GPU1
14 pages
Lec 14
No ratings yet
Lec 14
52 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
p10-cuda
No ratings yet
p10-cuda
28 pages
GPGPU
No ratings yet
GPGPU
139 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
13 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lec 1
No ratings yet
Lec 1
27 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Lecture2 GPU Architecture_2025
No ratings yet
Lecture2 GPU Architecture_2025
46 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
CUDA
No ratings yet
CUDA
33 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Chapter 9_Multiple Core Computers
No ratings yet
Chapter 9_Multiple Core Computers
44 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
VLSI Unit-2 2marks
No ratings yet
VLSI Unit-2 2marks
4 pages
Design and Implementation of 2bit Vedic Multiplier at 16nm Using PTL Logic
No ratings yet
Design and Implementation of 2bit Vedic Multiplier at 16nm Using PTL Logic
5 pages
Synchronous Dynamic Random-Access Memory
No ratings yet
Synchronous Dynamic Random-Access Memory
21 pages
Analog VLSI Design: Technology Trends
No ratings yet
Analog VLSI Design: Technology Trends
25 pages
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
30 pages
UNIT-4 8086 Instruction
No ratings yet
UNIT-4 8086 Instruction
93 pages
Micro Programmed Control
100% (8)
Micro Programmed Control
24 pages
2024 0327 元大投顧個股報告信驊5274TT買進AI伺服器市場商機更廣
No ratings yet
2024 0327 元大投顧個股報告信驊5274TT買進AI伺服器市場商機更廣
26 pages
Part 1 - Introduction To ARM and The Basics of Microcontroller Programming
No ratings yet
Part 1 - Introduction To ARM and The Basics of Microcontroller Programming
61 pages
Basic Computer OrganizationTiming and Control Unit
No ratings yet
Basic Computer OrganizationTiming and Control Unit
12 pages
Unit 3 Lecturer Notes
No ratings yet
Unit 3 Lecturer Notes
48 pages
Mp3-8-Bit Microprocessor Architecture and Operation
No ratings yet
Mp3-8-Bit Microprocessor Architecture and Operation
17 pages
DMF690N
No ratings yet
DMF690N
15 pages
CS104: Computer Organization: 30 March, 2020
No ratings yet
CS104: Computer Organization: 30 March, 2020
31 pages
Module 4 - CMOS Chip Design
No ratings yet
Module 4 - CMOS Chip Design
17 pages
Digital System Implementation
No ratings yet
Digital System Implementation
37 pages
94年台大資工計系詳解 (CA)
No ratings yet
94年台大資工計系詳解 (CA)
5 pages
Memory Speed Tables For Hpe Gen10 Servers Using First-, Second-, and Third-Generation Intel Xeon Scalable Processors
No ratings yet
Memory Speed Tables For Hpe Gen10 Servers Using First-, Second-, and Third-Generation Intel Xeon Scalable Processors
24 pages
Chapter 01 See Program Running
No ratings yet
Chapter 01 See Program Running
63 pages
Vlsi Design Unit-V, Vi (1)
No ratings yet
Vlsi Design Unit-V, Vi (1)
94 pages
12 Bit Comparator 32nm
No ratings yet
12 Bit Comparator 32nm
4 pages
Computer Organisation and Architecture (1)
No ratings yet
Computer Organisation and Architecture (1)
271 pages
Unit 5 - Digital Logic Families
No ratings yet
Unit 5 - Digital Logic Families
15 pages
597959005
No ratings yet
597959005
5 pages
DTSP Unit 5
No ratings yet
DTSP Unit 5
12 pages
Modulo18 RiscV DDCArv Ch8
No ratings yet
Modulo18 RiscV DDCArv Ch8
43 pages
8-Bit Microcontroller With 12K Bytes Flash AT89S53: Features
No ratings yet
8-Bit Microcontroller With 12K Bytes Flash AT89S53: Features
32 pages
CMOS Inverter
No ratings yet
CMOS Inverter
11 pages
ROM Memory
No ratings yet
ROM Memory
56 pages
2.1 Memory System(S.K)
No ratings yet
2.1 Memory System(S.K)
30 pages

00_CourseIntroduction

Uploaded by

00_CourseIntroduction

Uploaded by

SIMD - Single Instruction/Multiple Data

TPU - Tensor processing Unit 2

• Data locality and new programming paradigms

• Hardware-speciﬁc language: CUDA (NVIDIA)

• Computer lab 3/4

• Computer lab 5/6

• Computer lab on 7/8

• Then apply for an UPPMAX account

• Programmable stage in the rendering pipeline

• NVIDIA CUDA: make GPU programmable

• Can diﬀer for read and write operations

• A lot of area devoted to extract “parallelism” on the ﬂy (pipelining, out-of-order,

• A collection of CUDA cores (8 / 16 / 32 / 192)

• 2x2 8-core SMs, 32 cores

C-cache-Constant cache for fast broadcast of

• A collection of SMs + memory

• 8x2 8-core SMs, 128 cores

• 10x3 8-core SMs, 240 cores

Year Petaflops/s Architecture language

• are comprised by huge amounts of parallel threads at the same

• Then apply for an UPPMAX account

• We MUST follow the user rules for UPPMAX, see

You might also like