0% found this document useful (0 votes)
5 views

00_CourseIntroduction

An explaination about the course introduction

Uploaded by

polix36021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

00_CourseIntroduction

An explaination about the course introduction

Uploaded by

polix36021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

SIMD - Single Instruction/Multiple Data

TPU - Tensor processing Unit 2


• Single instruction, multiple data (SIMD) programming on classical CPUs
• Common GPU architectures
• Multiple precision of data: double (FP64), single (SP32), half
• (FP16, BF16), int8

• Data locality and new programming paradigms

• Hardware-specific language: CUDA (NVIDIA)


• OpenACC
• Kokkos c++

• TensorFlow (Google)
• Optimization
• Gpus with python 3
• Computer lab: 1/2
• Due: September 15

• Computer lab 3/4


• Due: September 22

• Computer lab 5/6


• Due: September 28

• Computer lab on 7/8


• Due: October 10

4
5
• We have a project for the course, UPPMAX 2023/2-23
• You need to set up an account SUPR and on UPPMAX now, see the basic
information https://www.uppmax.uu.se/support/getting-started/ and specifically
step 1 on https://www.uppmax.uu.se/support/getting-started/applying-for-a-user-
account/ (if you do not have an account yet)

• Then apply for an UPPMAX account


• We MUST follow the user rules for UPPMAX, see
https://www.uppmax.uu.se/support/user-guides/snowy-user-guide/
• Specific information for the GPU nodes:
https://www.uppmax.uu.se/support/user-guides/using-the-gpu-nodes-on-snowy/
-

HOMEWORK 6
7
Week Day Date Start time End time Lokal (Ångström) Type Content
w35 Tuesday 2023-08-29 10:15 12:00 101130, Lecture Lecture 1 Syllabus, Introduction Gpu hardware
w35 Thursday 2023-08-31 10:15 12:00 101130, Lecture Lecture 2 Background Computer Architecture
w36 Monday 2023-09-04 15:15 17:00 4103/4102 ,Computer room Laboratory experiment Lab 1 Lab assignment 1
w36 Tuesday 2023-09-05 08:15 10:00 101130 Lecture Lecture 3 Cuda Programming+ Nvidia whitepaper
w36 Thursday 2023-09-07 10:15 12:00 101130 Lecture Lecture 4 Cuda Programming+ Final Project
w37 Tuesday 2023-09-12 08:15 10:00 4103/4102, Computer room, Laboratory experiment Lab 2 Lab assignment 1/2

w37 Thursday 2023-09-14 15:15 17:00 6K1101/6K1107, Computer room, Laboratory experiment Lab 3
w37 Friday 2023-09-15 13:15 15:00 101130 Lecture Lecture 5 Neural Networks + Tensorflow
w38 Monday 2023-09-18 10:15 12:00 10K1203, Computer room Laboratory experiment Lab 4

w38 Friday 2023-09-22 15:15 17:00 10K1203, Computer room, Laboratory experiment Lab 5
w39 Monday 2023-09-25 13:15 15:00 101130 Lecture Lecture 6 Kokkos
w39 Tuesday 2023-09-26 15:15 17:00 4103/4104 Laboratory experiment Lab 6

w39 Friday 2023-09-29 15:15 17:00 10K1203, Computer room Laboratory experiment Lab 7

w40 Monday 2023-10-02 15:15 17:00 4103/4102,Computer room Laboratory experiment Lab 8
w40 Tuesday 2023-10-03 13:15 15:00 2004, Ångström Lecture Lecture 7 Kokkos
w40 Friday 2023-10-06 13:15 15:00 10K1203, Computer room Laboratory experiment Lab 9 Project assignment

w41 Wednesday 2023-10-11 15:15 17:00 10K1203, Computer room Laboratory experiment Lab 10 Time to work on Project on your own

w41 Friday 2023-10-13 13:15 15:00 10K1203, Computer room Laboratory experiment Lab 11 Project assignment

w43 Wednesday 2023-10-25 08:15 17:00 101125 Oral exam Oral exam Discussion of Project
8
• GPU Hardware (today)
• CPU hardware (Thursday)

9
• GPUs (graphics processing units) for 2D
rendering of a 3D scene (transform, clipping and
lightning)
• Question: Use extended computing capabilities
be somethings else

• Programmable stage in the rendering pipeline


• Vertex and pixel shaders
• Vectors (colors, position) and matrices natively
• Very messy!

• NVIDIA CUDA: make GPU programmable


11
12
• Divisions & square roots run slower
• Trigonometric functions sometimes in software by many (20-100)
floating point operations, sometimes in hardware

• Can differ for read and write operations


• Differs between fast cache memory and slow main memory

13
Compute Memory BW Power What
NVIDIA 9.7 TFlop/s 1.6 TB/s 400 W HPC-oriented
Ampere A100 GPU
Intel Xeon 2.9 TFlop/s 205 GB/s 270 W generic CPU
Platinum (Ice 40 cores
Lake) 2.3 GHz,
AMD Epyc 2.0 TFlop/s 205 GB/s 225 W generic CPU,
7713 64 cores
2.0 GHz
Fujitsu A64FX 3.1 TFlop/s 900 GB/s 130 W special-
purpose HPC
48 cores
2.2 GHz

14
Source: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html 15
• CPU must be good at everything
• GPU focuses on massively parallel computations with independent work
• Less flexible and more specialized

• A lot of area devoted to extract “parallelism” on the fly (pipelining, out-of-order,


speculative execution)
• High clock frequencies
• Execution units with low latency but more power draw (“bigger” transistors)

• Provide lots of resources to enable many threads in flight, e.g. registers to hold
operands
• Multithreading can hide latency → limited-sized caches
• Share control logic across many threads

16
1. SIMD width not wide enough to hide control logic enough
2. SIMD execution units work with caches primarily designed by latency
tradeoffs
3. Less parallelism exposed by traditional CPU programs, out-of-order
mechanism does not reach far enough
4. Memory access latency covered by prefetching: More power-hungry
than real access
5. SIMD execution units “extend” latency hardware → have lower latency
and thus bigger, more power-hungry transistors
6. Memory outside caches (RAM) uses less parallel but lower latency
DDR RAM architecture rather than GDDR or HBM (high-bandwidth
memory)
7. Higher clock frequencies (NVIDIA A100: ∼ 1.4 GHz, CPUs >2 GHz all-
core)

17
• Classical CPU combined with GPU
• GPU is accelerator for suitable tasks (parallel, compute)
• CPU runs the remaining tasks (bookkeeping, setup, IO, . . . )
• CPU and GPU might also share parallel tasks
• Typical hardware is a hybrid of accelerator and host
• Case for UPPMAX GPU nodes: Intel host CPU and NVIDIA T4 accelerator

18
19
20
• CUDA core or Streaming Processor

• A collection of CUDA cores (8 / 16 / 32 / 192)


• All cores in one SM run the same instructions
• Has some fast, shared cache memory
• Can synchronize

21
• A collection of SMs +
memory

• 2x2 8-core SMs, 32 cores

C-cache-Constant cache for fast broadcast of


reads from constant memory 22
ROP – Redner output unit
SFU – Special function unit
MT – Multiple thread
I-Cache – Instruction

• A collection of SMs + memory

• 8x2 8-core SMs, 128 cores

23
• A collection of SMs + memory

• 10x3 8-core SMs, 240 cores

24
25
• CUDA cores
(INT/FP32/FP64)
• LD/ST (Load/Store)
• Special function units
• Tensor Core
• Register file
• Warp scheduler
• Data caches
• Instruction buffers/caches
• Texture units

26
nvidia-ampere-architecture-whitepaper.pdf

27
• US supercomputer Frontier (first to reach 1 Exaflop/s in double precision) uses AMD GPUs. Greenest
supercomputer Tflop/Watt

Year Petaflops/s Architecture language


Fugaku 2020 500 Fujitsu ARM C++/Fortran
Summit 2019 200 IBM CPUs + Nvidia GPUs CUDA
Perlmutter 2021 100 AMD CPUs + Nvidia GPUs CUDA
Aurora 2021 1000 Intel CPUs + Intel GPUs SYCL
Frontier 2021 1500 AMD CPUs + AMD GPUs HIP
El Capitan 2023 2000 AMD CPUs + AMD GPUs HIP
28
29
• Can use pointers on both host and device transparently
• But how do we make sure the data is where we want it to be?

30
31
• both host and device memory;
• data transfer between host and device;
• starting device kernels.

• are comprised by huge amounts of parallel threads at the same


time
• divide the data-parallel workload among these threads
• switches execution between groups of threads to hide memory
latency

32
You need to set up an account on SUPR and UPPMAX now, see the basic
information https://www.uppmax.uu.se/support/getting-started/ and specifically step
1 on https://www.uppmax.uu.se/support/getting-started/applying-for-a-user-
account/ (if you do not have an account yet)

• Then apply for an UPPMAX account

• We MUST follow the user rules for UPPMAX, see


https://www.uppmax.uu.se/support/user-guides/snowy-user-guide/
• Specific information for the GPU nodes:
https://www.uppmax.uu.se/support/user-guides/using-the-gpu-nodes-on-snowy/

33

You might also like