Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing (part 1)
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 2
COMPUTER MARKET
IT Environments
Embedded system
System that you don’t use directly (not a computer by itself, with a display and input devices)
Combination of a computer processor, memory, and input/output peripheral devices
Has a dedicated function within a larger mechanical or electrical system (e.g., washing machine)
Desktop Computing
Home computers
Performance / price ration as major issue
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 4
APPLICATIONS
Some Problems always benefit from faster processing
Simulation and modeling (climate, earthquakes, airplane design, car design, vehicle traffic patterns, …)
Next-generation medicine
DNA sequencing, simulation of drug effects
[5] HPC and Supercomputing
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 5
RUNNING APPLICATIONS
How can you become faster?
[6] WT 2013/14
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 6
MACHINE MODEL
Von Neumann Architecture
Today machines rely on optimizations tricks (caching, buffering etc..) to deal with this problem
However this issue is getting again relevant today with multi-core and many-core systems
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 7
HOW TO DO ANYTHING FASTER
Three Ways
Main solution:
Optimize the computing time from 1950 to 2000
per instruction (Instruction
Level Parallelism (ILP), code
optimization, caching)
“… the number of transistors that can be inexpensively placed on integrated circuits is increasing
exponentially, doubling approximately every two years. …” (Gordon Moore, 1965)
It is only about the number of transistors that can be places over a die
Die: small block of semiconducting material on which a given functional circuit is fabricated
Transistors have reached the size of few nanometers E.g., 1.2 trillion transistors on a
die over a area of 46,000𝑚𝑚2
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 9
[8] Cerebras Wafer Scale Engine
THE INDUSTRY PROBLEM
Processor Speed Development
Companies have been shrinking the technology to try to follow Moore’s Law
By doubling the density of semiconductor, they were getting performance and power improvements
WORK
Investment costs to keep up with performance demands SMARTER
Instruction-level
parallelism (ILP)
Increasingly high
Only a few fab vendors can keep up
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 10
THE FREE LUNCH IS OVER
Paradigm Shift
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 11
PARALLEL COMPUTING
Concepts
The problem should be solved in less time than with a single compute resource
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 12
HARDWARE LEVELS OF PARALLELISM
In-core parallelism Elements of vectors are
processed in parallel
Process of a CPU splitting each of its physical cores into virtual cores (threads)
Multiple threads with multiple tasks are executed simultaneously on one CPU core
Although running on the same core, they are completely separated from each other
Similar in concept to preemptive multitasking but is implemented at the thread level
Intel branded this process as hyper-threading
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 13
HARDWARE LEVELS OF PARALLELISM
In-processor parallelism
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 15
CPU VS. GPU
Overview
A matter of specialities
Chip
CPU GPU
[11] Andreas Herten
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 16
CPU VS. GPU
Overview
GPU multiprocessor
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 19
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 25
SHARED MEMORY SYSTEM
Single Computer
System where a number of CPUs work on a common, shared physical address space
Programming using OpenMP (set of compiler directives to ‘mark parallel regions’)
Immediate access to all data by all processors without explicit communication
Significant advances in CPU (or microprocessor chips)
[14] OpenMP API Specification
Multi-core CPU chips have quad, six, or n processing cores on one chip
Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years
o Reached a limit due to constraint in physics (~ 5 GHz)
CORE 1 CORE 2
… CORE N CORE 1 CORE 2
… CORE N
L3 CACHE / DRAM
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 26
DISTRIBUTED-MEMORY SYSTEM
Multiple Computers
Programming Model:
MPI
COMMUNICATION NETWORK
Architecture: Shared-memory building blocks interconnected with a fast network (e.g., InfiniBand)
NODE
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 28
HIGH PERFORMANCE COMPUTING
Definition
High Performance Computing (HPC) is based on computing resources that enable the efficient use of
parallel computing techniques through specific support with dedicated hardware such as high performance
cpu/core interconnections
HPC
network
interconnection
important
focus in this lecture
High Throughput Computing (HTC) is based on commonly available computing resources such as
commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high
performance interconnection between the cpu/cores
network
interconnection
HTC less important!
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 29
PARALLEL ARCHITECTURES
In Deep Learning
147 DL papers
Out of the 240 reviewed papers
Different network technologies provide different performance [17] T. Ben-Nun and T. Hoefler (2018)
73 DL papers
Both modern Ethernet and InfiniBand provide high bandwidth Out of the 240 reviewed papers
But InfiniBand has significantly lower latencies (delays) and higher message rates
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 31
TOP 500 LIST
November 2019
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 32
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 33
#1 HPC MACHINE SUMMIT
Node
Each node: two 22-core Power9 IBM CPUs
Each node: six(!) NVIDIA Tesla V100 GPUs
(i.e., 4608 nodes x 6 GPUs = 27648 GPUs)
[23] ORNL
5,120 CUDA cores
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 34
MULTI-GPU COMMUNICATION
HPC Systems - NVLink and NVSwitch
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 35
NVIDIA TESLA VOLTA (V100)
Selected Facts
Equipped with over 21 billion transistors with 5120 CUDA cores and 640 tensor cores
A tensor core is optimized for deep learning workloads
o Acceleration of large matrix operations
o Perform mixed-precision matrix multiply and accumulate calculations in a single operation
Predecessor of NVIDIA v100 was NVIDIA Tesla Pascal (p100)
Predecessor of NVIDIA p100 was NVIDIA Tesla Kepler (e.g., K80 or K40)
ResNet-50
over 23 million of trainable parameters
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing [28] NVIDIA V100 TENSOR CORE GPU Page 36
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)
[29] Summit
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 37
THE DEEP PROJECTS
Research & innovation projects co-funded by the European Union
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 38
DEEP-EST PROJECT
Applications
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 39
DEEP-EST PROJECT
Modular Supercomputing Architecture (MSA)
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 40
MODULAR SUPERCOMPUTING ARCHITECTURE (MSA)
Scaling with GPUDirect Implementation
The problem:
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 41
EXTREME SCALE BOOSTER (ESB)
GPU-centric Programming
[32] SLURM
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 43
REFERENCES
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 44
REFERENCES
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 45
REFERENCES
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 46
REFERENCES
[31] Estela Suarez, The DEEP evolution, Seminar Modular Architectures, Jülich Supercomputing Centre (JSC)
[32] SLURM: Support for Multi-core/Multi-thread Architectures
Online: https://slurm.schedmd.com/mc_support.html
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 47