0% found this document useful (0 votes)
53 views47 pages

Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1

Uploaded by

Rachid Rahmoune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views47 pages

Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1

Uploaded by

Rachid Rahmoune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

SCALABLE MACHINE LEARNING FOR REMOTE SENSING BIG DATA

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing (part 1)

DR. – ING. GABRIELE CAVALLARO


HIGH PRODUCTIVITY DATA PROCESSING RESEARCH GROUP
JUELICH SUPERCOMPUTING CENTRE (JSC)
OUTLINE

 The Free Lunch is Over


 Moore’s Law
 Work Harder and work smarter
 Get Help: Many-Core Era
 Hardware Levels of Parallelism
 In-core, In-Processor, Single and Multiple Computers
 Graphics Processing Units (GPUS)
 High Performance Computing (HPC)
 TOP500
 Architectures of HPC Systems
 Modular Supercomputing Architecture
 Deep projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 2
COMPUTER MARKET
IT Environments
 Embedded system
 System that you don’t use directly (not a computer by itself, with a display and input devices)
 Combination of a computer processor, memory, and input/output peripheral devices
 Has a dedicated function within a larger mechanical or electrical system (e.g., washing machine)

 Embedded Computing: real time behavior


 The system guarantees to end at a defined point (task is done in a specific amount of time)
 Not interested in fast computing
 Power consumption and price as major issue

[1] Embedded system

[2] Embedded system example


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 3
COMPUTER MARKET
IT Environments

 Desktop Computing
 Home computers
 Performance / price ration as major issue

[3] Desktop Computer


 Servers
 The key: maximum performance and maintainability reliability
 Business service provisioning and major goals
 Web servers, banking back-end, order processing, …

[4] Microsoft opens UK data centres

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 4
APPLICATIONS
Some Problems always benefit from faster processing

 Simulation and modeling (climate, earthquakes, airplane design, car design, vehicle traffic patterns, …)

 Machine (deep) learning with big data

 Web search, social networks

 Modern computer games

 Next-generation medicine
 DNA sequencing, simulation of drug effects
[5] HPC and Supercomputing

 Business data processing

 Graphic effects on consumer devices

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 5
RUNNING APPLICATIONS
How can you become faster?

 Split the application into instructions (generated by the compiler)


 These instructions are then given to the system
 We want to execute these instructions as fast as possible

[6] WT 2013/14

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 6
MACHINE MODEL
Von Neumann Architecture

 First computers had fixed programs (electronic calculator)


 Von Neumann Architecture (1945, for EDVAC project)
 Radical idea: load code into memory (i.e., code becomes data)
 Use one machine for doing several things (general purpose computer)
 Von Neumann bottleneck (Bus): access the memory at least 3 times
 (1) Read the instruction, (2) read the input data and compute, (3) write the result

- Instruction set for control flows stored in memory


- Program is treated as data, which allows the exchange of code
[6] WT 2013/14 during runtime and self modification

 Today machines rely on optimizations tricks (caching, buffering etc..) to deal with this problem
 However this issue is getting again relevant today with multi-core and many-core systems
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 7
HOW TO DO ANYTHING FASTER
Three Ways

 E.g., Washing your car, doing homework, processing instructions in a computer


 In Hardware terminology:

Increase the frequency


speed of the hardware

Main solution:
Optimize the computing time from 1950 to 2000
per instruction (Instruction
Level Parallelism (ILP), code
optimization, caching)

Use more than one process


unit (option that was never
really exploited until lately) Nowadays
The first two options were
[6] WT 2013/14
good enough
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 8
MOORE’S LAW

 “… the number of transistors that can be inexpensively placed on integrated circuits is increasing
exponentially, doubling approximately every two years. …” (Gordon Moore, 1965)

 Rule of exponential growth

 Applied to many IT hardware developments

 Sometimes misinterpreted as performance indication


[7] Moore's law

 Meanwhile a self-fulfilling prophecy

 May still hold for the next 10-20 year

 It is only about the number of transistors that can be places over a die
 Die: small block of semiconducting material on which a given functional circuit is fabricated
 Transistors have reached the size of few nanometers E.g., 1.2 trillion transistors on a
die over a area of 46,000𝑚𝑚2
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 9
[8] Cerebras Wafer Scale Engine
THE INDUSTRY PROBLEM
Processor Speed Development

 Companies have been shrinking the technology to try to follow Moore’s Law
 By doubling the density of semiconductor, they were getting performance and power improvements

 After a while the improvements started plateauing


 Silicon clock frequency is flattening out
 Power is flattening out with silicon
WORK
HARDER

 Performance is flattening out with silicon

WORK
 Investment costs to keep up with performance demands SMARTER
Instruction-level
parallelism (ILP)
 Increasingly high
 Only a few fab vendors can keep up

[9] Herb Sutter

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 10
THE FREE LUNCH IS OVER
Paradigm Shift

 Clock speed curve flattened in 2003 (constraint in physics)


 Power Wall
 Instruction Level Parallelism (ILP) wall
 Memory wall WORK
HARDER

 Speeding up the serial instruction execution through


WORK
clock speed improvements no longer works SMARTER
Instruction-level
parallelism (ILP)

 We stumbled into the Many-Core Era


 Use the additional transistors to build more cores
[9] Herb Sutter
 Get more performance (speed up) with many core
 The task of reaching better performance is a responsibility of software developers

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 11
PARALLEL COMPUTING
Concepts

 Simultaneous use of multiple compute resources to solve a computational problem:


 Break the problem into discrete parts that can be solved concurrently
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different processors

An overall control/coordination ‘compute elements’ (e.g. cores)


mechanism is employed solve a problem in a cooperative way

[10] Introduction to Parallel Computing

 The problem should be solved in less time than with a single compute resource

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 12
HARDWARE LEVELS OF PARALLELISM
In-core parallelism Elements of vectors are
processed in parallel

 Single Instruction Multiple Data (SIMD)


 Each CPU core has its own independent SIMD execution units Parallel execution units into
each CPU core
 Exploit data level parallelism, but not concurrency
 Help to improve the performance of multimedia use simultaneous (parallel) computations,
but only a single process (instruction)
o E.g., For same operation to every pixel of an image at a given moment

Register sizes (e.g. 64, 128, 256 and 512 bits)


 Simultaneous Multithreading (SMT) [11] Andreas Herten

 Process of a CPU splitting each of its physical cores into virtual cores (threads)
 Multiple threads with multiple tasks are executed simultaneously on one CPU core
 Although running on the same core, they are completely separated from each other
 Similar in concept to preemptive multitasking but is implemented at the thread level
 Intel branded this process as hyper-threading

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 13
HARDWARE LEVELS OF PARALLELISM
In-processor parallelism

 Single Instruction Multiple Threads (SIMT= SIMD + SMT)


 SIMD is combined with multithreading (introduced by Nvidia)
 The hardware groups threads that execute the same instruction into warps
 Several warps constitute a thread block

- Executed physically in parallel on multiprocessor


WARP
(working unit) - Its threads issue instructions in lock-step (as with SIMD)

[12] Peter Messmer


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 14
GRAPHICS PROCESSING UNITS (GPUS)
Processors

 Use of many simple cores executing threads rather slowly


 High throughput computing-oriented architecture
 Use massive parallelism by executing a lot of concurrent threads
 Handle an ever increasing amount of multiple instruction threads

 Great for data and task parallelism


 Applications that use vector/matrix multiplication (e.g., deep learning algorithms)

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 15
CPU VS. GPU
Overview

 A matter of specialities

TRANSPORTING ONE TRANSPORTING MANY

 Chip

CPU GPU
[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 16
CPU VS. GPU
Overview
GPU multiprocessor

CPU core ≊ GPU multiprocessor


(i.e., Streaming Multiprocessor (SM))

[11] Andreas Herten GPU


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 17
CPU VS. GPU
Performance

GFLOP= 109 floating point operations per second


[11] Andreas Herten [13] Karl Rupp
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 18
GPU MEMORY
Different than CPU Memory

 GPU: accelerator / extension card


 Separate device from CPU
 Separate memory
 Memory transfers need special consideration
 Do as little as possible
 Formerly: explicit copy data from/to GPU
 Now: done automatically (performance …?)

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 19
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 20
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 21
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 22
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 23
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 24
HARDWARE LEVELS OF PARALLELISM
Single Computer and Multiple Computers

 Exposed to the programmer in the form of multi-core systems


(A) Single-machine (shared memory) (B) Multi-machine (distributed memory)
Parallel programming :‘Programming models’
OpenMP Message Passage Interface (MPI)

Multiple threads (the Multiple processes (forces


programmer work on the the programmer to consider
parallelism, leaving the data the distribution of the
shuffling to the hardware data as a first-class
system concern)

[15] MPI Standard


[14] OpenMP API Specification

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 25
SHARED MEMORY SYSTEM
Single Computer

 System where a number of CPUs work on a common, shared physical address space
 Programming using OpenMP (set of compiler directives to ‘mark parallel regions’)
 Immediate access to all data by all processors without explicit communication
 Significant advances in CPU (or microprocessor chips)
[14] OpenMP API Specification
 Multi-core CPU chips have quad, six, or n processing cores on one chip
 Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years
o Reached a limit due to constraint in physics (~ 5 GHz)
CORE 1 CORE 2
… CORE N CORE 1 CORE 2
… CORE N

L1 CACHE L1 CACHE L1 CACHE L1 CACHE L1 CACHE L1 CACHE


Hierarchy of caches (on/off chip)
L2 CACHE L2 CACHE
L1 cache is private to each core; on-chip
L2 cache is shared; on-chip
MULTICORE CPU ONE CHIP (SOCKET) MULTICORE CPU ONE CHIP (SOCKET)
L3 cache or Dynamic random access memory (DRAM); off-chip
CHIPSET

L3 CACHE / DRAM

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 26
DISTRIBUTED-MEMORY SYSTEM
Multiple Computers

 No remote memory access on distributed-memory systems


 Process cannot access another process’ memory directly
 Enables explicit message passing as communication between processors
[15] MPI Standard
 Require to ‘send messages’ back and forth between processes
 Programming is tedious & complicated, but most flexible method

Programming Model:
MPI

COMMUNICATION NETWORK

 Processors communicate via Network Interfaces (NI)


 NI mediates the connection to a Communication network
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 27
HIGH PERFORMANCE COMPUTING
Mixture of shared-memory and distributed-memory systems

 Wikipedia: ‘redirects from HPC to Supercomputer’


 Computer at the frontline of contemporary processing capacity with particularly high speed of calculation

[16] Wikipedia ‘Supercomputer’ Online


 HPC includes work on ‘four basic building blocks’:
 Theory (numerical laws, physical models, speed-up performance, etc.)
 Technology (multi-core, supercomputers, networks, storages, etc.)
 Architecture (shared-memory, distributed-memory, interconnects, etc.)
 Software (libraries, schedulers, monitoring, applications, etc.)

 Architecture: Shared-memory building blocks interconnected with a fast network (e.g., InfiniBand)

NODE

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 28
HIGH PERFORMANCE COMPUTING
Definition

 High Performance Computing (HPC) is based on computing resources that enable the efficient use of
parallel computing techniques through specific support with dedicated hardware such as high performance
cpu/core interconnections

HPC

network
interconnection
important
focus in this lecture

 High Throughput Computing (HTC) is based on commonly available computing resources such as
commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high
performance interconnection between the cpu/cores
network
interconnection
HTC less important!
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 29
PARALLEL ARCHITECTURES
In Deep Learning

 Clear trend towards GPUs (dominate the publications from 2013)


 However, even accelerated nodes are not sufficient for the large computational workload
 Quickly growing of multi-node parallelism

147 DL papers
Out of the 240 reviewed papers

[17] T. Ben-Nun and T. Hoefler (2018)

 Distributed-memory architectures with accelerators such as GPUs


 Default option for machine learning today
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 30
PARALLEL ARCHITECTURES
In Deep Learning

 Training deep learning models is a very compute-intensive task


 Single machines are often not capable to finish this task in a desired time-frame

 Accelerate the training by distributing the computation


 Across multiple machines connected by a network (HPC systems)

 Different network technologies provide different performance [17] T. Ben-Nun and T. Hoefler (2018)

73 DL papers
 Both modern Ethernet and InfiniBand provide high bandwidth Out of the 240 reviewed papers

 But InfiniBand has significantly lower latencies (delays) and higher message rates

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 31
TOP 500 LIST
November 2019

 The measure of speed in HPC matters


 Common measure for parallel computers
established by TOP500 list
 Based on LINPACK benchmark for ranking the
best 500 computers worldwide
[18] TOP 500 supercomputing sites

 LINPACK solves a dense system


of linear equations of unspecified size.
It covers only a single architectural …
aspect (‘critics exist’) [19] LINPACK Benchmark implementation

 Alternatives realistic applications,


benchmark suites and criteria exist
[20] HPC Challenge Benchmark Suite [21] JUBE Benchmark Suite [22] The GREEN500

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 32
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)

 Hosted at Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL)


 Space: As big as two tennis courts
[23] ORNL

 Total number of nodes: 4608 Nodes


 Nodes linked together with Mellanox
dual-rail EDR Infiniband network

 Power consumption: 13 Megawatts


 Total system memory: >10 PetaByte (PB)
512 GigaBytes (GB) DDR4 + 96 GB HBM2 +
~ 1.6 TeraByte (TB) non-volatile memory (NVM)

 File system: 250 PB IBM transferring data at 2.5 TB/s

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 33
#1 HPC MACHINE SUMMIT
Node
 Each node: two 22-core Power9 IBM CPUs
 Each node: six(!) NVIDIA Tesla V100 GPUs
(i.e., 4608 nodes x 6 GPUs = 27648 GPUs)

22 cores (4 hardware threads/core)

[23] ORNL
5,120 CUDA cores
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 34
MULTI-GPU COMMUNICATION
HPC Systems - NVLink and NVSwitch

 NVLink: enables high-speed direct GPU-to-GPU interconnects


 It supports 6 NVLink connections per NVIDIA Tesla Volta v100 GPU
 It can interconnet up to 8 NVIDIA Tesla Volta v100 GPUs
 NVSwitch: enables all-to-all GPU communication within a single node
 Incorporates multiple NVLinks
 NVLink/NVSwitch are considered as ‘Islands‘ since they do not scale with workloads to a full HPC machine
 Like the GPUDirect interface is enabling

NVLink NVSwitch GPUDirect


[24] NVLink and NVSwitch [25] Mellanox OFED GPUDirect RDMA

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 35
NVIDIA TESLA VOLTA (V100)
Selected Facts
 Equipped with over 21 billion transistors with 5120 CUDA cores and 640 tensor cores
 A tensor core is optimized for deep learning workloads
o Acceleration of large matrix operations
o Perform mixed-precision matrix multiply and accumulate calculations in a single operation
 Predecessor of NVIDIA v100 was NVIDIA Tesla Pascal (p100)
 Predecessor of NVIDIA p100 was NVIDIA Tesla Kepler (e.g., K80 or K40)

ResNet-50
over 23 million of trainable parameters

[26] ResNet in Keras [27] K. He, X. Zhang, et al. (2016)

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing [28] NVIDIA V100 TENSOR CORE GPU Page 36
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)

 Summit Early Science Video Series: Enabling Scientific Innovation

[29] Summit

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 37
THE DEEP PROJECTS
Research & innovation projects co-funded by the European Union

 Paving the road towards Exascale computing


EU-Exascale projects
27 partners
Total budget: 44 M€
 DEEP (2012–2015) EU-funding: 30 M€
o Cluster/Booster concept

 DEEP-ER (Extended Reach - 2013–2017)


o scalable I/O and resiliency

 DEEP-EST (Extreme Scale Technologies - 2017–2020)


o generalized modular supercomputing architecture,
support for data analytics and machine learning
One of only two Exascale project series funded

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 38
DEEP-EST PROJECT
Applications

 University of Iceland – Earth Science

 CERN – High Energy Physics

 NEST – Brain Simulation

 ASTRON – Radio Astronomy

 KU Leuven – Space Weather

 GROMACS – Molecular Dynamics

 Hardware design driven by software needs


[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 39
DEEP-EST PROJECT
Modular Supercomputing Architecture (MSA)

 DEEP-EST prototype includes three compute modules


 CLUSTER Module
 Extreme Scale Booster (ESB)
 Data Analytics Module (DAM)

 Storage module handles workflow data


 Local storage and the network attached memory (NAM) for hot application data
 Global communication engine (GCE) accelerates MPI collective operations
 Fast network federation binds all parts together

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 40
MODULAR SUPERCOMPUTING ARCHITECTURE (MSA)
Scaling with GPUDirect Implementation

 Innovative GPU interconnects are realized via GPUDirect implementations


 Go beyond the current limits of NVLink/NVSwitch ‘islands‘
 Ongoing research in DEEP-EST EU project

 The problem:

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 41
EXTREME SCALE BOOSTER (ESB)
GPU-centric Programming

 Enables scalability based on GPUs


 Reduces load on host CPUs by enabling direct transfer of data through MPI to the GPUs
o (i.e., host CPU becomes more of a slim network)

[31] Estela Suarez


Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 42
HARDWARE LEVELS OF PARALLELISM
Summary

 Best performance is achieved with a combination of them!


SIMD SIMT SMT MPI MPI+MSA

In-core In-processor Single Computer Multiple ‘’Computers’’ Multiple HPC Systems


parallelism parallelism
Simultaneous Multithreading Tightly-coupled Tightly-coupled
Many threads Cross-core, Cross-socket Supercomputing Heterogeneous Hardware
on many cores OpenMP, pthreads

[11] Andreas Herten

[12] Peter Messmer

[32] SLURM

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 43
REFERENCES

 [1] Embedded system


Online: https://en.wikipedia.org/wiki/Embedded_system
 [2] Embedded system example
Online: https://www.electronicshub.org/basics-of-embedded-c-program/
 [3] Desktop Computer
Online: https://www.datamate.org/buddy/
 [4] Microsoft opens UK data centres
Online: https://www.theinquirer.net/inquirer/news/2470021/microsoft-opens-three-uk-data-centres-for-azure-and-office-365
 [5] HPC and Supercomputing at GTC 2017
Online: https://www.youtube.com/watch?v=Usl_TCUTWD8
 [6] Parallel Programming Concepts (WT 2013/14): Introduction
Online: https://www.tele-task.de/lecture/video/4135/
 [7] Moore's law
Online: https://en.wikipedia.org/wiki/Moore%27s_law
 [8] Cerebras Wafer Scale Engine
Online: https://www.cerebras.net/
 [9] Herb Sutter, ‘’The Free Lunch Is Over’’
Online: http://www.gotw.ca/publications/concurrency-ddj.htm
 [10] Introduction to Parallel Computing
Online: https://computing.llnl.gov/tutorials/parallel_comp/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 44
REFERENCES

 [11] Andreas Herten, ‘’GPU Accelerators at JSC’’


Online: https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/supercomputer-ressources-2019-11/12-sc-gpu.html?nn=2363978
 [12] Peter Messmer, CUDA Overview, New Features and Optimization
 Online: file:///C:/Users/caval/Downloads/Nvidia_CUDAIntro.pdf
 [13] Karl Rupp, CPU, GPU and MIC Hardware Characteristics over Time
Online: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
 [14] The OpenMP API specification for parallel programming
Online:http://openmp.org/wp/openmp-specifications/
 [15] The MPI Standard,
Online: http://www.mpi-forum.org/docs/
 [16] Wikipedia ‘Supercomputer’
Online: http://en.wikipedia.org/wiki/Supercomputer
 [17] T. Ben-Nun and T. Hoefler (2018). Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Online: http://arxiv.org/abs/1802.09941
 [18] TOP500 Supercomputing Sites
Online: http://www.top500.org/
 [19] LINPACK Benchmark
Online: http://www.netlib.org/benchmark/hpl/
 [20] HPC Challenge Benchmark Suite
Online: http://icl.cs.utk.edu/hpcc/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 45
REFERENCES

 [21] JUBE Benchmark Suite


Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html
 [22] The GREEN500
Online: https://www.top500.org/green500/
 [23] ORNL develops, deploys AI capabilities across research portfolio
Online: https://www.olcf.ornl.gov/wp-content/uploads/2019/05/Summit_System_Overview_20190520.pdf
 [24] NVLink and NVSwitch
Online: https://www.nvidia.com/en-us/data-center/nvlink/
 [25] Mellanox OFED GPUDirect RDMA
Online: https://www.mellanox.com/related-docs/prod_software/PB_GPUDirect_RDMA.PDF
 [26] Understanding and Coding a ResNet in Keras
Online: https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33
 [27] K. He, X. Zhang, S. Ren and J Sun (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
https://doi.org/10.1109/CVPR.2016.90
 [28] NVIDIA V100 TENSOR CORE GPU
Online: https://www.nvidia.com/en-us/data-center/v100/
 [29] Summit Early Science Video Series: Enabling Scientific Innovation
Online: https://www.youtube.com/watch?v=xYsUH3JbOTA
 [30] The DEEP projects
Online: https://www.deep-projects.eu/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 46
REFERENCES

 [31] Estela Suarez, The DEEP evolution, Seminar Modular Architectures, Jülich Supercomputing Centre (JSC)
 [32] SLURM: Support for Multi-core/Multi-thread Architectures
Online: https://slurm.schedmd.com/mc_support.html

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 47

You might also like