0% found this document useful (0 votes)

53 views47 pages

Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1

Uploaded by

Rachid Rahmoune

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views47 pages

Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1

Uploaded by

Rachid Rahmoune

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

SCALABLE MACHINE LEARNING FOR REMOTE SENSING BIG DATA

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing (part 1)

DR. – ING. GABRIELE CAVALLARO

HIGH PRODUCTIVITY DATA PROCESSING RESEARCH GROUP
JUELICH SUPERCOMPUTING CENTRE (JSC)
OUTLINE

 The Free Lunch is Over

 Moore’s Law
 Work Harder and work smarter
 Get Help: Many-Core Era
 Hardware Levels of Parallelism
 In-core, In-Processor, Single and Multiple Computers
 Graphics Processing Units (GPUS)
 High Performance Computing (HPC)
 TOP500
 Architectures of HPC Systems
 Modular Supercomputing Architecture
 Deep projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 2
COMPUTER MARKET
IT Environments
 Embedded system
 System that you don’t use directly (not a computer by itself, with a display and input devices)
 Combination of a computer processor, memory, and input/output peripheral devices
 Has a dedicated function within a larger mechanical or electrical system (e.g., washing machine)

 Embedded Computing: real time behavior

 The system guarantees to end at a defined point (task is done in a specific amount of time)
 Not interested in fast computing
 Power consumption and price as major issue

[1] Embedded system

[2] Embedded system example

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 3
COMPUTER MARKET
IT Environments

 Desktop Computing
 Home computers
 Performance / price ration as major issue

[3] Desktop Computer

 Servers
 The key: maximum performance and maintainability reliability
 Business service provisioning and major goals
 Web servers, banking back-end, order processing, …

[4] Microsoft opens UK data centres

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 4
APPLICATIONS
Some Problems always benefit from faster processing

 Simulation and modeling (climate, earthquakes, airplane design, car design, vehicle traffic patterns, …)

 Machine (deep) learning with big data

 Web search, social networks

 Modern computer games

 Next-generation medicine
 DNA sequencing, simulation of drug effects
[5] HPC and Supercomputing

 Business data processing

 Graphic effects on consumer devices

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 5
RUNNING APPLICATIONS
How can you become faster?

 Split the application into instructions (generated by the compiler)

 These instructions are then given to the system
 We want to execute these instructions as fast as possible

[6] WT 2013/14

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 6
MACHINE MODEL
Von Neumann Architecture

 First computers had fixed programs (electronic calculator)

 Von Neumann Architecture (1945, for EDVAC project)
 Radical idea: load code into memory (i.e., code becomes data)
 Use one machine for doing several things (general purpose computer)
 Von Neumann bottleneck (Bus): access the memory at least 3 times
 (1) Read the instruction, (2) read the input data and compute, (3) write the result

- Instruction set for control flows stored in memory

- Program is treated as data, which allows the exchange of code
[6] WT 2013/14 during runtime and self modification

 Today machines rely on optimizations tricks (caching, buffering etc..) to deal with this problem
 However this issue is getting again relevant today with multi-core and many-core systems
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 7
HOW TO DO ANYTHING FASTER
Three Ways

 E.g., Washing your car, doing homework, processing instructions in a computer

 In Hardware terminology:

Increase the frequency

speed of the hardware

Main solution:
Optimize the computing time from 1950 to 2000
per instruction (Instruction
Level Parallelism (ILP), code
optimization, caching)

Use more than one process

unit (option that was never
really exploited until lately) Nowadays
The first two options were
[6] WT 2013/14
good enough
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 8
MOORE’S LAW

 “… the number of transistors that can be inexpensively placed on integrated circuits is increasing
exponentially, doubling approximately every two years. …” (Gordon Moore, 1965)

 Rule of exponential growth

 Applied to many IT hardware developments

 Sometimes misinterpreted as performance indication

[7] Moore's law

 Meanwhile a self-fulfilling prophecy

 May still hold for the next 10-20 year

 It is only about the number of transistors that can be places over a die
 Die: small block of semiconducting material on which a given functional circuit is fabricated
 Transistors have reached the size of few nanometers E.g., 1.2 trillion transistors on a
die over a area of 46,000𝑚𝑚2
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 9
[8] Cerebras Wafer Scale Engine
THE INDUSTRY PROBLEM
Processor Speed Development

 Companies have been shrinking the technology to try to follow Moore’s Law
 By doubling the density of semiconductor, they were getting performance and power improvements

 After a while the improvements started plateauing

 Silicon clock frequency is flattening out
 Power is flattening out with silicon
WORK
HARDER

 Performance is flattening out with silicon

WORK
 Investment costs to keep up with performance demands SMARTER
Instruction-level
parallelism (ILP)
 Increasingly high
 Only a few fab vendors can keep up

[9] Herb Sutter

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 10
THE FREE LUNCH IS OVER
Paradigm Shift

 Clock speed curve flattened in 2003 (constraint in physics)

 Power Wall
 Instruction Level Parallelism (ILP) wall
 Memory wall WORK
HARDER

 Speeding up the serial instruction execution through

WORK
clock speed improvements no longer works SMARTER
Instruction-level
parallelism (ILP)

 We stumbled into the Many-Core Era

 Use the additional transistors to build more cores
[9] Herb Sutter
 Get more performance (speed up) with many core
 The task of reaching better performance is a responsibility of software developers

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 11
PARALLEL COMPUTING
Concepts

 Simultaneous use of multiple compute resources to solve a computational problem:

 Break the problem into discrete parts that can be solved concurrently
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different processors

An overall control/coordination ‘compute elements’ (e.g. cores)

mechanism is employed solve a problem in a cooperative way

[10] Introduction to Parallel Computing

 The problem should be solved in less time than with a single compute resource

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 12
HARDWARE LEVELS OF PARALLELISM
In-core parallelism Elements of vectors are
processed in parallel

 Single Instruction Multiple Data (SIMD)

 Each CPU core has its own independent SIMD execution units Parallel execution units into
each CPU core
 Exploit data level parallelism, but not concurrency
 Help to improve the performance of multimedia use simultaneous (parallel) computations,
but only a single process (instruction)
o E.g., For same operation to every pixel of an image at a given moment

Register sizes (e.g. 64, 128, 256 and 512 bits)

 Simultaneous Multithreading (SMT) [11] Andreas Herten

 Process of a CPU splitting each of its physical cores into virtual cores (threads)
 Multiple threads with multiple tasks are executed simultaneously on one CPU core
 Although running on the same core, they are completely separated from each other
 Similar in concept to preemptive multitasking but is implemented at the thread level
 Intel branded this process as hyper-threading

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 13
HARDWARE LEVELS OF PARALLELISM
In-processor parallelism

 Single Instruction Multiple Threads (SIMT= SIMD + SMT)

 SIMD is combined with multithreading (introduced by Nvidia)
 The hardware groups threads that execute the same instruction into warps
 Several warps constitute a thread block

- Executed physically in parallel on multiprocessor

WARP
(working unit) - Its threads issue instructions in lock-step (as with SIMD)

[12] Peter Messmer

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 14
GRAPHICS PROCESSING UNITS (GPUS)
Processors

 Use of many simple cores executing threads rather slowly

 High throughput computing-oriented architecture
 Use massive parallelism by executing a lot of concurrent threads
 Handle an ever increasing amount of multiple instruction threads

 Great for data and task parallelism

 Applications that use vector/matrix multiplication (e.g., deep learning algorithms)

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 15
CPU VS. GPU
Overview

 A matter of specialities

TRANSPORTING ONE TRANSPORTING MANY

 Chip

CPU GPU
[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 16
CPU VS. GPU
Overview
GPU multiprocessor

CPU core ≊ GPU multiprocessor

(i.e., Streaming Multiprocessor (SM))

[11] Andreas Herten GPU

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 17
CPU VS. GPU
Performance

GFLOP= 109 floating point operations per second

[11] Andreas Herten [13] Karl Rupp
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 18
GPU MEMORY
Different than CPU Memory

 GPU: accelerator / extension card

 Separate device from CPU
 Separate memory
 Memory transfers need special consideration
 Do as little as possible
 Formerly: explicit copy data from/to GPU
 Now: done automatically (performance …?)

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 19
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 20
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 21
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 22
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 23
PROCESSING FLOW
CPU ⇒ GPU ⇒ CPU

[11] Andreas Herten

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 24
HARDWARE LEVELS OF PARALLELISM
Single Computer and Multiple Computers

 Exposed to the programmer in the form of multi-core systems

(A) Single-machine (shared memory) (B) Multi-machine (distributed memory)
Parallel programming :‘Programming models’
OpenMP Message Passage Interface (MPI)

Multiple threads (the Multiple processes (forces

programmer work on the the programmer to consider
parallelism, leaving the data the distribution of the
shuffling to the hardware data as a first-class
system concern)

[15] MPI Standard

[14] OpenMP API Specification

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 25
SHARED MEMORY SYSTEM
Single Computer

 System where a number of CPUs work on a common, shared physical address space
 Programming using OpenMP (set of compiler directives to ‘mark parallel regions’)
 Immediate access to all data by all processors without explicit communication
 Significant advances in CPU (or microprocessor chips)
[14] OpenMP API Specification
 Multi-core CPU chips have quad, six, or n processing cores on one chip
 Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years
o Reached a limit due to constraint in physics (~ 5 GHz)
CORE 1 CORE 2
… CORE N CORE 1 CORE 2
… CORE N

L1 CACHE L1 CACHE L1 CACHE L1 CACHE L1 CACHE L1 CACHE

Hierarchy of caches (on/off chip)
L2 CACHE L2 CACHE
L1 cache is private to each core; on-chip
L2 cache is shared; on-chip
MULTICORE CPU ONE CHIP (SOCKET) MULTICORE CPU ONE CHIP (SOCKET)
L3 cache or Dynamic random access memory (DRAM); off-chip
CHIPSET

L3 CACHE / DRAM

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 26
DISTRIBUTED-MEMORY SYSTEM
Multiple Computers

 No remote memory access on distributed-memory systems

 Process cannot access another process’ memory directly
 Enables explicit message passing as communication between processors
[15] MPI Standard
 Require to ‘send messages’ back and forth between processes
 Programming is tedious & complicated, but most flexible method

Programming Model:
MPI

COMMUNICATION NETWORK

 Processors communicate via Network Interfaces (NI)

 NI mediates the connection to a Communication network
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 27
HIGH PERFORMANCE COMPUTING
Mixture of shared-memory and distributed-memory systems

 Wikipedia: ‘redirects from HPC to Supercomputer’

 Computer at the frontline of contemporary processing capacity with particularly high speed of calculation

[16] Wikipedia ‘Supercomputer’ Online

 HPC includes work on ‘four basic building blocks’:
 Theory (numerical laws, physical models, speed-up performance, etc.)
 Technology (multi-core, supercomputers, networks, storages, etc.)
 Architecture (shared-memory, distributed-memory, interconnects, etc.)
 Software (libraries, schedulers, monitoring, applications, etc.)

 Architecture: Shared-memory building blocks interconnected with a fast network (e.g., InfiniBand)

NODE

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 28
HIGH PERFORMANCE COMPUTING
Definition

 High Performance Computing (HPC) is based on computing resources that enable the efficient use of
parallel computing techniques through specific support with dedicated hardware such as high performance
cpu/core interconnections

HPC

network
interconnection
important
focus in this lecture

 High Throughput Computing (HTC) is based on commonly available computing resources such as
commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high
performance interconnection between the cpu/cores
network
interconnection
HTC less important!
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 29
PARALLEL ARCHITECTURES
In Deep Learning

 Clear trend towards GPUs (dominate the publications from 2013)

 However, even accelerated nodes are not sufficient for the large computational workload
 Quickly growing of multi-node parallelism

147 DL papers
Out of the 240 reviewed papers

[17] T. Ben-Nun and T. Hoefler (2018)

 Distributed-memory architectures with accelerators such as GPUs

 Default option for machine learning today
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 30
PARALLEL ARCHITECTURES
In Deep Learning

 Training deep learning models is a very compute-intensive task

 Single machines are often not capable to finish this task in a desired time-frame

 Accelerate the training by distributing the computation

 Across multiple machines connected by a network (HPC systems)

 Different network technologies provide different performance [17] T. Ben-Nun and T. Hoefler (2018)

73 DL papers
 Both modern Ethernet and InfiniBand provide high bandwidth Out of the 240 reviewed papers

 But InfiniBand has significantly lower latencies (delays) and higher message rates

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 31
TOP 500 LIST
November 2019

 The measure of speed in HPC matters

 Common measure for parallel computers
established by TOP500 list
 Based on LINPACK benchmark for ranking the
best 500 computers worldwide
[18] TOP 500 supercomputing sites

 LINPACK solves a dense system

of linear equations of unspecified size.
It covers only a single architectural …
aspect (‘critics exist’) [19] LINPACK Benchmark implementation

 Alternatives realistic applications,

benchmark suites and criteria exist
[20] HPC Challenge Benchmark Suite [21] JUBE Benchmark Suite [22] The GREEN500

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 32
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)

 Hosted at Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL)

 Space: As big as two tennis courts
[23] ORNL

 Total number of nodes: 4608 Nodes

 Nodes linked together with Mellanox
dual-rail EDR Infiniband network

 Power consumption: 13 Megawatts

 Total system memory: >10 PetaByte (PB)
512 GigaBytes (GB) DDR4 + 96 GB HBM2 +
~ 1.6 TeraByte (TB) non-volatile memory (NVM)

 File system: 250 PB IBM transferring data at 2.5 TB/s

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 33
#1 HPC MACHINE SUMMIT
Node
 Each node: two 22-core Power9 IBM CPUs
 Each node: six(!) NVIDIA Tesla V100 GPUs
(i.e., 4608 nodes x 6 GPUs = 27648 GPUs)

22 cores (4 hardware threads/core)

[23] ORNL
5,120 CUDA cores
Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 34
MULTI-GPU COMMUNICATION
HPC Systems - NVLink and NVSwitch

 NVLink: enables high-speed direct GPU-to-GPU interconnects

 It supports 6 NVLink connections per NVIDIA Tesla Volta v100 GPU
 It can interconnet up to 8 NVIDIA Tesla Volta v100 GPUs
 NVSwitch: enables all-to-all GPU communication within a single node
 Incorporates multiple NVLinks
 NVLink/NVSwitch are considered as ‘Islands‘ since they do not scale with workloads to a full HPC machine
 Like the GPUDirect interface is enabling

NVLink NVSwitch GPUDirect

[24] NVLink and NVSwitch [25] Mellanox OFED GPUDirect RDMA

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 35
NVIDIA TESLA VOLTA (V100)
Selected Facts
 Equipped with over 21 billion transistors with 5120 CUDA cores and 640 tensor cores
 A tensor core is optimized for deep learning workloads
o Acceleration of large matrix operations
o Perform mixed-precision matrix multiply and accumulate calculations in a single operation
 Predecessor of NVIDIA v100 was NVIDIA Tesla Pascal (p100)
 Predecessor of NVIDIA p100 was NVIDIA Tesla Kepler (e.g., K80 or K40)

ResNet-50
over 23 million of trainable parameters

[26] ResNet in Keras [27] K. He, X. Zhang, et al. (2016)

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing [28] NVIDIA V100 TENSOR CORE GPU Page 36
#1 HPC MACHINE SUMMIT
TOP 500 (November 2019)

 Summit Early Science Video Series: Enabling Scientific Innovation

[29] Summit

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 37
THE DEEP PROJECTS
Research & innovation projects co-funded by the European Union

 Paving the road towards Exascale computing

EU-Exascale projects
27 partners
Total budget: 44 M€
 DEEP (2012–2015) EU-funding: 30 M€
o Cluster/Booster concept

 DEEP-ER (Extended Reach - 2013–2017)

o scalable I/O and resiliency

 DEEP-EST (Extreme Scale Technologies - 2017–2020)

o generalized modular supercomputing architecture,
support for data analytics and machine learning
One of only two Exascale project series funded

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 38
DEEP-EST PROJECT
Applications

 University of Iceland – Earth Science

 CERN – High Energy Physics

 NEST – Brain Simulation

 ASTRON – Radio Astronomy

 KU Leuven – Space Weather

 GROMACS – Molecular Dynamics

 Hardware design driven by software needs

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 39
DEEP-EST PROJECT
Modular Supercomputing Architecture (MSA)

 DEEP-EST prototype includes three compute modules

 CLUSTER Module
 Extreme Scale Booster (ESB)
 Data Analytics Module (DAM)

 Storage module handles workflow data

 Local storage and the network attached memory (NAM) for hot application data
 Global communication engine (GCE) accelerates MPI collective operations
 Fast network federation binds all parts together

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 40
MODULAR SUPERCOMPUTING ARCHITECTURE (MSA)
Scaling with GPUDirect Implementation

 Innovative GPU interconnects are realized via GPUDirect implementations

 Go beyond the current limits of NVLink/NVSwitch ‘islands‘
 Ongoing research in DEEP-EST EU project

 The problem:

[30] The DEEP projects

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 41
EXTREME SCALE BOOSTER (ESB)
GPU-centric Programming

 Enables scalability based on GPUs

 Reduces load on host CPUs by enabling direct transfer of data through MPI to the GPUs
o (i.e., host CPU becomes more of a slim network)

[31] Estela Suarez

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 42
HARDWARE LEVELS OF PARALLELISM
Summary

 Best performance is achieved with a combination of them!

SIMD SIMT SMT MPI MPI+MSA

In-core In-processor Single Computer Multiple ‘’Computers’’ Multiple HPC Systems

parallelism parallelism
Simultaneous Multithreading Tightly-coupled Tightly-coupled
Many threads Cross-core, Cross-socket Supercomputing Heterogeneous Hardware
on many cores OpenMP, pthreads

[11] Andreas Herten

[12] Peter Messmer

[32] SLURM

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 43
REFERENCES

 [1] Embedded system

Online: https://en.wikipedia.org/wiki/Embedded_system
 [2] Embedded system example
Online: https://www.electronicshub.org/basics-of-embedded-c-program/
 [3] Desktop Computer
Online: https://www.datamate.org/buddy/
 [4] Microsoft opens UK data centres
Online: https://www.theinquirer.net/inquirer/news/2470021/microsoft-opens-three-uk-data-centres-for-azure-and-office-365
 [5] HPC and Supercomputing at GTC 2017
Online: https://www.youtube.com/watch?v=Usl_TCUTWD8
 [6] Parallel Programming Concepts (WT 2013/14): Introduction
Online: https://www.tele-task.de/lecture/video/4135/
 [7] Moore's law
Online: https://en.wikipedia.org/wiki/Moore%27s_law
 [8] Cerebras Wafer Scale Engine
Online: https://www.cerebras.net/
 [9] Herb Sutter, ‘’The Free Lunch Is Over’’
Online: http://www.gotw.ca/publications/concurrency-ddj.htm
 [10] Introduction to Parallel Computing
Online: https://computing.llnl.gov/tutorials/parallel_comp/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 44
REFERENCES

 [11] Andreas Herten, ‘’GPU Accelerators at JSC’’

Online: https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/supercomputer-ressources-2019-11/12-sc-gpu.html?nn=2363978
 [12] Peter Messmer, CUDA Overview, New Features and Optimization
 Online: file:///C:/Users/caval/Downloads/Nvidia_CUDAIntro.pdf
 [13] Karl Rupp, CPU, GPU and MIC Hardware Characteristics over Time
Online: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
 [14] The OpenMP API specification for parallel programming
Online:http://openmp.org/wp/openmp-specifications/
 [15] The MPI Standard,
Online: http://www.mpi-forum.org/docs/
 [16] Wikipedia ‘Supercomputer’
Online: http://en.wikipedia.org/wiki/Supercomputer
 [17] T. Ben-Nun and T. Hoefler (2018). Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Online: http://arxiv.org/abs/1802.09941
 [18] TOP500 Supercomputing Sites
Online: http://www.top500.org/
 [19] LINPACK Benchmark
Online: http://www.netlib.org/benchmark/hpl/
 [20] HPC Challenge Benchmark Suite
Online: http://icl.cs.utk.edu/hpcc/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 45
REFERENCES

 [21] JUBE Benchmark Suite

Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html
 [22] The GREEN500
Online: https://www.top500.org/green500/
 [23] ORNL develops, deploys AI capabilities across research portfolio
Online: https://www.olcf.ornl.gov/wp-content/uploads/2019/05/Summit_System_Overview_20190520.pdf
 [24] NVLink and NVSwitch
Online: https://www.nvidia.com/en-us/data-center/nvlink/
 [25] Mellanox OFED GPUDirect RDMA
Online: https://www.mellanox.com/related-docs/prod_software/PB_GPUDirect_RDMA.PDF
 [26] Understanding and Coding a ResNet in Keras
Online: https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33
 [27] K. He, X. Zhang, S. Ren and J Sun (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
https://doi.org/10.1109/CVPR.2016.90
 [28] NVIDIA V100 TENSOR CORE GPU
Online: https://www.nvidia.com/en-us/data-center/v100/
 [29] Summit Early Science Video Series: Enabling Scientific Innovation
Online: https://www.youtube.com/watch?v=xYsUH3JbOTA
 [30] The DEEP projects
Online: https://www.deep-projects.eu/

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 46
REFERENCES

 [31] Estela Suarez, The DEEP evolution, Seminar Modular Architectures, Jülich Supercomputing Centre (JSC)
 [32] SLURM: Support for Multi-core/Multi-thread Architectures
Online: https://slurm.schedmd.com/mc_support.html

Lecture 4 – Parallel and Scalable Machine Learning with High Performance Computing Page 47

Sdet PDF
No ratings yet
Sdet PDF
4 pages
High Performance Computing: Course Introduction
No ratings yet
High Performance Computing: Course Introduction
32 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
8051 Timer Counter
No ratings yet
8051 Timer Counter
98 pages
01 - Introduction: 1 Why Parallel Programming Is Important in Research
No ratings yet
01 - Introduction: 1 Why Parallel Programming Is Important in Research
50 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
Mca 4
No ratings yet
Mca 4
61 pages
Ec23 Chapter1
No ratings yet
Ec23 Chapter1
84 pages
Cours 1
No ratings yet
Cours 1
38 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
Cours 1
No ratings yet
Cours 1
38 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
No ratings yet
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
50 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
134 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
CPU Parallelism & GPU
No ratings yet
CPU Parallelism & GPU
12 pages
Unit 1
No ratings yet
Unit 1
54 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Lec 01
No ratings yet
Lec 01
67 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
PC 1
No ratings yet
PC 1
53 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
High Performance Computing Lecture 1 HPC Public
No ratings yet
High Performance Computing Lecture 1 HPC Public
50 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
AWS clf-02 Question and Answers 2025
No ratings yet
AWS clf-02 Question and Answers 2025
24 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Self-Learning Module For Senior High School Learners: National Capital Region
No ratings yet
Self-Learning Module For Senior High School Learners: National Capital Region
14 pages
Replacing NPU1 C With An NPU1 D
No ratings yet
Replacing NPU1 C With An NPU1 D
22 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Java Games Tetris Game
0% (1)
Java Games Tetris Game
17 pages
Unit 4-Mca
No ratings yet
Unit 4-Mca
29 pages
Apple Cinema
No ratings yet
Apple Cinema
13 pages
Computer Motherboard Diagram Is Very Useful For When You Need To Replace Motherboard, Do
No ratings yet
Computer Motherboard Diagram Is Very Useful For When You Need To Replace Motherboard, Do
2 pages
Debug Bex Query
No ratings yet
Debug Bex Query
5 pages
IEWB-RS Volume 2 Lab 1: Difficulty Rating (10 Highest) : 5 Lab Overview
No ratings yet
IEWB-RS Volume 2 Lab 1: Difficulty Rating (10 Highest) : 5 Lab Overview
18 pages
1673
No ratings yet
1673
3 pages
Berba Catalog
No ratings yet
Berba Catalog
155 pages
Log Com - Roblox.client 1686918755
No ratings yet
Log Com - Roblox.client 1686918755
328 pages
Di 124
No ratings yet
Di 124
2 pages
IBM Redbooks Notes Traveler Deployment
No ratings yet
IBM Redbooks Notes Traveler Deployment
492 pages
File Organization and Database Design
No ratings yet
File Organization and Database Design
4 pages
Bg20 & 30 ### LCT - Apt Login & SW Upgradation Procudure
No ratings yet
Bg20 & 30 ### LCT - Apt Login & SW Upgradation Procudure
22 pages
Network Hacking
No ratings yet
Network Hacking
10 pages
Department of Electronics and Instrumentation Internal Assessment Test - III
No ratings yet
Department of Electronics and Instrumentation Internal Assessment Test - III
2 pages
Feed Forward Control Philosophy For Temperature
No ratings yet
Feed Forward Control Philosophy For Temperature
12 pages
B&K Precision 1801 Frequency Counter 1972
No ratings yet
B&K Precision 1801 Frequency Counter 1972
22 pages
2008 Hanoi Introduction To InSAR
No ratings yet
2008 Hanoi Introduction To InSAR
80 pages
An Open-Source Web-Application For Regional Analysis of GRACE Gro
No ratings yet
An Open-Source Web-Application For Regional Analysis of GRACE Gro
103 pages
Fyp4033 Fyp1
No ratings yet
Fyp4033 Fyp1
54 pages
AWS Roadmap - Roadmap - SH
No ratings yet
AWS Roadmap - Roadmap - SH
3 pages
zt400 Series Rfid Specification Sheet en Us
No ratings yet
zt400 Series Rfid Specification Sheet en Us
4 pages
Ast El50
No ratings yet
Ast El50
2 pages
Calibration DP Pressure Transmitter
No ratings yet
Calibration DP Pressure Transmitter
11 pages
A Comparison of Satellite Data-Based Drought Indicators in Detecting The 2012 Drought in The Southestern Us
No ratings yet
A Comparison of Satellite Data-Based Drought Indicators in Detecting The 2012 Drought in The Southestern Us
30 pages
Polymorphism Assignment
No ratings yet
Polymorphism Assignment
5 pages
Y11 Mock Mark Scheme 2023
No ratings yet
Y11 Mock Mark Scheme 2023
8 pages
533 AFarahmand NSSTM Applications 10-15-20
No ratings yet
533 AFarahmand NSSTM Applications 10-15-20
16 pages
Spatiotemporal Changes of Terrestrial Water Storage and Possible Causes in The Closed Qaidam Basin, China Using GRACE and GRACE Follow-On Data
No ratings yet
Spatiotemporal Changes of Terrestrial Water Storage and Possible Causes in The Closed Qaidam Basin, China Using GRACE and GRACE Follow-On Data
15 pages
Apollo Series 65 CS (90°C) Heat Detector CTEC - 55000-137APO - Datasheet - 2020-04-03
No ratings yet
Apollo Series 65 CS (90°C) Heat Detector CTEC - 55000-137APO - Datasheet - 2020-04-03
2 pages
2014 GL 059323
No ratings yet
2014 GL 059323
9 pages
DVWS Installation
No ratings yet
DVWS Installation
1 page
Monkey Testing
No ratings yet
Monkey Testing
4 pages