0% found this document useful (0 votes)

82 views33 pages

Nvidia - Rapids

Uploaded by

Fabio Miranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views33 pages

Nvidia - Rapids

Uploaded by

Fabio Miranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

NVIDIA FOR MACHINE LEARNING

João Paulo Navarro – Solutions Architect

[email protected]
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity

CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers

CONSUMER INTERNET & INDUSTRY APPLICATIONS SCIENTIFIC APPLICATIONS VIRTUAL GRAPHICS

APPS & Amber

+600
FRAMEWORKS NAMD Applications

MACHINE LEARNING DEEP LEARNING HPC VIRTUAL GPU

CUDA-X & cuDF cuML
cuDNN cuGRAPH cuDNN CUTLASS TensorRT OpenACC cuFFT vDWS vPC vAPPS
NVIDIA SDKs

CUDA & CORE LIBRARIES - cuBLAS | NCCL

TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD
2
PLATFORM BUILT FOR DL
Accelerating Every Framework And Fueling Innovation

Speech Video
Tensor Cores

NVLink NVSwitch

Translation Personalization

Volta Tensor Core, NVSwitch,

All Use-cases All Major Frameworks NVLink

3
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
Relative Time to Train Improvements
(ResNet-50)

At scale
14 Minutes
256x V100

DGX-1
4 Hours
8x V100

Single Node
1X V100 30 Hours

Single Node
4.8 Days
1X P100

2x CPU 25 Days

0 20 40 60 80 100 120 140

4
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
TRADITIONAL
HYPERSCALE
CLUSTER
300 Dual-CPU Servers
180 kW
NVIDIA DGX-2
FOR
DEEP LEARNING
1 DGX-2
10 kW

1/8 the Cost

1/60 the Space
1/18 the Power
MACHINE LEARNING WITH
NVIDIA RAPIDS
7
Open Source Data Science Ecosystem
Familiar Python APIs

Data Preparation Model Training Visualization

Dask

Pandas Scikit-Learn NetworkX PyTorch Chainer MxNet Matplotlib/Seaborn

Analytics Machine Learning Graph Analytics Deep Learning Visualization

CPU Memory

8
RAPIDS
End-to-End Accelerated GPU Data Science

Data Preparation Model Training Visualization

Dask

cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

9
GPU-Accelerated ETL
The average data scientist spends 90+% of their time in ETL as opposed to training
models

10
Benchmarks: single-GPU Speedup vs. Pandas
cuDF v0.9, Pandas 0.24.2

Running on NVIDIA DGX-1:

GPU: NVIDIA Tesla V100 32GB

CPU: Intel(R) Xeon(R) CPU E5-2698 v4
@ 2.20GHz

Benchmark Setup:

DataFrames: 2x int32 columns key columns,

3x int32 value columns

Merge: inner

GroupBy: count, sum, min, max calculated

for each value column
11
cuML

12
Machine Learning
More models more problems

Data Preparation Model Training Visualization

Dask

cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

13
Problem
Data sizes continue to grow

Massive Dataset

Histograms / Distributions
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs. Dimension Reduction
Feature Selection

Time
Increases
Remove Outliers
Iterate. Cross Validate & Grid Search.
Iterate some more.
Hours? Days?
Sampling

Meet reasonable speed vs accuracy tradeoff

14
ML Technology Stack
Python Dask cuML
Dask cuDF
cuDF
Cython Numpy

cuML Algorithms Thrust

Cub
cuML Prims cuSolver
nvGraph
CUDA Libraries
CUTLASS
cuSparse
cuRand
CUDA cuBlas

15
Algorithms
GPU-accelerated Scikit-Learn
Decision Trees / Random Forests
Classification / Regression Linear Regression
Logistic Regression
K-Nearest Neighbors

Inference Random forest / GBDT inference

K-Means
Clustering DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
Decomposition & Dimensionality Reduction UMAP
Spectral Embedding
Cross Validation
Holt-Winters
Time Series Kalman Filtering
Hyper-parameter Tuning
Key:
● Preexisting
More to come! ● NEW for 0.9

16
RAPIDS matches common Python APIs
CPU-Based Clustering

from sklearn.datasets import make_moons

import pandas

X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)

X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})

Find Clusters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

17
RAPIDS matches common Python APIs
GPU-Accelerated Clustering

from sklearn.datasets import make_moons

import cudf

X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)

X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})

Find Clusters
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

18
Benchmarks: single-GPU cuML vs scikit-learn

1x V100
vs
2x 20 core CPU

19
GPU-ACCELERATED
XGBOOST
20
XGBOOST: THE WORLD’S MOST POPULAR
MACHINE LEARNING ALGORITHM
Versatile and High Performance

The leading algorithm for tabular data

Outperforms most ML algorithms on

regression, classification and ranking

Winner of many data science Kaggle

competitions

InfoWorld Technology of the Year Award, 2019

Well known in data science community and

widely used for forecasting, fraud detection,
recommender engines, and much more

21
HOW CAN XGBOOST BE IMPROVED?
XGBoost Performance is Constrained by CPU Limitations

CPU processing is slow, creating issues for

large data sets or when timeliness is crucial
(e.g. intraday requirements for financial
services)

Hyperparameter search is very slow, making

search not feasible

Prediction speed limits the depth and number

of trees in time sensitive applications

22
GPU-ACCELERATED XGBOOST
Unleashing the Power of NVIDIA GPUs for Users of XGBoost

Faster Time To Insight

XGBoost training on GPUs is significantly faster than CPUs,
completely transforming the timescales of machine learning
workflows.

Better Predictions, Sooner

Work with larger datasets and perform more model iterations
without spending valuable time waiting.

Lower Costs
Reduce infrastructure investment and save money with
improved business forecasting.

Easy to Use
Works seamlessly with the RAPIDS open source data processing
and machine learning libraries and ecosystem for end-to-end
GPU-accelerated workflows.

23
LOADING DATA INTO A GPU DATAFRAME
USE WITH MINIMAL CODE CHANGES
GPU-Acceleration with the same XGBoost Usage

BEFORE AFTER
import xgboost as xgb import xgboost as xgb

params = {'max_depth': 3, params = {‘tree_method’: ‘gpu_hist’,

'learning_rate': 0.1} 'max_depth': 3,
'learning_rate': 0.1}

Create
dtrain an empty
= xgb.DMatrix(X, y) DataFrame, and add a column dtrain = xgb.DMatrix(X, y)
bst = xgb.train(params, dtrain) bst = xgb.train(params, dtrain)

24
XGBOOST: GPU VS. CPU
Tremendous Performance Improvements and Better Accuracy

Take advantage of parallel processing with multiple GPUs

Scale to multiple nodes

GPU implementation is more memory efficient (half of CPU)

Improved accuracy by allowing time for more iterations, ability to leverage hyperparameter
search, and reduced scale out needs

A single DGX-2 with GPU-accelerated XGBoost is 10x Faster than 100 CPU nodes

25
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features

300 Servers | $3M | 180 kW

26
GPU-ACCELERATED
DATA SCIENCE
CLUSTER
GPU-accelerated XGBoost
with DGX-2
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU…
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000

27
DISTRIBUTED XGBOOST
GPU-Accelerated XGBoost for Large Scale Workloads

GPU-acceleration for XGBoost with Apache Spark and Dask

Multiple nodes and multiple GPUs per node

Explore and prototype models on a PC, workstation, server, or cloud instance and scale to two or more
nodes for production training

An ideal solution for GPU-accelerated clusters and enterprise scale workloads

Try out Dask support immediately using Google Cloud Dataproc

Download for on-prem and cloud deployments

28
LEARN MORE ABOUT
GPU-ACCELERATED XGBOOST

rapids.ai/xgboost.html rapids.ai/dask.html

29
SOFTWARE - NGC
30
NGC: GPU-OPTIMIZED SOFTWARE HUB
Ready-to-run GPU Optimized Software, Anywhere

50+ Containers 15+ Model Training Scripts

DL, ML, HPC NLP, Image Classification, Object Detection &
more

NGC

60 Pre-trained Models Industry Workflows On-prem

NLP, Image Classification, Object Detection Cloud Hybrid Cloud Multi-cloud
Medical Imaging, Intelligent Video
& more 31
Analytics
SIMPLIFYING APPLICATION DEPLOYMENTS
Driving Productivity and Faster Discoveries

Superior Performance - Continuous optimizations

Pre-trained Models & Scripts - Speed up AI workflows

On-demand Software – Higher productivity

Data Scientists & Scalable – on multi-GPU, multi-node systems

Developers
Run Anywhere - On-Prem, Cloud, Hybrid

Designed for Enterprise & HPC – Docker & Singularity Sysadmins &
DevOps
32
NVIDIA PLATFORM FOR AI
João Paulo Navarro – Solutions Architect
[email protected]

NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
Federated Learning - Hope and Scope
No ratings yet
Federated Learning - Hope and Scope
4 pages
Performance Computing
100% (1)
Performance Computing
102 pages
Building LLaMA 3 From Scratch With Python
No ratings yet
Building LLaMA 3 From Scratch With Python
34 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
Generative AI APIs For Practical Applications
No ratings yet
Generative AI APIs For Practical Applications
27 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
DGX Solution Stack Whitepaper
No ratings yet
DGX Solution Stack Whitepaper
24 pages
MLOPs Artem Koval
No ratings yet
MLOPs Artem Koval
38 pages
Nvidia XID - Errors
No ratings yet
Nvidia XID - Errors
12 pages
Generative Adversial Network
No ratings yet
Generative Adversial Network
21 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
No ratings yet
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
27 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
Using FFmpeg With NVIDIA GPU Hardware Acceleration
No ratings yet
Using FFmpeg With NVIDIA GPU Hardware Acceleration
22 pages
Federated Learning Overview, Strategies, Applications, Tools and
No ratings yet
Federated Learning Overview, Strategies, Applications, Tools and
24 pages
Module 2 Class 1
No ratings yet
Module 2 Class 1
9 pages
NGC Registry Launch Technical Overview
No ratings yet
NGC Registry Launch Technical Overview
11 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
1 - Optimize Amazon SageMaker Deployment Strategies
No ratings yet
1 - Optimize Amazon SageMaker Deployment Strategies
45 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Cap 100M PDF
No ratings yet
Cap 100M PDF
35 pages
362.00 Nvidia Control Panel Quick Start Guide
No ratings yet
362.00 Nvidia Control Panel Quick Start Guide
33 pages
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
No ratings yet
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
20 pages
344.48 Nvidia Control Panel Quick Start Guide PDF
No ratings yet
344.48 Nvidia Control Panel Quick Start Guide PDF
33 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Generative AI
No ratings yet
Generative AI
2 pages
dgx2 User Guide
No ratings yet
dgx2 User Guide
125 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
No ratings yet
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
50 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
FOSDEM14 HPC Devroom 12 Sniper
No ratings yet
FOSDEM14 HPC Devroom 12 Sniper
33 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
The Machine Learning Lifecycle in 2021
No ratings yet
The Machine Learning Lifecycle in 2021
20 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
OpenAI SOC 3 Report
No ratings yet
OpenAI SOC 3 Report
12 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
27 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
TB 04631 001 - v01
No ratings yet
TB 04631 001 - v01
25 pages
2021-02-04 DAIM Company Presentation
No ratings yet
2021-02-04 DAIM Company Presentation
17 pages
Triton X-100-1
No ratings yet
Triton X-100-1
9 pages
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
No ratings yet
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
21 pages
CC 1 Unit Notes
No ratings yet
CC 1 Unit Notes
8 pages
Ai Notes
No ratings yet
Ai Notes
2 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
No ratings yet
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
5 pages
The Right Tools For Professionals: Nvidia Workstation Gpus
No ratings yet
The Right Tools For Professionals: Nvidia Workstation Gpus
4 pages
ACER Altos R3 Server Datasheet
No ratings yet
ACER Altos R3 Server Datasheet
2 pages
High Performance Computing Update 0908 - InCOSE
No ratings yet
High Performance Computing Update 0908 - InCOSE
19 pages
High Performance Network-on-Chip Through MPLS
No ratings yet
High Performance Network-on-Chip Through MPLS
4 pages
Nvidia RTX A2000 Datasheet
No ratings yet
Nvidia RTX A2000 Datasheet
1 page
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet