0% found this document useful (0 votes)
82 views33 pages

Nvidia - Rapids

Uploaded by

Fabio Miranda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views33 pages

Nvidia - Rapids

Uploaded by

Fabio Miranda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

NVIDIA FOR MACHINE LEARNING

João Paulo Navarro – Solutions Architect


[email protected]
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity

CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers

CONSUMER INTERNET & INDUSTRY APPLICATIONS SCIENTIFIC APPLICATIONS VIRTUAL GRAPHICS

APPS & Amber


+600
FRAMEWORKS NAMD Applications

MACHINE LEARNING DEEP LEARNING HPC VIRTUAL GPU


CUDA-X & cuDF cuML
cuDNN cuGRAPH cuDNN CUTLASS TensorRT OpenACC cuFFT vDWS vPC vAPPS
NVIDIA SDKs

CUDA & CORE LIBRARIES - cuBLAS | NCCL

TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD
2
PLATFORM BUILT FOR DL
Accelerating Every Framework And Fueling Innovation

Speech Video
Tensor Cores

NVLink NVSwitch

Translation Personalization

Volta Tensor Core, NVSwitch,


All Use-cases All Major Frameworks NVLink

3
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
Relative Time to Train Improvements
(ResNet-50)

At scale
14 Minutes
256x V100

DGX-1
4 Hours
8x V100

Single Node
1X V100 30 Hours

Single Node
4.8 Days
1X P100

2x CPU 25 Days

0 20 40 60 80 100 120 140

4
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
TRADITIONAL
HYPERSCALE
CLUSTER
300 Dual-CPU Servers
180 kW
NVIDIA DGX-2
FOR
DEEP LEARNING
1 DGX-2
10 kW

1/8 the Cost


1/60 the Space
1/18 the Power
MACHINE LEARNING WITH
NVIDIA RAPIDS
7
Open Source Data Science Ecosystem
Familiar Python APIs

Data Preparation Model Training Visualization

Dask

Pandas Scikit-Learn NetworkX PyTorch Chainer MxNet Matplotlib/Seaborn


Analytics Machine Learning Graph Analytics Deep Learning Visualization

CPU Memory

8
RAPIDS
End-to-End Accelerated GPU Data Science

Data Preparation Model Training Visualization

Dask

cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

9
GPU-Accelerated ETL
The average data scientist spends 90+% of their time in ETL as opposed to training
models

10
Benchmarks: single-GPU Speedup vs. Pandas
cuDF v0.9, Pandas 0.24.2

Running on NVIDIA DGX-1:

GPU: NVIDIA Tesla V100 32GB


CPU: Intel(R) Xeon(R) CPU E5-2698 v4
@ 2.20GHz

Benchmark Setup:

DataFrames: 2x int32 columns key columns,


3x int32 value columns

Merge: inner

GroupBy: count, sum, min, max calculated


for each value column
11
cuML

12
Machine Learning
More models more problems

Data Preparation Model Training Visualization

Dask

cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

13
Problem
Data sizes continue to grow

Massive Dataset

Histograms / Distributions
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs. Dimension Reduction
Feature Selection

Time
Increases
Remove Outliers
Iterate. Cross Validate & Grid Search.
Iterate some more.
Hours? Days?
Sampling

Meet reasonable speed vs accuracy tradeoff


14
ML Technology Stack
Python Dask cuML
Dask cuDF
cuDF
Cython Numpy

cuML Algorithms Thrust


Cub
cuML Prims cuSolver
nvGraph
CUDA Libraries
CUTLASS
cuSparse
cuRand
CUDA cuBlas

15
Algorithms
GPU-accelerated Scikit-Learn
Decision Trees / Random Forests
Classification / Regression Linear Regression
Logistic Regression
K-Nearest Neighbors

Inference Random forest / GBDT inference

K-Means
Clustering DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
Decomposition & Dimensionality Reduction UMAP
Spectral Embedding
Cross Validation
Holt-Winters
Time Series Kalman Filtering
Hyper-parameter Tuning
Key:
● Preexisting
More to come! ● NEW for 0.9

16
RAPIDS matches common Python APIs
CPU-Based Clustering

from sklearn.datasets import make_moons


import pandas

X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)

X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})

Find Clusters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

17
RAPIDS matches common Python APIs
GPU-Accelerated Clustering

from sklearn.datasets import make_moons


import cudf

X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)

X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})

Find Clusters
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

18
Benchmarks: single-GPU cuML vs scikit-learn

1x V100
vs
2x 20 core CPU

19
GPU-ACCELERATED
XGBOOST
20
XGBOOST: THE WORLD’S MOST POPULAR
MACHINE LEARNING ALGORITHM
Versatile and High Performance

The leading algorithm for tabular data

Outperforms most ML algorithms on


regression, classification and ranking

Winner of many data science Kaggle


competitions

InfoWorld Technology of the Year Award, 2019

Well known in data science community and


widely used for forecasting, fraud detection,
recommender engines, and much more

21
HOW CAN XGBOOST BE IMPROVED?
XGBoost Performance is Constrained by CPU Limitations

CPU processing is slow, creating issues for


large data sets or when timeliness is crucial
(e.g. intraday requirements for financial
services)

Hyperparameter search is very slow, making


search not feasible

Prediction speed limits the depth and number


of trees in time sensitive applications

22
GPU-ACCELERATED XGBOOST
Unleashing the Power of NVIDIA GPUs for Users of XGBoost

Faster Time To Insight


XGBoost training on GPUs is significantly faster than CPUs,
completely transforming the timescales of machine learning
workflows.

Better Predictions, Sooner


Work with larger datasets and perform more model iterations
without spending valuable time waiting.

Lower Costs
Reduce infrastructure investment and save money with
improved business forecasting.

Easy to Use
Works seamlessly with the RAPIDS open source data processing
and machine learning libraries and ecosystem for end-to-end
GPU-accelerated workflows.

23
LOADING DATA INTO A GPU DATAFRAME
USE WITH MINIMAL CODE CHANGES
GPU-Acceleration with the same XGBoost Usage

BEFORE AFTER
import xgboost as xgb import xgboost as xgb

params = {'max_depth': 3, params = {‘tree_method’: ‘gpu_hist’,


'learning_rate': 0.1} 'max_depth': 3,
'learning_rate': 0.1}

Create
dtrain an empty
= xgb.DMatrix(X, y) DataFrame, and add a column dtrain = xgb.DMatrix(X, y)
bst = xgb.train(params, dtrain) bst = xgb.train(params, dtrain)

24
XGBOOST: GPU VS. CPU
Tremendous Performance Improvements and Better Accuracy

Take advantage of parallel processing with multiple GPUs

Scale to multiple nodes

GPU implementation is more memory efficient (half of CPU)

Improved accuracy by allowing time for more iterations, ability to leverage hyperparameter
search, and reduced scale out needs

A single DGX-2 with GPU-accelerated XGBoost is 10x Faster than 100 CPU nodes

25
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features

300 Servers | $3M | 180 kW

26
GPU-ACCELERATED
DATA SCIENCE
CLUSTER
GPU-accelerated XGBoost
with DGX-2
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU…
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000

27
DISTRIBUTED XGBOOST
GPU-Accelerated XGBoost for Large Scale Workloads

GPU-acceleration for XGBoost with Apache Spark and Dask

Multiple nodes and multiple GPUs per node

Explore and prototype models on a PC, workstation, server, or cloud instance and scale to two or more
nodes for production training

An ideal solution for GPU-accelerated clusters and enterprise scale workloads

Try out Dask support immediately using Google Cloud Dataproc

Download for on-prem and cloud deployments

28
LEARN MORE ABOUT
GPU-ACCELERATED XGBOOST

rapids.ai/xgboost.html rapids.ai/dask.html

29
SOFTWARE - NGC
30
NGC: GPU-OPTIMIZED SOFTWARE HUB
Ready-to-run GPU Optimized Software, Anywhere

50+ Containers 15+ Model Training Scripts


DL, ML, HPC NLP, Image Classification, Object Detection &
more

NGC

60 Pre-trained Models Industry Workflows On-prem


NLP, Image Classification, Object Detection Cloud Hybrid Cloud Multi-cloud
Medical Imaging, Intelligent Video
& more 31
Analytics
SIMPLIFYING APPLICATION DEPLOYMENTS
Driving Productivity and Faster Discoveries

Superior Performance - Continuous optimizations

Pre-trained Models & Scripts - Speed up AI workflows

On-demand Software – Higher productivity

Data Scientists & Scalable – on multi-GPU, multi-node systems


Developers
Run Anywhere - On-Prem, Cloud, Hybrid

Designed for Enterprise & HPC – Docker & Singularity Sysadmins &
DevOps
32
NVIDIA PLATFORM FOR AI
João Paulo Navarro – Solutions Architect
[email protected]

You might also like