Nvidia - Rapids
Nvidia - Rapids
CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers
TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD
2
PLATFORM BUILT FOR DL
Accelerating Every Framework And Fueling Innovation
Speech Video
Tensor Cores
NVLink NVSwitch
Translation Personalization
3
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
Relative Time to Train Improvements
(ResNet-50)
At scale
14 Minutes
256x V100
DGX-1
4 Hours
8x V100
Single Node
1X V100 30 Hours
Single Node
4.8 Days
1X P100
2x CPU 25 Days
4
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
TRADITIONAL
HYPERSCALE
CLUSTER
300 Dual-CPU Servers
180 kW
NVIDIA DGX-2
FOR
DEEP LEARNING
1 DGX-2
10 kW
Dask
CPU Memory
8
RAPIDS
End-to-End Accelerated GPU Data Science
Dask
cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization
GPU Memory
9
GPU-Accelerated ETL
The average data scientist spends 90+% of their time in ETL as opposed to training
models
10
Benchmarks: single-GPU Speedup vs. Pandas
cuDF v0.9, Pandas 0.24.2
Benchmark Setup:
Merge: inner
12
Machine Learning
More models more problems
Dask
cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz
Analytics Machine Learning Graph Analytics Deep Learning Visualization
GPU Memory
13
Problem
Data sizes continue to grow
Massive Dataset
Histograms / Distributions
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs. Dimension Reduction
Feature Selection
Time
Increases
Remove Outliers
Iterate. Cross Validate & Grid Search.
Iterate some more.
Hours? Days?
Sampling
15
Algorithms
GPU-accelerated Scikit-Learn
Decision Trees / Random Forests
Classification / Regression Linear Regression
Logistic Regression
K-Nearest Neighbors
K-Means
Clustering DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
Decomposition & Dimensionality Reduction UMAP
Spectral Embedding
Cross Validation
Holt-Winters
Time Series Kalman Filtering
Hyper-parameter Tuning
Key:
● Preexisting
More to come! ● NEW for 0.9
16
RAPIDS matches common Python APIs
CPU-Based Clustering
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
Find Clusters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
17
RAPIDS matches common Python APIs
GPU-Accelerated Clustering
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
Find Clusters
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
18
Benchmarks: single-GPU cuML vs scikit-learn
1x V100
vs
2x 20 core CPU
19
GPU-ACCELERATED
XGBOOST
20
XGBOOST: THE WORLD’S MOST POPULAR
MACHINE LEARNING ALGORITHM
Versatile and High Performance
21
HOW CAN XGBOOST BE IMPROVED?
XGBoost Performance is Constrained by CPU Limitations
22
GPU-ACCELERATED XGBOOST
Unleashing the Power of NVIDIA GPUs for Users of XGBoost
Lower Costs
Reduce infrastructure investment and save money with
improved business forecasting.
Easy to Use
Works seamlessly with the RAPIDS open source data processing
and machine learning libraries and ecosystem for end-to-end
GPU-accelerated workflows.
23
LOADING DATA INTO A GPU DATAFRAME
USE WITH MINIMAL CODE CHANGES
GPU-Acceleration with the same XGBoost Usage
BEFORE AFTER
import xgboost as xgb import xgboost as xgb
Create
dtrain an empty
= xgb.DMatrix(X, y) DataFrame, and add a column dtrain = xgb.DMatrix(X, y)
bst = xgb.train(params, dtrain) bst = xgb.train(params, dtrain)
24
XGBOOST: GPU VS. CPU
Tremendous Performance Improvements and Better Accuracy
Improved accuracy by allowing time for more iterations, ability to leverage hyperparameter
search, and reduced scale out needs
A single DGX-2 with GPU-accelerated XGBoost is 10x Faster than 100 CPU nodes
25
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features
26
GPU-ACCELERATED
DATA SCIENCE
CLUSTER
GPU-accelerated XGBoost
with DGX-2
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU…
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
27
DISTRIBUTED XGBOOST
GPU-Accelerated XGBoost for Large Scale Workloads
Explore and prototype models on a PC, workstation, server, or cloud instance and scale to two or more
nodes for production training
28
LEARN MORE ABOUT
GPU-ACCELERATED XGBOOST
rapids.ai/xgboost.html rapids.ai/dask.html
29
SOFTWARE - NGC
30
NGC: GPU-OPTIMIZED SOFTWARE HUB
Ready-to-run GPU Optimized Software, Anywhere
NGC
Designed for Enterprise & HPC – Docker & Singularity Sysadmins &
DevOps
32
NVIDIA PLATFORM FOR AI
João Paulo Navarro – Solutions Architect
[email protected]