Out-of-Core GPU Gradient Boosting: Rong Ou
Out-of-Core GPU Gradient Boosting: Rong Ou
Rong Ou
NVIDIA
Santa Clara, CA, USA
[email protected]
ABSTRACT bus. A naive approach that constantly swaps data in and out of
GPU-based algorithms have greatly accelerated many machine learn- GPU memory would cause too much slowdown, negating the per-
arXiv:2005.09148v1 [cs.LG] 19 May 2020
ing methods; however, GPU memory is typically smaller than main formance gain from GPUs.
memory, limiting the size of training data. In this paper, we de- By carefully structuring the data access patterns, and leverag-
scribe an out-of-core GPU gradient boosting algorithm implemented ing gradient-based sampling to reduce working memory size, we
in the XGBoost library. We show that much larger datasets can were able to significantly increase the size of training data accom-
fit on a given GPU, without degrading model accuracy or train- modated by a given GPU, with minimal impact to model accuracy
ing time. To the best of our knowledge, this is the first out-of-core and training time.
GPU implementation of gradient boosting. Similar approaches can
be applied to other machine learning algorithms. 2 BACKGROUND
In this section we review the gradient boosting algorithm as im-
CCS CONCEPTS plemented by XGBoost, its GPU variant, and the previous CPU-
• Computing methodologies → Machine learning; Graphics only external memory support. We also describe the sampling ap-
processors; • Information systems → Hierarchical storage proaches used to reduce memory footprint.
management.
2.1 Gradient Boosting
KEYWORDS
Given a dataset with n samples {xi , yi }in=1, where xi ∈ Rm is a
GPU, out-of-core algorithms, gradient boosting, machine learning vector of m-dimensional input features, and yi ∈ R is the label, a
decision tree model predicts the label:
1 INTRODUCTION K
Õ
Gradient boosting [7] is a popular machine learning method for ŷi = F (xi ) = fk (xi ), (1)
supervised learning tasks, such as classification, regression, and k=1
ranking. A prediction model is built sequentially out of an ensem- where fk ∈ F, the space of regression trees, and K is the number
ble of weak prediction models, typically decision trees. With bigger of trees. To learn a model, we minimize the following regularized
datasets and deeper trees, training time can become substantial. objective:
Graphics Processing Units (GPUs), originally designed to speed Õ Õ
up the rendering of display images, have proven to be powerful L(F ) = l(yˆi , yi ) + Ω(fk ) (2)
i k
accelerators for many parallel computing tasks, including machine
1
learning. GPU-based implementations [4, 6, 15] exist for several where Ω(f ) = γT + λ||w || 2 (3)
open-source gradient boosting libraries [3, 10, 14] that significantly 2
lower the training time. Here l is a differentiable loss function, Ω is the regularization term
Because GPU memory has higher bandwidth and lower latency, that penalizes the number of leaves in the tree T and leaf weights
it tends to cost more and thus is typically of smaller size than w, controlled by two hyperparameters γ and λ.
(t )
main memory. For example, on Amazon Web Services (AWS), a The model is trained sequentially. Let ŷi be the prediction at
p3.2xlarge instance has 1 NVIDIA Tesla V100 GPU with 16 GiB the t-th iteration, we need to find tree ft that minimizes:
memory, and 61 GiB main memory. On Google Cloud Platform n
(t −1)
Õ
(GCP), a similar instance can have as much as 78 GiB main mem- L(t ) = l(yi , ŷi + ft (xi )) + Ω(ft ) (4)
ory. Training with large datasets can cause GPU out-of-memory i =1
errors when there is plenty of main memory available. The quadratic Taylor expansion is:
XGBoost, a widely-used gradient boosting library, has experi- n
Õ
(t −1) 1
mental support for external memory [5], which allows training L(t ) ≃ [l(yi , ŷi ) + дi ft (xi ) + hi ft2 (xi )] + Ω(ft ), (5)
2
on datasets that do not fit in main memory 1 . Building on top of i =1
this feature, we designed and implemented out-of-core GPU algo- where дi and hi are first and second order gradients on the loss
rithms that extend XGBoost external memory support to GPUs. function with respect to ŷ (t −1) . For a given tree structure q(x), let
This is challenging since GPUs are typically connected to the rest I j = {i |q(xi ) = j} be the set of samples that fall into leaf j. The
of the computer system through the PCI Express (PCIe) bus, which optimal weight w j∗ of leaf j can be computed as:
has lower bandwidth and higher latency than the main memory Í
∗ i ∈I j дi
1 In
wj = − Í , (6)
this paper, "out-of-core" and "external memory" are used interchangeably. i ∈I j h i + λ
Rong Ou
and the corresponding optimal loss value is: during tree construction, the data pages are streamed from disk
T Í 2 via a multi-threaded pre-fetcher.
1 Õ ( i ∈I j дi )
L̃(t ) (q) = − Í + γT . (7)
2 j=1 i ∈I j hi + λ 2.4 Sampling
When constructing an individual tree, we start from a single leaf In its default setting, gradient boosting is a batch algorithm: the
and greedily add branches to the tree. Let I L and I R be the sets of whole dataset needs to be read and processed to construct each
samples that fall into the left and right nodes after a split, then the tree. Different sampling approaches have been proposed, mainly
loss reduction for a split is: as an additional regularization factor to get better generalization
" Í # performance, but they can also reduce the computation needed,
1 ( i ∈I L дi )2 ( i ∈I R дi )2 ( i ∈I дi )2
Í Í
leading to faster training time.
Lspl it = Í +Í −Í − γ (8)
2 i ∈I L h i + λ i ∈I R h i + λ i ∈I h i + λ 2.4.1 Stochastic Gradient Boosting (SGB). Shortly after introduc-
where I = I L ∪ I R . ing gradient boosting, Friedman [8] proposed an improvement: at
each iteration a subsample of the training data is drawn at random
2.2 GPU Tree Construction without replacement from the full training dataset. This randomly
selected subsample is then used in place of the full sample to con-
The GPU tree construction algorithm in XGBoost [11, 12] relies on
struct the decision tree and compute the model update for the cur-
a two-step process. First, in a preprocessing step, each input feature
rent iteration. It was shown that this sampling approach improves
is divided into quantiles and put into bins (max_bin defaults to 256).
model accuracy. However, the sampling ratio, f , needs to stay rel-
The bin numbers are then compressed into ELLPACK format, greatly
atively high, 0.5 ≤ f ≤ 0.8, for this improvement to occur.
reducing the size of the training data. This step is time consuming,
so it should only be done once at the beginning of training. 2.4.2 Gradient-based One-Side Sampling (GOSS). Ke et al. proposed
a sampling strategy weighted by the absolute value of the gra-
Algorithm 1: GPU Tree Construction dients [10]. At the beginning of each iteration, the top a × 100%
of training instances with the largest gradients are selected, then
Input: X : training examples
from the rest of the data a random sample of b × 100% instances
Input: д: gradient pairs for training examples
is drawn. The samples are scaled by 1−ab to make the gradient sta-
Output: tree: set of output nodes
tistics unbiased. Compared to SGB, GOSS can sample more aggres-
tree ← { }
sively, only using 10% - 20% of the data to achieve similar model
queue ← InitRoot()
accuracy.
while queue is not empty do
entry ← queue.pop() 2.4.3 Minimal Variance Sampling (MVS). Ibragimov et al. proposed
tree.insert(entry) another gradient-based sampling approach that aims to minimize
// Sort samples into leaf nodes the variance of the model. At each iteration the whole dataset is
RepartitionInstances(entry, X ) sampled with probability proportional to regularized absolute value
// Build gradient histograms of gradients:
BuildHistograms(entry, X , д)
q
д̂i = дi2 + λhi2, (9)
// Find the optimal split for children
left_entry ← EvaluateSplit(entry.left_histogram) where дi and hi are the first and second order gradients, λ can be
right_entry ← EvaluateSplit(entry.right_histogram) either a hyperparameter, or estimated from the squared mean of
queue.push(left_entry) the initial leaf value.
queue.push(right_entry) MVS was shown to perform better than both SGB and GOSS,
with sampling rate as low as 10%.
Algorithm 7: Out-of-Core GPU Tree Construction with Sam- Table 2: Training Time on Higgs Dataset
pling
Input: X : training examples Mode Time(seconds) AUC
Input: д: gradient pairs for training examples CPU In-core 1309.64 0.8393
Output: tree: set of output nodes CPU Out-of-core 1228.53 0.8393
д ′ ← Sample(д) GPU In-core 241.52 0.8398
AllocateOnGPU(sampled_page) GPU Out-of-core, f = 1.0 211.91 0.8396
foreach ellpack_page in X do GPU Out-of-core, f = 0.5 427.41 0.8395
Compact(sampled_page, ellpack_page) GPU Out-of-core, f = 0.3 421.59 0.8399
// Use in-core algorithm
tree ← BuildTree(sampled_page, д ′ )
5 DISCUSSION
Table 1: Maximum Data Size Faced with the explosive growth of data, GPU proved to be an ex-
cellent choice to speed up machine learning tasks. However, the
Mode # Rows relative small size of GPU memory puts a constraint on how much
data can be handled on a single GPU. To train on larger datasets,
In-core GPU 9 million distributed algorithms can be used to share the workload on multi-
Out-of-core GPU 13 million ple machines with multiple GPUs. Setting up and managing a dis-
Out-of-core GPU, f = 0.1 85 million tributed GPU cluster is expensive, both in terms of hardware and
networking cost and system administration overhead. It is there-
fore desirable to relax the GPU memory constraint on a single ma-
4.1 Data Size chine, to allow for easier experimentation with larger datasets.
A synthetic dataset with 500 columns is generated using Scikit- Because of the PCIe bottleneck, GPU out-of-core computation
learn [13]. The measurement is done on a Google Cloud Platform remains a challenge. A naive implementation that simply spills
(GCP) instance with an NVIDIA Tesla V100 GPU (16 GiB). Table 1 data over to main memory or disk would likely to be too slow
shows the maximum number of rows that can be accommodated to be useful. If the out-of-core GPU algorithm is slower than the
in each mode before getting an out-of-memory error. CPU version, then what is the point? Only by pursuing algorithmic
Combined with gradient-based sampling, the out-of-core mode changes, as we have done with gradient-based sampling here, can
allows an order of magnitude bigger dataset to be trained on a out-of-core GPU computation become competitive. The sampling
given GPU. For reference, the 85-million row, 500 column dataset approach may be applicable to other machine learning algorithms.
is 903 GiB on disk in LibSVM format [2], and can be trained suc- This is left as possible future work.
cessfully on a single 16 GiB GPU using a sampling ratio of 0.1. Working with XGBoost also presented unique software engi-
neering challenges. It is a popular open source project with many
4.2 Model Accuracy contributors, ranging from students, data scientists, to machine
When not sampling the data, the out-of-core GPU algorithm is learning software engineers. Code quality varies between differ-
ent parts of the code base. In order to support the existing users,
equivalent to the in-core version. With sampling, the size of the
data that can fit on a given GPU is increased. Ideally, this should many of which run XGBoost in production, care must be taken to
not change the generalization performance of the trained model. preserve the current behavior, and plan for breaking changes care-
fully. Much of the effort during this project was spent on refactor-
Figure 1 shows the training curves on the Higgs dataset [1]. Mod-
els with different sampling rates performed similarly, only dropped ing the code to make it easier to add new behaviors.
slightly when f = 0.1.
For a more detailed evaluation of MVS, see [9]. 6 CONCLUSION
In this paper we presented the first ever out-of-core GPU gradi-
4.3 Training Time ent boosting implementation. This approach greatly expands the
For end-to-end training time, the Higgs dataset is used, split ran- size of training data that can fit on a given GPU, without sacrific-
domly 0.95/0.05 for training and evaluation. All the XGBoost pa- ing model accuracy or training time. The source code changes are
rameters use their default value, except that max_depth is increased merged into the open-source XGBoost library. It is available for
to 8, and learning_rate is lowered to 0.1. Training is done for 500 production use and further research.
iterations. The hardware used is a desktop computer with an Intel
Core i7-5820K processor, 32 GB main memory, and an NVIDIA Ti- ACKNOWLEDGMENTS
tan V with 12 GiB memory. Table 2 shows the training time and We would like to thank Rory Mitchell and Jiaming Yuan for help-
evaluation AUC for the different modes. ful design discussions and careful code reviews. Special thanks to
Although out-of-core GPU training is slower than the in-core Sriram Chandramouli for helping with the implementation, and
version when sampling is enabled, it is still significantly faster than Philip Hyunsu Cho for maintaining XGBoost’s continuous build
the CPU-based algorithm. system.
Out-of-Core GPU Gradient Boosting
0.84
0.83
0.82
Evaluation AUC
f = 0.1
0.81
f = 0.2
0.80 f = 0.3
f = 0.4
0.79 f = 0.5
f = 0.6
0.78 f = 0.7
f = 0.8
0.77 f = 0.9
f = 1.0
0.76
0 100 200 300 400 500
Iterations
REFERENCES [12] R. Mitchell and E. Frank. 2017. Accelerating the XGBoost algorithm using GPU
[1] P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles computing. PeerJ Computer Science 3 (2017), e127.
in high-energy physics with deep learning. Nature Commun. 5 (2014), 4308. [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
https://doi.org/10.1038/ncomms5308 arXiv:hep-ph/1402.4735 Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
[2] C. Chang and C. Lin. 2011. LIBSVM: A library for support vector machines. ACM napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
transactions on intelligent systems and technology (TIST) 2, 3 (2011), 1–27. Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[3] T. Chen and C. Guestrin. 2016. XGBoost : A scalable tree boosting system. In [14] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. 2018.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- CatBoost: Unbiased boosting with categorical features. In Advances in Neural
covery and Data Mining (San Francisco, California, USA) (KDD ’16). ACM, New Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grau-
York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785 man, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 6638–6648.
[4] Microsoft Corporation. 2020. LightGBM GPU tutorial. Retrieved February 8, http://papers.nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features.pdf
2020 from https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html [15] CatBoost team. 2020. Training on GPU. Retrieved February 8, 2020 from
[5] XGBoost developers. 2020. Using XGBoost external mem- https://catboost.ai/docs/features/training-on-gpu.html
ory version (beta). Retrieved February 8, 2020 from
https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html
[6] XGBoost developers. 2020. XGBoost GPU support. Retrieved February 8, 2020
from https://xgboost.readthedocs.io/en/latest/gpu/
[7] J. H. Friedman. 2001. Greedy function approximation: A gradi-
ent boosting machine. Ann. Statist. 29, 5 (10 2001), 1189–1232.
https://doi.org/10.1214/aos/1013203451
[8] J. H. Friedman. 2002. Stochastic gradient boosting. Computational statistics &
data analysis 38, 4 (2002), 367–378.
[9] B. Ibragimov and G. Gusev. 2019. Minimal variance sampling in sto-
chastic gradient boosting. In Advances in Neural Information Processing
Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc,
E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 15061–15071.
http://papers.nips.cc/paper/9645-minimal-variance-sampling-in-stochastic-gradient-boosting.pdf
[10] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and
T. Liu. 2017. LightGBM : A highly efficient gradient boosting deci-
sion tree. In Advances in Neural Information Processing Systems 30,
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett (Eds.). Curran Associates, Inc., 3146–3154.
http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
[11] R. Mitchell, A. Adinets, T. Rao, and E. Frank. 2018. XGBoost: Scalable
GPU Accelerated Learning. CoRR abs/1806.11248 (2018). arXiv:1806.11248
http://arxiv.org/abs/1806.11248