0% found this document useful (0 votes)
16 views

L5 Slides

The document describes XGBoost, an optimized gradient boosting library that provides a scalable, portable and distributed tree boosting system. XGBoost has been very successful and widely used in many machine learning competitions and applications due to its scalability, performance and accuracy. The document outlines the key algorithms and optimizations in XGBoost that contribute to its success.

Uploaded by

liuyaozhangruc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

L5 Slides

The document describes XGBoost, an optimized gradient boosting library that provides a scalable, portable and distributed tree boosting system. XGBoost has been very successful and widely used in many machine learning competitions and applications due to its scalability, performance and accuracy. The document outlines the key algorithms and optimizations in XGBoost that contribute to its success.

Uploaded by

liuyaozhangruc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

XGBoost: A Scalable Tree

Boosting System
Tianqi Chen and Carlos Guestrin, University of Washington
XGBoost

š eXtreme Gradient Boosting


š 29 Kaggle challenges with winners in 2015
š 17 used XGBoost
š 8 of these solely used XGBoost; the others
combined XGBoost with DNNs
š KDDCup 2015
š Every single top 10 finisher used XGBoost
XGBoost Applications

š Store sales prediction


š High energy physics event classification
š Web text classification
š Customer behavior prediction
š Motion detection
š Ad click through rate prediction
š Malware classification
š Product categorization
š Hazard risk prediction
š Massive on-line course dropout rate prediction
Properties of XGBoost

š Single most important factor in its success: scalability


š Due to several important systems and algorithmic optimizations

1. Highly scalable end-to-end tree boosting system


2. Theoretically justified weighted quantile sketch for efficient proposal calculation
3. Novel sparsity-aware algorithm for parallel tree learning
4. Effective cache-aware block structure for out-of-core tree learning
What is “tree boosting”?

š Given a dataset (n
examples, m features)

š Tree ensemble uses K


additive functions to
predict output
What is “gradient boosting”?
Regularized objective function

Objective

2nd order
approx.

Remove
constants

Scoring function to
evaluate quality of
tree structure
Regularized objective function
Split-finding algorithms

š Exact
š Computationally demanding
š Enumerate all possible splits for continuous features

š Approximate
š Algorithm proposes candidate splits according to percentiles of feature distributions
š Maps continuous features to buckets split by candidate points
š Aggregates statistics and finds best solution among proposals
Comparison of split-finding

š Two variants
š Global
š Local
Shrinkage and column subsampling

š Shrinkage
š Scales newly added weights by a factor !
š Reduces influence of each individual tree
š Leaves space for future trees to improve model
š Similar to learning rate in stochastic optimization
š Column subsampling
š Subsample features
š Used in Random Forests
š Prevents overfitting more effectively than row-sampling
Sparsity-aware split finding

š Equates sparsity with missing values


š Defines a “default” direction: follow
the observed paths
š Compare to “dense” method
How does this work?

š Features need to be in sorted order to determine splits


š Concept of blocks
š Compressed column (CSC) format
š Each column sorted by corresponding feature value

š Exact greedy algorithm: all the data in a single block


š Data are sorted once before training and used subsequently in this format
Feature transformations in blocks
More on blocks

š Data is stored on multiple blocks, and these blocks are stored on disk
š Independent threads pre-fetch specific blocks into memory to prevent cache misses
š Block Compression
š Each column is compressed before being written to disk, and decompressed on-the-fly when
read from disk into a prefetched buffer
š Cuts down on disk I/O
š Block Sharding
š Data is split across multiple disks (i.e. cluster)
š Pre-fetcher is assigned to each disk to read data into memory
Cache-aware access

Exact Greedy Algorithm Approximate Algorithms


š Allocate an internal buffer in each thread š Choice of block size is critical
š Fetch gradient statistics š Small block size results in small workloads
for each thread
š Perform accumulation in mini-batch
š Large block size results in cache misses as
š Reduces runtime overhead when number
gradient statistics do not fit in cache
of rows is large
Cache-aware access
Exact Approximate
Results: out of core
Results: distributed
Results: scalability
Demonstration

https://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
Conclusions

š Novel sparsity-aware algorithm for handling sparse data


š Theoretical guarantees for weighted quantile sketching for approximate learning
š Cache access patterns, data compression, and data sharding techniques
http://arxiv.org/abs/1603.02754

You might also like