Xgboost
Xgboost
Collecting Data
First of all, just like what you do with any other case, you are going to import the Boston
Housing dataset and store it in a variable called boston. To import it from scikit-learn you will
need to run this snippet.
The boston variable itself is a dictionary, so you can check for its keys using the .keys()
method.
print(type(boston))
## <class 'sklearn.utils.Bunch'>
print(boston.keys())
You can easily check for its shape by using the boston.data.shape attribute, which will return
the size of the dataset.
print(boston.data.shape)
## (506, 13)
As you can see it returned (506, 13), that means there are 506 rows of data with 13 columns.
Now, if you want to know what the 13 columns are, you can simply use the .feature_names
attribute and it will return the feature names.
print(boston.feature_names)
## ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
## 'B' 'LSTAT']
1 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
The description of the dataset is available in the dataset itself. You can take a look at it using
.DESCR.
print(boston.DESCR)
2 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
## .. _boston_dataset:
##
## Boston house prices dataset
## ---------------------------
##
## **Data Set Characteristics:**
##
## :Number of Instances: 506
##
## :Number of Attributes: 13 numeric/categorical predictive. Median Value
(attribute 14) is usually the target.
##
## :Attribute Information (in order):
## - CRIM per capita crime rate by town
## - ZN proportion of residential land zoned for lots over 25,00
0 sq.ft.
## - INDUS proportion of non-retail business acres per town
## - CHAS Charles River dummy variable (= 1 if tract bounds river;
0 otherwise)
## - NOX nitric oxides concentration (parts per 10 million)
## - RM average number of rooms per dwelling
## - AGE proportion of owner-occupied units built prior to 1940
## - DIS weighted distances to five Boston employment centres
## - RAD index of accessibility to radial highways
## - TAX full-value property-tax rate per $10,000
## - PTRATIO pupil-teacher ratio by town
## - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks b
y town
## - LSTAT % lower status of the population
## - MEDV Median value of owner-occupied homes in $1000's
##
## :Missing Attribute Values: None
##
## :Creator: Harrison, D. and Rubinfeld, D.L.
##
## This is a copy of UCI ML housing dataset.
## https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
##
##
## This dataset was taken from the StatLib library which is maintained at Carn
egie Mellon University.
##
## The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
## prices and the demand for clean air', J. Environ. Economics & Management,
## vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnosti
cs
## ...', Wiley, 1980. N.B. Various transformations are used in the table on
## pages 244-261 of the latter.
##
## The Boston house-price data has been used in many machine learning papers t
hat address regression
## problems.
3 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
##
## .. topic:: References
##
## - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influentia
l Data and Sources of Collinearity', Wiley, 1980. 244-261.
## - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
In Proceedings on the Tenth International Conference of Machine Learning, 236-
243, University of Massachusetts, Amherst. Morgan Kaufmann.
Now let’s convert it into a pandas DataFrame! For that you need to import the pandas library
and call the DataFrame() function passing the argument boston.data. To label the names of the
columns, please use the .columnns attribute of the pandas DataFrame and assign it to
boston.feature_names.
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
type(boston.data)
## <class 'numpy.ndarray'>
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
Explore the top 5 rows of the dataset by using head() method on your pandas DataFrame.
data.head()
You’ll notice that there is no column called PRICE in the DataFrame. This is because the target
column is available in another attribute called boston.target. Append boston.target to your
pandas DataFrame.
data['PRICE'] = boston.target
4 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
Run the .info() method on your DataFrame to get useful information about the data.
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 506 entries, 0 to 505
## Data columns (total 14 columns):
## CRIM 506 non-null float64
## ZN 506 non-null float64
## INDUS 506 non-null float64
## CHAS 506 non-null float64
## NOX 506 non-null float64
## RM 506 non-null float64
## AGE 506 non-null float64
## DIS 506 non-null float64
## RAD 506 non-null float64
## TAX 506 non-null float64
## PTRATIO 506 non-null float64
## B 506 non-null float64
## LSTAT 506 non-null float64
## PRICE 506 non-null float64
## dtypes: float64(14)
## memory usage: 55.4 KB
Note that describe() only gives summary statistics of columns which are continuous in nature
and not categorical.
data.describe()
5 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
If you plan to use XGBoost on a dataset which has categorical features you may want to
consider applying some encoding (like one-hot encoding) to such features before training the
model. Also, if you have some missing values such as NA in the dataset you may or may not
do a separate treatment for them, because XGBoost is capable of handling missing values
6 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
internally. You can check out this link if you wish to know more on this.
Without delving into more exploratory analysis and feature engineering, you will now focus on
applying the algorithm to train the model on this data.
You will build the model using Trees as base learners (which are the default base learners)
using XGBoost’s scikit-learn compatible API. Along the way, you will also learn some of the
common tuning parameters which XGBoost provides in order to improve the model’s
performance, and using the root mean squared error (RMSE) performance metric to check the
performance of the trained model on the test set. Root mean Squared error is the square root
of the mean of the squared differences between the actual and the predicted values. As usual,
you start by importing the library xgboost and other important libraries that you will be using
for building the model.
Note you can install python libraries like xgboost on your system using pip install xgboost on
cmd.
Separate the target variable and rest of the variables using .iloc to subset the data.
Now you will convert the dataset into an optimized data structure called Dmatrix that XGBoost
supports and gives it acclaimed performance and efficiency gains. You will use this later in the
tutorial.
data_dmatrix = xgb.DMatrix(data=X,label=y)
XGBoost’s hyperparameters
At this point, before building the model, you should be aware of the tuning parameters that
XGBoost provides. Well, there are a plethora of tuning parameters for tree-based learners in
XGBoost and you can read all about them here. But the most common ones that you should
know are:
max_depth: determines how deeply each tree is allowed to grow during any boosting round.
subsample: percentage of samples used per tree. Low value can lead to underfitting.
colsample_bytree: percentage of features used per tree. High value can lead to overfitting.
objective: determines the loss function to be used like reg:linear for regression problems,
reg:logistic for classification problems with only decision, binary:logistic for classification
problems with probability.
7 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
XGBoost also supports regularization parameters to penalize models as they become more
complex and reduce them to simple (parsimonious) models.
gamma: controls whether a given node will split based on the expected reduction in loss after
the split. A higher value leads to fewer splits. Supported only for tree-based learners.
It’s also worth mentioning that though you are using trees as your base learners, you can also
use XGBoost’s relatively less popular linear base learners and one other tree learner known as
dart. All you have to do is set the booster parameter to either gbtree (default),gblinear or dart.
The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor()
class from the XGBoost library with the hyper-parameters passed as arguments. For
classification problems, you would have used the XGBClassifier() class.
Fit the regressor to the training set and make predictions on the test set using the familiar .fit()
and .predict() methods.
xg_reg.fit(X_train,y_train)
8 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
preds = xg_reg.predict(X_test)
Compute the rmse by invoking the mean_sqaured_error function from sklearn’s metrics
module.
## RMSE: 10.517005
Well, you can see that your RMSE for the price prediction came out to be around 10.8 per
1000$.
early_stopping_rounds: finishes training of the model early if the hold-out metric (“rmse” in our
case) does not improve for a given number of rounds.
9 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
This time you will create a hyper-parameter dictionary params which holds all the hyper-
parameters and their values as key-value pairs but will exclude the n_estimators from the
hyper-parameter dictionary because you will use num_boost_rounds instead.
You will use these parameters to build a 3-fold cross validation model by invoking XGBoost’s
cv() method and store the results in a cv_results DataFrame. Note that here you are using the
Dmatrix object you created before.
cv_results contains train and test RMSE metrics for each boosting round.
cv_results.head()
cv_results.tail()
print((cv_results["test-rmse-mean"]).tail(1))
10 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
## 49 3.862102
## Name: test-rmse-mean, dtype: float64
You can see that your RMSE for the price prediction has reduced as compared to last time and
came out to be around 4.03 per 1000$. You can reach an even lower RMSE for a different set
of hyper-parameters. You may consider applying techniques like Grid Search, Random Search
and Bayesian Optimization to reach the optimal set of hyper-parameters.
Another way to visualize your XGBoost models is to examine the importance of each feature
column in the original dataset within the model.
One simple way of doing this involves counting the number of times each feature is split on
across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph,
with the features ordered according to how many times they appear. XGBoost has a
plot_importance() function that allows you to do exactly this.
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
11 of 12 2020/9/1, 6:19 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
As you can see the feature RM has been given the highest importance score among all the
features. Thus XGBoost also gives you a way to do Feature Selection. Isn’t this brilliant?
Conclusion
You have reached the end of this tutorial. I hope this might have or will help you in some way or
the other. You started off with understanding how Boosting works in general and then
narrowed down to XGBoost specifically. You also practiced applying XGBoost on an open
source dataset and along the way you learned about its hyper-parameters, doing cross-
validation, visualizing the trees and in the end how it can also be used as a Feature Selection
technique. Whoa!! that’s something for starters, but there is so much to explore in XGBoost
that it can’t be covered in a single tutorial. If you would like to learn more, be sure to take a
look at our Extreme Gradient Boosting with XGBoost course on DataCamp.
12 of 12 2020/9/1, 6:19 PM