Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
Regression Analysis
Throughout the previous chapters, you learned a lot about the main concepts
behind supervised learning and trained many different models for classification tasks
to predict group memberships or categorical variables. In this chapter, we will take
a dive into another subcategory of supervised learning: regression analysis.
In this chapter, we will discuss the main concepts of regression models and cover
the following topics:
[ 1181 ]
EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES
AN: 1250754 ; Thakur, Ankita.; Python: Real-World Data Science
Account: undeloan.main.ehost
Predicting Continuous Target Variables with Regression Analysis
y = w0 + w1 x
Here, the weight w0 represents the y axis intercepts and w1 is the coefficient of
the explanatory variable. Our goal is to learn the weights of the linear equation to
describe the relationship between the explanatory variable and the target variable,
which can then be used to predict the responses of new explanatory variables that
were not part of the training dataset.
Based on the linear equation that we defined previously, linear regression can be
understood as finding the best-fitting straight line through the sample points, as
shown in the following figure:
This best-fitting line is also called the regression line, and the vertical lines from the
regression line to the sample points are the so-called offsets or residuals—the errors
of our prediction.
[ 1182 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
The special case of one explanatory variable is also called simple linear regression,
but of course we can also generalize the linear regression model to multiple
explanatory variables. Hence, this process is called multiple linear regression:
m
y = w0 x0 + w1 x1 +…+ wm xm = ∑wi xi = wT x
i =0
The features of the 506 samples may be summarized as shown in the excerpt of the
dataset description:
[ 1183 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
For the rest of this chapter, we will regard the housing prices (MEDV) as our
target variable—the variable that we want to predict using one or more of the 13
explanatory variables. Before we explore this dataset further, let's fetch it from the
UCI repository into a pandas DataFrame:
>>> import pandas as pd
>>> df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/housing/housing.data',
... header=None, sep='\s+')
>>> df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS',
... 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
... 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
>>> df.head()
To confirm that the dataset was loaded successfully, we displayed the first five lines
of the dataset, as shown in the following screenshot:
First, we will create a scatterplot matrix that allows us to visualize the pair-wise
correlations between the different features in this dataset in one place. To plot the
scatterplot matrix, we will use the pairplot function from the seaborn library
(http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library
for drawing statistical plots based on matplotlib:
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> sns.set(style='whitegrid', context='notebook')
[ 1184 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
As we can see in the following figure, the scatterplot matrix provides us with a
useful graphical summary of the relationships in a dataset:
[ 1185 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
Due to space constraints and for purposes of readability, we only plotted five
columns from the dataset: LSTAT, INDUS, NOX, RM, and MEDV. However,
you are encouraged to create a scatterplot matrix of the whole DataFrame to
further explore the data.
Using this scatterplot matrix, we can now quickly eyeball how the data is distributed
and whether it contains outliers. For example, we can see that there is a linear
relationship between RM and the housing prices MEDV (the fifth column of the
fourth row). Furthermore, we can see in the histogram (the lower right subplot in
the scatter plot matrix) that the MEDV variable seems to be normally distributed
but contains several outliers.
To quantify the linear relationship between the features, we will now create a
correlation matrix. A correlation matrix is closely related to the covariance matrix
that we have seen in the section about principal component analysis (PCA) in
Chapter 4, Building Good Training Sets – Data Preprocessing. Intuitively, we can
interpret the correlation matrix as a rescaled version of the covariance matrix.
In fact, the correlation matrix is identical to a covariance matrix computed from
standardized data.
The correlation matrix is a square matrix that contains the Pearson product-moment
correlation coefficients (often abbreviated as Pearson's r), which measure the linear
dependence between pairs of features. The correlation coefficients are bounded
to the range -1 and 1. Two features have a perfect positive correlation if r = 1 , no
correlation if r = 0 , and a perfect negative correlation if r = −1 , respectively. As
mentioned previously, Pearson's correlation coefficient can simply be calculated as
the covariance between two features x and y (numerator) divided by the product
of their standard deviations (denominator):
∑ ( x( ) − µ ) ( y ( ) − µ )
n i i
i =1 x y σ xy
r= =
σ xσ y
∑ ( x( ) − µ ) ∑ ( y ( ) − µ )
n 2 n 2
i i
i =1 x i =1 y
[ 1186 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
1 n x − µx y − µy
∑
n i σx
σ y
1 n
∑ ( x( ) − µ ) ( y ( ) − µ )
i i
x y
n ⋅ σ xσ y i
[ 1187 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
In the following code example, we will use NumPy's corrcoef function on the five
feature columns that we previously visualized in the scatterplot matrix, and we will
use seaborn's heatmap function to plot the correlation matrix array as a heat map:
>>> import numpy as np
>>> cm = np.corrcoef(df[cols].values.T)
>>> sns.set(font_scale=1.5)
>>> hm = sns.heatmap(cm,
... cbar=True,
... annot=True,
... square=True,
... fmt='.2f',
... annot_kws={'size': 15},
... yticklabels=cols,
... xticklabels=cols)
>>> plt.show()
As we can see in the resulting figure, the correlation matrix provides us with another
useful summary graphic that can help us to select features based on their respective
linear correlations:
To fit a linear regression model, we are interested in those features that have a high
correlation with our target variable MEDV. Looking at the preceding correlation
matrix, we see that our target variable MEDV shows the largest correlation with
the LSTAT variable (-0.74). However, as you might remember from the scatterplot
matrix, there is a clear nonlinear relationship between LSTAT and MEDV. On the
other hand, the correlation between RM and MEDV is also relatively high (0.70) and
given the linear relationship between those two variables that we observed in the
scatterplot, RM seems to be a good choice for an explanatory variable to introduce
the concepts of a simple linear regression model in the following section.
[ 1188 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
1 n (i )
( )
2
J ( w) = ∑ y − yˆ ( )
i
2 i =1
Here, ŷ is the predicted value yˆ = wT x (note that the term 1/2 is just used for
convenience to derive the update rule of GD). Essentially, OLS linear regression
can be understood as Adaline without the unit step function so that we obtain
continuous target values instead of the class labels -1 and 1. To demonstrate the
similarity, let's take the GD implementation of Adaline from Chapter 2, Training
Machine Learning Algorithms for Classification, and remove the unit step function to
implement our first linear regression model:
class LinearRegressionGD(object):
[ 1189 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
self.n_iter = n_iter
for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
self.w_[1:] += self.eta * X.T.dot(errors)
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self
If you need a refresher about how the weights are being updated—taking a step in
the opposite direction of the gradient—please revisit the Adaline section in Chapter 2,
Training Machine Learning Algorithms for Classification.
[ 1190 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
As we can see in the following plot, the GD algorithm converged after the fifth epoch:
Next, let's visualize how well the linear regression line fits the training data. To do
so, we will define a simple helper function that will plot a scatterplot of the training
samples and add the regression line:
>>> def lin_regplot(X, y, model):
... plt.scatter(X, y, c='blue')
... plt.plot(X, model.predict(X), color='red')
... return None
Now, we will use this lin_regplot function to plot the number of rooms against
house prices:
>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()
[ 1191 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
As we can see in the following plot, the linear regression line reflects the general
trend that house prices tend to increase with the number of rooms:
Although this observation makes intuitive sense, the data also tells us that the
number of rooms does not explain the house prices very well in many cases. Later
in this chapter, we will discuss how to quantify the performance of a regression
model. Interestingly, we also observe a curious line y = 3 , which suggests that the
prices may have been clipped. In certain applications, it may also be important to
report the predicted outcome variables on their original scale. To scale the predicted
price outcome back on the Price in $1000's axes, we can simply apply the
inverse_transform method of the StandardScaler:
In the preceding code example, we used the previously trained linear regression
model to predict the price of a house with five rooms. According to our model,
such a house is worth $10,840.
[ 1192 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
On a side note, it is also worth mentioning that we technically don't have to update
the weights of the intercept if we are working with standardized variables since the
y axis intercept is always 0 in those cases. We can quickly confirm this by printing
the weights:
>>> print('Slope: %.3f' % lr.w_[1])
Slope: 0.695
>>> print('Intercept: %.3f' % lr.w_[0])
Intercept: -0.000
[ 1193 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
Now, when we plot the training data and our fitted model by executing the code
above, we can see that the overall result looks identical to our GD implementation:
w =(XT X ) XT y
−1
[ 1194 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
Let's now wrap our linear model in the RANSAC algorithm using scikit-learn's
RANSACRegressor object:
[ 1195 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
We set the maximum number of iterations of the RANSACRegressor to 100, and using
min_samples=50, we set the minimum number of the randomly chosen samples to
be at least 50. Using the residual_metric parameter, we provided a callable lambda
function that simply calculates the absolute vertical distances between the fitted line
and the sample points. By setting the residual_threshold parameter to 5.0, we
only allowed samples to be included in the inlier set if their vertical distance to the
fitted line is within 5 distance units, which works well on this particular dataset. By
default, scikit-learn uses the MAD estimate to select the inlier threshold, where MAD
stands for the Median Absolute Deviation of the target values y. However, the choice
of an appropriate value for the inlier threshold is problem-specific, which is one
disadvantage of RANSAC. Many different approaches have been developed over the
recent years to select a good inlier threshold automatically. You can find a detailed
discussion in R. Toldo and A. Fusiello's. Automatic Estimation of the Inlier Threshold in
Robust Multiple Structures Fitting (in Image Analysis and Processing–ICIAP 2009,
pages 123–131. Springer, 2009).
After we have fitted the RANSAC model, let's obtain the inliers and outliers from the
fitted RANSAC linear regression model and plot them together with the linear fit:
>>> inlier_mask = ransac.inlier_mask_
>>> outlier_mask = np.logical_not(inlier_mask)
>>> line_X = np.arange(3, 10, 1)
>>> line_y_ransac = ransac.predict(line_X[:, np.newaxis])
>>> plt.scatter(X[inlier_mask], y[inlier_mask],
... c='blue', marker='o', label='Inliers')
>>> plt.scatter(X[outlier_mask], y[outlier_mask],
... c='lightgreen', marker='s', label='Outliers')
>>> plt.plot(line_X, line_y_ransac, color='red')
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper left')
>>> plt.show()
[ 1196 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
As we can see in the following scatterplot, the linear regression model was fitted on
the detected set of inliers shown as circles:
When we print the slope and intercept of the model executing the following code,
we can see that the linear regression line is slightly different from the fit that we
obtained in the previous section without RANSAC:
>>> print('Slope: %.3f' % ransac.estimator_.coef_[0])
Slope: 9.621
>>> print('Intercept: %.3f' % ransac.estimator_.intercept_)
Intercept: -37.137
Using RANSAC, we reduced the potential effect of the outliers in this dataset,
but we don't know if this approach has a positive effect on the predictive
performance for unseen data. Thus, in the next section we will discuss how to
evaluate a regression model for different approaches, which is a crucial part of
building systems for predictive modeling.
[ 1197 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
As we remember from Chapter 6, Learning Best Practices for Model Evaluation and
Hyperparameter Tuning, we want to split our dataset into separate training and
test datasets where we use the former to fit the model and the latter to evaluate its
performance to generalize to unseen data. Instead of proceeding with the simple
regression model, we will now use all variables in the dataset and train a multiple
regression model:
>>> from sklearn.cross_validation import train_test_split
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.3, random_state=0)
>>> slr = LinearRegression()
>>> slr.fit(X_train, y_train)
>>> y_train_pred = slr.predict(X_train)
>>> y_test_pred = slr.predict(X_test)
Since our model uses multiple explanatory variables, we can't visualize the linear
regression line (or hyperplane to be precise) in a two-dimensional plot, but we
can plot the residuals (the differences or vertical distances between the actual and
predicted values) versus the predicted values to diagnose our regression model.
Those residual plots are a commonly used graphical analysis for diagnosing
regression models to detect nonlinearity and outliers, and to check if the errors
are randomly distributed.
[ 1198 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
Using the following code, we will now plot a residual plot where we simply subtract
the true target variables from our predicted responses:
>>> plt.scatter(y_train_pred, y_train_pred - y_train,
... c='blue', marker='o', label='Training data')
>>> plt.scatter(y_test_pred, y_test_pred - y_test,
... c='lightgreen', marker='s', label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()
After executing the code, we should see a residual plot with a line passing through
the x axis origin as shown here:
In the case of a perfect prediction, the residuals would be exactly zero, which we will
probably never encounter in realistic and practical applications. However, for a good
regression model, we would expect that the errors are randomly distributed and
the residuals should be randomly scattered around the centerline. If we see patterns
in a residual plot, it means that our model is unable to capture some explanatory
information, which is leaked into the residuals as we can slightly see in our preceding
residual plot. Furthermore, we can also use residual plots to detect outliers, which are
represented by the points with a large deviation from the centerline.
[ 1199 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
1 n (i )
( )
2
MSE = ∑
n i =1
y − yˆ (i )
We will see that the MSE on the training set is 19.96, and the MSE of the test set is
much larger with a value of 27.20, which is an indicator that our model is overfitting
the training data.
SSE
R2 = 1 −
SST
Here, SSE is the sum of squared errors and SST is the total sum of squares
( )
2
SST = ∑ i =1 y ( ) − µ y
n i
, or in other words, it is simply the variance of the response.
Let's quickly show that R 2 is indeed just a rescaled version of the MSE:
SSE
R2 = 1 −
SST
1 n
( )
2
∑ y ( ) − yˆ ( )
i i
1− n
i =1
1 n
( )
2
∑ y( ) − µ y
i
n i =1
[ 1200 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
MSE
1−
Var ( y )
For the training dataset, R 2 is bounded between 0 and 1, but it can become
negative for the test set. If R 2 = 1 , the model fits the data perfectly with a
corresponding MSE = 0 .
Evaluated on the training data, the R 2 of our model is 0.765, which doesn't sound
too bad. However, the R 2 on the test dataset is only 0.673, which we can compute
by executing the following code:
>>> from sklearn.metrics import r2_score
>>> print('R^2 train: %.3f, test: %.3f' %
... (r2_score(y_train, y_train_pred),
... r2_score(y_test, y_test_pred)))
Ridge regression is an L2 penalized model where we simply add the squared sum of
the weights to our least-squares cost function:
( )
2
J ( w ) Ridge = ∑ y (i ) − yˆ (i ) +λ w 2
2
i =1
Here:
m
L2 : λ w 22 = λ ∑w j 2
j =1
[ 1201 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
An alternative approach that can lead to sparse models is the LASSO. Depending
on the regularization strength, certain weights can become zero, which makes the
LASSO also useful as a supervised feature selection technique:
( )
2
J ( w ) LASSO = ∑ y (i ) − yˆ (i ) +λ w 1
i =1
Here:
m
L1: λ w 1 = λ ∑ w j
j =1
( )
2
J ( w ) ElasticNet = ∑ y (i ) − yˆ (i ) + λ1 ∑w j 2 + λ2 ∑ w j
i =1 j =1 j =1
Those regularized regression models are all available via scikit-learn, and the
usage is similar to the regular regression model except that we have to specify the
regularization strength via the parameter λ , for example, optimized via k-fold
cross-validation.
Note that the regularization strength is regulated by the parameter alpha, which is
similar to the parameter λ . Likewise, we can initialize a LASSO regressor from the
linear_model submodule:
[ 1202 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
y = w0 + w1 x + w2 x 2 x 2 + ... + wd x d
Here, d denotes the degree of the polynomial. Although we can use polynomial
regression to model a nonlinear relationship, it is still considered a multiple
linear regression model because of the linear regression coefficients w .
We will now discuss how to use the PolynomialFeatures transformer class from
scikit-learn to add a quadratic term ( d = 2 ) to a simple regression problem with
one explanatory variable, and compare the polynomial to the linear fit. The steps
are as follows:
[ 1203 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
In the resulting plot, we can see that the polynomial fit captures the relationship
between the response and explanatory variable much better than the linear fit:
[ 1204 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
As we can see after executing the preceding code, the MSE decreased from 570
(linear fit) to 61 (quadratic fit), and the coefficient of determination reflects a closer
fit to the quadratic model ( R 2 = 0.982 ) as opposed to the linear fit ( R 2 = 0.832 ) in
this particular toy problem.
# linear fit
>>> X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
>>> regr = regr.fit(X, y)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y, regr.predict(X))
# quadratic fit
>>> regr = regr.fit(X_quad, y)
>>> y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
>>> quadratic_r2 = r2_score(y, regr.predict(X_quad))
# cubic fit
[ 1205 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
# plot results
>>> plt.scatter(X, y,
... label='training points',
... color='lightgray')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear (d=1), $R^2=%.2f$'
... % linear_r2,
... color='blue',
... lw=2,
... linestyle=':')
>>> plt.plot(X_fit, y_quad_fit,
... label='quadratic (d=2), $R^2=%.2f$'
... % quadratic_r2,
... color='red',
... lw=2,
... linestyle='-')
>>> plt.plot(X_fit, y_cubic_fit,
... label='cubic (d=3), $R^2=%.2f$'
... % cubic_r2,
... color='green',
... lw=2,
... linestyle='--')
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper right')
>>> plt.show()
As we can see in the resulting plot, the cubic fit captures the relationship between
the house prices and LSTAT better than the linear and quadratic fit. However, we
should be aware that adding more and more polynomial features increases the
complexity of a model and therefore increases the chance of overfitting. Thus, in
practice, it is always recommended that you evaluate the performance of the model
on a separate test dataset to estimate the generalization performance:
[ 1206 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
In addition, polynomial features are not always the best choice for modeling nonlinear
relationships. For example, just by looking at the MEDV-LSTAT scatterplot, we could
propose that a log transformation of the LSTAT feature variable and the square root of
MEDV may project the data onto a linear feature space suitable for a linear regression
fit. Let's test this hypothesis by executing the following code:
# transform features
>>> X_log = np.log(X)
>>> y_sqrt = np.sqrt(y)
# fit features
>>> X_fit = np.arange(X_log.min()-1,
... X_log.max()+1, 1)[:, np.newaxis]
>>> regr = regr.fit(X_log, y_sqrt)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y_sqrt, regr.predict(X_log))
# plot results
>>> plt.scatter(X_log, y_sqrt,
... label='training points',
... color='lightgray')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear (d=1), $R^2=%.2f$' % linear_r2,
... color='blue',
... lw=2)
>>> plt.xlabel('log(% lower status of the population [LSTAT])')
[ 1207 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
After transforming the explanatory onto the log space and taking the square root
of the target variables, we were able to capture the relationship between the two
variables with a linear regression line that seems to fit the data better ( R 2 = 0.69 )
than any of the polynomial feature transformations previously:
[ 1208 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
N left N right
IG ( D p , xi ) = I ( D p ) − I ( Dleft ) − I ( Dright )
Np Np
Here, x is the feature to perform the split, N p is the number of samples in the
parent node, I is the impurity function, Dp is the subset of training samples in the
parent node, and Dleft and Dright are the subsets of training samples in the left and
right child node after the split. Remember that our goal is to find the feature split
that maximizes the information gain, or in other words, we want to find the feature
split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine
Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity,
which is a useful criterion for classification. To use a decision tree for regression,
we will replace entropy as the impurity measure of a node t by the MSE:
1
∑ ( y ( ) − yˆ )
2
I ( t ) = MSE ( t ) = i
t
Nt i∈Dt
1
yˆt =
N
∑ y( )
i∈Dt
i
[ 1209 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
In the context of decision tree regression, the MSE is often also referred to as
within-node variance, which is why the splitting criterion is also better known
as variance reduction. To see what the line fit of a decision tree looks like, let's use
the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear
relationship between the MEDV and LSTAT variables:
>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
>>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()
As we can see from the resulting plot, the decision tree captures the general
trend in the data. However, a limitation of this model is that it does not capture
the continuity and differentiability of the desired prediction. In addition, we
need to be careful about choosing an appropriate value for the depth of the tree
to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:
[ 1210 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
In the next section, we will take a look at a more robust way for fitting regression
trees: random forests.
Now, let's use all the features in the Housing Dataset to fit a random forest
regression model on 60 percent of the samples and evaluate its performance
on the remaining 40 percent. The code is as follows:
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
... train_test_split(X, y,
... test_size=0.4,
... random_state=1)
[ 1211 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
Unfortunately, we see that the random forest tends to overfit the training data.
However, it's still able to explain the relationship between the target and
explanatory variables relatively well ( R 2 = 0.871 on the test dataset).
[ 1212 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
As it was already summarized by the R 2 coefficient, we can see that the model
fits the training data better than the test data, as indicated by the outliers in the y
axis direction. Also, the distribution of the residuals does not seem to be completely
random around the zero center point, indicating that the model is not able to
capture all the exploratory information. However, the residual plot indicates a
large improvement over the residual plot of the linear model that we plotted
earlier in this chapter:
[ 1213 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
[ 1214 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8
[ 1215 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis
[ 1216 ]
EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use