0% found this document useful (0 votes)
13 views

Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science

Uploaded by

Richard Murcia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science

Uploaded by

Richard Murcia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Predicting Continuous

Target Variables with


All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Regression Analysis
Throughout the previous chapters, you learned a lot about the main concepts
behind supervised learning and trained many different models for classification tasks
to predict group memberships or categorical variables. In this chapter, we will take
a dive into another subcategory of supervised learning: regression analysis.

Regression models are used to predict target variables on a continuous scale,


which makes them attractive for addressing many questions in science as well as
applications in industry, such as understanding relationships between variables,
evaluating trends, or making forecasts. One example would be predicting the sales
of a company in future months.

In this chapter, we will discuss the main concepts of regression models and cover
the following topics:

• Exploring and visualizing datasets


• Looking at different approaches to implement linear regression models
• Training regression models that are robust to outliers
• Evaluating regression models and diagnosing common problems
• Fitting regression models to nonlinear data
Copyright 2016. Packt Publishing.

[ 1181 ]
EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES
AN: 1250754 ; Thakur, Ankita.; Python: Real-World Data Science
Account: undeloan.main.ehost
Predicting Continuous Target Variables with Regression Analysis

Introducing a simple linear regression


model
The goal of simple (univariate) linear regression is to model the relationship between
a single feature (explanatory variable x) and a continuous valued response (target
variable y). The equation of a linear model with one explanatory variable is defined
as follows:

y = w0 + w1 x

Here, the weight w0 represents the y axis intercepts and w1 is the coefficient of
the explanatory variable. Our goal is to learn the weights of the linear equation to
describe the relationship between the explanatory variable and the target variable,
which can then be used to predict the responses of new explanatory variables that
were not part of the training dataset.

Based on the linear equation that we defined previously, linear regression can be
understood as finding the best-fitting straight line through the sample points, as
shown in the following figure:

This best-fitting line is also called the regression line, and the vertical lines from the
regression line to the sample points are the so-called offsets or residuals—the errors
of our prediction.

[ 1182 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

The special case of one explanatory variable is also called simple linear regression,
but of course we can also generalize the linear regression model to multiple
explanatory variables. Hence, this process is called multiple linear regression:

m
y = w0 x0 + w1 x1 +…+ wm xm = ∑wi xi = wT x
i =0

Here, w0 is the y axis intercept with x0 = 1 .

Exploring the Housing Dataset


Before we implement our first linear regression model, we will introduce a new
dataset, the Housing Dataset, which contains information about houses in the
suburbs of Boston collected by D. Harrison and D.L. Rubinfeld in 1978. The Housing
Dataset has been made freely available and can be downloaded from the UCI machine
learning repository at https://archive.ics.uci.edu/ml/datasets/Housing.

The features of the 506 samples may be summarized as shown in the excerpt of the
dataset description:

• CRIM: This is the per capita crime rate by town


• ZN: This is the proportion of residential land zoned for lots larger than
25,000 sq.ft.
• INDUS: This is the proportion of non-retail business acres per town
• CHAS: This is the Charles River dummy variable (this is equal to 1 if tract
bounds river; 0 otherwise)
• NOX: This is the nitric oxides concentration (parts per 10 million)
• RM: This is the average number of rooms per dwelling
• AGE: This is the proportion of owner-occupied units built prior to 1940
• DIS: This is the weighted distances to five Boston employment centers
• RAD: This is the index of accessibility to radial highways
• TAX: This is the full-value property-tax rate per $10,000
• PTRATIO: This is the pupil-teacher ratio by town
• B: This is calculated as 1000(Bk - 0.63)^2, where Bk is the proportion of
people of African American descent by town
• LSTAT: This is the percentage lower status of the population
• MEDV: This is the median value of owner-occupied homes in $1000s

[ 1183 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

For the rest of this chapter, we will regard the housing prices (MEDV) as our
target variable—the variable that we want to predict using one or more of the 13
explanatory variables. Before we explore this dataset further, let's fetch it from the
UCI repository into a pandas DataFrame:
>>> import pandas as pd
>>> df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/housing/housing.data',
... header=None, sep='\s+')
>>> df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS',
... 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
... 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
>>> df.head()

To confirm that the dataset was loaded successfully, we displayed the first five lines
of the dataset, as shown in the following screenshot:

Visualizing the important characteristics of a


dataset
Exploratory Data Analysis (EDA) is an important and recommended first step prior
to the training of a machine learning model. In the rest of this section, we will use
some simple yet useful techniques from the graphical EDA toolbox that may help
us to visually detect the presence of outliers, the distribution of the data, and the
relationships between features.

First, we will create a scatterplot matrix that allows us to visualize the pair-wise
correlations between the different features in this dataset in one place. To plot the
scatterplot matrix, we will use the pairplot function from the seaborn library
(http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library
for drawing statistical plots based on matplotlib:
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> sns.set(style='whitegrid', context='notebook')

[ 1184 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

>>> cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']


>>> sns.pairplot(df[cols], size=2.5)
>>> plt.show()

As we can see in the following figure, the scatterplot matrix provides us with a
useful graphical summary of the relationships in a dataset:

Importing the seaborn library modifies the default aesthetics of


matplotlib for the current Python session. If you do not want to
use seaborn's style settings, you can reset the matplotlib settings
by executing the following command:
>>> sns.reset_orig()

[ 1185 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

Due to space constraints and for purposes of readability, we only plotted five
columns from the dataset: LSTAT, INDUS, NOX, RM, and MEDV. However,
you are encouraged to create a scatterplot matrix of the whole DataFrame to
further explore the data.

Using this scatterplot matrix, we can now quickly eyeball how the data is distributed
and whether it contains outliers. For example, we can see that there is a linear
relationship between RM and the housing prices MEDV (the fifth column of the
fourth row). Furthermore, we can see in the histogram (the lower right subplot in
the scatter plot matrix) that the MEDV variable seems to be normally distributed
but contains several outliers.

Note that in contrast to common belief, training a linear regression model


does not require that the explanatory or target variables are normally
distributed. The normality assumption is only a requirement for certain
statistical tests and hypothesis tests that are beyond the scope of this book
(Montgomery, D. C., Peck, E. A., and Vining, G. G. Introduction to linear
regression analysis. John Wiley and Sons, 2012, pp.318–319).

To quantify the linear relationship between the features, we will now create a
correlation matrix. A correlation matrix is closely related to the covariance matrix
that we have seen in the section about principal component analysis (PCA) in
Chapter 4, Building Good Training Sets – Data Preprocessing. Intuitively, we can
interpret the correlation matrix as a rescaled version of the covariance matrix.
In fact, the correlation matrix is identical to a covariance matrix computed from
standardized data.

The correlation matrix is a square matrix that contains the Pearson product-moment
correlation coefficients (often abbreviated as Pearson's r), which measure the linear
dependence between pairs of features. The correlation coefficients are bounded
to the range -1 and 1. Two features have a perfect positive correlation if r = 1 , no
correlation if r = 0 , and a perfect negative correlation if r = −1 , respectively. As
mentioned previously, Pearson's correlation coefficient can simply be calculated as
the covariance between two features x and y (numerator) divided by the product
of their standard deviations (denominator):

∑ ( x( ) − µ ) ( y ( ) − µ )
n i i
i =1 x y σ xy
r= =
σ xσ y
∑ ( x( ) − µ ) ∑ ( y ( ) − µ )
n 2 n 2
i i
i =1 x i =1 y

[ 1186 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

Here, µ denotes the sample mean of the corresponding feature, σ xy is the


covariance between the features x and y , and σ x and σ y are the features'
standard deviations, respectively.

We can show that the covariance between standardized features is in


fact equal to their linear correlation coefficient.
Let's first standardize the features x and y , to obtain their z-scores
which we will denote as x′ and y ′ , respectively:
x − µx y − µy
x′ = , y′ =
σx σy
Remember that we calculate the (population) covariance between two
features as follows:
1 n (i )
σ xy = ∑ x − µx
n i
( ) ( y( ) − µ )
i
y

Since standardization centers a feature variable at mean 0, we can now


calculate the covariance between the scaled features as follows:
1 n
σ xy′ = ∑ ( x '− 0 )( y '− 0 )
n i
Through resubstitution, we get the following result:

1 n  x − µx  y − µy 
∑
n i  σx
  
 σ y 
1 n

∑ ( x( ) − µ ) ( y ( ) − µ )
i i
x y
n ⋅ σ xσ y i

We can simplify it as follows:


σ xy
σ 'xy =
σ xσ y

[ 1187 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

In the following code example, we will use NumPy's corrcoef function on the five
feature columns that we previously visualized in the scatterplot matrix, and we will
use seaborn's heatmap function to plot the correlation matrix array as a heat map:
>>> import numpy as np
>>> cm = np.corrcoef(df[cols].values.T)
>>> sns.set(font_scale=1.5)
>>> hm = sns.heatmap(cm,
... cbar=True,
... annot=True,
... square=True,
... fmt='.2f',
... annot_kws={'size': 15},
... yticklabels=cols,
... xticklabels=cols)
>>> plt.show()

As we can see in the resulting figure, the correlation matrix provides us with another
useful summary graphic that can help us to select features based on their respective
linear correlations:

To fit a linear regression model, we are interested in those features that have a high
correlation with our target variable MEDV. Looking at the preceding correlation
matrix, we see that our target variable MEDV shows the largest correlation with
the LSTAT variable (-0.74). However, as you might remember from the scatterplot
matrix, there is a clear nonlinear relationship between LSTAT and MEDV. On the
other hand, the correlation between RM and MEDV is also relatively high (0.70) and
given the linear relationship between those two variables that we observed in the
scatterplot, RM seems to be a good choice for an explanatory variable to introduce
the concepts of a simple linear regression model in the following section.
[ 1188 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

Implementing an ordinary least squares


linear regression model
At the beginning of this chapter, we discussed that linear regression can be
understood as finding the best-fitting straight line through the sample points of
our training data. However, we have neither defined the term best-fitting nor have
we discussed the different techniques of fitting such a model. In the following
subsections, we will fill in the missing pieces of this puzzle using the Ordinary
Least Squares (OLS) method to estimate the parameters of the regression line that
minimizes the sum of the squared vertical distances (residuals or errors) to the
sample points.

Solving regression for regression parameters


with gradient descent
Consider our implementation of the ADAptive LInear NEuron (Adaline) from
Chapter 2, Training Machine Learning Algorithms for Classification; we remember that
the artificial neuron uses a linear activation function and we defined a cost function
J ( ⋅) , which we minimized to learn the weights via optimization algorithms, such as
Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function
in Adaline is the Sum of Squared Errors (SSE). This is identical to the OLS cost
function that we defined:

1 n (i )
( )
2
J ( w) = ∑ y − yˆ ( )
i

2 i =1

Here, ŷ is the predicted value yˆ = wT x (note that the term 1/2 is just used for
convenience to derive the update rule of GD). Essentially, OLS linear regression
can be understood as Adaline without the unit step function so that we obtain
continuous target values instead of the class labels -1 and 1. To demonstrate the
similarity, let's take the GD implementation of Adaline from Chapter 2, Training
Machine Learning Algorithms for Classification, and remove the unit step function to
implement our first linear regression model:
class LinearRegressionGD(object):

def __init__(self, eta=0.001, n_iter=20):


self.eta = eta

[ 1189 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

self.n_iter = n_iter

def fit(self, X, y):


self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []

for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
self.w_[1:] += self.eta * X.T.dot(errors)
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self

def net_input(self, X):


return np.dot(X, self.w_[1:]) + self.w_[0]

def predict(self, X):


return self.net_input(X)

If you need a refresher about how the weights are being updated—taking a step in
the opposite direction of the gradient—please revisit the Adaline section in Chapter 2,
Training Machine Learning Algorithms for Classification.

To see our LinearRegressionGD regressor in action, let's use the RM (number of


rooms) variable from the Housing Data Set as the explanatory variable to train a
model that can predict MEDV (the housing prices). Furthermore, we will standardize
the variables for better convergence of the GD algorithm. The code is as follows:
>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y)
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)

[ 1190 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

We discussed in Chapter 2, Training Machine Learning Algorithms for Classification,


that it is always a good idea to plot the cost as a function of the number of epochs
(passes over the training dataset) when we are using optimization algorithms, such
as gradient descent, to check for convergence. To cut a long story short, let's plot the
cost against the number of epochs to check if the linear regression has converged:
>>> plt.plot(range(1, lr.n_iter+1), lr.cost_)
>>> plt.ylabel('SSE')
>>> plt.xlabel('Epoch')
>>> plt.show()

As we can see in the following plot, the GD algorithm converged after the fifth epoch:

Next, let's visualize how well the linear regression line fits the training data. To do
so, we will define a simple helper function that will plot a scatterplot of the training
samples and add the regression line:
>>> def lin_regplot(X, y, model):
... plt.scatter(X, y, c='blue')
... plt.plot(X, model.predict(X), color='red')
... return None

Now, we will use this lin_regplot function to plot the number of rooms against
house prices:
>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

[ 1191 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

As we can see in the following plot, the linear regression line reflects the general
trend that house prices tend to increase with the number of rooms:

Although this observation makes intuitive sense, the data also tells us that the
number of rooms does not explain the house prices very well in many cases. Later
in this chapter, we will discuss how to quantify the performance of a regression
model. Interestingly, we also observe a curious line y = 3 , which suggests that the
prices may have been clipped. In certain applications, it may also be important to
report the predicted outcome variables on their original scale. To scale the predicted
price outcome back on the Price in $1000's axes, we can simply apply the
inverse_transform method of the StandardScaler:

>>> num_rooms_std = sc_x.transform([5.0])


>>> price_std = lr.predict(num_rooms_std)
>>> print("Price in $1000's: %.3f" % \
... sc_y.inverse_transform(price_std))
Price in $1000's: 10.840

In the preceding code example, we used the previously trained linear regression
model to predict the price of a house with five rooms. According to our model,
such a house is worth $10,840.

[ 1192 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

On a side note, it is also worth mentioning that we technically don't have to update
the weights of the intercept if we are working with standardized variables since the
y axis intercept is always 0 in those cases. We can quickly confirm this by printing
the weights:
>>> print('Slope: %.3f' % lr.w_[1])
Slope: 0.695
>>> print('Intercept: %.3f' % lr.w_[0])
Intercept: -0.000

Estimating the coefficient of a regression


model via scikit-learn
In the previous section, we implemented a working model for regression
analysis. However, in a real-world application, we may be interested in more
efficient implementations, for example, scikit-learn's LinearRegression object
that makes use of the LIBLINEAR library and advanced optimization algorithms
that work better with unstandardized variables. This is sometimes desirable for
certain applications:
>>> from sklearn.linear_model import LinearRegression
>>> slr = LinearRegression()
>>> slr.fit(X, y)
>>> print('Slope: %.3f' % slr.coef_[0])
Slope: 9.102
>>> print('Intercept: %.3f' % slr.intercept_)
Intercept: -34.671

As we can see by executing the preceding code, scikit-learn's LinearRegression


model fitted with the unstandardized RM and MEDV variables yielded different
model coefficients. Let's compare it to our own GD implementation by plotting
MEDV against RM:
>>> lin_regplot(X, y, slr)
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()w

[ 1193 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

Now, when we plot the training data and our fitted model by executing the code
above, we can see that the overall result looks identical to our GD implementation:

As an alternative to using machine learning libraries, there is also


a closed-form solution for solving OLS involving a system of linear
equations that can be found in most introductory statistics textbooks:

w =(XT X ) XT y
−1

We can implement it in Python as follows:


# adding a column vector of "ones"
>>> Xb = np.hstack((np.ones((X.shape[0], 1)), X))
>>> w = np.zeros(X.shape[1])
>>> z = np.linalg.inv(np.dot(Xb.T, Xb))
>>> w = np.dot(z, np.dot(Xb.T, y))
>>> print('Slope: %.3f' % w[1])
Slope: 9.102
>>> print('Intercept: %.3f' % w[0])
Intercept: -34.671
The advantage of this method is that it is guaranteed to find the optimal
solution analytically. However, if we are working with very large
datasets, it can be computationally too expensive to invert the matrix in
this formula (sometimes also called the normal equation) or the sample
matrix may be singular (non-invertible), which is why we may prefer
iterative methods in certain cases.
If you are interested in more information on how to obtain the normal
equations, I recommend you take a look at Dr. Stephen Pollock's chapter,
The Classical Linear Regression Model from his lectures at the University
of Leicester, which are available for free at http://www.le.ac.uk/
users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.

[ 1194 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

Fitting a robust regression model using


RANSAC
Linear regression models can be heavily impacted by the presence of outliers.
In certain situations, a very small subset of our data can have a big effect on the
estimated model coefficients. There are many statistical tests that can be used to
detect outliers, which are beyond the scope of the book. However, removing
outliers always requires our own judgment as a data scientist, as well as our
domain knowledge.

As an alternative to throwing out outliers, we will look at a robust method of


regression using the RANdom SAmple Consensus (RANSAC) algorithm,
which fits a regression model to a subset of the data, the so-called inliers.

We can summarize the iterative RANSAC algorithm as follows:

1. Select a random number of samples to be inliers and fit the model.


2. Test all other data points against the fitted model and add those points
that fall within a user-given tolerance to the inliers.
3. Refit the model using all inliers.
4. Estimate the error of the fitted model versus the inliers.
5. Terminate the algorithm if the performance meets a certain user-defined
threshold or if a fixed number of iterations has been reached; go back to
step 1 otherwise.

Let's now wrap our linear model in the RANSAC algorithm using scikit-learn's
RANSACRegressor object:

>>> from sklearn.linear_model import RANSACRegressor


>>> ransac = RANSACRegressor(LinearRegression(),
... max_trials=100,
... min_samples=50,
... residual_metric=lambda x: np.sum(np.abs(x), axis=1),
... residual_threshold=5.0,
... random_state=0)
>>> ransac.fit(X, y)

[ 1195 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

We set the maximum number of iterations of the RANSACRegressor to 100, and using
min_samples=50, we set the minimum number of the randomly chosen samples to
be at least 50. Using the residual_metric parameter, we provided a callable lambda
function that simply calculates the absolute vertical distances between the fitted line
and the sample points. By setting the residual_threshold parameter to 5.0, we
only allowed samples to be included in the inlier set if their vertical distance to the
fitted line is within 5 distance units, which works well on this particular dataset. By
default, scikit-learn uses the MAD estimate to select the inlier threshold, where MAD
stands for the Median Absolute Deviation of the target values y. However, the choice
of an appropriate value for the inlier threshold is problem-specific, which is one
disadvantage of RANSAC. Many different approaches have been developed over the
recent years to select a good inlier threshold automatically. You can find a detailed
discussion in R. Toldo and A. Fusiello's. Automatic Estimation of the Inlier Threshold in
Robust Multiple Structures Fitting (in Image Analysis and Processing–ICIAP 2009,
pages 123–131. Springer, 2009).

After we have fitted the RANSAC model, let's obtain the inliers and outliers from the
fitted RANSAC linear regression model and plot them together with the linear fit:
>>> inlier_mask = ransac.inlier_mask_
>>> outlier_mask = np.logical_not(inlier_mask)
>>> line_X = np.arange(3, 10, 1)
>>> line_y_ransac = ransac.predict(line_X[:, np.newaxis])
>>> plt.scatter(X[inlier_mask], y[inlier_mask],
... c='blue', marker='o', label='Inliers')
>>> plt.scatter(X[outlier_mask], y[outlier_mask],
... c='lightgreen', marker='s', label='Outliers')
>>> plt.plot(line_X, line_y_ransac, color='red')
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper left')
>>> plt.show()

[ 1196 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

As we can see in the following scatterplot, the linear regression model was fitted on
the detected set of inliers shown as circles:

When we print the slope and intercept of the model executing the following code,
we can see that the linear regression line is slightly different from the fit that we
obtained in the previous section without RANSAC:
>>> print('Slope: %.3f' % ransac.estimator_.coef_[0])
Slope: 9.621
>>> print('Intercept: %.3f' % ransac.estimator_.intercept_)
Intercept: -37.137

Using RANSAC, we reduced the potential effect of the outliers in this dataset,
but we don't know if this approach has a positive effect on the predictive
performance for unseen data. Thus, in the next section we will discuss how to
evaluate a regression model for different approaches, which is a crucial part of
building systems for predictive modeling.

[ 1197 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

Evaluating the performance of linear


regression models
In the previous section, we discussed how to fit a regression model on training data.
However, you learned in previous chapters that it is crucial to test the model on data
that it hasn't seen during training to obtain an unbiased estimate of its performance.

As we remember from Chapter 6, Learning Best Practices for Model Evaluation and
Hyperparameter Tuning, we want to split our dataset into separate training and
test datasets where we use the former to fit the model and the latter to evaluate its
performance to generalize to unseen data. Instead of proceeding with the simple
regression model, we will now use all variables in the dataset and train a multiple
regression model:
>>> from sklearn.cross_validation import train_test_split
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.3, random_state=0)
>>> slr = LinearRegression()
>>> slr.fit(X_train, y_train)
>>> y_train_pred = slr.predict(X_train)
>>> y_test_pred = slr.predict(X_test)

Since our model uses multiple explanatory variables, we can't visualize the linear
regression line (or hyperplane to be precise) in a two-dimensional plot, but we
can plot the residuals (the differences or vertical distances between the actual and
predicted values) versus the predicted values to diagnose our regression model.
Those residual plots are a commonly used graphical analysis for diagnosing
regression models to detect nonlinearity and outliers, and to check if the errors
are randomly distributed.

[ 1198 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

Using the following code, we will now plot a residual plot where we simply subtract
the true target variables from our predicted responses:
>>> plt.scatter(y_train_pred, y_train_pred - y_train,
... c='blue', marker='o', label='Training data')
>>> plt.scatter(y_test_pred, y_test_pred - y_test,
... c='lightgreen', marker='s', label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

After executing the code, we should see a residual plot with a line passing through
the x axis origin as shown here:

In the case of a perfect prediction, the residuals would be exactly zero, which we will
probably never encounter in realistic and practical applications. However, for a good
regression model, we would expect that the errors are randomly distributed and
the residuals should be randomly scattered around the centerline. If we see patterns
in a residual plot, it means that our model is unable to capture some explanatory
information, which is leaked into the residuals as we can slightly see in our preceding
residual plot. Furthermore, we can also use residual plots to detect outliers, which are
represented by the points with a large deviation from the centerline.

[ 1199 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

Another useful quantitative measure of a model's performance is the so-called


Mean Squared Error (MSE), which is simply the average value of the SSE cost
function that we minimize to fit the linear regression model. The MSE is useful
to for comparing different regression models or for tuning their parameters via
a grid search and cross-validation:

1 n (i )
( )
2
MSE = ∑
n i =1
y − yˆ (i )

Execute the following code:


>>> from sklearn.metrics import mean_squared_error
>>> print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))

We will see that the MSE on the training set is 19.96, and the MSE of the test set is
much larger with a value of 27.20, which is an indicator that our model is overfitting
the training data.

Sometimes it may be more useful to report the coefficient of determination ( R 2 ), which


can be understood as a standardized version of the MSE, for better interpretability of
the model performance. In other words, R 2 is the fraction of response variance that is
captured by the model. The R 2 value is defined as follows:

SSE
R2 = 1 −
SST

Here, SSE is the sum of squared errors and SST is the total sum of squares
( )
2
SST = ∑ i =1 y ( ) − µ y
n i
, or in other words, it is simply the variance of the response.

Let's quickly show that R 2 is indeed just a rescaled version of the MSE:

SSE
R2 = 1 −
SST

1 n
( )
2
∑ y ( ) − yˆ ( )
i i

1− n
i =1

1 n
( )
2
∑ y( ) − µ y
i

n i =1

[ 1200 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

MSE
1−
Var ( y )

For the training dataset, R 2 is bounded between 0 and 1, but it can become
negative for the test set. If R 2 = 1 , the model fits the data perfectly with a
corresponding MSE = 0 .

Evaluated on the training data, the R 2 of our model is 0.765, which doesn't sound
too bad. However, the R 2 on the test dataset is only 0.673, which we can compute
by executing the following code:
>>> from sklearn.metrics import r2_score
>>> print('R^2 train: %.3f, test: %.3f' %
... (r2_score(y_train, y_train_pred),
... r2_score(y_test, y_test_pred)))

Using regularized methods for regression


As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using
Scikit-learn, regularization is one approach to tackle the problem of overfitting by
adding additional information, and thereby shrinking the parameter values of the
model to induce a penalty against complexity. The most popular approaches to
regularized linear regression are the so-called Ridge Regression, Least Absolute
Shrinkage and Selection Operator (LASSO) and Elastic Net method.

Ridge regression is an L2 penalized model where we simply add the squared sum of
the weights to our least-squares cost function:

( )
2
J ( w ) Ridge = ∑ y (i ) − yˆ (i ) +λ w 2
2
i =1

Here:

m
L2 : λ w 22 = λ ∑w j 2
j =1

By increasing the value of the hyperparameter λ , we increase the regularization


strength and shrink the weights of our model. Please note that we don't regularize
the intercept term w0 .

[ 1201 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

An alternative approach that can lead to sparse models is the LASSO. Depending
on the regularization strength, certain weights can become zero, which makes the
LASSO also useful as a supervised feature selection technique:

( )
2
J ( w ) LASSO = ∑ y (i ) − yˆ (i ) +λ w 1
i =1

Here:

m
L1: λ w 1 = λ ∑ w j
j =1

However, a limitation of the LASSO is that it selects at most n variables if m > n . A


compromise between Ridge regression and the LASSO is the Elastic Net, which has a
L1 penalty to generate sparsity and a L2 penalty to overcome some of the limitations
of the LASSO, such as the number of selected variables.
n m m

( )
2
J ( w ) ElasticNet = ∑ y (i ) − yˆ (i ) + λ1 ∑w j 2 + λ2 ∑ w j
i =1 j =1 j =1

Those regularized regression models are all available via scikit-learn, and the
usage is similar to the regular regression model except that we have to specify the
regularization strength via the parameter λ , for example, optimized via k-fold
cross-validation.

A Ridge Regression model can be initialized as follows:


>>> from sklearn.linear_model import Ridge
>>> ridge = Ridge(alpha=1.0)

Note that the regularization strength is regulated by the parameter alpha, which is
similar to the parameter λ . Likewise, we can initialize a LASSO regressor from the
linear_model submodule:

>>> from sklearn.linear_model import Lasso


>>> lasso = Lasso(alpha=1.0)

Lastly, the ElasticNet implementation allows us to vary the L1 to L2 ratio:


>>> from sklearn.linear_model import ElasticNet
>>> lasso = ElasticNet(alpha=1.0, l1_ratio=0.5)

[ 1202 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

For example, if we set l1_ratio to 1.0, the ElasticNet regressor would be


equal to LASSO regression. For more detailed information about the different
implementations of linear regression, please see the documentation at
http://scikit-learn.org/stable/modules/linear_model.html.

Turning a linear regression model into a


curve – polynomial regression
In the previous sections, we assumed a linear relationship between explanatory and
response variables. One way to account for the violation of linearity assumption is
to use a polynomial regression model by adding polynomial terms:

y = w0 + w1 x + w2 x 2 x 2 + ... + wd x d

Here, d denotes the degree of the polynomial. Although we can use polynomial
regression to model a nonlinear relationship, it is still considered a multiple
linear regression model because of the linear regression coefficients w .

We will now discuss how to use the PolynomialFeatures transformer class from
scikit-learn to add a quadratic term ( d = 2 ) to a simple regression problem with
one explanatory variable, and compare the polynomial to the linear fit. The steps
are as follows:

1. Add a second degree polynomial term:


from sklearn.preprocessing import PolynomialFeatures
>>> X = np.array([258.0, 270.0, 294.0,
… 320.0, 342.0, 368.0,
… 396.0, 446.0, 480.0,
… 586.0])[:, np.newaxis]

>>> y = np.array([236.4, 234.4, 252.8,


… 298.6, 314.2, 342.2,
… 360.8, 368.0, 391.2,
… 390.8])
>>> lr = LinearRegression()
>>> pr = LinearRegression()
>>> quadratic = PolynomialFeatures(degree=2)
>>> X_quad = quadratic.fit_transform(X)

[ 1203 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

2. Fit a simple linear regression model for comparison:


>>> lr.fit(X, y)
>>> X_fit = np.arange(250,600,10)[:, np.newaxis]
>>> y_lin_fit = lr.predict(X_fit)

3. Fit a multiple regression model on the transformed features for


polynomial regression:
>>> pr.fit(X_quad, y)
>>> y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))
Plot the results:
>>> plt.scatter(X, y, label='training points')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear fit', linestyle='--')
>>> plt.plot(X_fit, y_quad_fit,
... label='quadratic fit')
>>> plt.legend(loc='upper left')
>>> plt.show()

In the resulting plot, we can see that the polynomial fit captures the relationship
between the response and explanatory variable much better than the linear fit:

>>> y_lin_pred = lr.predict(X)


>>> y_quad_pred = pr.predict(X_quad)
>>> print('Training MSE linear: %.3f, quadratic: %.3f' % (
... mean_squared_error(y, y_lin_pred),
... mean_squared_error(y, y_quad_pred)))
Training MSE linear: 569.780, quadratic: 61.330
>>> print('Training R^2 linear: %.3f, quadratic: %.3f' % (

[ 1204 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

... r2_score(y, y_lin_pred),


... r2_score(y, y_quad_pred)))
Training R^2 linear: 0.832, quadratic: 0.982

As we can see after executing the preceding code, the MSE decreased from 570
(linear fit) to 61 (quadratic fit), and the coefficient of determination reflects a closer
fit to the quadratic model ( R 2 = 0.982 ) as opposed to the linear fit ( R 2 = 0.832 ) in
this particular toy problem.

Modeling nonlinear relationships in the


Housing Dataset
After we discussed how to construct polynomial features to fit nonlinear relationships
in a toy problem, let's now take a look at a more concrete example and apply those
concepts to the data in the Housing Dataset. By executing the following code, we will
model the relationship between house prices and LSTAT (percent lower status of the
population) using second degree (quadratic) and third degree (cubic) polynomials
and compare it to a linear fit.

The code is as follows:


>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> regr = LinearRegression()

# create polynomial features


>>> quadratic = PolynomialFeatures(degree=2)
>>> cubic = PolynomialFeatures(degree=3)
>>> X_quad = quadratic.fit_transform(X)
>>> X_cubic = cubic.fit_transform(X)

# linear fit
>>> X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
>>> regr = regr.fit(X, y)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y, regr.predict(X))

# quadratic fit
>>> regr = regr.fit(X_quad, y)
>>> y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
>>> quadratic_r2 = r2_score(y, regr.predict(X_quad))

# cubic fit

[ 1205 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

>>> regr = regr.fit(X_cubic, y)


>>> y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
>>> cubic_r2 = r2_score(y, regr.predict(X_cubic))

# plot results
>>> plt.scatter(X, y,
... label='training points',
... color='lightgray')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear (d=1), $R^2=%.2f$'
... % linear_r2,
... color='blue',
... lw=2,
... linestyle=':')
>>> plt.plot(X_fit, y_quad_fit,
... label='quadratic (d=2), $R^2=%.2f$'
... % quadratic_r2,
... color='red',
... lw=2,
... linestyle='-')
>>> plt.plot(X_fit, y_cubic_fit,
... label='cubic (d=3), $R^2=%.2f$'
... % cubic_r2,
... color='green',
... lw=2,
... linestyle='--')
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper right')
>>> plt.show()

As we can see in the resulting plot, the cubic fit captures the relationship between
the house prices and LSTAT better than the linear and quadratic fit. However, we
should be aware that adding more and more polynomial features increases the
complexity of a model and therefore increases the chance of overfitting. Thus, in
practice, it is always recommended that you evaluate the performance of the model
on a separate test dataset to estimate the generalization performance:

[ 1206 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

In addition, polynomial features are not always the best choice for modeling nonlinear
relationships. For example, just by looking at the MEDV-LSTAT scatterplot, we could
propose that a log transformation of the LSTAT feature variable and the square root of
MEDV may project the data onto a linear feature space suitable for a linear regression
fit. Let's test this hypothesis by executing the following code:
# transform features
>>> X_log = np.log(X)
>>> y_sqrt = np.sqrt(y)

# fit features
>>> X_fit = np.arange(X_log.min()-1,
... X_log.max()+1, 1)[:, np.newaxis]
>>> regr = regr.fit(X_log, y_sqrt)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y_sqrt, regr.predict(X_log))

# plot results
>>> plt.scatter(X_log, y_sqrt,
... label='training points',
... color='lightgray')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear (d=1), $R^2=%.2f$' % linear_r2,
... color='blue',
... lw=2)
>>> plt.xlabel('log(% lower status of the population [LSTAT])')

[ 1207 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

>>> plt.ylabel('$\sqrt{Price \; in \; \$1000\'s [MEDV]}$')


>>> plt.legend(loc='lower left')
>>> plt.show()

After transforming the explanatory onto the log space and taking the square root
of the target variables, we were able to capture the relationship between the two
variables with a linear regression line that seems to fit the data better ( R 2 = 0.69 )
than any of the polynomial feature transformations previously:

Dealing with nonlinear relationships using


random forests
In this section, we are going to take a look at random forest regression, which is
conceptually different from the previous regression models in this chapter. A random
forest, which is an ensemble of multiple decision trees, can be understood as the sum
of piecewise linear functions in contrast to the global linear and polynomial regression
models that we discussed previously. In other words, via the decision tree algorithm,
we are subdividing the input space into smaller regions that become more manageable.

[ 1208 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

Decision tree regression


An advantage of the decision tree algorithm is that it does not require any
transformation of the features if we are dealing with nonlinear data. We remember
from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we grow
a decision tree by iteratively splitting its nodes until the leaves are pure or a stopping
criterion is satisfied. When we used decision trees for classification, we defined
entropy as a measure of impurity to determine which feature split maximizes the
Information Gain (IG), which can be defined as follows for a binary split:

N left N right
IG ( D p , xi ) = I ( D p ) − I ( Dleft ) − I ( Dright )
Np Np

Here, x is the feature to perform the split, N p is the number of samples in the
parent node, I is the impurity function, Dp is the subset of training samples in the
parent node, and Dleft and Dright are the subsets of training samples in the left and
right child node after the split. Remember that our goal is to find the feature split
that maximizes the information gain, or in other words, we want to find the feature
split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine
Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity,
which is a useful criterion for classification. To use a decision tree for regression,
we will replace entropy as the impurity measure of a node t by the MSE:

1
∑ ( y ( ) − yˆ )
2
I ( t ) = MSE ( t ) = i
t
Nt i∈Dt

Here, Nt is the number of training samples at node t , Dt is the training subset


at node t , y (i ) is the true target value, and yˆ t is the predicted target value
(sample mean):

1
yˆt =
N
∑ y( )
i∈Dt
i

[ 1209 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

In the context of decision tree regression, the MSE is often also referred to as
within-node variance, which is why the splitting criterion is also better known
as variance reduction. To see what the line fit of a decision tree looks like, let's use
the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear
relationship between the MEDV and LSTAT variables:
>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
>>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()

As we can see from the resulting plot, the decision tree captures the general
trend in the data. However, a limitation of this model is that it does not capture
the continuity and differentiability of the desired prediction. In addition, we
need to be careful about choosing an appropriate value for the depth of the tree
to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:

[ 1210 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

In the next section, we will take a look at a more robust way for fitting regression
trees: random forests.

Random forest regression


As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using
Scikit-learn, the random forest algorithm is an ensemble technique that combines
multiple decision trees. A random forest usually has a better generalization
performance than an individual decision tree due to randomness that helps to
decrease the model variance. Other advantages of random forests are that they are
less sensitive to outliers in the dataset and don't require much parameter tuning.
The only parameter in random forests that we typically need to experiment with
is the number of trees in the ensemble. The basic random forests algorithm for
regression is almost identical to the random forest algorithm for classification that
we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.
The only difference is that we use the MSE criterion to grow the individual decision
trees, and the predicted target variable is calculated as the average prediction over all
decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest
regression model on 60 percent of the samples and evaluate its performance
on the remaining 40 percent. The code is as follows:
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
... train_test_split(X, y,
... test_size=0.4,
... random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor


>>> forest = RandomForestRegressor( ..
. n_estimators=1000,
... criterion='mse',
... random_state=1,
... n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)

[ 1211 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

>>> print('MSE train: %.3f, test: %.3f' % (


... mean_squared_error(y_train, y_train_pred),
... mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
... r2_score(y_train, y_train_pred),
... r2_score(y_test, y_test_pred)))
MSE train: 1.642, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data.
However, it's still able to explain the relationship between the target and
explanatory variables relatively well ( R 2 = 0.871 on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:


>>> plt.scatter(y_train_pred,
... y_train_pred - y_train,
... c='black',
... marker='o',
... s=35,
... alpha=0.5,
... label='Training data')
>>> plt.scatter(y_test_pred,
... y_test_pred - y_test,
... c='lightgreen',
... marker='s',
... s=35,
... alpha=0.7,
... label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

[ 1212 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

As it was already summarized by the R 2 coefficient, we can see that the model
fits the training data better than the test data, as indicated by the outliers in the y
axis direction. Also, the distribution of the residuals does not seem to be completely
random around the zero center point, indicating that the model is not able to
capture all the exploratory information. However, the residual plot indicates a
large improvement over the residual plot of the linear model that we plotted
earlier in this chapter:

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn,


we also discussed the kernel trick that can be used in combination
with support vector machine (SVM) for classification, which is useful
if we are dealing with nonlinear problems. Although a discussion is
beyond of the scope of this book, SVMs can also be used in nonlinear
regression tasks. The interested reader can find more information
about Support Vector Machines for regression in an excellent report
by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification
and Regression. (ISIS technical report, 14, 1998). An SVM regressor is
also implemented in scikit-learn, and more information about its usage
can be found at http://scikit-learn.org/stable/modules/
generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

[ 1213 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

[ 1214 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Chapter 8

[ 1215 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use
Predicting Continuous Target Variables with Regression Analysis

[ 1216 ]

EBSCOhost - printed on 2/3/2024 10:14 AM via UNIVERSIDAD DE LOS ANDES. All use subject to https://www.ebsco.com/terms-of-use

You might also like