Linear Regression: What Is Regression Analysis?
Linear Regression: What Is Regression Analysis?
Linear Regression: What Is Regression Analysis?
Linear Regression
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the
data points, in such a manner that the differences between the distances of data points from the curve or line
is minimized. I’ll explain this in more details in coming sections.
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You
have the recent company data which indicates that the growth in sales is around two and a half times the
growth in the economy. Using this insight, we can predict future sales of the company based on current &
past information.
There are multiple benefits of using regression analysis. They are as follows:
It indicates the significant relationships between dependent variable and independent variable.
It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as
the effect of price changes and the number of promotional activities. These benefits help market researchers
/ data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building
predictive models.
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics
which people pick while learning predictive modeling. In this technique, the dependent variable is continuous,
independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or more independent
variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor variable(s).
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [19]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
In [20]:
Out[20]:
TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
Radio: advertising dollars spent on Radio
Newspaper: advertising dollars spent on Newspaper
In [21]:
Out[21]:
(200, 4)
There are 200 observations, and thus 200 markets in the dataset.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [22]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37b3f59b38>
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
y = β0 + β1 x
y is the response
x is the feature
β 0 is the intercept
Together, β 0 and β 1 are called the model coefficients. To create your model, you must "learn" the values of
these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Take a quick look at the plot we created. Now consider each point, and know that they each have a
coordinate in the form (X,Y). Now draw an imaginary line between each point and our current "best-fit" line.
We'll call the distanace between each point and our current best-fit line, D. To get a quick image of what
we're currently trying to visualize, take a look at the picture below:
Now as before, we're labeling each green line as having a distance D, and each red point as having a
coordinate of (X,Y). Then we can define our best fit line as the line having the property were:
2 2 2 2 2
D + D + D + D +. . . . +D
1 2 3 4 N
So how do we find this line? The least-square line approximating the set of points:
(X, Y ) 1 , (X, Y ) 2 , (X, Y ) 3 , (X, Y ) 4 , (X, Y ) 5 ,
this is basically just a rewritten form of the standard equation for a line:
Y = mx + b
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
We can solve for these constants a0 and a1 by simultaneously solving these equations:
ΣY = a0 N + a1 ΣX
2
ΣXY = a0 ΣX + a1 ΣX
These are called the normal equations for the least squares line. There are further steps that can be taken in
rearranging these equations to solve for y, but we'll let scikit-learn do the rest of the heavy lifting here.
scikit-learn
Introduction
Since its release in 2007, scikit-learn has become one of the most popular open source machine learning
libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification,
regression, dimensionality reduction, and clustering. It also provides modules for extracting features,
processing data, and evaluating models.
Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy
and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional
matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing.
scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile
API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of
the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM
and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also
includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning
data.
Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without
restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-
learn is noted for its reliability; much of the library is covered by automated tests.
In [23]:
# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales
7.03259354913
[ 0.04753664]
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Note that if an increase in TV ad spending was associated with a decrease in sales, β 1 would be negative.
y = 7.032594 + 0.047537 × 50
In [24]:
Out[24]:
9.409444
In [25]:
# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
X_new.head()
Out[25]:
TV
0 50
In [26]:
Out[26]:
array([ 9.40942557])
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [27]:
Out[27]:
TV
0 0.7
1 296.4
In [28]:
Out[28]:
In [29]:
Out[29]:
[<matplotlib.lines.Line2D at 0x7f37b47945c0>]
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low
variance), but the average of those models won't do a great job capturing the true relationship (high bias).
Note that low variance is a useful characteristic when you don't have a lot of training data!
A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our
model coefficients, which are interpreted as follows: If the population from which this sample was drawn was
sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
In [30]:
Out[30]:
0 1
TV 0.042231 0.052843
Keep in mind that we only have a single sample of data, and not the entire population of data. The "true"
coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the
coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the
coefficient is probably within.
Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals
(which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative
hypothesis may indeed be true, except that you just don't have enough data to show that.)
null hypothesis: There is no relationship between TV ads and Sales (and thus β 1 equals zero)
alternative hypothesis: There is a relationship between TV ads and Sales (and thus β 1 is not
equal to zero)
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95%
confidence interval does not include zero. Conversely, the p-value represents the probability that the
coefficient is actually zero:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [31]:
Out[31]:
Intercept 1.406300e-35
TV 1.467390e-42
dtype: float64
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the
95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than
0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again,
using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between
TV ads and Sales.
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the
model. Here's an example of what R-squared "looks like":
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
You can see that the blue line explains some of the variance in the data (R-squared=0.54), the green line
explains more of the variance (R-squared=0.64), and the red line fits the training data even further (R-
squared=0.66). (Does the red line look like it's overfitting?)
Let's calculate the R-squared value for our simple linear model:
In [32]:
Out[32]:
0.61187505085007099
Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely
on the domain. Therefore, it's most useful as a tool for comparing different models.
y = β 0 + β 1 x 1 +. . . +β n x n
Each x represents a different feature, and each feature has its own coefficient. In this case:
y = β 0 + β 1 × T V + β 2 × Radio + β 3 × N ewspaper
In [34]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales
lm = LinearRegression()
lm.fit(X, y)
2.93888936946
[ 0.04576465 0.18853002 -0.00103749]
How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an
increase of $1000 in TV ad spending is associated with an increase in Sales of 45.765 widgets.
A lot of the information we have been reviewing piece-by-piece is available in the model summary output:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [35]:
Out[35]:
Df Model: 3
TV and Radio have significant p-values, whereas Newspaper does not. Thus we reject the null
hypothesis for TV and Radio (that there is no association between those features and Sales), and
fail to reject the null hypothesis for Newspaper.
TV and Radio ad spending are both positively associated with Sales, whereas Newspaper ad
spending is slightly negatively associated with Sales. (However, this is irrelevant since we have
failed to reject the null hypothesis for Newspaper.)
This model has a higher R-squared (0.897) than the previous model, which means that this model
provides a better fit to the data than a model that only includes TV.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Feature Selection
How do I decide which features to include in a linear model? Here's one idea:
Try different models, and only keep predictors in the model if they have small p-values.
Check whether the R-squared value goes up when you add new predictors.
Linear models rely upon a lot of assumptions (such as the features being independent), and if
those assumptions are violated (which they usually are), R-squared and p-values are less reliable.
Using a p-value cutoff of 0.05 means that if you add 100 predictors to a model that are pure noise,
5 of them (on average) will still be counted as significant.
R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-
squared value will generalize. Below is an example:
In [36]:
Out[36]:
0.89719426108289557
In [37]:
# add Newspaper to the model (which we believe has no association with Sales)
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm.rsquared
Out[37]:
0.89721063817895219
R-squared will always increase as you add more features to the model, even if they are unrelated to the
response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the
best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for
overfitting), but it generally under-penalizes complexity (http://scott.fortmann-
roe.com/docs/MeasuringError.html).
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of
out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-
sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods
for searching different sets of parameters and different models. Importantly, cross-validation can be applied
to any model, whereas the methods described above only apply to linear models.
Let's create a new feature called Size, and randomly assign observations to be small or large:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [38]:
import numpy as np
Out[38]:
For scikit-learn, we need to represent all data numerically. If the feature only has two categories, we can
simply create a dummy variable that represents the categories as a binary value:
In [39]:
Out[39]:
Let's redo the multiple linear regression and include the IsLarge predictor:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [40]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge']
X = data[feature_cols]
y = data.Sales
# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)
# print coefficients
zip(feature_cols, lm.coef_)
Out[40]:
<zip at 0x7f37b3ef8cc8>
How do we interpret the IsLarge coefficient? For a given amount of TV/Radio/Newspaper ad spending,
being a large market is associated with an average increase in Sales of 57.42 widgets (as compared to a
Small market, which is called the baseline level).
What if we had reversed the 0/1 coding and created the feature 'IsSmall' instead? The coefficient would be
the same, except it would be negative instead of positive. As such, your choice of category for the baseline
does not matter, all that changes is your interpretation of the coefficient.
In [41]:
Out[41]:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban
because that would imply an ordered relationship between suburban and urban (and thus urban is
somehow "twice" the suburban category).
In [42]:
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means row
s, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()
Out[42]:
Why do we only need two dummy variables, not three? Because two dummies captures all of the
information about the Area feature, and implicitly defines rural as the baseline level. (In general, if you have a
categorical feature with k levels, you create k-1 dummy variables.)
If this is confusing, think about why we only needed one dummy variable for Size (IsLarge), not two dummy
variables (IsSmall and IsLarge).
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [44]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban']
X = data[feature_cols]
y = data.Sales
# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)
# print coefficients
print(feature_cols, lm.coef_)
Holding all other variables fixed, being a suburban area is associated with an average decrease in
Sales of 106.56 widgets (as compared to the baseline level, which is rural).
Being an urban area is associated with an average increase in Sales of 268.13 widgets (as
compared to rural).
A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree,
disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the
categories numerically (such as 1, 2, 3, 4, 5).
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis
requires at least 20 cases per independent variable in the analysis.
First, linear regression needs the relationship between the independent and dependent variables to be linear.
It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity
assumption can best be tested with scatter plots, the following two examples depict two cases, where no and
little linearity is present.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can
best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g.,
the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-
transformation) might fix this issue.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs
when the independent variables are too highly correlated with each other. Multicollinearity may be tested with
three central criteria:
1. Correlation matrix – when computing the matrix of Pearson's Bivariate Correlation among all
independent variables the correlation coefficients need to be smaller than 1.
2. Tolerance – the tolerance measures the influence of one independent variable on all other
independent variables; the tolerance is calculated with an initial linear regression analysis.
Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might
be multicollinearity in the data and with T < 0.01 there certainly is.
3. Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as
VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100
there is certainly multicollinearity among the variables.
If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from
each score) might help to solve the problem. However, the simplest way to address the problem is to remove
independent variables with high VIF values. Fourth, linear regression analysis requires that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other.
For instance, this typically occurs in stock prices, where the price is not independent from the previous price.
4) Condition Index – the condition index is calculated using a factor analysis on the independent variables.
Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate
strong multicollinearity.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve
the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors
to insure independence of the factors in the linear regression analysis. Fourthly, linear regression analysis
requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not
independent from each other. In other words when the value of y(x+1) is not independent from the value of
y(x).
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to
check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The
following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):
The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two
groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is
present, a non-linear correction might fix the problem.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression
In [ ]:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false