Linear Regression: What Is Regression Analysis?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

5/15/2018 Linear regression

Linear Regression

What is Regression Analysis?


Regression analysis is a form of predictive modelling technique which investigates the relationship between a
dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time
series modelling and finding the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied through regression.

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the
data points, in such a manner that the differences between the distances of data points from the curve or line
is minimized. I’ll explain this in more details in coming sections.

Why do we use Regression Analysis?


As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:

Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You
have the recent company data which indicates that the growth in sales is around two and a half times the
growth in the economy. Using this insight, we can predict future sales of the company based on current &
past information.

There are multiple benefits of using regression analysis. They are as follows:

It indicates the significant relationships between dependent variable and independent variable.
It indicates the strength of impact of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different scales, such as
the effect of price changes and the number of promotional activities. These benefits help market researchers
/ data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building
predictive models.

Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics
which people pick while learning predictive modeling. In this technique, the dependent variable is continuous,
independent variable(s) can be continuous or discrete, and nature of regression line is linear.

Linear Regression establishes a relationship between dependent variable (Y) and one or more independent
variables (X) using a best fit straight line (also known as regression line).

It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor variable(s).

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [19]:

# imports
import pandas as pd
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook


%matplotlib inline

Example: Advertising Data


Let's take a look at some data, ask some questions about that data, and then use linear regression to answer
those questions!

In [20]:

# read data into a DataFrame


data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()

Out[20]:

TV Radio Newspaper Sales

1 230.1 37.8 69.2 22.1

2 44.5 39.3 45.1 10.4

3 17.2 45.9 69.3 9.3

4 151.5 41.3 58.5 18.5

5 180.8 10.8 58.4 12.9

What are the features?

TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
Radio: advertising dollars spent on Radio
Newspaper: advertising dollars spent on Newspaper

What is the response?

Sales: sales of a single product in a given market (in thousands of widgets)

In [21]:

# print the shape of the DataFrame


data.shape

Out[21]:

(200, 4)

There are 200 observations, and thus 200 markets in the dataset.

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [22]:

# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37b3f59b38>

Questions About the Advertising Data


Let's pretend you work for the company that manufactures and markets this widget. The company might ask
you the following: On the basis of this data, how should we spend our advertising money in the future?

This general question might lead you to more specific questions:

1. Is there a relationship between ads and sales?


2. How strong is that relationship?
3. Which ad types contribute to sales?
4. What is the effect of each ad type of sales?
5. Given ad spending in a particular market, can sales be predicted?

We will explore these questions below!

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Simple Linear Regression


Simple linear regression is an approach for predicting a quantitative response using a single feature (or
"predictor" or "input variable"). It takes the following form:

y = β0 + β1 x

What does each term represent?

y is the response
x is the feature

β 0 is the intercept

β 1 is the coefficient for x

Together, β 0 and β 1 are called the model coefficients. To create your model, you must "learn" the values of
these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!

Estimating ("Learning") Model Coefficients


Generally speaking, coefficients are estimated using the least squares criterion, which means we are find
the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

What elements are present in the diagram?

The black dots are the observed values of x and y.


The blue line is our least squares line.
The red lines are the residuals, which are the distances between the observed values and the least
squares line.

How do the model coefficients relate to the least squares line?

β0 is the intercept (the value of y when x =0)


β1 is the slope (the change in y divided by change in x )

Here is a graphical depiction of those calculations:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

The mathematics behind the Least Squares Method.

Take a quick look at the plot we created. Now consider each point, and know that they each have a
coordinate in the form (X,Y). Now draw an imaginary line between each point and our current "best-fit" line.
We'll call the distanace between each point and our current best-fit line, D. To get a quick image of what
we're currently trying to visualize, take a look at the picture below:

Now as before, we're labeling each green line as having a distance D, and each red point as having a
coordinate of (X,Y). Then we can define our best fit line as the line having the property were:
2 2 2 2 2
D + D + D + D +. . . . +D
1 2 3 4 N

So how do we find this line? The least-square line approximating the set of points:
(X, Y ) 1 , (X, Y ) 2 , (X, Y ) 3 , (X, Y ) 4 , (X, Y ) 5 ,

has the equation:


Y = a0 + a1 X

this is basically just a rewritten form of the standard equation for a line:
Y = mx + b

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

We can solve for these constants a0 and a1 by simultaneously solving these equations:
ΣY = a0 N + a1 ΣX
2
ΣXY = a0 ΣX + a1 ΣX

These are called the normal equations for the least squares line. There are further steps that can be taken in
rearranging these equations to solve for y, but we'll let scikit-learn do the rest of the heavy lifting here.

scikit-learn

Introduction
Since its release in 2007, scikit-learn has become one of the most popular open source machine learning
libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification,
regression, dimensionality reduction, and clustering. It also provides modules for extracting features,
processing data, and evaluating models.

Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy
and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional
matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing.

scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile
API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of
the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM
and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also
includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning
data.

Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without
restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-
learn is noted for its reliability; much of the library is covered by automated tests.

In [23]:

# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales

# follow the usual sklearn pattern: import, instantiate, fit


from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients


print(lm.intercept_)
print(lm.coef_)

7.03259354913
[ 0.04753664]

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Interpreting Model Coefficients


How do we interpret the TV coefficient (β 1 )?

A "unit" increase in TV ad spending is associated with a 0.047537 "unit" increase in Sales.


Or more clearly: An additional $1,000 spent on TV ads is associated with an increase in sales of
47.537 widgets.

Note that if an increase in TV ad spending was associated with a decrease in sales, β 1 would be negative.

Using the Model for Prediction


Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict
for the Sales in that market?
y = β0 + β1 x

y = 7.032594 + 0.047537 × 50

In [24]:

# manually calculate the prediction


7.032594 + 0.047537*50

Out[24]:

9.409444

Thus, we would predict Sales of 9,409 widgets in that market.

In [25]:

# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
X_new.head()

Out[25]:

TV

0 50

In [26]:

# use the model to make predictions on a new value


lm.predict(X_new)

Out[26]:

array([ 9.40942557])

Plotting the Least Squares Line


Let's make predictions for the smallest and largest observed values of x, and then use the predicted
values to plot the least squares line:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [27]:

# create a DataFrame with the minimum and maximum values of TV


X_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
X_new.head()

Out[27]:

TV

0 0.7

1 296.4

In [28]:

# make predictions for those x values and store them


preds = lm.predict(X_new)
preds

Out[28]:

array([ 7.0658692 , 21.12245377])

In [29]:

# first, plot the observed data


data.plot(kind='scatter', x='TV', y='Sales')

# then, plot the least squares line


plt.plot(X_new, preds, c='red', linewidth=2)

Out[29]:

[<matplotlib.lines.Line2D at 0x7f37b47945c0>]

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Confidence in our Model


Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?

Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low
variance), but the average of those models won't do a great job capturing the true relationship (high bias).
Note that low variance is a useful characteristic when you don't have a lot of training data!

A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our
model coefficients, which are interpreted as follows: If the population from which this sample was drawn was
sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

In [30]:

import statsmodels.formula.api as smf


lm = smf.ols(formula='Sales ~ TV', data=data).fit()
lm.conf_int()

Out[30]:

0 1

Intercept 6.129719 7.935468

TV 0.042231 0.052843

Keep in mind that we only have a single sample of data, and not the entire population of data. The "true"
coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the
coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the
coefficient is probably within.

Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals
(which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.

Hypothesis Testing and p-values


Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null
hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data
supports rejecting the null hypothesis or failing to reject the null hypothesis.

(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative
hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:

null hypothesis: There is no relationship between TV ads and Sales (and thus β 1 equals zero)
alternative hypothesis: There is a relationship between TV ads and Sales (and thus β 1 is not
equal to zero)

How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95%
confidence interval does not include zero. Conversely, the p-value represents the probability that the
coefficient is actually zero:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [31]:

# print the p-values for the model coefficients


lm.pvalues

Out[31]:

Intercept 1.406300e-35
TV 1.467390e-42
dtype: float64

If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the
95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than
0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again,
using 0.05 as the cutoff is just a convention.)

In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between
TV ads and Sales.

Note that we generally ignore the p-value for the intercept.

How Well Does the Model Fit the data?


The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is
the proportion of variance explained, meaning the proportion of variance in the observed data that is
explained by the model, or the reduction in error over the null model. (The null model just predicts the mean
of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the
model. Here's an example of what R-squared "looks like":

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

You can see that the blue line explains some of the variance in the data (R-squared=0.54), the green line
explains more of the variance (R-squared=0.64), and the red line fits the training data even further (R-
squared=0.66). (Does the red line look like it's overfitting?)

Let's calculate the R-squared value for our simple linear model:

In [32]:

# print the R-squared value for the model


lm.rsquared

Out[32]:
0.61187505085007099

Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely
on the domain. Therefore, it's most useful as a tool for comparing different models.

Multiple Linear Regression


Simple linear regression can easily be extended to include multiple features. This is called multiple linear
regression:

y = β 0 + β 1 x 1 +. . . +β n x n

Each x represents a different feature, and each feature has its own coefficient. In this case:

y = β 0 + β 1 × T V + β 2 × Radio + β 3 × N ewspaper

Let's use Statsmodels to estimate these coefficients:

In [34]:

# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales

lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients


print(lm.intercept_)
print(lm.coef_)

2.93888936946
[ 0.04576465 0.18853002 -0.00103749]

How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an
increase of $1000 in TV ad spending is associated with an increase in Sales of 45.765 widgets.

A lot of the information we have been reviewing piece-by-piece is available in the model summary output:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [35]:

lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()


lm.conf_int()
lm.summary()

Out[35]:

OLS Regression Results

Dep. Variable: Sales R-squared: 0.897

Model: OLS Adj. R-squared: 0.896

Method: Least Squares F-statistic: 570.3

Date: Thu, 07 Sep 2017 Prob (F-statistic): 1.58e-96

Time: 09:23:48 Log-Likelihood: -386.18

No. Observations: 200 AIC: 780.4

Df Residuals: 196 BIC: 793.6

Df Model: 3

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 2.9389 0.312 9.422 0.000 2.324 3.554

TV 0.0458 0.001 32.809 0.000 0.043 0.049

Radio 0.1885 0.009 21.893 0.000 0.172 0.206

Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011

Omnibus: 60.414 Durbin-Watson: 2.084

Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241

Skew: -1.327 Prob(JB): 1.44e-33

Kurtosis: 6.332 Cond. No. 454.

What are a few key things we learn from this output?

TV and Radio have significant p-values, whereas Newspaper does not. Thus we reject the null
hypothesis for TV and Radio (that there is no association between those features and Sales), and
fail to reject the null hypothesis for Newspaper.
TV and Radio ad spending are both positively associated with Sales, whereas Newspaper ad
spending is slightly negatively associated with Sales. (However, this is irrelevant since we have
failed to reject the null hypothesis for Newspaper.)
This model has a higher R-squared (0.897) than the previous model, which means that this model
provides a better fit to the data than a model that only includes TV.

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Feature Selection
How do I decide which features to include in a linear model? Here's one idea:

Try different models, and only keep predictors in the model if they have small p-values.
Check whether the R-squared value goes up when you add new predictors.

What are the drawbacks to this approach?

Linear models rely upon a lot of assumptions (such as the features being independent), and if
those assumptions are violated (which they usually are), R-squared and p-values are less reliable.
Using a p-value cutoff of 0.05 means that if you add 100 predictors to a model that are pure noise,
5 of them (on average) will still be counted as significant.
R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-
squared value will generalize. Below is an example:

In [36]:

# only include TV and Radio in the model


lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
lm.rsquared

Out[36]:

0.89719426108289557

In [37]:

# add Newspaper to the model (which we believe has no association with Sales)
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm.rsquared

Out[37]:

0.89721063817895219

R-squared will always increase as you add more features to the model, even if they are unrelated to the
response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the
best linear model.

There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for
overfitting), but it generally under-penalizes complexity (http://scott.fortmann-
roe.com/docs/MeasuringError.html).

So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of
out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-
sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods
for searching different sets of parameters and different models. Importantly, cross-validation can be applied
to any model, whereas the methods described above only apply to linear models.

Handling Categorical Predictors with Two Categories


Up to now, all of our predictors have been numeric. What if one of our predictors was categorical?

Let's create a new feature called Size, and randomly assign observations to be small or large:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [38]:

import numpy as np

# set a seed for reproducibility


np.random.seed(12345)

# create a Series of booleans in which roughly half are True


nums = np.random.rand(len(data))
mask_large = nums > 0.5

# initially set Size to small, then change roughly half to be large


data['Size'] = 'small'
data.loc[mask_large, 'Size'] = 'large'
data.head()

Out[38]:

TV Radio Newspaper Sales Size

1 230.1 37.8 69.2 22.1 large

2 44.5 39.3 45.1 10.4 small

3 17.2 45.9 69.3 9.3 small

4 151.5 41.3 58.5 18.5 small

5 180.8 10.8 58.4 12.9 large

For scikit-learn, we need to represent all data numerically. If the feature only has two categories, we can
simply create a dummy variable that represents the categories as a binary value:

In [39]:

# create a new Series called IsLarge


data['IsLarge'] = data.Size.map({'small':0, 'large':1})
data.head()

Out[39]:

TV Radio Newspaper Sales Size IsLarge

1 230.1 37.8 69.2 22.1 large 1

2 44.5 39.3 45.1 10.4 small 0

3 17.2 45.9 69.3 9.3 small 0

4 151.5 41.3 58.5 18.5 small 0

5 180.8 10.8 58.4 12.9 large 1

Let's redo the multiple linear regression and include the IsLarge predictor:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [40]:

# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge']
X = data[feature_cols]
y = data.Sales

# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)

# print coefficients
zip(feature_cols, lm.coef_)

Out[40]:

<zip at 0x7f37b3ef8cc8>

How do we interpret the IsLarge coefficient? For a given amount of TV/Radio/Newspaper ad spending,
being a large market is associated with an average increase in Sales of 57.42 widgets (as compared to a
Small market, which is called the baseline level).

What if we had reversed the 0/1 coding and created the feature 'IsSmall' instead? The coefficient would be
the same, except it would be negative instead of positive. As such, your choice of category for the baseline
does not matter, all that changes is your interpretation of the coefficient.

Handling Categorical Predictors with More than Two


Categories
Let's create a new feature called Area, and randomly assign observations to be rural, suburban, or urban:

In [41]:

# set a seed for reproducibility


np.random.seed(123456)

# assign roughly one third of observations to each group


nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
data.head()

Out[41]:

TV Radio Newspaper Sales Size IsLarge Area

1 230.1 37.8 69.2 22.1 large 1 rural

2 44.5 39.3 45.1 10.4 small 0 urban

3 17.2 45.9 69.3 9.3 small 0 rural

4 151.5 41.3 58.5 18.5 small 0 urban

5 180.8 10.8 58.4 12.9 large 1 suburban

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban
because that would imply an ordered relationship between suburban and urban (and thus urban is
somehow "twice" the suburban category).

Instead, we create another dummy variable:

In [42]:

# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means row
s, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()

Out[42]:

TV Radio Newspaper Sales Size IsLarge Area Area_suburban Area_

1 230.1 37.8 69.2 22.1 large 1 rural 0 0

2 44.5 39.3 45.1 10.4 small 0 urban 0 1

3 17.2 45.9 69.3 9.3 small 0 rural 0 0

4 151.5 41.3 58.5 18.5 small 0 urban 0 1

5 180.8 10.8 58.4 12.9 large 1 suburban 1 0

Here is how we interpret the coding:

rural is coded as Area_suburban=0 and Area_urban=0


suburban is coded as Area_suburban=1 and Area_urban=0
urban is coded as Area_suburban=0 and Area_urban=1

Why do we only need two dummy variables, not three? Because two dummies captures all of the
information about the Area feature, and implicitly defines rural as the baseline level. (In general, if you have a
categorical feature with k levels, you create k-1 dummy variables.)

If this is confusing, think about why we only needed one dummy variable for Size (IsLarge), not two dummy
variables (IsSmall and IsLarge).

Let's include the two new dummy variables in the model:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [44]:

# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban']
X = data[feature_cols]
y = data.Sales

# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)

# print coefficients
print(feature_cols, lm.coef_)

['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban'] [


0.04574401 0.1878667 -0.0010877 0.07739661 -0.10656299 0.26813802]

How do we interpret the coefficients?

Holding all other variables fixed, being a suburban area is associated with an average decrease in
Sales of 106.56 widgets (as compared to the baseline level, which is rural).
Being an urban area is associated with an average increase in Sales of 268.13 widgets (as
compared to rural).

A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree,
disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the
categories numerically (such as 1, 2, 3, 4, 5).

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Assumptions of Linear Regression


Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions:

Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity

A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis
requires at least 20 cases per independent variable in the analysis.

First, linear regression needs the relationship between the independent and dependent variables to be linear.
It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity
assumption can best be tested with scatter plots, the following two examples depict two cases, where no and
little linearity is present.

Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can
best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g.,
the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-
transformation) might fix this issue.

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs
when the independent variables are too highly correlated with each other. Multicollinearity may be tested with
three central criteria:

1. Correlation matrix – when computing the matrix of Pearson's Bivariate Correlation among all
independent variables the correlation coefficients need to be smaller than 1.
2. Tolerance – the tolerance measures the influence of one independent variable on all other
independent variables; the tolerance is calculated with an initial linear regression analysis.
Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might
be multicollinearity in the data and with T < 0.01 there certainly is.
3. Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as
VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100
there is certainly multicollinearity among the variables.

If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from
each score) might help to solve the problem. However, the simplest way to address the problem is to remove
independent variables with high VIF values. Fourth, linear regression analysis requires that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other.
For instance, this typically occurs in stock prices, where the price is not independent from the previous price.

4) Condition Index – the condition index is calculated using a factor analysis on the independent variables.
Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate
strong multicollinearity.

If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve
the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors
to insure independence of the factors in the linear regression analysis. Fourthly, linear regression analysis
requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not
independent from each other. In other words when the value of y(x+1) is not independent from the value of
y(x).

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to
check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The
following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):

The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two
groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is
present, a non-linear correction might fix the problem.

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
5/15/2018 Linear regression

In [ ]:

http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false

You might also like