Linear Regression: What Is Regression Analysis?

5/15/2018 Linear regression
Linear Regression
What is Regression Analysis?

Regression analysis is a form of predictive modelling technique which investigates the relationship between a
dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time
series modelling and finding the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the
data points, in such a manner that the differences between the distances of data points from the curve or line
is minimized. I’ll explain this in more details in coming sections.
Why do we use Regression Analysis?

As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You
have the recent company data which indicates that the growth in sales is around two and a half times the
growth in the economy. Using this insight, we can predict future sales of the company based on current &
past information.
There are multiple benefits of using regression analysis. They are as follows:
It indicates the significant relationships between dependent variable and independent variable.
It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as
the effect of price changes and the number of promotional activities. These benefits help market researchers
/ data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building
predictive models.
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics
which people pick while learning predictive modeling. In this technique, the dependent variable is continuous,
independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or more independent
variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor variable(s).
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML2/Linear%20regression.ipynb?download=false
In [19]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
# this allows plots to appear directly in the notebook

%matplotlib inline
Example: Advertising Data

Let's take a look at some data, ask some questions about that data, and then use linear regression to answer
those questions!
In [20]:
# read data into a DataFrame

data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()
Out[20]:
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
What are the features?
TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
Radio: advertising dollars spent on Radio
Newspaper: advertising dollars spent on Newspaper
What is the response?
Sales: sales of a single product in a given market (in thousands of widgets)
In [21]:
# print the shape of the DataFrame

data.shape
Out[21]:
(200, 4)
There are 200 observations, and thus 200 markets in the dataset.
In [22]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37b3f59b38>
Questions About the Advertising Data

Let's pretend you work for the company that manufactures and markets this widget. The company might ask
you the following: On the basis of this data, how should we spend our advertising money in the future?
This general question might lead you to more specific questions:
1. Is there a relationship between ads and sales?

2. How strong is that relationship?
3. Which ad types contribute to sales?
4. What is the effect of each ad type of sales?
5. Given ad spending in a particular market, can sales be predicted?
We will explore these questions below!
Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or
"predictor" or "input variable"). It takes the following form:
y = β0 + β1 x
What does each term represent?
y is the response
x is the feature
β 0 is the intercept
β 1 is the coefficient for x
Together, β 0 and β 1 are called the model coefficients. To create your model, you must "learn" the values of
these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!
Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find
the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):
What elements are present in the diagram?
The black dots are the observed values of x and y.

The blue line is our least squares line.
The red lines are the residuals, which are the distances between the observed values and the least
squares line.
How do the model coefficients relate to the least squares line?
β0 is the intercept (the value of y when x =0)

β1 is the slope (the change in y divided by change in x )
Here is a graphical depiction of those calculations:
The mathematics behind the Least Squares Method.
Take a quick look at the plot we created. Now consider each point, and know that they each have a
coordinate in the form (X,Y). Now draw an imaginary line between each point and our current "best-fit" line.
We'll call the distanace between each point and our current best-fit line, D. To get a quick image of what
we're currently trying to visualize, take a look at the picture below:
Now as before, we're labeling each green line as having a distance D, and each red point as having a
coordinate of (X,Y). Then we can define our best fit line as the line having the property were:
2 2 2 2 2
D + D + D + D +. . . . +D
1 2 3 4 N
So how do we find this line? The least-square line approximating the set of points:
(X, Y ) 1 , (X, Y ) 2 , (X, Y ) 3 , (X, Y ) 4 , (X, Y ) 5 ,
has the equation:

Y = a0 + a1 X
this is basically just a rewritten form of the standard equation for a line:
Y = mx + b
We can solve for these constants a0 and a1 by simultaneously solving these equations:
ΣY = a0 N + a1 ΣX
2
ΣXY = a0 ΣX + a1 ΣX
These are called the normal equations for the least squares line. There are further steps that can be taken in
rearranging these equations to solve for y, but we'll let scikit-learn do the rest of the heavy lifting here.
scikit-learn
Introduction
Since its release in 2007, scikit-learn has become one of the most popular open source machine learning
libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification,
regression, dimensionality reduction, and clustering. It also provides modules for extracting features,
processing data, and evaluating models.
Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy
and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional
matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing.
scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile
API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of
the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM
and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also
includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning
data.
Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without
restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-
learn is noted for its reliability; much of the library is covered by automated tests.
In [23]:
# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales
# follow the usual sklearn pattern: import, instantiate, fit

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients

print(lm.intercept_)
print(lm.coef_)
7.03259354913
[ 0.04753664]
Interpreting Model Coefficients

How do we interpret the TV coefficient (β 1 )?
A "unit" increase in TV ad spending is associated with a 0.047537 "unit" increase in Sales.

Or more clearly: An additional $1,000 spent on TV ads is associated with an increase in sales of
47.537 widgets.
Note that if an increase in TV ad spending was associated with a decrease in sales, β 1 would be negative.
Using the Model for Prediction

Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict
for the Sales in that market?
y = β0 + β1 x
y = 7.032594 + 0.047537 × 50
In [24]:
# manually calculate the prediction

7.032594 + 0.047537*50
Out[24]:
9.409444
Thus, we would predict Sales of 9,409 widgets in that market.
In [25]:
# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
X_new.head()
Out[25]:
TV
0 50
In [26]:
# use the model to make predictions on a new value

lm.predict(X_new)
Out[26]:
array([ 9.40942557])
Plotting the Least Squares Line

Let's make predictions for the smallest and largest observed values of x, and then use the predicted
values to plot the least squares line:
In [27]:
# create a DataFrame with the minimum and maximum values of TV

X_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
X_new.head()
Out[27]:
TV
0 0.7
1 296.4
In [28]:
# make predictions for those x values and store them

preds = lm.predict(X_new)
preds
Out[28]:
array([ 7.0658692 , 21.12245377])
In [29]:
# first, plot the observed data

data.plot(kind='scatter', x='TV', y='Sales')
# then, plot the least squares line

plt.plot(X_new, preds, c='red', linewidth=2)
Out[29]:
[<matplotlib.lines.Line2D at 0x7f37b47945c0>]
Confidence in our Model

Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?
Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low
variance), but the average of those models won't do a great job capturing the true relationship (high bias).
Note that low variance is a useful characteristic when you don't have a lot of training data!
A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our
model coefficients, which are interpreted as follows: If the population from which this sample was drawn was
sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
In [30]:
import statsmodels.formula.api as smf

lm = smf.ols(formula='Sales ~ TV', data=data).fit()
lm.conf_int()
Out[30]:
0 1
Intercept 6.129719 7.935468
TV 0.042231 0.052843
Keep in mind that we only have a single sample of data, and not the entire population of data. The "true"
coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the
coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the
coefficient is probably within.
Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals
(which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
Hypothesis Testing and p-values

Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null
hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data
supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative
hypothesis may indeed be true, except that you just don't have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
null hypothesis: There is no relationship between TV ads and Sales (and thus β 1 equals zero)
alternative hypothesis: There is a relationship between TV ads and Sales (and thus β 1 is not
equal to zero)
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95%
confidence interval does not include zero. Conversely, the p-value represents the probability that the
coefficient is actually zero:
In [31]:
# print the p-values for the model coefficients

lm.pvalues
Out[31]:
Intercept 1.406300e-35
TV 1.467390e-42
dtype: float64
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the
95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than
0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again,
using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between
TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
How Well Does the Model Fit the data?

The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is
the proportion of variance explained, meaning the proportion of variance in the observed data that is
explained by the model, or the reduction in error over the null model. (The null model just predicts the mean
of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the
model. Here's an example of what R-squared "looks like":
You can see that the blue line explains some of the variance in the data (R-squared=0.54), the green line
explains more of the variance (R-squared=0.64), and the red line fits the training data even further (R-
squared=0.66). (Does the red line look like it's overfitting?)
Let's calculate the R-squared value for our simple linear model:
In [32]:
# print the R-squared value for the model

lm.rsquared
Out[32]:
0.61187505085007099
Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely
on the domain. Therefore, it's most useful as a tool for comparing different models.
Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features. This is called multiple linear
regression:
y = β 0 + β 1 x 1 +. . . +β n x n
Each x represents a different feature, and each feature has its own coefficient. In this case:
y = β 0 + β 1 × T V + β 2 × Radio + β 3 × N ewspaper
Let's use Statsmodels to estimate these coefficients:
In [34]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
y = data.Sales
lm.fit(X, y)
# print intercept and coefficients

print(lm.intercept_)
print(lm.coef_)
2.93888936946
[ 0.04576465 0.18853002 -0.00103749]
How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an
increase of $1000 in TV ad spending is associated with an increase in Sales of 45.765 widgets.
A lot of the information we have been reviewing piece-by-piece is available in the model summary output:
In [35]:
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()

lm.conf_int()
lm.summary()
Out[35]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Thu, 07 Sep 2017 Prob (F-statistic): 1.58e-96
Time: 09:23:48 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
Radio 0.1885 0.009 21.893 0.000 0.172 0.206
Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
What are a few key things we learn from this output?
TV and Radio have significant p-values, whereas Newspaper does not. Thus we reject the null
hypothesis for TV and Radio (that there is no association between those features and Sales), and
fail to reject the null hypothesis for Newspaper.
TV and Radio ad spending are both positively associated with Sales, whereas Newspaper ad
spending is slightly negatively associated with Sales. (However, this is irrelevant since we have
failed to reject the null hypothesis for Newspaper.)
This model has a higher R-squared (0.897) than the previous model, which means that this model
provides a better fit to the data than a model that only includes TV.
Feature Selection
How do I decide which features to include in a linear model? Here's one idea:
Try different models, and only keep predictors in the model if they have small p-values.
Check whether the R-squared value goes up when you add new predictors.
What are the drawbacks to this approach?
Linear models rely upon a lot of assumptions (such as the features being independent), and if
those assumptions are violated (which they usually are), R-squared and p-values are less reliable.
Using a p-value cutoff of 0.05 means that if you add 100 predictors to a model that are pure noise,
5 of them (on average) will still be counted as significant.
R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-
squared value will generalize. Below is an example:
In [36]:
# only include TV and Radio in the model

lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
lm.rsquared
Out[36]:
0.89719426108289557
In [37]:
# add Newspaper to the model (which we believe has no association with Sales)
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm.rsquared
Out[37]:
0.89721063817895219
R-squared will always increase as you add more features to the model, even if they are unrelated to the
response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the
best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for
overfitting), but it generally under-penalizes complexity (http://scott.fortmann-
roe.com/docs/MeasuringError.html).
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of
out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-
sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods
for searching different sets of parameters and different models. Importantly, cross-validation can be applied
to any model, whereas the methods described above only apply to linear models.
Handling Categorical Predictors with Two Categories

Up to now, all of our predictors have been numeric. What if one of our predictors was categorical?
Let's create a new feature called Size, and randomly assign observations to be small or large:
In [38]:
import numpy as np
# set a seed for reproducibility

np.random.seed(12345)
# create a Series of booleans in which roughly half are True

nums = np.random.rand(len(data))
mask_large = nums > 0.5
# initially set Size to small, then change roughly half to be large

data['Size'] = 'small'
data.loc[mask_large, 'Size'] = 'large'
data.head()
Out[38]:
TV Radio Newspaper Sales Size
1 230.1 37.8 69.2 22.1 large
2 44.5 39.3 45.1 10.4 small
3 17.2 45.9 69.3 9.3 small
4 151.5 41.3 58.5 18.5 small
5 180.8 10.8 58.4 12.9 large
For scikit-learn, we need to represent all data numerically. If the feature only has two categories, we can
simply create a dummy variable that represents the categories as a binary value:
In [39]:
# create a new Series called IsLarge

data['IsLarge'] = data.Size.map({'small':0, 'large':1})
data.head()
Out[39]:
TV Radio Newspaper Sales Size IsLarge
1 230.1 37.8 69.2 22.1 large 1
2 44.5 39.3 45.1 10.4 small 0
3 17.2 45.9 69.3 9.3 small 0
4 151.5 41.3 58.5 18.5 small 0
5 180.8 10.8 58.4 12.9 large 1
Let's redo the multiple linear regression and include the IsLarge predictor:
In [40]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge']
y = data.Sales
# instantiate, fit
lm.fit(X, y)
# print coefficients
zip(feature_cols, lm.coef_)
Out[40]:
<zip at 0x7f37b3ef8cc8>
How do we interpret the IsLarge coefficient? For a given amount of TV/Radio/Newspaper ad spending,
being a large market is associated with an average increase in Sales of 57.42 widgets (as compared to a
Small market, which is called the baseline level).
What if we had reversed the 0/1 coding and created the feature 'IsSmall' instead? The coefficient would be
the same, except it would be negative instead of positive. As such, your choice of category for the baseline
does not matter, all that changes is your interpretation of the coefficient.
Handling Categorical Predictors with More than Two

Categories
Let's create a new feature called Area, and randomly assign observations to be rural, suburban, or urban:
In [41]:
# set a seed for reproducibility

np.random.seed(123456)
# assign roughly one third of observations to each group

nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
data.head()
Out[41]:
TV Radio Newspaper Sales Size IsLarge Area
1 230.1 37.8 69.2 22.1 large 1 rural
2 44.5 39.3 45.1 10.4 small 0 urban
3 17.2 45.9 69.3 9.3 small 0 rural
4 151.5 41.3 58.5 18.5 small 0 urban
5 180.8 10.8 58.4 12.9 large 1 suburban
We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban
because that would imply an ordered relationship between suburban and urban (and thus urban is
somehow "twice" the suburban category).
Instead, we create another dummy variable:
In [42]:
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means row
s, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()
Out[42]:
TV Radio Newspaper Sales Size IsLarge Area Area_suburban Area_
1 230.1 37.8 69.2 22.1 large 1 rural 0 0
2 44.5 39.3 45.1 10.4 small 0 urban 0 1
3 17.2 45.9 69.3 9.3 small 0 rural 0 0
4 151.5 41.3 58.5 18.5 small 0 urban 0 1
5 180.8 10.8 58.4 12.9 large 1 suburban 1 0
Here is how we interpret the coding:
rural is coded as Area_suburban=0 and Area_urban=0

suburban is coded as Area_suburban=1 and Area_urban=0
urban is coded as Area_suburban=0 and Area_urban=1
Why do we only need two dummy variables, not three? Because two dummies captures all of the
information about the Area feature, and implicitly defines rural as the baseline level. (In general, if you have a
categorical feature with k levels, you create k-1 dummy variables.)
If this is confusing, think about why we only needed one dummy variable for Size (IsLarge), not two dummy
variables (IsSmall and IsLarge).
Let's include the two new dummy variables in the model:
In [44]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban']
y = data.Sales
# instantiate, fit
lm.fit(X, y)
# print coefficients
print(feature_cols, lm.coef_)
['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban'] [

0.04574401 0.1878667 -0.0010877 0.07739661 -0.10656299 0.26813802]
How do we interpret the coefficients?
Holding all other variables fixed, being a suburban area is associated with an average decrease in
Sales of 106.56 widgets (as compared to the baseline level, which is rural).
Being an urban area is associated with an average increase in Sales of 268.13 widgets (as
compared to rural).
A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree,
disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the
categories numerically (such as 1, 2, 3, 4, 5).
Assumptions of Linear Regression

Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis
requires at least 20 cases per independent variable in the analysis.
First, linear regression needs the relationship between the independent and dependent variables to be linear.
It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity
assumption can best be tested with scatter plots, the following two examples depict two cases, where no and
little linearity is present.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can
best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g.,
the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-
transformation) might fix this issue.
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs
when the independent variables are too highly correlated with each other. Multicollinearity may be tested with
three central criteria:
1. Correlation matrix – when computing the matrix of Pearson's Bivariate Correlation among all
independent variables the correlation coefficients need to be smaller than 1.
2. Tolerance – the tolerance measures the influence of one independent variable on all other
independent variables; the tolerance is calculated with an initial linear regression analysis.
Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might
be multicollinearity in the data and with T < 0.01 there certainly is.
3. Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as
VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100
there is certainly multicollinearity among the variables.
If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from
each score) might help to solve the problem. However, the simplest way to address the problem is to remove
independent variables with high VIF values. Fourth, linear regression analysis requires that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other.
For instance, this typically occurs in stock prices, where the price is not independent from the previous price.
4) Condition Index – the condition index is calculated using a factor analysis on the independent variables.
Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate
strong multicollinearity.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve
the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors
to insure independence of the factors in the linear regression analysis. Fourthly, linear regression analysis
requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not
independent from each other. In other words when the value of y(x+1) is not independent from the value of
y(x).
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to
check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The
following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):
The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two
groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is
present, a non-linear correction might fix the problem.
In [ ]:

Linear Regression: What Is Regression Analysis?

Uploaded by

Copyright:

Available Formats

Linear Regression: What Is Regression Analysis?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression: What Is Regression Analysis?

Uploaded by

Copyright:

Available Formats

5/15/2018 Linear regression

What is Regression Analysis?

Why do we use Regression Analysis?

# this allows plots to appear directly in the notebook

Example: Advertising Data

# read data into a DataFrame

TV Radio Newspaper Sales

1 230.1 37.8 69.2 22.1

2 44.5 39.3 45.1 10.4

3 17.2 45.9 69.3 9.3

4 151.5 41.3 58.5 18.5

5 180.8 10.8 58.4 12.9

What are the features?

What is the response?

Sales: sales of a single product in a given market (in thousands of widgets)

# print the shape of the DataFrame

Questions About the Advertising Data

This general question might lead you to more specific questions:

1. Is there a relationship between ads and sales?

We will explore these questions below!

Simple Linear Regression

What does each term represent?

β 1 is the coefficient for x

Estimating ("Learning") Model Coefficients

What elements are present in the diagram?

The black dots are the observed values of x and y.

How do the model coefficients relate to the least squares line?

β0 is the intercept (the value of y when x =0)

Here is a graphical depiction of those calculations:

The mathematics behind the Least Squares Method.

has the equation:

# follow the usual sklearn pattern: import, instantiate, fit

# print intercept and coefficients

Interpreting Model Coefficients

A "unit" increase in TV ad spending is associated with a 0.047537 "unit" increase in Sales.

Using the Model for Prediction

# manually calculate the prediction

Thus, we would predict Sales of 9,409 widgets in that market.

# use the model to make predictions on a new value

Plotting the Least Squares Line

# create a DataFrame with the minimum and maximum values of TV

# make predictions for those x values and store them

array([ 7.0658692 , 21.12245377])

# first, plot the observed data

# then, plot the least squares line

Confidence in our Model

import statsmodels.formula.api as smf

Intercept 6.129719 7.935468

Hypothesis Testing and p-values

As it relates to model coefficients, here is the conventional hypothesis test:

# print the p-values for the model coefficients

Note that we generally ignore the p-value for the intercept.

How Well Does the Model Fit the data?

# print the R-squared value for the model

Multiple Linear Regression

Let's use Statsmodels to estimate these coefficients:

# print intercept and coefficients

lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()

OLS Regression Results