0% found this document useful (0 votes)
3 views

1 linear regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

1 linear regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Machine

Learning
Chapter -0 Linear Regression
Learning Goals
What is regression?
Why regression?
Scatter plot
Measures of association
Correlation coefficient
Simple linear regression
Fitting a regression line

Regression Analysis
Regression Analysis + Correlation = Predict future performance using past results
While Correlation explains the degree of linear relationship that exists between two variables,
Regression defines the relationship more precisely
Regression analysis is a tool that uses data on relevant variables to develop a prediction equation, or model
It generates an equation to describe the statistical relationship between one or more predictors and the response
variable and to predict new observations

www.learninglabb.com Page 1
The regression has five key
assumptions:
Linear relationship.
Multivariate normality.
No or little multi co-linearity
Homoscedasticity : Homoskedastic (also spelled "homoscedastic") refers to a condition in which the variance of the
residual, or error term, in a regression model is constant.
Autocorrelation: linear regression analysis requires that there is little or no autocorrelation in the data. In other
words when the value of y(x+1) is not independent from the value of y(x).

www.learninglabb.com Page 1
Linear Relationship
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also
important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be
tested with scatter plots.

www.learninglabb.com Page 1
Multivariate Normality
Multivariate Normality is the third assumption in assumptions of linear regression. The linear regression analysis
requires all variables to be multivariate normal. Means data should be normally distributed.

To check the normality use the q-q plot we can infer if the data comes from a normal distribution. If the data is
normally distributed then it gets a fairly straight line. if it not normal then seen with deviation in the straight line

www.learninglabb.com Page 1
No or low Multicollinearity
No or low Multicollinearity is the fifth assumption in assumptions of linear regression. It refers to a situation where a
number of independent variables in a multiple regression model are closely correlated to one another.
Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In
other words, one predictor variable can be used to predict the other. This creates redundant information, skewing
the results in a regression model

Two method to check multicollinearity


Correlation coefficients
Variance Inflation Factor( VIF )

www.learninglabb.com Page 1
Homoscedasticity
Homoscedasticity is the fourth assumption in assumptions of linear regression. Homoscedasticity describes a
situation in which the error term ( the “noise” or random disturbance in the relationship between the independent
and the target) is the same across all values of the independent variables. A scatter plot of residual values vs
predicted values is a good way to check for homoscedasticity.
If the variance of the residual is symmetrically distributed across the residual line then data is said to be
homoscedastic.
If the variance is unequal for residual, across the residual line then the data is said to be heteroscedasticity. In this
case, the residual can form bow-tie, arrow, or any non-symmetric shape.

www.learninglabb.com Page 1
Little or No autocorrelation
No or low autocorrelation is the second assumption in assumptions of linear regression. The linear regression
analysis requires that there is little or no autocorrelation in the data
In other words when the value of y(x+1) is independent of the value of y(x).
If the values of a column or feature are correlated with values of that same column then it is said to be
autocorrelated, In other words, Correlation within a column.
The residuals in the linear regression model are assumed to be independently and identically distributed
Autocorrelation occurs when the residuals are not independent or depend of each other
Plot the ACF plot of residual to check the autocorrelation in data.
If the graph look like cyclic graph their means residuals contain positive autocorrelation, If the graph look like
alternative graph their means residuals contain negative autocorrelation
Graph shows the positive autocorrelation because it look like the cyclic graph

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Simple Linear Regression
Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or
independent variable and other is response or dependent variable.

It looks for statistical relationship but not deterministic relationship.

For example, relationship between height and weight.

In Simple Linear Regression, a single variable "X" is used to define/predict Y


E.g. Used car cost = B1 + (B2) x (Miles driven) + E (error)
Simple Regression Equation: Y = B1 + (B2)*(X) + E (error)

www.learninglabb.com Page 1
Regression

www.learninglabb.com Page 1
www.learninglabb.com Page 1
Business Case : The Newspaper Data
In order to investigate the feasibility of starting a Sunday edition for a large metropolitan newspaper, information was
obtained from a sample of 34 newspapers concerning their daily and Sunday circulations (in thousands)

www.learninglabb.com Page 1
Scatter Plot : The Newspaper Data

www.learninglabb.com Page 1
Which straight line?...
The Newspaper Data

www.learninglabb.com Page 1
The Best line : Least Squares Method
The line of our interest is :

www.learninglabb.com Page 1
Coefficient of Determination R 2 ∧
Proportion of variation in Y "explained" by the regression on X

explained variation SSR


r2 =
2
= 0≤r ≤1
total variation SST

www.learninglabb.com Page 1
Measure of variation
Sums of squares
Total sum of squares = Regression sum of squares Error sum of squares
Total variation = Explained variation 4- Unexplained variation

Total sum of squares (Total Variation): SST =

Regression sum of squares (Explained Variation): SSR =

Error sum of squares (Unexplained Variation): SSE =

www.learninglabb.com Page 1
r2_score
The r2_score and mean_squared_error. The r2_score ranges from 0 to 1 where 1 signifies absolute precision and 0
signify poor performance, the mse or the mean squared error is basically the summation of all the squared errors and
lesser the mse value better our model is.

www.learninglabb.com Page 1
Lets implement in python
Simple Linear Regression: The Newspaper data
Regression Output

www.learninglabb.com Page 1
Waist Circumference Adipose Tissue
The Waist Circumference — Adipose Tissue business problem
Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of
cardio-vascular diseases
Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and reliable
measurement of the AT (at any site in the body)
The problems with using the CT scan are:
Many physicians do not have access to this technology
Irradiation of the patient (suppresses the immune system)
Expensive
Is there a simpler yet reasonably accurate way to predict the AT area? i.e.
Easily available
Risk free
Inexpensive
A group of researchers' conducted a study with the aim of predicting abdominal AT area using simple
anthropometric measurements i.e. measurements on the human body
The Waist Circumference — Adipose Tissue data is a part of this study wherein the aim is to study how well waist
circumference(WC) predicts the AT area.
www.learninglabb.com Page 1
The Waist Circumference Adipose
Tissue data

www.learninglabb.com Page 1

You might also like