0% found this document useful (0 votes)
16 views35 pages

CG DADL - 2024 June - Lecture 03

Introduction to Data Analytics and Descriptive Analytics

Uploaded by

tangow440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

CG DADL - 2024 June - Lecture 03

Introduction to Data Analytics and Descriptive Analytics

Uploaded by

tangow440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 3

Simple Linear Regression


Corporate Gurukul – Data Analytics using Deep Learning
June 2024

Lecturer: A/P TAN Wee Kek


Email: [email protected] :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
 At the end of this lecture, you should understand:
 Structure of regression models.
 Simple linear regression.
 Validation of regression models.

1 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Overview of Regression Analysis
 In data mining, we are interested to predict the value of a
target variable from the value of one or more
explanatory variables.
 Example – Predict a child’s weight based on his/her height.
 Weight is the target variable.
 Height is the explanatory variable.
 Regression analysis builds statistical models that
characterize relationships among numerical variables.
 Two broad categories of regression models:
 Cross-sectional data – Focus of this lecture.
 Time-series data – Focus of subsequent lecture:
 Independent variables are time or some function of time.
2 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Structure of Regression Models
 Purpose of regression models is to identify functional
relationship between the target variable and a subset of
the remaining variables in the data.
 Goal of regression models is twofold:
 Highlight and interpret dependency of the target variable on
other variables.
 Predict the future value of the target variable based upon the
functional relationship identified and future value of the
explanatory variables.
 Target variable is also known as dependent, response
or output variable.
 Explanatory variable is also known as independent or
predictory variables.
3 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Structure of Regression Models (cont.)
 Suppose dataset D composed of m observations, a target
variable and n explanatory variables:
 Explanatory variables of each observation may be represented
by a vector x i , i ∈ M in the n-dimensional space n.
 Target variable is denoted by yi .
 The m vectors of observation is written as a matrix X having
dimension m x n.
 Target variable is written as y = ( y1 , y2 ,..., ym )
 Let Y be the random variable representing the target attribute
and X j , j ∈ N , the random variables associated with the
explanatory variables.

4 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Structure of Regression Models (cont.)
 Regression models conjecture the existence of a function
f: n →  that expresses the relationship between the target
variable Y and the n explanatory variables X j :
Y = f ( X 1 , X 2 ,..., X n )

5 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Linear Regression Models
 If we assume that the functional relationship f: n →  is
linear, we have linear regression models.
 This assumption may be restrictive but most nonlinear
relationships may be reduced to a linear one by applying
appropriate preliminary transformation:
 A quadratic relationship of the form
Y = b + wX + dX 2
can be linearized through the transformation Z = X 2 into a
linear relationship with two explanatory variables:
Y = b + wX + dZ

6 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Linear Regression Models (cont.)
 An exponential relationship of the form
Y = e b + wX
can be linearized through a logarithmic transformation Z = log Y ,
which converts it into the linear relationship:
Z = b + wX
 A simple linear regression model with one explanatory variable
is of the form:
Y = α + βX + ε

7 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Simple Linear Regression
 Bivariate linear regression models a random variable Y as
linear function of another random variable X:
Y = α + βX + ε
 ε is a random variable, referred to as error, which indicates the
discrepancy between the response Y and the prediction
f ( X ) = α + βX .
 When the regression coefficients are determined by minimizing
the sum of squared errors SSE, ε must follow a normal
distribution with 0 mean and standard deviation σ :
Ε(ε i | Χ i ) = 0
var(ε i | Χ i ) = σ 2
Note: Standard deviation is the square root of variance and variance is the average of the
squared differences from the mean.
8 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)
 The preceding model is known as simple linear regression
where there is only one explanatory variable.
 When there are multiple explanatory variables, the model
would be a multiple linear regression model of the form:
Y = α + β1 X 1 + β 2 X 2 + ... + β n X n + ε

9 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Simple Linear Regression (cont.)
 For a simple linear regression model of the form:
Y = α + βX + ε
given the data samples (x1 , y1 ), (x2 , y2 ),..., (xs , ys )
 The error for the prediction is:
𝜀𝜀𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝛼𝛼 − 𝛽𝛽𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
 The regression coefficients α and β can be computed by
the method of least squares which minimizes the sum of
the squared errors SSE:
𝑠𝑠

𝑆𝑆𝑆𝑆𝑆𝑆 = � 𝜀𝜀𝑖𝑖 2
𝑖𝑖=1

10 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Simple Linear Regression (cont.)
 To find the regression coefficients α and β that minimize
SSE: S
(x − x )( y − y )
∑ i i
β= i =1
S
(
∑ ix − x )2

i =1

α = y − βx
S

∑x i
x= i =1
S
S

∑y i
y= i =1
S
11 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)
 Suppose we have a linear equation 𝑦𝑦 = 2 + 3𝑥𝑥 in which
𝑆𝑆𝑆𝑆𝑆𝑆 = 0:

12 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Example of Simple Linear Regression
 Predict a child’s weight based on height:
 The dataset contains 19 observations.
 There are four variables altogether – Name, Weight, Height
and Age.
 Note that for linear regression, we are using both Scikit
Learn and StatsModels.
 StatsModels provide more summary statistics as
compared to Scikit Learn.
 We could also manually calculate the required statistics...

13 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Example of Simple Linear Regression
(cont.)

src01

14 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Example of Simple Linear Regression
(cont.)

Scikit Learn’s output StatsModels’ output

15 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Example of Simple Linear Regression
(cont.)

 The regression equation is 𝑦𝑦 = 42.5701 + 0.1976𝑥𝑥 + 𝜀𝜀


 𝛽𝛽 = 0.1976: A one unit increase in height leads to an expected increase of 0.1976 unit in
weight.
 𝛼𝛼 = 42.5701: When 𝑥𝑥 = 0, the expected 𝑦𝑦 value is 42.5701 (danger of extrapolation)
 𝑁𝑁 = 19, number of observations: Most of the dots, i.e., actual 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values, are close to the
fitted line.
16 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Determination
 R-Square = R 2 = 0.7705: 77.05% of the variation in yi is
explained by the model:
 This value is the proportion of total variance explained by
the predictive variable(s):
Model Sum of Squares Error Sum of Squares
R2 = = 1−
Corrected Total Sum of Squares Corrected Total Sum of Squares
𝑆𝑆
2
Model Sum of Squares = � 𝑦𝑦�𝑖𝑖 − 𝑦𝑦̄
𝑖𝑖=1
S
Error Sum of Squares = ∑ ( yi − yˆ i )
2

i =1
S
Corrected Total Sum of Squares = ∑ ( yi − y )
2

i =1

17 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Validation of Model – Coefficient of
Determination (cont.)
 R 2 near zero indicates very little of the variability in yi is
explained by the linear relationship of Y with X.
 R 2 near 1 indicates almost all of the variability in yi is explained
by the linear relationship of Y with X.
 R 2 is known as the coefficient of determination or multiple R-
Squared.
 Root Mean Squared Error (RMSE) =
𝑆𝑆𝑆𝑆𝑆𝑆 ∑𝑆𝑆 2
𝑖𝑖=1 𝑒𝑒𝑖𝑖 ∑𝑆𝑆 𝑦𝑦𝑖𝑖 2
𝑖𝑖=1 𝑦𝑦𝑖𝑖 −�
 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑀𝑀𝑀𝑀𝑀𝑀 = = =
𝑆𝑆 𝑠𝑠 𝑠𝑠
 Recall that in linear regression, the goal is to minimize 𝑆𝑆𝑆𝑆𝑆𝑆.
 So smaller value of 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅, i.e., close to 0.0, is better.
 Smaller 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 indicates a model with better fit.

18 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Validation of Model – Coefficient of
Determination (cont.)
• R-Square = Model MS/Corrected Total SS
= 0.771
• 77.1% of the variance in weight can be explained by
the simple linear model with height as the
independent variable.
• Adjusted R-Square = 0.757
= 1 – (1 – R2) (m – 1)/(m – n – 1)
= 1 – (1 – 0.771)(19-1)/
(19 – 1 – 1)
= 0.757
• R-Square always increases when a new term is added
to a model, but adjusted R-Square increases only if the
new term improves the model more than would be
expected by chance.

Root MSE = 𝑀𝑀𝑀𝑀𝑀𝑀𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸


= 2.3906
This is an estimate for the standard error of the
residuals σ.

19 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Validation of Model – Coefficient of
Determination (cont.)
• DF Model = 1, only one independent
variable in this model
• DF Corrected Total = S-1=18, because
19

∑ (y
i =1
i − y) = 0

• Knowing 18 of them, we will know the


value of the 19th difference.

Analysis of Variance
• F-value = MSEModel/MSEError = 57.08
• F-value has n and m-n-1 DF
• The corresponding p-value is < .0001,
indicating that at least one of the
independent variables is useful for predicting
the dependent variable.
• In this case, there is only 1 independent
variable: the value of height is useful for
predicting the value of weight.

20 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Validation of Model – Significance of
Coefficient of Determination

• Is the value of β significantly different from zero?


• Hypothesis test: H 0 : β = 0 versus Ha : β ≠ 0
• The t-value for the test is (0.1976/0.026) = 7.555 with corresponding p-value of < 0.0001.
• Since the p-value is lower than 5%, we may conclude with 95% confidence to reject H 0 that
β is 0.
• Note for simple linear regression models: t-value of the β parameter is the square root of
the F-value. In this example, 7.555× 7.555 ≈ 57.08
21 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Validation of Model – Significance of
Coefficient of Determination (cont.)

• The area under the curve to the left of -7.555 and to the right of +7.555 is less than 0.0001.
• We reject the null hypothesis and conclude that the slope β is not 0, i.e. the variable height is
useful for predicting the dependent variable weight.
22 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Linear Correlation

• In a simple linear regression model, the coefficient of determination = the squared of the
coefficient of linear correlation between X and Y.
• In our example: X = Height; Y = Weight
• r = 0.877785
• R2 = 0.7705 = 0.877785 x 0.877785

23 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Assumptions of Linear Regression
 Linear regression has five key assumptions.
 Linear relationship – Relationship between the
independent and dependent variables is linear.
 Homoscedasticity – Residuals are equal across the
regression line.
 No auto-correlation – Residuals must be independent
from each other.
 Multivariate normality – Residuals must be normally
distributed.
 No or little multicollinearity – Independent variables are
not correlated with each other.
24 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression
 Linearity:
 Relationship between the independent and dependent variables
is linear.
 The linearity assumption can best be tested with scatter plots.
 Recall the scatter and line plot of the linear regression line that
we have created earlier.

The 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values appear to be linear.

25 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Evaluating the Assumptions of Linear
Regression (cont.)
 Homoscedasticity:
 Residuals are equal across the regression line.
 Scatter plots between residuals and predicted values are used
to confirm this assumption.
 Any pattern would result in a violation of this assumption and
point toward a poor fitting model.
 See the sample script in src02.

In the child’s weight example, no regular pattern/trend is observed.

26 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Evaluating the Assumptions of Linear
Regression (cont.)
 We can also check the normality of the residuals using a Q-Q
plot.
 See the sample script in src03.

• QQ plot on the left shows the residuals in the child’s weight example.
• Data points must fall (approximately) on a straight line for normal distribution.

27 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Evaluating the Assumptions of Linear
Regression (cont.)
 Auto-correlation:
 Residuals must be independent from each other.
 Residuals are randomly distributed with no pattern in the
scatter plot from src02.
 We can also use the Durbin-Watson test to test the null
hypothesis that the residuals are not linearly auto-correlated:
 While d can assume values between 0 and 4, values around 2 indicate
no autocorrelation.
 As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-
correlation in the data.
 StatsModels’ will report the Durbin-Watson’s d value, which is
2.643 in the child’s
weight example.
28 CG DADL (June 2024) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)
 Multivariate normality:
 Residuals must be normally distributed.
 We can perform visual/graphical test to check for normality of
the data using Q-Q plot and also histogram.
 See the sample script in src04 (similar to src03)

29 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Evaluating the Assumptions of Linear
Regression (cont.)
 No or little multicollinearity:
 Independent variables are not correlated with each other.
 For simple linear regression, this is obviously not a problem 
 We will revisit the multicollinearity assumption in the multiple
linear regression model.
 Child’s weight example:
 We may conclude that the residuals are normal and
independent.
 The linear regression model fits the data well.

30 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Confidence in the Linear Regression
Model
 Linear regression is considered as a low variance/high bias
model:
 Under repeated sampling, the line will stay roughly in the same
place (low variance).
 But the average of those models will not do a great job in
capturing the true relationship (high bias).
 Note that low variance is a useful characteristic when you do
not have a lot of training data.
 A closely related concept is
confidence intervals:
 StatsModels calculates 95% confidence
intervals for our model coefficients.

31 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Confidence in the Linear Regression
Model (cont.)
 We can interpret the confidence intervals as follows:
 If the population from which this sample was drawn was sampled 100
times.
 Approximately 95 of those confidence intervals would contain the
“true” coefficient.

32 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Confidence in the Linear Regression
Model (cont.)
 We can compare the true relationship to the predictions
by using StatsModels to calculate the confidence intervals
of the predictions:
 See the sample script in src05.

33 CG DADL (June 2024) Lecture 3 – Simple Linear Regression


Can We Assess the Accuracy of a Linear
Regression Model?
 A linear regression model is intended to perform point
predictions:
 It is difficult to make an exact point prediction of the actual
continuous numerical value.
 Thus, it is not viable to assess accuracy in a conventional way.
 Other than the various measures of goodness, a model
with a tight 95% confidence interval is preferred.
 But we can perform split validation to assess model
overfitting:
 The model from the testing data should return comparable
values for the measures of goodness.

34 CG DADL (June 2024) Lecture 3 – Simple Linear Regression

You might also like