0% found this document useful (0 votes)
130 views

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

This document provides an overview of regression analysis techniques for a data analytics course. It introduces simple linear regression, multiple linear regression, and the advertising dataset used as an example. Key concepts covered include fitting linear models to predict a response using one or more predictor variables, estimating regression coefficients, evaluating model fit using residual sum of squares, and using these techniques to help recommend a marketing plan based on the advertising data.

Uploaded by

WINORLOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

This document provides an overview of regression analysis techniques for a data analytics course. It introduces simple linear regression, multiple linear regression, and the advertising dataset used as an example. Key concepts covered include fitting linear models to predict a response using one or more predictor variables, estimating regression coefficients, evaluating model fit using residual sum of squares, and using these techniques to help recommend a marketing plan based on the advertising data.

Uploaded by

WINORLOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)


School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 36


Summary of Facilitator’s Profile

Education

B.E., Electronics and Communication Engineering, Anna University,


Tamil Nadu, India - April 2008.
M.Sc., Signal Processing, Nanyang Technological University
(NTU), Singapore - May 2011.
Ph.D., Signal/Image Compression, NTU, Singapore - August 2016.

Experience

Post-doctoral experience: Research Fellow, NTU, Singapore -


October 2016 to April 2018.
Teaching experience: Assistant Professor (Senior Grade), VIT,
Chennai - June 2018 onwards.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 36


Suggested Readings

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,


“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,


PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 36


Contents

1 Module 1: Regression Analysis

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 36


Module 1: Regression Analysis

Topics to be covered in Module-1

The Advertising Dataset and Problem Statement


Simple Linear Regression
Multiple Linear Regression
Model Estimation and Evaluation
Correlation
Time Series Forecasting
Autocorrelation
ANOVA - Analysis of Variance

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 36


Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

Figure 1: Sales (in thousands of units) for a particular product as a function of


advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 36


Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

The plot in Figure 1 displays sales, in thousands of units, as a


function of TV, radio, and newspaper budgets, in thousands of
dollars, for 200 different markets.
In each plot, a simple least squares fit of sales to that variable is
shown. In other words, each blue line represents a simple model that
can be used to predict sales using TV, radio, and newspaper,
respectively.
Suppose that in our role as statistical consultants we are asked to
suggest, on the basis of this data, a marketing plan for next year that
will result in high product sales.
What information would be useful in order to provide such a
recommendation?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 36


Module 1: Regression Analysis

The Advertising Dataset and Problem Statement


Few questions that we might seek to address:

Is there a relationship between advertising budget and sales? If yes,


how strong is that relationship?
Is the relationship linear?
How accurately can we estimate the effect of each medium on sales?
How accurately can we predict future sales?
Which media contribute to sales?
Which media generate the biggest boost in sales?
How much increase in sales is associated with a given increase in TV
advertising?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 36


Module 1: Regression Analysis

Simple Linear Regression

Simple Linear Regression is a straightforward approach for predicting


a quantitative response Y on the basis of a single predictor variable
X . Mathematically, this linear relationship can be expressed as

Y ≈ β0 + β1 X

where β0 and β1 are two unknown constants that represent the


intercept and slope terms in the linear model.
For example, X may represent TV advertising and Y may represent
sales. Then we can regress sales onto TV by fitting the model

sales ≈ β0 + β1 TV

Together, β0 and β1 known as the model coefficients or parameters.


We must use training data/samples to estimate these coefficients.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 36
Module 1: Regression Analysis
Simple Linear Regression
Once we produce the estimates β̂0 and β̂1 using the training data, we
can predict y given x:
ŷ ≈ β̂0 + β̂1 x.
Let ŷi ≈ β̂0 + β̂1 xi be the prediction for i th value of y based on the
i th value of x. Then
ei = yi − ŷi
represents the i residual. This is the difference between the i th
th
observed response value and the i th predicted response value.
The residual sum of squares (RSS) is defined as
n
X
RSS = ei2 = e12 + e22 + e32 + .... + en2
i
where n is the number of predictions or simply, the number of
samples in the training data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 36
Module 1: Regression Analysis

Simple Linear Regression

A random pattern in the residual plot is an indication that a linear


model provides a decent fit to the data.
The least squares approach chooses β̂0 and β̂1 to minimize the RSS.
Using calculus, one can show that the minimizers are
Pn
(xi − x̄)(yi − ȳ )
β̂1 = i Pn 2
and β̂0 = ȳ − β̂1 x̄
i (xi − x̄)

where x̄ = n1 ni xi and ȳ = n1 ni yi are sample means. These β̂0 and


P P

β̂1 are the least squares coefficient estimates for simple linear
regression, and they give the best linear fit on the given training
data.
Figure 2 shows the simple linear regression fit to the Advertising data,
where β̂0 = 7.03 and β̂1 = 0.0475.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 36
Module 1: Regression Analysis

Simple Linear Regression

Figure 2: Simple linear regression fit to the Advertising data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 36


Module 1: Regression Analysis

Question 1.1

Which of the following statements is true about linear regression regarding


outliers?

(a) Linear regression is sensitive to outliers.


(b) Linear regression is not sensitive to outliers.
(c) The impact of outliers on linear regression depends upon the data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 36


Module 1: Regression Analysis

Question 1.2
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.

(a) (4 3)
(b) (5 3)
(c) (5 1)
(d) (1 5)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 36


Module 1: Regression Analysis

Question 1.3

For the five training examples given in Question 1.2,

(i) Find the best linear fit.


(ii) Determine the minimum RSS.
(iii) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 36


Module 1: Regression Analysis
Multiple Linear Regression
Although simple linear regression is a useful approach for predicting a
response on the basis of a single predictor variable, in practice more
than one predictor variable will be available, and hence simple linear
regression can be extended to multiple linear regression.
Continuing with the same sales prediction example, in the advertising
data, amount of money spent advertising on the radio and in
newspaper are available. Therefore, we can regress sales onto TV,
radio and newspaper by fitting the model
sales ≈ β0 + β1 TV + β2 radio + β3 newspaper
where β0 , β1 , β2 , and β3 are the model coefficients or parameters.
Predicting a quantitative response Y on the basis of a multiple
predictor variables X1 , X2 , ... and Xp can be expressed as
Y ≈ β0 + β1 X1 + β2 X2 + ... + βp Xp
where p is the number of distinct predictor variables.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 36
Module 1: Regression Analysis
Multiple Linear Regression
Upon estimating β0 , β1 , ... βp using training data/samples, we can
predict y as follows:
ŷ ≈ β̂0 + β̂1 x1 + β̂2 x2 + ... + β̂p xp .
The regression model can be re-stated in matrix form as follows:
X×B =Y
T
where X = [1 X1 X2 ... Xp ], and B = [β̂0 β̂1 β̂2 ... β̂p ] is the
(column) vector form of model coefficients to be estimated. Note
that Y , X1 , X2 , ... Xp are training samples of dimension n × 1.
As in the case of simple linear regression, the least squares approach
can be used to determine the coefficients. The solution is given by
B = X† Y
−1
where X † = (X T X ) X T is the pseudo-inverse of X .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 36
Module 1: Regression Analysis

Question 1.4

When you perform multiple linear regression, which among the following
are questions you will be interested in?

(a) Is at least one of the predictors useful in predicting the response?


(b) Do all the predictors help to explain Y , or is only a subset of the
predictors useful?
(c) How well does the model fit the data?
(d) Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 36


Module 1: Regression Analysis
Model Estimation and Evaluation
Assume that the true relationship between X and Y takes the form
Y = f (X ) +  for some unknown function f (X ), where  is a
mean-zero random error term. If f (X ) is to be approximated by a
simple linear function, then this linear relationship can be expressed as
Y = β0 + β1 X + .
In the case of Y being a random variable, how accurate is the sample
mean (µ̂) of Y as an estimate of its population mean (µ)? In general,
this question is answered by computing the standard error of µ̂,
expressed as SE(µ̂)
p σ
SE(µ̂) = Var(µ̂) = √
n
p
where n is the size of the training set and σ = Var() is the
standard deviation of each of the realizations yi of Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Assuming the errors i for each observation are uncorrelated with
common variance σ 2 , the standard errors associated with β̂0 and β̂1
can be expressed as
s
1 x̄ 2
SE(β̂0 ) = σ + Pn 2
n i=1 (xi − x̄)
and s
1
SE(β̂1 ) = σ Pn .
i=1 (xi − x̄)2
p
In general, σ = Var() is not known, but can be estimated from the
data. This estimate is known as the residual standard error (RSE),
and is expressed as r
RSS
RSE = .
n−2
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Standard errors can be used to compute confidence intervals.
For simple linear regression, the 95% confidence interval for β0
approximately takes the form
β̂0 ± 2 SE(β̂0 ).
That is, there is approximately a 95% probability that the interval
[β̂0 − 2 SE(β̂0 ) , β̂0 + 2 SE(β̂0 )]
will contain the true value of β0 . Similarly, there is approximately a
95% probability that the interval
[β̂1 − 2 SE(β̂1 ) , β̂1 + 2 SE(β̂1 )]
will contain the true value of β0 .
The word ‘approximately’ is included mainly because: (i) The errors
are assumed to be Gaussian; and (ii) The factor ‘2’ in front of SE(.)
terms will vary slightly depending on ‘n’ in linear regression.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 21 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation

The RSE provides an absolute measure of lack of fit of the model to


the data. A small RSE indicates that the model fits the data well
whereas a large RSE indicates that the model doesn’t fit the data
well. But since it is measured in the units of Y , it is not always clear
what constitutes a good RSE.
The R 2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as
RSS
R2 = 1 −
TSS

where TSS = ni=1 (yi − ȳ )2 is the total sum of squares. Note that
P
R 2 statistic is independent of the scale of Y , and it always takes a
value between 0 and 1.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation


TSS = ni=1 (yi − ȳ )2 measures the total variance in the response
P
variable Y , and can be interpreted as the amount of variability
inherent in the response before the regression is performed.
TSS - RSS = ni=1 {(yi − ȳ )2 − (yi − ŷi )2 } measures the amount of
P
variability in the response that is removed by performing the
regression, and therefore R 2 measures the proportion of variability in
Y that can be explained using X .
An R 2 statistic that is close to 1 indicates that a large proportion of
the variability in the response has been explained by the regression. A
number close to 0 indicates that the regression did not explain much
of the variability in the response; this might occur because the linear
model is wrong, or the inherent error σ 2 is high, or both.
The R 2 statistic is also a measure of the linear relationship between
X and Y and it is closely related to correlation between X and Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 23 / 36
Module 1: Regression Analysis

Question 1.5
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b).

(a) Find the best linear fit.


(b) Evaluate the standard errors associated with â and b̂.
(c) Determine the 95% confidence interval for a and b.
(d) Compute the R 2 statistic.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 36


Module 1: Regression Analysis

Model Estimation and Evaluation

Bias-Variance Tradeoff

Bias is the error resulting from simplifying assumptions made by the


model to make the target function easier to approximate.
Variance is the amount that the estimate of the target function will
change given different training data.
Underfitted models have high bias and low variance.
Overfitted models have low bias and high variance.
With an increase in model complexity, bias decreases and variance
increases.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 36


Module 1: Regression Analysis

Correlation
When comparing two random variables, say x1 and x2 , covariance
Cov(x1 , x2 ) is used to determine how much these two vary together,
whereas correlation Corr(x1 , x2 ) is used to determine whether a
change in one variable will result in a change in another.
For multiple data points, the covariance matrix is given by

(X − m)(X − m)T
C= .
n
where X = [x1 x2 ...] is the data matrix with n columns (each column
is one data point) and m is the mean vector of the data points.
Correlation, a normalized version of the covariance, is expressed as
Cov(x1 , x2 )
Corr(x1 , x2 ) = .
σx1 σx2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 36


Module 1: Regression Analysis

Correlation

Both covariance and correlation measure linear relationships between


variables. Examples: relationship between height and weight of
children and relationship between speed and weight of cars, etc.
Since covariance is affected by a change in scale, it can take values
between −∞ and ∞. However, the correlation coefficient always lies
between -1 and 1, and it can be used to make statements and
compare correlations.
When the correlation coefficient is positive, an increase in one
variable results in an increase in the other. When the correlation
coefficient is negative, an increase in one variable results in a decrease
in the other (i.e. the change happens in the opposite direction). A
zero correlation coefficient indicates there is no relationship between
the two variables. Figure 3 shows these three types of relationship.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 36


Module 1: Regression Analysis
Correlation
In some scenarios, correlation measure may be misleading due to the
existence of a spurious relationship (two variables have no relationship
but wrongly inferred due to either coincidence or the presence of a
certain unseen factor known as confounding factor/lurking variable).

Figure 3: Four-quadrant scatterplots showing 3 types of relationship between 2


random variables. Source: https://acadgild.com/blog/covariance-and-correlation

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 36


Module 1: Regression Analysis

Time Series Forecasting

Time series modeling deals with the time based data. Time can be
years, days, hours, minutes, etc.
Time series forecasting involves fitting a model on time based data
and using it to predict future observations.
Time series forecasting serves two purposes: understanding the
pattern/trend in the time series data and forecasting/extrapolating
the future values of it. The forecast package in R contains functions
which serve these purposes.
In time series forecasting, the AutoRegressive Integrated Moving
Average (ARIMA) model is fitted to the time series data either to
better understand the data or to predict future points in the series.
Components of a time series are level, trend, seasonal, cyclical and
noise/irregular (random) variations.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 36


Module 1: Regression Analysis
Time Series Forecasting
Figure 4 shows the forecast of 4 future values of ’AirPassengers’ data
using ARIMA model (available in forecast package).

Figure 4: Forecast from ARIMA(3,1,3) - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 36


Module 1: Regression Analysis
Autocorrelation
As correlation measures the linear relationship between two variables,
autocorrelation measures the linear relationship between lagged values
of a time series data/variable. The term ’lag’ refers to ’time dealy’.
Figure 5 shows the autocorrelation plot of ’AirPassengers’ data
obtained using Acf() function (available in forecast package).

Figure 5: ACF plot - ’AirPassengers’ data


Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 31 / 36
Module 1: Regression Analysis

ANOVA - Analysis of Variance

Analysis of Variance (ANOVA) is a statistical technique for comparing


the means of more than 2 sample groups and deciding whether they
are drawn from the same population or not.
The hypothesis is stated as follows:

H0 : µ1 = µ2 = µ3 = ...
Ha : µ1 6= µ2 6= µ3 6= ...

ANOVA also allows comparision of more than 2 population.


Assumptions made:
(i) Samples are independent and randomly drawn from respective
populations,
(ii) Populations are normally distributed, and
(iii) Variances of the population are equal.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 32 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
Let X denote the data matrix consisting of samples from r groups
such that each column corresponds to one group, X̄ denote the mean
of all the entries in X , x̄j denote the mean of all entries in column-j
and nj denote the number of samples in column-j.
To establish comparison between groups, three variances are
considered. They are Sum-of-Squares-Total (SST ),
Sum-of-Squares-TReatments (SSTR) and Sum-of-Squares-Error
(SSE ):
2
XX
SST = (Xi,j − X̄ )
j i
2
X
SSTR = nj (x̄j − X̄ )
j
XX
SSE = (Xi,j − x̄j )2 .
j i
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 33 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
SST gives the overall variance in the data, SSTR gives the part of
the variation within the data due to differences among the groups,
and SSE gives the part of the variation within the data due to error.
Note that SST = SSTR + SSE .
The ANOVA F-statistic is defined as
MSTR
F =
MSE
where MSTR = SSTR/d.o.f = SSTR/(r − 1) and P
MSE = SSE /d.o.f = SSE /(n − r ). Note that n = j nj is the total
number of samples.
If F-statistic is greater than the critical value, then the null hypothesis
is rejected. The critical value is obtained from the F-distribution table
using parameters such as significance level (α) and degrees of
freedom (d.o.f) of SSTR and SSE.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 34 / 36
Module 1: Regression Analysis

Question 1.6
Assume there are 3 canteens in a college and the sale of an item in those
canteens during first week of February-2021 is as follows:

Table 1: Data for Question 1.6

Canteen A Canteen B Canteen C


40 30 50
60 30 60
70 10 30
30 70 20
50 60 20

Is there a significant difference between the mean sales of the item, at


α = 0.05?
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 36
Module 1: Regression Analysis

Module-1 Summary

The Advertising dataset example and problem statements


Simple Linear Regression and Multiple Linear Regression
Simple Linear Regression Model - Estimation and Evaluation
Correlation: Measures linear relationship between 2 variables
Time Series Forecasting: Analysis and prediction of time-based data
Autocorrelation: Measures linear relationship between lagged values
ANOVA: Compares more than 2 population (uses F-statistic)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 36

You might also like