0% found this document useful (0 votes)

130 views

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

This document provides an overview of regression analysis techniques for a data analytics course. It introduces simple linear regression, multiple linear regression, and the advertising dataset used as an example. Key concepts covered include fitting linear models to predict a response using one or more predictor variables, estimating regression coefficients, evaluating model fit using residual sum of squares, and using these techniques to help recommend a marketing plan based on the advertising data.

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 36

Summary of Facilitator’s Profile

Education

B.E., Electronics and Communication Engineering, Anna University,

Tamil Nadu, India - April 2008.
M.Sc., Signal Processing, Nanyang Technological University
(NTU), Singapore - May 2011.
Ph.D., Signal/Image Compression, NTU, Singapore - August 2016.

Experience

Post-doctoral experience: Research Fellow, NTU, Singapore -

October 2016 to April 2018.
Teaching experience: Assistant Professor (Senior Grade), VIT,
Chennai - June 2018 onwards.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 36

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 36

Contents

1 Module 1: Regression Analysis

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 36

Module 1: Regression Analysis

Topics to be covered in Module-1

The Advertising Dataset and Problem Statement

Simple Linear Regression
Multiple Linear Regression
Model Estimation and Evaluation
Correlation
Time Series Forecasting
Autocorrelation
ANOVA - Analysis of Variance

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

Figure 1: Sales (in thousands of units) for a particular product as a function of

advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

The plot in Figure 1 displays sales, in thousands of units, as a

function of TV, radio, and newspaper budgets, in thousands of
dollars, for 200 different markets.
In each plot, a simple least squares fit of sales to that variable is
shown. In other words, each blue line represents a simple model that
can be used to predict sales using TV, radio, and newspaper,
respectively.
Suppose that in our role as statistical consultants we are asked to
suggest, on the basis of this data, a marketing plan for next year that
will result in high product sales.
What information would be useful in order to provide such a
recommendation?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

Few questions that we might seek to address:

Is there a relationship between advertising budget and sales? If yes,

how strong is that relationship?
Is the relationship linear?
How accurately can we estimate the effect of each medium on sales?
How accurately can we predict future sales?
Which media contribute to sales?
Which media generate the biggest boost in sales?
How much increase in sales is associated with a given increase in TV
advertising?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 36

Module 1: Regression Analysis

Simple Linear Regression

Simple Linear Regression is a straightforward approach for predicting

a quantitative response Y on the basis of a single predictor variable
X . Mathematically, this linear relationship can be expressed as

Y ≈ β0 + β1 X

where β0 and β1 are two unknown constants that represent the

intercept and slope terms in the linear model.
For example, X may represent TV advertising and Y may represent
sales. Then we can regress sales onto TV by fitting the model

sales ≈ β0 + β1 TV

Together, β0 and β1 known as the model coefficients or parameters.

We must use training data/samples to estimate these coefficients.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 36
Module 1: Regression Analysis
Simple Linear Regression
Once we produce the estimates β̂0 and β̂1 using the training data, we
can predict y given x:
ŷ ≈ β̂0 + β̂1 x.
Let ŷi ≈ β̂0 + β̂1 xi be the prediction for i th value of y based on the
i th value of x. Then
ei = yi − ŷi
represents the i residual. This is the difference between the i th
th
observed response value and the i th predicted response value.
The residual sum of squares (RSS) is defined as
n
X
RSS = ei2 = e12 + e22 + e32 + .... + en2
i
where n is the number of predictions or simply, the number of
samples in the training data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 36
Module 1: Regression Analysis

Simple Linear Regression

A random pattern in the residual plot is an indication that a linear

model provides a decent fit to the data.
The least squares approach chooses β̂0 and β̂1 to minimize the RSS.
Using calculus, one can show that the minimizers are
Pn
(xi − x̄)(yi − ȳ )
β̂1 = i Pn 2
and β̂0 = ȳ − β̂1 x̄
i (xi − x̄)

where x̄ = n1 ni xi and ȳ = n1 ni yi are sample means. These β̂0 and

P P

β̂1 are the least squares coefficient estimates for simple linear
regression, and they give the best linear fit on the given training
data.
Figure 2 shows the simple linear regression fit to the Advertising data,
where β̂0 = 7.03 and β̂1 = 0.0475.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 36
Module 1: Regression Analysis

Simple Linear Regression

Figure 2: Simple linear regression fit to the Advertising data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 36

Module 1: Regression Analysis

Question 1.1

Which of the following statements is true about linear regression regarding

outliers?

(a) Linear regression is sensitive to outliers.

(b) Linear regression is not sensitive to outliers.
(c) The impact of outliers on linear regression depends upon the data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 36

Module 1: Regression Analysis

Question 1.2
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.

(a) (4 3)
(b) (5 3)
(c) (5 1)
(d) (1 5)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 36

Module 1: Regression Analysis

Question 1.3

For the five training examples given in Question 1.2,

(i) Find the best linear fit.

(ii) Determine the minimum RSS.
(iii) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 36

Module 1: Regression Analysis
Multiple Linear Regression
Although simple linear regression is a useful approach for predicting a
response on the basis of a single predictor variable, in practice more
than one predictor variable will be available, and hence simple linear
regression can be extended to multiple linear regression.
Continuing with the same sales prediction example, in the advertising
data, amount of money spent advertising on the radio and in
newspaper are available. Therefore, we can regress sales onto TV,
radio and newspaper by fitting the model
sales ≈ β0 + β1 TV + β2 radio + β3 newspaper
where β0 , β1 , β2 , and β3 are the model coefficients or parameters.
Predicting a quantitative response Y on the basis of a multiple
predictor variables X1 , X2 , ... and Xp can be expressed as
Y ≈ β0 + β1 X1 + β2 X2 + ... + βp Xp
where p is the number of distinct predictor variables.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 36
Module 1: Regression Analysis
Multiple Linear Regression
Upon estimating β0 , β1 , ... βp using training data/samples, we can
predict y as follows:
ŷ ≈ β̂0 + β̂1 x1 + β̂2 x2 + ... + β̂p xp .
The regression model can be re-stated in matrix form as follows:
X×B =Y
T
where X = [1 X1 X2 ... Xp ], and B = [β̂0 β̂1 β̂2 ... β̂p ] is the
(column) vector form of model coefficients to be estimated. Note
that Y , X1 , X2 , ... Xp are training samples of dimension n × 1.
As in the case of simple linear regression, the least squares approach
can be used to determine the coefficients. The solution is given by
B = X† Y
−1
where X † = (X T X ) X T is the pseudo-inverse of X .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 36
Module 1: Regression Analysis

Question 1.4

When you perform multiple linear regression, which among the following
are questions you will be interested in?

(a) Is at least one of the predictors useful in predicting the response?

(b) Do all the predictors help to explain Y , or is only a subset of the
predictors useful?
(c) How well does the model fit the data?
(d) Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 36

Module 1: Regression Analysis
Model Estimation and Evaluation
Assume that the true relationship between X and Y takes the form
Y = f (X ) + for some unknown function f (X ), where is a
mean-zero random error term. If f (X ) is to be approximated by a
simple linear function, then this linear relationship can be expressed as
Y = β0 + β1 X + .
In the case of Y being a random variable, how accurate is the sample
mean (µ̂) of Y as an estimate of its population mean (µ)? In general,
this question is answered by computing the standard error of µ̂,
expressed as SE(µ̂)
p σ
SE(µ̂) = Var(µ̂) = √
n
p
where n is the size of the training set and σ = Var() is the
standard deviation of each of the realizations yi of Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Assuming the errors i for each observation are uncorrelated with
common variance σ 2 , the standard errors associated with β̂0 and β̂1
can be expressed as
s
1 x̄ 2
SE(β̂0 ) = σ + Pn 2
n i=1 (xi − x̄)
and s
1
SE(β̂1 ) = σ Pn .
i=1 (xi − x̄)2
p
In general, σ = Var() is not known, but can be estimated from the
data. This estimate is known as the residual standard error (RSE),
and is expressed as r
RSS
RSE = .
n−2
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Standard errors can be used to compute confidence intervals.
For simple linear regression, the 95% confidence interval for β0
approximately takes the form
β̂0 ± 2 SE(β̂0 ).
That is, there is approximately a 95% probability that the interval
[β̂0 − 2 SE(β̂0 ) , β̂0 + 2 SE(β̂0 )]
will contain the true value of β0 . Similarly, there is approximately a
95% probability that the interval
[β̂1 − 2 SE(β̂1 ) , β̂1 + 2 SE(β̂1 )]
will contain the true value of β0 .
The word ‘approximately’ is included mainly because: (i) The errors
are assumed to be Gaussian; and (ii) The factor ‘2’ in front of SE(.)
terms will vary slightly depending on ‘n’ in linear regression.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 21 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation

The RSE provides an absolute measure of lack of fit of the model to

the data. A small RSE indicates that the model fits the data well
whereas a large RSE indicates that the model doesn’t fit the data
well. But since it is measured in the units of Y , it is not always clear
what constitutes a good RSE.
The R 2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as
RSS
R2 = 1 −
TSS

where TSS = ni=1 (yi − ȳ )2 is the total sum of squares. Note that
P
R 2 statistic is independent of the scale of Y , and it always takes a
value between 0 and 1.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation

TSS = ni=1 (yi − ȳ )2 measures the total variance in the response
P
variable Y , and can be interpreted as the amount of variability
inherent in the response before the regression is performed.
TSS - RSS = ni=1 {(yi − ȳ )2 − (yi − ŷi )2 } measures the amount of
P
variability in the response that is removed by performing the
regression, and therefore R 2 measures the proportion of variability in
Y that can be explained using X .
An R 2 statistic that is close to 1 indicates that a large proportion of
the variability in the response has been explained by the regression. A
number close to 0 indicates that the regression did not explain much
of the variability in the response; this might occur because the linear
model is wrong, or the inherent error σ 2 is high, or both.
The R 2 statistic is also a measure of the linear relationship between
X and Y and it is closely related to correlation between X and Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 23 / 36
Module 1: Regression Analysis

Question 1.5
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b).

(a) Find the best linear fit.

(b) Evaluate the standard errors associated with â and b̂.
(c) Determine the 95% confidence interval for a and b.
(d) Compute the R 2 statistic.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 36

Module 1: Regression Analysis

Model Estimation and Evaluation

Bias-Variance Tradeoff

Bias is the error resulting from simplifying assumptions made by the

model to make the target function easier to approximate.
Variance is the amount that the estimate of the target function will
change given different training data.
Underfitted models have high bias and low variance.
Overfitted models have low bias and high variance.
With an increase in model complexity, bias decreases and variance
increases.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 36

Module 1: Regression Analysis

Correlation
When comparing two random variables, say x1 and x2 , covariance
Cov(x1 , x2 ) is used to determine how much these two vary together,
whereas correlation Corr(x1 , x2 ) is used to determine whether a
change in one variable will result in a change in another.
For multiple data points, the covariance matrix is given by

(X − m)(X − m)T
C= .
n
where X = [x1 x2 ...] is the data matrix with n columns (each column
is one data point) and m is the mean vector of the data points.
Correlation, a normalized version of the covariance, is expressed as
Cov(x1 , x2 )
Corr(x1 , x2 ) = .
σx1 σx2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 36

Module 1: Regression Analysis

Correlation

Both covariance and correlation measure linear relationships between

variables. Examples: relationship between height and weight of
children and relationship between speed and weight of cars, etc.
Since covariance is affected by a change in scale, it can take values
between −∞ and ∞. However, the correlation coefficient always lies
between -1 and 1, and it can be used to make statements and
compare correlations.
When the correlation coefficient is positive, an increase in one
variable results in an increase in the other. When the correlation
coefficient is negative, an increase in one variable results in a decrease
in the other (i.e. the change happens in the opposite direction). A
zero correlation coefficient indicates there is no relationship between
the two variables. Figure 3 shows these three types of relationship.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 36

Module 1: Regression Analysis
Correlation
In some scenarios, correlation measure may be misleading due to the
existence of a spurious relationship (two variables have no relationship
but wrongly inferred due to either coincidence or the presence of a
certain unseen factor known as confounding factor/lurking variable).

Figure 3: Four-quadrant scatterplots showing 3 types of relationship between 2

random variables. Source: https://acadgild.com/blog/covariance-and-correlation

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 36

Module 1: Regression Analysis

Time Series Forecasting

Time series modeling deals with the time based data. Time can be
years, days, hours, minutes, etc.
Time series forecasting involves fitting a model on time based data
and using it to predict future observations.
Time series forecasting serves two purposes: understanding the
pattern/trend in the time series data and forecasting/extrapolating
the future values of it. The forecast package in R contains functions
which serve these purposes.
In time series forecasting, the AutoRegressive Integrated Moving
Average (ARIMA) model is fitted to the time series data either to
better understand the data or to predict future points in the series.
Components of a time series are level, trend, seasonal, cyclical and
noise/irregular (random) variations.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 36

Module 1: Regression Analysis
Time Series Forecasting
Figure 4 shows the forecast of 4 future values of ’AirPassengers’ data
using ARIMA model (available in forecast package).

Figure 4: Forecast from ARIMA(3,1,3) - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 36

Module 1: Regression Analysis
Autocorrelation
As correlation measures the linear relationship between two variables,
autocorrelation measures the linear relationship between lagged values
of a time series data/variable. The term ’lag’ refers to ’time dealy’.
Figure 5 shows the autocorrelation plot of ’AirPassengers’ data
obtained using Acf() function (available in forecast package).

Figure 5: ACF plot - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 31 / 36
Module 1: Regression Analysis

ANOVA - Analysis of Variance

Analysis of Variance (ANOVA) is a statistical technique for comparing

the means of more than 2 sample groups and deciding whether they
are drawn from the same population or not.
The hypothesis is stated as follows:

H0 : µ1 = µ2 = µ3 = ...
Ha : µ1 6= µ2 6= µ3 6= ...

ANOVA also allows comparision of more than 2 population.

Assumptions made:
(i) Samples are independent and randomly drawn from respective
populations,
(ii) Populations are normally distributed, and
(iii) Variances of the population are equal.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 32 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
Let X denote the data matrix consisting of samples from r groups
such that each column corresponds to one group, X̄ denote the mean
of all the entries in X , x̄j denote the mean of all entries in column-j
and nj denote the number of samples in column-j.
To establish comparison between groups, three variances are
considered. They are Sum-of-Squares-Total (SST ),
Sum-of-Squares-TReatments (SSTR) and Sum-of-Squares-Error
(SSE ):
2
XX
SST = (Xi,j − X̄ )
j i
2
X
SSTR = nj (x̄j − X̄ )
j
XX
SSE = (Xi,j − x̄j )2 .
j i
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 33 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
SST gives the overall variance in the data, SSTR gives the part of
the variation within the data due to differences among the groups,
and SSE gives the part of the variation within the data due to error.
Note that SST = SSTR + SSE .
The ANOVA F-statistic is defined as
MSTR
F =
MSE
where MSTR = SSTR/d.o.f = SSTR/(r − 1) and P
MSE = SSE /d.o.f = SSE /(n − r ). Note that n = j nj is the total
number of samples.
If F-statistic is greater than the critical value, then the null hypothesis
is rejected. The critical value is obtained from the F-distribution table
using parameters such as significance level (α) and degrees of
freedom (d.o.f) of SSTR and SSE.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 34 / 36
Module 1: Regression Analysis

Question 1.6
Assume there are 3 canteens in a college and the sale of an item in those
canteens during first week of February-2021 is as follows:

Table 1: Data for Question 1.6

Canteen A Canteen B Canteen C

40 30 50
60 30 60
70 10 30
30 70 20
50 60 20

Is there a significant difference between the mean sales of the item, at

α = 0.05?
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 36
Module 1: Regression Analysis

Module-1 Summary

The Advertising dataset example and problem statements

Simple Linear Regression and Multiple Linear Regression
Simple Linear Regression Model - Estimation and Evaluation
Correlation: Measures linear relationship between 2 variables
Time Series Forecasting: Analysis and prediction of time-based data
Autocorrelation: Measures linear relationship between lagged values
ANOVA: Compares more than 2 population (uses F-statistic)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 36

Coding: MCQ's
100% (1)
Coding: MCQ's
3 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Situational Poverty
100% (1)
Situational Poverty
9 pages
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
No ratings yet
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
158 pages
5 - Part II - Regression Analysis w-notes(1)
No ratings yet
5 - Part II - Regression Analysis w-notes(1)
10 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
Mod 3C
No ratings yet
Mod 3C
36 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Linear Regression
No ratings yet
Linear Regression
97 pages
Linear Regression
No ratings yet
Linear Regression
35 pages
Data Analytics Regression Unit III
No ratings yet
Data Analytics Regression Unit III
27 pages
BBABB602 Study Material and Syllabus
No ratings yet
BBABB602 Study Material and Syllabus
67 pages
DSA1101 2019 Week2 Part1
No ratings yet
DSA1101 2019 Week2 Part1
51 pages
ML 5
No ratings yet
ML 5
21 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
DA UNIT-III
No ratings yet
DA UNIT-III
14 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
w3 - Linear Model - Linear Regression
No ratings yet
w3 - Linear Model - Linear Regression
33 pages
Tute - 04
No ratings yet
Tute - 04
6 pages
Da Unit-3
No ratings yet
Da Unit-3
27 pages
Data Analytics Regression UNIT-III
No ratings yet
Data Analytics Regression UNIT-III
26 pages
Regression Model and Its Applications
100% (1)
Regression Model and Its Applications
30 pages
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
No ratings yet
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
33 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
BA unit3
No ratings yet
BA unit3
42 pages
Lecture 4
No ratings yet
Lecture 4
62 pages
AA3 - Linear Regression - 2024
No ratings yet
AA3 - Linear Regression - 2024
26 pages
unit5_R
No ratings yet
unit5_R
5 pages
Linear Regression Part 1
No ratings yet
Linear Regression Part 1
40 pages
BA unit 2 notes (1)
No ratings yet
BA unit 2 notes (1)
5 pages
Linear Regression
100% (1)
Linear Regression
14 pages
ZG512 L2 Linear Regression 0308
No ratings yet
ZG512 L2 Linear Regression 0308
38 pages
Data Analytics Unit 3 Notes
100% (2)
Data Analytics Unit 3 Notes
28 pages
Chapter2 1
No ratings yet
Chapter2 1
55 pages
IS4242 W3 Regression Analyses
No ratings yet
IS4242 W3 Regression Analyses
67 pages
Module 2 Transcripts_v3
No ratings yet
Module 2 Transcripts_v3
103 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan download pdf
100% (1)
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan download pdf
51 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
UNIT-III Lecture Notes
No ratings yet
UNIT-III Lecture Notes
18 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Unveiling The Power of Regression Analysis - A Comprehensive Exploration
No ratings yet
Unveiling The Power of Regression Analysis - A Comprehensive Exploration
5 pages
Regression Analysis: Causal Relationship Between The Explanatory and
No ratings yet
Regression Analysis: Causal Relationship Between The Explanatory and
17 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Course 10-Part 1
No ratings yet
Course 10-Part 1
32 pages
Data Analytivs-Unit-2
No ratings yet
Data Analytivs-Unit-2
24 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
Regression Analysis PDF
No ratings yet
Regression Analysis PDF
3 pages
Da On Regression
No ratings yet
Da On Regression
58 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
COMM5005 Lecture 8
No ratings yet
COMM5005 Lecture 8
54 pages
Linear Regression
No ratings yet
Linear Regression
47 pages
Chapter_2_Linear and Logistic Regression
No ratings yet
Chapter_2_Linear and Logistic Regression
34 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Chapter 3 - Multiple Linear Regression
No ratings yet
Chapter 3 - Multiple Linear Regression
49 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
LinearStatisticalModels and Regression Analysis
No ratings yet
LinearStatisticalModels and Regression Analysis
27 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Student Solutions Manual for Mathematics for Economics, fourth edition
From Everand
Student Solutions Manual for Mathematics for Economics, fourth edition
Michael Hoy
No ratings yet
Sl. No. Full Name Gender Course: Internal
No ratings yet
Sl. No. Full Name Gender Course: Internal
6 pages
Essentials of Data Analytics: J Component Report
No ratings yet
Essentials of Data Analytics: J Component Report
25 pages
Module 4
No ratings yet
Module 4
40 pages
Reg - No Talent First Name Talent Middle Name
No ratings yet
Reg - No Talent First Name Talent Middle Name
32 pages
Sno Reg No First Name Last Name
No ratings yet
Sno Reg No First Name Last Name
14 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
12 pages
Ece3099 Ipt PPT Template 18becxxxx
No ratings yet
Ece3099 Ipt PPT Template 18becxxxx
27 pages
Name Reg - No
No ratings yet
Name Reg - No
4 pages
Schneider Electric - FTE R&D Job Description - 2022 Batch
No ratings yet
Schneider Electric - FTE R&D Job Description - 2022 Batch
32 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
39 pages
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
No ratings yet
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
3 pages
Winners Advice
No ratings yet
Winners Advice
40 pages
Full Profect Report
No ratings yet
Full Profect Report
39 pages
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
No ratings yet
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
31 pages
18bec1241 Tarp Report Team1
No ratings yet
18bec1241 Tarp Report Team1
100 pages
Candidatename Gender Degree Branch
No ratings yet
Candidatename Gender Degree Branch
2 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
17 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
29 pages
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
No ratings yet
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
7 pages
18bec1241 Team6 Esd J Report
No ratings yet
18bec1241 Team6 Esd J Report
22 pages
LIC LAB REPORT New
No ratings yet
LIC LAB REPORT New
27 pages
Research Methodology Notes by DR Kalailakshmi
80% (5)
Research Methodology Notes by DR Kalailakshmi
110 pages
[FREE PDF sample] (Ebook) The Humongous Book of Statistics Problems: Translated for People Who Don't Speak Math by W. Michael Kelley, Robert A. Donnelly ISBN 9781592578658, 1592578659 ebooks
100% (2)
[FREE PDF sample] (Ebook) The Humongous Book of Statistics Problems: Translated for People Who Don't Speak Math by W. Michael Kelley, Robert A. Donnelly ISBN 9781592578658, 1592578659 ebooks
77 pages
Pretest ch10
No ratings yet
Pretest ch10
7 pages
Statistics For Management, 7e Authors: Sanjay Rastogi and Masood H. Siddiqui
0% (1)
Statistics For Management, 7e Authors: Sanjay Rastogi and Masood H. Siddiqui
13 pages
Hypothesis Testing Keshav N
No ratings yet
Hypothesis Testing Keshav N
8 pages
Econometrics Chapter 8 PPT Slides
100% (1)
Econometrics Chapter 8 PPT Slides
42 pages
Tutorial ANCOVA - BI
No ratings yet
Tutorial ANCOVA - BI
2 pages
Ch07 Forecasting Modelling
No ratings yet
Ch07 Forecasting Modelling
32 pages
Section 7.4
No ratings yet
Section 7.4
16 pages
CH08A
No ratings yet
CH08A
6 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
Endogeneity: Yusep Suparman
No ratings yet
Endogeneity: Yusep Suparman
25 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
4405 11042 1 SM
No ratings yet
4405 11042 1 SM
12 pages
Cfa Level 2 2023 Summary
No ratings yet
Cfa Level 2 2023 Summary
100 pages
A Fuzzy Logic Framework For Modeling Climate Change Impacts On Ecosystems
No ratings yet
A Fuzzy Logic Framework For Modeling Climate Change Impacts On Ecosystems
10 pages
Introduction To Prob. and Counting
No ratings yet
Introduction To Prob. and Counting
38 pages
Sta404 Chapter 08
No ratings yet
Sta404 Chapter 08
120 pages
Nonparametric Methods: Nominal Level Hypothesis Tests
No ratings yet
Nonparametric Methods: Nominal Level Hypothesis Tests
41 pages
(eBook PDF) An Introduction to Statistical Methods & Data Analysis 7th - Download the ebook now for the best reading experience
100% (8)
(eBook PDF) An Introduction to Statistical Methods & Data Analysis 7th - Download the ebook now for the best reading experience
57 pages
TCD 2022
No ratings yet
TCD 2022
6 pages
Who Is Bayes? What Is Bayes?: Michal Oleszak
No ratings yet
Who Is Bayes? What Is Bayes?: Michal Oleszak
25 pages
Anova Ppt Stats 511 for Pg
No ratings yet
Anova Ppt Stats 511 for Pg
27 pages
SUBMIT ASSIGNMENT - Spearmann Correlation PT 1
No ratings yet
SUBMIT ASSIGNMENT - Spearmann Correlation PT 1
8 pages
CA Foundation Test Paper-Correlation and Regression
67% (3)
CA Foundation Test Paper-Correlation and Regression
5 pages
AQR Interview Question 1707592692
No ratings yet
AQR Interview Question 1707592692
2 pages
Jurnal Eko Regional
No ratings yet
Jurnal Eko Regional
14 pages
DL Unit-2
No ratings yet
DL Unit-2
32 pages
Icossar 09
No ratings yet
Icossar 09
29 pages
Time Series Analysis Updated Unit-4
No ratings yet
Time Series Analysis Updated Unit-4
31 pages

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 36

B.E., Electronics and Communication Engineering, Anna University,

Post-doctoral experience: Research Fellow, NTU, Singapore -

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 36

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 36

1 Module 1: Regression Analysis

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 36

Topics to be covered in Module-1

The Advertising Dataset and Problem Statement

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 36

The Advertising Dataset and Problem Statement

Figure 1: Sales (in thousands of units) for a particular product as a function of

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 36

The Advertising Dataset and Problem Statement

The plot in Figure 1 displays sales, in thousands of units, as a

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 36

The Advertising Dataset and Problem Statement

Is there a relationship between advertising budget and sales? If yes,

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 36

Simple Linear Regression

Simple Linear Regression is a straightforward approach for predicting

where β0 and β1 are two unknown constants that represent the

Together, β0 and β1 known as the model coefficients or parameters.

Simple Linear Regression

A random pattern in the residual plot is an indication that a linear

where x̄ = n1 ni xi and ȳ = n1 ni yi are sample means. These β̂0 and

Simple Linear Regression

Figure 2: Simple linear regression fit to the Advertising data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 36

Which of the following statements is true about linear regression regarding

(a) Linear regression is sensitive to outliers.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 36

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 36

For the five training examples given in Question 1.2,

(i) Find the best linear fit.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 36

(a) Is at least one of the predictors useful in predicting the response?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 36

Model Estimation and Evaluation

The RSE provides an absolute measure of lack of fit of the model to

Model Estimation and Evaluation

(a) Find the best linear fit.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 36

Model Estimation and Evaluation

Bias is the error resulting from simplifying assumptions made by the

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 36

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 36

Both covariance and correlation measure linear relationships between

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 36

Figure 3: Four-quadrant scatterplots showing 3 types of relationship between 2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 36

Time Series Forecasting

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 36

Figure 4: Forecast from ARIMA(3,1,3) - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 36

Figure 5: ACF plot - ’AirPassengers’ data

ANOVA - Analysis of Variance

Analysis of Variance (ANOVA) is a statistical technique for comparing

ANOVA also allows comparision of more than 2 population.

Table 1: Data for Question 1.6

Canteen A Canteen B Canteen C

Is there a significant difference between the mean sales of the item, at

The Advertising dataset example and problem statements

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 36

You might also like