CSE3506 PPT Ref1
CSE3506 PPT Ref1
(2 0 2 4 4)
1
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation
2
Data Analytics – What?
• Science of analyzing raw data in order to make conclusions about
that information [Investopedia]
• Predictive analytics - focuses on what is likely going to happen in the near term.
– What happened to sales the last time we had a hot summer?
– How many weather models predict a hot summer this year?
• Prescriptive analytics suggests a course of action.
– If the likelihood of a hot summer is measured as an average of say five weather models is
above 58%, we should add an evening shift to the workers to increase output.
What is Machine Learning?
• Large volume of data demands automated methods of data analysis which is what
machine learning provides.
– Reinforcement Learning
Statistical learning refers to
• getting inference from a vast data set using set of tools
• These tools can be classified as
• Supervised
or
• Unsupervised
9
9
Supervised Learning
10
Unsupervised Learning
11
Supervised Learning
• The goal is to learn a mapping from inputs x to outputs y, given a labeled set
of input-output pairs.
D= 𝑥𝑖 , 𝑦𝑖 𝑁
𝑖=1
• Here, D is called the training set, and N is the number of training examples.
• 𝑥𝑖 , 𝑦𝑖 is the ith training sample
Regression Problem
𝑁
D = 𝑥𝑖 𝑖=1
• Sometimes called “Knowledge Discovery”
• Clustering algorithms come in this category where data are grouped based on
similarities.
Clustering Problem –Grouping people based on weight
and height
• Supervised statistical learning:
Involves building a statistical model for predicting, or estimating,
an output based on one or more inputs
Examples: Problems occur in business, medicine, astrophysics,
and public policy
• Unsupervised statistical learning:
There are inputs but no supervising output; nevertheless we can
learn relationships and structure from such data
Example: Input dataset containing images of different types of cats
and dogs
19
19
Process
20
Statistical Learning – Advertising Data
21
21
The Advertising data set consists of the
sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, Radio, and
newspaper
22
22
The Advertising data set consists of the sales
of a product in 200 different cities, along
with advertising budgets for three different
media: TV, Radio, and newspaper
Goal is to develop an accurate model that
can be used to predict sales on the basis of
the three media budgets
23
23
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
• Variables 24
24
• Output Variable: Sales
• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables
25
25
There is some relationship between
Y and X = (X1, X2,...,Xp)
General form of relationship is
Y = f(X) +
where
f is some fixed but unknown
function of X1,...,Xp
is a random error term,
which is independent of X and
has mean zero
26
26
• The black lines represent the
error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)
Observed values of
• Overall, these errors have
income and years of approximately mean zero
education for 30
individuals
27
27
Statistical Learning refers to a set of
approaches for estimating f in the
equation
• Y = f(X) +
Reasons to estimate ‘f’,:
Prediction
Inference
28
28
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.
JEE Score
29
29
On the basis of given advertising data,
• Marketing plan for next year can be made
• To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
• Predicting sales with a high level of
accuracy requires a strong relationship.
• If it is strong relationship then
• In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect 30
30
The important questions are
Which media contribute more to sales?
Do all three contribute to sales, or do just
one or two.
The individual effects of each medium on
the money spent
For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount
will sales increase?
How accurately can we predict this
amount of increase?
32
Simple Linear Regression
Estimating the coefficients 0 and 1
33
33
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a) Find the best linear fit
(b) Determine the minimum RSS
(c) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.
34
34
Solution:
(a) To find the best fit, calculate the model coefficients using the formula
35
35
Solution:
38
38
Question-2:
Consider the following five training examples
Find the best linear fit
39
39
Example
9/27/2022 CSE3505-FDA 40
Linear Regression Model (with Normally Distributed Errors)
• In most linear regression analyses, it is common to assume that the error term
is a normally distributed random variable with mean equal to zero and
constant variance.
The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is acceptably accurate.
Mean Squared Error(MSE)
The lower the value the better and 0 means the model is perfect.
Root Mean Squared Error(RMSE)
This produces a value between 0 and 1, where values closer to 0 represent better fitting models. Based on a rule of
thumb, it can be said that RMSE values between 0.2 and 0.5 shows that the model can relatively predict the data
accurately.
RSS and TSS
50
50
RSS and TSS
where
51
51
RSE and Standard Error
52
52
RSquared
• The RSE provides an absolute measure of lack of fit of the model to the
data. A small RSE indicates that the model fits the data well whereas a
large RSE indicates that the model doesn’t fit the data well. But since it
is measured in the units of Y, it is not always clear what constitutes a
good RSE
• The R2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as
Adjusted R-Squared can be calculated mathematically in terms of sum of squares. The only difference
between R-square and Adjusted R-square equation is degree of freedom.
56
56
F-Statistic
• The F-statistic gives the overall significance of the model. It assess whether at least
one predictor variable has a non-zero coefficient.
.
• p-value is the probability your results could have happened by chance.
• A big F, with a small p-value, means that the null hypothesis is
discredited, and we would assert that there is a general relationship
between the response and predictors (while a small F, with a big p-value
indicates that there is no relationship).
Statistical hypotheses
• Null hypothesis (H0): the coefficients are equal to zero (i.e., no
relationship between x and y)
• Alternative Hypothesis (Ha): the coefficients are not equal to zero
(i.e., there is some relationship between x and y)
61
61
Solution:
63
63
To find R2 value, first find TSS
= 241.2391
= 0.9965
64
64
F = MSM / MSE = (explained variance) / (unexplained variance)
69
Regression Terminology
• The response variable (Y) is assumed to be related to the
k predictors (X1, X2, . . . , Xk ) by a linear equation called
the population regression model:
y = β 0 + β 1 x1 + β 2 x2 + . . . + β k xk + ε
• A random error ε represents everything that is not part of
the model. The unknown regression coefficients β0, β1, β2,
. . . , βk are parameters and are denoted by Greek letters
70
Regression Terminology
• The sample estimates of the regression coefficients are
denoted by Roman letters b0, b1, b2, . . . , bk . The
predicted value of the response variable is denoted 𝑦 and
is calculated by inserting the values of the predictors into
the estimated regression equation:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + … . +𝑏𝑘 𝑥𝑘 (predicted value of Y)
71
Data Format
• To obtain a fitted regression, we need n observed values
of the response variable Y and its proposed predictors X1,
X2, . . . , Xk . A multivariate data set is a single column of
Y-values and k columns of X-values.
72
Example
73
https://www.statology.org/multiple-linear-regression-
74
75
76
Sample Data
• Illustration: Home Prices
• Table shows sales of 30 new homes in an upscale development. Although the
selling price of a home (the response variable) may depend on many factors, we
will examine three potential explanatory variables.
• Definition of Variable Short Name
• Y = selling price of a home (thousands of dollars) Price
• X1 = home size (square feet) SqFt
• X2 = lot size (thousand square feet) LotSize
• X3 = number of bathrooms Baths
77
Sample Data
78
Sample Data Home
1
X1
Sqft
2192
X2
LotSize
16.4
X3
Baths
2.5
Y
Price
505.5
2 3429 24.7 3.5 784.1
3 2842 17.7 3.5 649
4 2987 20.3 3.5 689.8
5 3029 22.2 3 709.8
6 2616 20.8 2.5 590.2
7 2978 17.3 3 643.3
8 3595 22.4 3.5 789.7
9 2838 27.4 3 683
10 2591 19.2 2 544.3
11 3633 26.9 4 822.8
12 2822 23.1 3 637.7
13 2994 20.4 3 618.7
14 2696 22.7 3.5 619.3
15 2134 13.4 2.5 490.5
16 3076 19.8 3 675.1
17 3259 20.8 3.5 710.4
18 3162 19.4 4 674.7
19 2885 23.2 3 663.6
20 2550 20.2 3 606.6
21 3380 19.6 4.5 758.9
22 3131 22.5 3.5 723.3
23 2754 19.2 2.5 621.8
24 2710 21.6 3 622.4
25 2616 20.8 2.5 631.3
26 2608 17.3 3.5 574
27 3572 29 4 863.8
28 2924 21.8 2.5 652.7
29 3614 25.5 3.5 844.2
30 2600 24.1 3.5 629.9
79
Estimation of Regression
• Intercept = -28.85
• Slope Sqft = 0.171
• Slope LotSize = 6.78
• Slope Baths = 15.53
80
Predictions from a Fitted Regression
81
Predictions from a Fitted Regression
82
Example:
86
What is correlation?
• What is the pattern of the data points that you observe?
– Line, curve or no pattern at all
89
89
• If r is close to 1, the
variables are positively
correlated there is likely
a strong linear relationship
between the two variables,
with a positive slope.
90
• If r is close to -1, the
variables are negatively
correlated there is likely
a strong linear relationship
between the two variables,
with a negative slope.
91
91
• If r is close to 0, the
variables are not correlated
that there is likely no
linear relationship between
the two variables, however,
the variables may still be
related in some other way.
92
92
What is a correlation matrix
A correlation matrix is essentially a table depicting the correlation coefficients for various variables. The
rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
Correlation
Correlation
Question to Ponder
• Do you recollect any other bivariate analysis that measures
the linear relationship between two variables?
Covariance Vs. Correlation
Covariance Correlation
• Provides the direction of linear • Provides both direction and the
relationship between two variables strength of the linear relationship
between two variables
• Has no upper or lower bound • Ranges between -1 to +1
• Not standardized
• Standardized
Covariance Formula
Covariance
X Y X-MX Y-MY (X-MX)(Y-MY)
2 12 -2 -10.4 20.8
3 17 -1 -5.4 5.4
4 23 0 0.6 0
5 28 1 5.6 5.6
6 32 2 9.6 19.2
4 22.4 51
12.75
• A correlation is a relationship between two variables.
• Is there a relationship between the number of employee training
hours and the number of jobs produced?
• Is there a relationship between the number of hours a student spends
studying for a Mathematics test and the student’s score on that test?
• Let x to be the independent variable and y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y)
• Mathematically, the strength and direction of a linear relationship
between two variables is represented by the correlation coefficient.
10
101
1
Correlation Analysis
• Pearson correlation (r)
– measures a linear dependence between two variables (x and y).
10
103
3
Question:
The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r
10
104
4
10
105
5
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1
10
106
6
Correlation coefficient can be computed using the
functions cor() or cor.test():
• cor() computes the correlation coefficient
107
If your data contain missing values, use the following R code to
handle missing values by case-wise deletion.
cor(x, y, method = "pearson", use = "complete.obs")
108
Question:
The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r
10
109
9
11
110
0
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1
11
111
1
ANOVA
11
2
One-way ANOVA test
• ANOVA – Analysis of Variance
• One-way ANOVA, also known as one-factor ANOVA is a test for
comparing means of more than two groups
• ANOVA is a hypothesis testing procedure that is used to evaluate
differences between 2 or more samples
• It checks whether the means of two or more sample groups are
statistically different or not.
• ANOVA test hypotheses:
– Null hypothesis: the means of the different groups are the same
– Alternative hypothesis: At least one sample mean is not equal to the
others.
Assumptions of ANOVA test
• The observations are obtained independently and randomly
from the population defined by the factor levels
• The data of each factor level are normally distributed.
• These normal populations have a common variance. (Levene’s
test can be used to check this.)
How it works?
• Assume that we have 3 groups (A, B, C) to compare:
– Compute the common variance, which is called variance within
samples (S2within) or residual variance.
– Compute the variance between sample means as follow:
• Compute the mean of each group
• Compute the variance between sample means (S2between)
– Produce F-statistic as the ratio of S2between / S2within.
Is there a statistically significant difference in the mean weight loss among the four diets?
116
18 employees (six each from first year to third year of experience) were selected for an
informal study about their Performance Evaluation score. The evaluation was done for
a score of 100. Using One-way ANOVA technique, find out whether or not a difference
exists somewhere between the three different year levels
117
Step1:
Setting the hypothesis (Null hypothesis or alternate hypothesis)
• Null Hypothesis (H0: 1=2=3)
• Alternate Hypothesis (Ha: Alteast one difference among the means)
And
• Fixing the confidence interval (90%, 95%)
=0.1 or 0.05
118
Step3:Calculating the Means
• Means for each group and
• Grand mean
𝑆𝑆𝐵
• Mean Squares_between =
𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑆𝑆𝐸
• Mean Squares_within =
𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛
121
ANOVA
12
122
2
Question:
18 employees (six each from first year to third year of experience) were selected for an informal study
about their Performance Evaluation score. The evaluation was done for a score of 100. Using One-way
ANOVA technique, find out whether or not a difference exists somewhere between the three different
year levels
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
123
3
124
Groups/ Columns
Scores
First Year Second Year Third Year
Random Sample
within each group 82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
125
5
Calculate the mean of each column
12
127
7
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
Mean 𝑥 72 76 74.83
𝒙 = 74.28 12
128
8
Scores
First Year Second Year Third Year (XA-Xmean)^2 (XB-Xmean)^2 (Xc-Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83
𝒙 = 74.28 12
129
9
Sum of
Scores Squares_between
First Year Second Year Third Year
1. Find difference
82 62 64 between each
93 85 73 group mean and
61 94 87 the overall mean
2. Square the
74 78 91 deviations
69 71 56 3. Multiply with no.
of values of each
53 66 78 column
Mean 𝑥 72 76 74.83 4. Add them up
𝒙 = 74.28
SSC = 6(72 – 74.28)2 + 6(76 – 74.28)2 +6 (74.83 – 74.28)2 = 50.778
13
130
0
Scores
First Year Second Year Third Year (XA-xa_mean)^2 (XB-xb_mean)^2 (XC-xc_mean)^2 Sum of
82 62 64 100 196 117.361
93 85 73 441 81 3.361 Squares_within
61 94 87 121 324 148.028
74 78 91 4 4 261.361
69 71 56 9 25 354.694
53 66 78 361 100 10.028
Sum 432 456 449 1036 730 894.833
Mean 72 76 74.83
131
Formulas for One-Way ANOVA
df = Degrees of Freedom
1. DoF b/w the columns Mean Squares_between
SS_between
= 𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛
ANOVA F - statistic
Mean Squares_between
F=
Mean Squares_within
MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within 132
132
Substituting the values
Formula to calculate Critical Value in Excel:
Fcritical=qf(0.05,df1,df2,lower.tail=FALSE)
An F-statistic greater than the critical value is equivalent to a p-value less than alpha and both means
that you reject the null hypothesis.
The F value can be used along with the p value in deciding whether your results are significant enough
to reject the null hypothesis.
134
Thank You