0% found this document useful (0 votes)
3 views135 pages

CSE3506 PPT Ref1

The document outlines the essentials of data analytics, including types such as descriptive, diagnostic, predictive, and prescriptive analytics. It discusses machine learning paradigms, specifically supervised and unsupervised learning, and emphasizes the importance of statistical learning in making predictions and inferences. Additionally, it covers linear regression, its applications in predicting sales based on advertising budgets, and various performance metrics for evaluating model accuracy.

Uploaded by

niwinkumar7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views135 pages

CSE3506 PPT Ref1

The document outlines the essentials of data analytics, including types such as descriptive, diagnostic, predictive, and prescriptive analytics. It discusses machine learning paradigms, specifically supervised and unsupervised learning, and emphasizes the importance of statistical learning in making predictions and inferences. Additionally, it covers linear regression, its applications in predicting sales based on advertising budgets, and various performance metrics for evaluating model accuracy.

Uploaded by

niwinkumar7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

CSE3506 Essentials of Data Analytics

(2 0 2 4 4)

Dr. Asnath Phamila Y

1
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation

2
Data Analytics – What?
• Science of analyzing raw data in order to make conclusions about
that information [Investopedia]

• Analytics is the systematic computational analysis of data or


statistics [Wikipedia]

• Data analysis is a process of inspecting, cleaning, transforming,


and modeling data with the goal of discovering useful information,
deriving conclusions, and supporting decision-making [Wikipedia]
Data Analytics – Why?
• Helps business to optimize their
performances
– Reduce cost
– Improved business
– Make better decision
Data Analytics – Types
• Descriptive analytics - describes what has happened over a given
period of time.
– Have the number of views gone up?
– Are sales stronger this month than last?

• Diagnostic analytics - focuses more on why something happened.


This involves more diverse data inputs and a bit of hypothesizing.
– Did the weather affect sales of a cool drink?
– Did that latest marketing campaign impact sales?
Data Analytics – Types

• Predictive analytics - focuses on what is likely going to happen in the near term.
– What happened to sales the last time we had a hot summer?
– How many weather models predict a hot summer this year?
• Prescriptive analytics suggests a course of action.
– If the likelihood of a hot summer is measured as an average of say five weather models is
above 58%, we should add an evening shift to the workers to increase output.
What is Machine Learning?

• Large volume of data demands automated methods of data analysis which is what
machine learning provides.

• Machine learning is defined as a set of methods that can automatically detect


patterns in data, and then use the uncovered patterns to predict future data, or to
perform other kinds of decision making under uncertainty.
Machine Learning Paradigms
• Three Learning Paradigms

– Predictive or Supervised Learning

– Descriptive or Unsupervised Learning

– Reinforcement Learning
Statistical learning refers to
• getting inference from a vast data set using set of tools
• These tools can be classified as
• Supervised
or
• Unsupervised

9
9
Supervised Learning

• A training set of examples with the correct


responses (targets) is provided and, based on this
training set, the algorithm generalizes to respond
correctly to all possible inputs. This is also called
learning from exemplars.

10
Unsupervised Learning

• Correct responses are not provided, but instead


the algorithm tries to identify similarities
between the inputs so that inputs that have
something in common are categorized together.
• The statistical approach to unsupervised learning
is known as density estimation.

11
Supervised Learning
• The goal is to learn a mapping from inputs x to outputs y, given a labeled set
of input-output pairs.
D= 𝑥𝑖 , 𝑦𝑖 𝑁
𝑖=1
• Here, D is called the training set, and N is the number of training examples.
• 𝑥𝑖 , 𝑦𝑖 is the ith training sample

• Each 𝑥𝑖 is a d-dimensional vector of numbers - features, attributes or


covariates
• Height, weight, age or complex objects (image, text, time series, graph etc.)
Supervised Learning
Classification Problem
• Each 𝑦𝑖 is a response variable
• Categorical or nominal variable from a finite set 𝑦𝑖 𝝐 1, … , 𝐶
• Eg., Male or Female

Regression Problem

• Real valued variable


• Eg., Temperature, Age, Height, Weight
Classification Problem – Credit Scoring
Regression Problem – Car Price Prediction
Unsupervised Learning
• The goal is to find “interesting patterns” in the data.

𝑁
D = 𝑥𝑖 𝑖=1
• Sometimes called “Knowledge Discovery”

• Clustering algorithms come in this category where data are grouped based on
similarities.
Clustering Problem –Grouping people based on weight
and height
• Supervised statistical learning:
 Involves building a statistical model for predicting, or estimating,
an output based on one or more inputs
 Examples: Problems occur in business, medicine, astrophysics,
and public policy
• Unsupervised statistical learning:
 There are inputs but no supervising output; nevertheless we can
learn relationships and structure from such data
 Example: Input dataset containing images of different types of cats
and dogs

19
19
Process

• Data Collection and Preparation


• Feature Selection
• Algorithm Choice
• Parameter and Model Selection
• Training
• Evaluation

20
Statistical Learning – Advertising Data

 The Advertising data set consists of the


sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, radio, and
newspaper

21
21
 The Advertising data set consists of the
sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, Radio, and
newspaper

22
22
 The Advertising data set consists of the sales
of a product in 200 different cities, along
with advertising budgets for three different
media: TV, Radio, and newspaper
 Goal is to develop an accurate model that
can be used to predict sales on the basis of
the three media budgets

23
23
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
• Variables 24
24
• Output Variable: Sales
• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables

25
25
 There is some relationship between
Y and X = (X1, X2,...,Xp)
 General form of relationship is
 Y = f(X) + 
 where
 f is some fixed but unknown
function of X1,...,Xp
  is a random error term,
which is independent of X and
has mean zero
26
26
• The black lines represent the
error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)

Observed values of
• Overall, these errors have
income and years of approximately mean zero
education for 30
individuals
27
27
 Statistical Learning refers to a set of
approaches for estimating f in the
equation
• Y = f(X) + 
 Reasons to estimate ‘f’,:
 Prediction
 Inference

28
28
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.

JEE Score
29
29
On the basis of given advertising data,
• Marketing plan for next year can be made
• To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
• Predicting sales with a high level of
accuracy requires a strong relationship.
• If it is strong relationship then
• In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect 30
30
The important questions are
 Which media contribute more to sales?
 Do all three contribute to sales, or do just
one or two.
 The individual effects of each medium on
the money spent
 For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount
will sales increase?
 How accurately can we predict this
amount of increase?

Linear regression can be used to answer each of these questions 31


31
Linear Regression - Types
• Types:
 Based on the number of independent variables,
there are two types of linear regression
 Simple Linear Regression
 Multiple Linear Regression
• Mathematically, the linear relationship is approximately
modeled as
• y = 0 + 1 x

32
Simple Linear Regression
Estimating the coefficients 0 and 1

33
33
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a) Find the best linear fit
(b) Determine the minimum RSS
(c) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

34
34
Solution:
(a) To find the best fit, calculate the model coefficients using the formula

35
35
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2


2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting in
the formula 0 =3.2732
1 = 4.9029
36
36
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2


2 12 -2 -10.4 20.8 4
3 17 -1 -5.4 5.4 1
4 23 0 0.6 0 0
5 28 1 5.6 5.6 1
6 32 2 9.6 19.2 4 The best linear fit is
Sum 20 12 -2 -10.4 20.8 4
Y = 5.1X + 2
Mean 4 22.4 Sum 51 10
Substituting in
the formula 1 = 51/10 = 5.1
0 = 22.4-5.1*4 = 2
37
37
Solution:
Best Linear Fit

38
38
Question-2:
Consider the following five training examples
Find the best linear fit

39
39
Example

9/27/2022 CSE3505-FDA 40
Linear Regression Model (with Normally Distributed Errors)
• In most linear regression analyses, it is common to assume that the error term
is a normally distributed random variable with mean equal to zero and
constant variance.

• Thus, the linear regression model is expressed as


Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2


2 12 -2 -10.4 20.8 4
3 17 -1 -5.4 5.4 1
4 23 0 0.6 0 0
5 28 1 5.6 5.6 1
6 32 2 9.6 19.2 4 The best linear fit is
Sum 20 112 -2 -10.4 20.8 4
Y = 5.1X + 2
Mean 4 22.4 Sum 51 10
Substituting in
the formula 1 = 51/10 = 5.1
0 = 22.4-5.1*4 = 2
42
42
Sample data and Model
• Data: Marketing from Datarium package
• We want to predict future sales on the basis of advertising budget spent on
youtube.
• sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear
model:

#building a linear model


model <- lm(sales~youtube,data=marketing)
model
Coefficients:
(Intercept) youtube
8.43911 0.04754
Performance Metrics
– Mean Absolute Error MAE
– Mean Square Error MSE
– Root MSE RMSE
– Residual Sum of Squares RSS
– Total Sum of Squares TSS
– Residual Standard Error RSE
– Sum of Squares Error SSE
– Standard Error SE
– R-Squared
– Adjusted R Squared
– F Statistic
Mean Absolute Error(MAE)

We aim to get a minimum MAE because this is a loss.


Mean Absolute Percentage Error (MAPE)

What is a good value of MAPE?

The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is acceptably accurate.
Mean Squared Error(MSE)

The lower the value the better and 0 means the model is perfect.
Root Mean Squared Error(RMSE)

This produces a value between 0 and 1, where values closer to 0 represent better fitting models. Based on a rule of
thumb, it can be said that RMSE values between 0.2 and 0.5 shows that the model can relatively predict the data
accurately.
RSS and TSS

50
50
RSS and TSS

To calculate RSS, use the following formula

where

51
51
RSE and Standard Error

52
52
RSquared
• The RSE provides an absolute measure of lack of fit of the model to the
data. A small RSE indicates that the model fits the data well whereas a
large RSE indicates that the model doesn’t fit the data well. But since it
is measured in the units of Y, it is not always clear what constitutes a
good RSE
• The R2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as

• Note that R2 statistic is independent of the scale of Y, and it always


takes a value between 0 and 1 53
53
• For example, if 𝑅2 =0.8, then 80% of variance in the data is
explained by the model.
R-squared and Adjusted R-squared
• A high value of R2 is a good indication.
• However, as the value of R2 tends to increase when more
predictors are added in the model, such as in multiple linear
regression model, you should mainly consider the adjusted R-
squared, which is a penalized R2 for a higher number of
predictors.
– An (adjusted) R2 that is close to 1 indicates that a large proportion of the
variability in the outcome has been explained by the regression model.
– A number near 0 indicates that the regression model did not explain
much of the variability in the outcome.
Rsquared Adjusted
It measures the proportion of variation explained by only those independent variables that really help
in explaining the dependent variable. It penalizes you for adding independent variable that do not help
in predicting the dependent variable.

Adjusted R-Squared can be calculated mathematically in terms of sum of squares. The only difference
between R-square and Adjusted R-square equation is degree of freedom.

56
56
F-Statistic
• The F-statistic gives the overall significance of the model. It assess whether at least
one predictor variable has a non-zero coefficient.
.
• p-value is the probability your results could have happened by chance.
• A big F, with a small p-value, means that the null hypothesis is
discredited, and we would assert that there is a general relationship
between the response and predictors (while a small F, with a big p-value
indicates that there is no relationship).
Statistical hypotheses
• Null hypothesis (H0): the coefficients are equal to zero (i.e., no
relationship between x and y)
• Alternative Hypothesis (Ha): the coefficients are not equal to zero
(i.e., there is some relationship between x and y)

• State the null and alternative hypothesis:


H0: β1 = β2 = , ... , = βp-1 = 0
H1: βj ≠ 0 for some j
F = MSM / MSE = (explained variance) / (unexplained variance)

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE


• Degrees of Freedom for Model: DFM = p
• Degrees of Freedom for Error: DFE = n – p -1
Question-4:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b).
(a) Find the best linear fit
(b) Find the residuals.
(c) Evaluate the standard errors associated with
(d) Compute R2 and Adjusted R2
(e) Calculate F-statistic 60
60
Solution:
(X-
X Y Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 3.2732 + 4.9029X
Mean 4 22.88472
Substituting in
the formula

61
61
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2


2 12.8978 -2 -9.9869 19.9738 4 13.0789 0.0328
3 17.7586 -1 -5.1261 5.1261 1 17.9818 0.0498
4 23.3192 0 0.4345 0.0000 0 22.8847 0.1888
5 28.3129 1 5.4282 5.4282 1 27.7876 0.2759
6 32.1351 2 9.2504 18.5008 4 32.6905 0.3085
Sum 20 114.4236 0 0.0000 49.0289 10 RSS 0.8558
Mean 4 22.88472

Y predicted is calculated using the best linear fit RSS𝒎𝒊𝒏 = 0.8558


Y = 4.9029 + 3.2732 X 62
62
Substituting RSS = 0.8558 and n = 5, then RSE = 0.5341.

Standard error for a is


 = RSE
SE(a) = 0.1689

Standard error for b is


SE(b) = 0.7186

63
63
To find R2 value, first find TSS
= 241.2391

= 0.9965

R2adjusted = 1-(1-0.9965)*4/3= 0.9953

64
64
F = MSM / MSE = (explained variance) / (unexplained variance)

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE


• Degrees of Freedom for Model: DFM = p
• Degrees of Freedom for Error: DFE = n – p -1
• Mean of Squares for Model: MSM = SSM / DFM= 240.383/1=240.383
Mean of Squares for Error: MSE = SSE / DFE=0.8558/3=0.2853
• Fstatistic=MSM/MSE = 240.383/0.2853=842.56
F_statistic > F_critical
F-statistic: 842.6 on 1 and 3 DF

Null Hypothesis is rejected


Question-4:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12 17 23 28 32]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b).
(a) Find the best linear fit
(b) Find the residuals.
(c) Evaluate the standard errors associated with
(d) Compute R2 and Adjusted R2
(e) Calculate F-statistic 68
68
Multiple Regression
• Multiple regression extends simple regression to include several independent variables
(called predictors).
• Multiple regression is required when a single-predictor model is inadequate to describe
the true relationship between the dependent variable Y (the response variable) and its
potential predictors (X1, X2, X3, . . .).
• The interpretation of multiple regression is similar to simple regression because simple
regression is a special case of multiple regression.

69
Regression Terminology
• The response variable (Y) is assumed to be related to the
k predictors (X1, X2, . . . , Xk ) by a linear equation called
the population regression model:
y = β 0 + β 1 x1 + β 2 x2 + . . . + β k xk + ε
• A random error ε represents everything that is not part of
the model. The unknown regression coefficients β0, β1, β2,
. . . , βk are parameters and are denoted by Greek letters

70
Regression Terminology
• The sample estimates of the regression coefficients are
denoted by Roman letters b0, b1, b2, . . . , bk . The
predicted value of the response variable is denoted 𝑦 and
is calculated by inserting the values of the predictors into
the estimated regression equation:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + … . +𝑏𝑘 𝑥𝑘 (predicted value of Y)

71
Data Format
• To obtain a fitted regression, we need n observed values
of the response variable Y and its proposed predictors X1,
X2, . . . , Xk . A multivariate data set is a single column of
Y-values and k columns of X-values.

72
Example

73
https://www.statology.org/multiple-linear-regression-

74
75
76
Sample Data
• Illustration: Home Prices
• Table shows sales of 30 new homes in an upscale development. Although the
selling price of a home (the response variable) may depend on many factors, we
will examine three potential explanatory variables.
• Definition of Variable Short Name
• Y = selling price of a home (thousands of dollars) Price
• X1 = home size (square feet) SqFt
• X2 = lot size (thousand square feet) LotSize
• X3 = number of bathrooms Baths

77
Sample Data

78
Sample Data Home
1
X1
Sqft
2192
X2
LotSize
16.4
X3
Baths
2.5
Y
Price
505.5
2 3429 24.7 3.5 784.1
3 2842 17.7 3.5 649
4 2987 20.3 3.5 689.8
5 3029 22.2 3 709.8
6 2616 20.8 2.5 590.2
7 2978 17.3 3 643.3
8 3595 22.4 3.5 789.7
9 2838 27.4 3 683
10 2591 19.2 2 544.3
11 3633 26.9 4 822.8
12 2822 23.1 3 637.7
13 2994 20.4 3 618.7
14 2696 22.7 3.5 619.3
15 2134 13.4 2.5 490.5
16 3076 19.8 3 675.1
17 3259 20.8 3.5 710.4
18 3162 19.4 4 674.7
19 2885 23.2 3 663.6
20 2550 20.2 3 606.6
21 3380 19.6 4.5 758.9
22 3131 22.5 3.5 723.3
23 2754 19.2 2.5 621.8
24 2710 21.6 3 622.4
25 2616 20.8 2.5 631.3
26 2608 17.3 3.5 574
27 3572 29 4 863.8
28 2924 21.8 2.5 652.7
29 3614 25.5 3.5 844.2
30 2600 24.1 3.5 629.9

79
Estimation of Regression

• Intercept = -28.85
• Slope Sqft = 0.171
• Slope LotSize = 6.78
• Slope Baths = 15.53

80
Predictions from a Fitted Regression

• We can use the fitted regression model to make


predictions for various assumed predictor values. For
example, what would be the expected selling price of a
2,800-square-foot home with 21⁄2 baths on a lot with
18,500 square feet?

81
Predictions from a Fitted Regression

• We can use the fitted regression model to make


predictions for various assumed predictor values. For
example, what would be the expected selling price of a
2,800-square-foot home with 21⁄2 baths on a lot with
18,500 square feet?

82
Example:

We’d like to fit a simple linear regression model using hours


studied as a predictor variable and exam score as a response
variable for 15 students in a particular class:
We can use the lm() function to fit this simple linear regression model in R:
Correlation

86
What is correlation?
• What is the pattern of the data points that you observe?
– Line, curve or no pattern at all

Could observe a line pattern. Ie., it follows a linear pattern

When temperature increases, no.of O-Ring failures


decreases
Correlation
• Bivariate analysis that measures the strength of association
between two variables and the direction of the relationship.
• A correlation is a relationship between two variables.
• Is there a relationship between the number of employee training
hours and the number of jobs produced?
• Is there a relationship between the number of hours a student spends
studying for a Mathematics test and the student’s score on that test?
• Let x to be the independent variable and y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y)
• Mathematically, the strength and direction of a linear relationship
between two variables is represented by the correlation coefficient.

89
89
• If r is close to 1, the
variables are positively
correlated  there is likely
a strong linear relationship
between the two variables,
with a positive slope.

90
• If r is close to -1, the
variables are negatively
correlated  there is likely
a strong linear relationship
between the two variables,
with a negative slope.

91
91
• If r is close to 0, the
variables are not correlated
 that there is likely no
linear relationship between
the two variables, however,
the variables may still be
related in some other way.

92
92
What is a correlation matrix

A correlation matrix is essentially a table depicting the correlation coefficients for various variables. The
rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
Correlation
Correlation
Question to Ponder
• Do you recollect any other bivariate analysis that measures
the linear relationship between two variables?
Covariance Vs. Correlation
Covariance Correlation
• Provides the direction of linear • Provides both direction and the
relationship between two variables strength of the linear relationship
between two variables
• Has no upper or lower bound • Ranges between -1 to +1

• Not standardized
• Standardized
Covariance Formula
Covariance
X Y X-MX Y-MY (X-MX)(Y-MY)
2 12 -2 -10.4 20.8
3 17 -1 -5.4 5.4
4 23 0 0.6 0
5 28 1 5.6 5.6
6 32 2 9.6 19.2
4 22.4 51

12.75
• A correlation is a relationship between two variables.
• Is there a relationship between the number of employee training
hours and the number of jobs produced?
• Is there a relationship between the number of hours a student spends
studying for a Mathematics test and the student’s score on that test?
• Let x to be the independent variable and y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y)
• Mathematically, the strength and direction of a linear relationship
between two variables is represented by the correlation coefficient.

10
101
1
Correlation Analysis
• Pearson correlation (r)
– measures a linear dependence between two variables (x and y).

– It’s also known as a parametric correlation test because it depends to the


distribution of the data.
– It can be used only when x and y are from normal distribution.
 The correlation coefficient r is given by

 This will always be a number between -1 and 1 (inclusive).

10
103
3
Question:
 The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r

10
104
4
10
105
5
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1

10
106
6
Correlation coefficient can be computed using the
functions cor() or cor.test():
• cor() computes the correlation coefficient

• cor.test() test for association/correlation between paired

samples. It returns both the correlation coefficient and


the significance level(or p-value) of the correlation .
cor(x, y, method = "pearson“)
Method="pearson", "kendall", "spearman"

107
If your data contain missing values, use the following R code to
handle missing values by case-wise deletion.
cor(x, y, method = "pearson", use = "complete.obs")

108
Question:
 The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r

10
109
9
11
110
0
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1

11
111
1
ANOVA

11
2
One-way ANOVA test
• ANOVA – Analysis of Variance
• One-way ANOVA, also known as one-factor ANOVA is a test for
comparing means of more than two groups
• ANOVA is a hypothesis testing procedure that is used to evaluate
differences between 2 or more samples
• It checks whether the means of two or more sample groups are
statistically different or not.
• ANOVA test hypotheses:
– Null hypothesis: the means of the different groups are the same
– Alternative hypothesis: At least one sample mean is not equal to the
others.
Assumptions of ANOVA test
• The observations are obtained independently and randomly
from the population defined by the factor levels
• The data of each factor level are normally distributed.
• These normal populations have a common variance. (Levene’s
test can be used to check this.)
How it works?
• Assume that we have 3 groups (A, B, C) to compare:
– Compute the common variance, which is called variance within
samples (S2within) or residual variance.
– Compute the variance between sample means as follow:
• Compute the mean of each group
• Compute the variance between sample means (S2between)
– Produce F-statistic as the ratio of S2between / S2within.
Is there a statistically significant difference in the mean weight loss among the four diets?

116
18 employees (six each from first year to third year of experience) were selected for an
informal study about their Performance Evaluation score. The evaluation was done for
a score of 100. Using One-way ANOVA technique, find out whether or not a difference
exists somewhere between the three different year levels

117
Step1:
Setting the hypothesis (Null hypothesis or alternate hypothesis)
• Null Hypothesis (H0: 1=2=3)
• Alternate Hypothesis (Ha: Alteast one difference among the means)
And
• Fixing the confidence interval (90%, 95%)
=0.1 or 0.05

Step2: Find the df


• df between the groups/columns
• df within the groups/columns
• df_total

118
Step3:Calculating the Means
• Means for each group and
• Grand mean

Step4: All variability across the columns/groups


• SST
• SSB (Sum of Squares between/Columns)
• SSE(Sum of Squares within/Errors)
Step5: To calculate the variance between and within

𝑆𝑆𝐵
• Mean Squares_between =
𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛

𝑆𝑆𝐸
• Mean Squares_within =
𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛

Step 6: To perform F test (To calculate F_ratio)


• F_statistic = Mean Square_between / Mean Square_within
• F_critical from F distribution table 119
120
• F-statistic value is
less than Fcritical
• Null hypothesis is
accepted.
• It means there is no
significant
difference in mean
values

121
ANOVA

At least one mean is an outlier and each


distribution is narrow; distinct from each other

Means are fairly close to overall mean and/


or distributions overlap a bit; hard to
distinguish

The means are very close to overall


mean and/ or distribution “melt” together

12
122
2
Question:
18 employees (six each from first year to third year of experience) were selected for an informal study
about their Performance Evaluation score. The evaluation was done for a score of 100. Using One-way
ANOVA technique, find out whether or not a difference exists somewhere between the three different
year levels
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
123
3
124
Groups/ Columns

Scores
First Year Second Year Third Year
Random Sample
within each group 82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
125
5
Calculate the mean of each column

Scores Calculate Grand Mean/


First Year Second Year Third Year Overall Mean 𝒙
82 62 64
The mean of all 18 scores
93 85 73
is
61 94 87 𝒙 = 74.28
74 78 91
69 71 56
53 66 78
12
Mean 𝑥 72 76 74.83 126
6
Partitioning Sum of Squares

12
127
7
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
Mean 𝑥 72 76 74.83
𝒙 = 74.28 12
128
8
Scores
First Year Second Year Third Year (XA-Xmean)^2 (XB-Xmean)^2 (Xc-Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83

SST = 1067.130 + 747.796 + 896.685 = 2711.611

𝒙 = 74.28 12
129
9
Sum of
Scores Squares_between
First Year Second Year Third Year
1. Find difference
82 62 64 between each
93 85 73 group mean and
61 94 87 the overall mean
2. Square the
74 78 91 deviations
69 71 56 3. Multiply with no.
of values of each
53 66 78 column
Mean 𝑥 72 76 74.83 4. Add them up

𝒙 = 74.28
SSC = 6(72 – 74.28)2 + 6(76 – 74.28)2 +6 (74.83 – 74.28)2 = 50.778
13
130
0
Scores

First Year Second Year Third Year (XA-xa_mean)^2 (XB-xb_mean)^2 (XC-xc_mean)^2 Sum of
82 62 64 100 196 117.361
93 85 73 441 81 3.361 Squares_within
61 94 87 121 324 148.028
74 78 91 4 4 261.361
69 71 56 9 25 354.694
53 66 78 361 100 10.028
Sum 432 456 449 1036 730 894.833
Mean 72 76 74.83

SSE = 1036 + 730 + 894.833 = 2660.833

131
Formulas for One-Way ANOVA
df = Degrees of Freedom
1. DoF b/w the columns Mean Squares_between
SS_between
= 𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛

2. DoF within the columns


SS_𝑤𝑖𝑡ℎ𝑖𝑛
Mean Squares_within = 𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛

ANOVA F - statistic
Mean Squares_between
F=
Mean Squares_within

MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within 132
132
Substituting the values
Formula to calculate Critical Value in Excel:

F.INV.RT(ALPHA,NUMERATOR DOF, DENOMINATOR DOF)


50.778
Mean Squares_between = = 25.389
3−1
• F-statistic value is less than
2660.833 Fcritical
Mean Squares_within = = 177.389
• Null hypothesis is accepted.
18−3
• It means there is no
25.389 significant difference in
= = 0.1431 mean values
177.389

Critical value of F: F, dfc, dfe = F0.05, 2, 15 = 3.68


133
133
R Code – ANOVA

Fcritical=qf(0.05,df1,df2,lower.tail=FALSE)

Pvalue=pf(fstatistic, df1, df2, lower.tail = FALSE)

An F-statistic greater than the critical value is equivalent to a p-value less than alpha and both means
that you reject the null hypothesis.
The F value can be used along with the p value in deciding whether your results are significant enough
to reject the null hypothesis.
134
Thank You

You might also like