0% found this document useful (0 votes)

3 views135 pages

CSE3506 PPT Ref1

The document outlines the essentials of data analytics, including types such as descriptive, diagnostic, predictive, and prescriptive analytics. It discusses machine learning paradigms, specifically supervised and unsupervised learning, and emphasizes the importance of statistical learning in making predictions and inferences. Additionally, it covers linear regression, its applications in predicting sales based on advertising budgets, and various performance metrics for evaluating model accuracy.

Uploaded by

niwinkumar7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views135 pages

CSE3506 PPT Ref1

Uploaded by

niwinkumar7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 135

CSE3506 Essentials of Data Analytics

(2 0 2 4 4)

Dr. Asnath Phamila Y

1
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation

2
Data Analytics – What?
• Science of analyzing raw data in order to make conclusions about
that information [Investopedia]

• Analytics is the systematic computational analysis of data or

statistics [Wikipedia]

• Data analysis is a process of inspecting, cleaning, transforming,

and modeling data with the goal of discovering useful information,
deriving conclusions, and supporting decision-making [Wikipedia]
Data Analytics – Why?
• Helps business to optimize their
performances
– Reduce cost
– Improved business
– Make better decision
Data Analytics – Types
• Descriptive analytics - describes what has happened over a given
period of time.
– Have the number of views gone up?
– Are sales stronger this month than last?

• Diagnostic analytics - focuses more on why something happened.

This involves more diverse data inputs and a bit of hypothesizing.
– Did the weather affect sales of a cool drink?
– Did that latest marketing campaign impact sales?
Data Analytics – Types

• Predictive analytics - focuses on what is likely going to happen in the near term.
– What happened to sales the last time we had a hot summer?
– How many weather models predict a hot summer this year?
• Prescriptive analytics suggests a course of action.
– If the likelihood of a hot summer is measured as an average of say five weather models is
above 58%, we should add an evening shift to the workers to increase output.
What is Machine Learning?

• Large volume of data demands automated methods of data analysis which is what
machine learning provides.

• Machine learning is defined as a set of methods that can automatically detect

patterns in data, and then use the uncovered patterns to predict future data, or to
perform other kinds of decision making under uncertainty.
Machine Learning Paradigms
• Three Learning Paradigms

– Predictive or Supervised Learning

– Descriptive or Unsupervised Learning

– Reinforcement Learning
Statistical learning refers to
• getting inference from a vast data set using set of tools
• These tools can be classified as
• Supervised
or
• Unsupervised

9
9
Supervised Learning

• A training set of examples with the correct

responses (targets) is provided and, based on this
training set, the algorithm generalizes to respond
correctly to all possible inputs. This is also called
learning from exemplars.

10
Unsupervised Learning

• Correct responses are not provided, but instead

the algorithm tries to identify similarities
between the inputs so that inputs that have
something in common are categorized together.
• The statistical approach to unsupervised learning
is known as density estimation.

11
Supervised Learning
• The goal is to learn a mapping from inputs x to outputs y, given a labeled set
of input-output pairs.
D= 𝑥𝑖 , 𝑦𝑖 𝑁
𝑖=1
• Here, D is called the training set, and N is the number of training examples.
• 𝑥𝑖 , 𝑦𝑖 is the ith training sample

• Each 𝑥𝑖 is a d-dimensional vector of numbers - features, attributes or

covariates
• Height, weight, age or complex objects (image, text, time series, graph etc.)
Supervised Learning
Classification Problem
• Each 𝑦𝑖 is a response variable
• Categorical or nominal variable from a finite set 𝑦𝑖 𝝐 1, … , 𝐶
• Eg., Male or Female

Regression Problem

• Real valued variable

• Eg., Temperature, Age, Height, Weight
Classification Problem – Credit Scoring
Regression Problem – Car Price Prediction
Unsupervised Learning
• The goal is to find “interesting patterns” in the data.

𝑁
D = 𝑥𝑖 𝑖=1
• Sometimes called “Knowledge Discovery”

• Clustering algorithms come in this category where data are grouped based on
similarities.
Clustering Problem –Grouping people based on weight
and height
• Supervised statistical learning:
 Involves building a statistical model for predicting, or estimating,
an output based on one or more inputs
 Examples: Problems occur in business, medicine, astrophysics,
and public policy
• Unsupervised statistical learning:
 There are inputs but no supervising output; nevertheless we can
learn relationships and structure from such data
 Example: Input dataset containing images of different types of cats
and dogs

19
19
Process

• Data Collection and Preparation

• Feature Selection
• Algorithm Choice
• Parameter and Model Selection
• Training
• Evaluation

20
Statistical Learning – Advertising Data

 The Advertising data set consists of the

sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, radio, and
newspaper

21
21
 The Advertising data set consists of the
sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, Radio, and
newspaper

22
22
 The Advertising data set consists of the sales
of a product in 200 different cities, along
with advertising budgets for three different
media: TV, Radio, and newspaper
 Goal is to develop an accurate model that
can be used to predict sales on the basis of
the three media budgets

23
23
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
• Variables 24
24
• Output Variable: Sales
• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables

25
25
 There is some relationship between
Y and X = (X1, X2,...,Xp)
 General form of relationship is
 Y = f(X) + 
 where
 f is some fixed but unknown
function of X1,...,Xp
  is a random error term,
which is independent of X and
has mean zero
26
26
• The black lines represent the
error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)

Observed values of
• Overall, these errors have
income and years of approximately mean zero
education for 30
individuals
27
27
 Statistical Learning refers to a set of
approaches for estimating f in the
equation
• Y = f(X) + 
 Reasons to estimate ‘f’,:
 Prediction
 Inference

28
28
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.

JEE Score
29
29
On the basis of given advertising data,
• Marketing plan for next year can be made
• To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
• Predicting sales with a high level of
accuracy requires a strong relationship.
• If it is strong relationship then
• In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect 30
30
The important questions are
 Which media contribute more to sales?
 Do all three contribute to sales, or do just
one or two.
 The individual effects of each medium on
the money spent
 For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount
will sales increase?
 How accurately can we predict this
amount of increase?

Linear regression can be used to answer each of these questions 31

31
Linear Regression - Types
• Types:
 Based on the number of independent variables,
there are two types of linear regression
 Simple Linear Regression
 Multiple Linear Regression
• Mathematically, the linear relationship is approximately
modeled as
• y = 0 + 1 x

32
Simple Linear Regression
Estimating the coefficients 0 and 1

33
33
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a) Find the best linear fit
(b) Determine the minimum RSS
(c) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

34
34
Solution:
(a) To find the best fit, calculate the model coefficients using the formula

35
35
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting in
the formula 0 =3.2732
1 = 4.9029
36
36
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

2 12 -2 -10.4 20.8 4
3 17 -1 -5.4 5.4 1
4 23 0 0.6 0 0
5 28 1 5.6 5.6 1
6 32 2 9.6 19.2 4 The best linear fit is
Sum 20 12 -2 -10.4 20.8 4
Y = 5.1X + 2
Mean 4 22.4 Sum 51 10
Substituting in
the formula 1 = 51/10 = 5.1
0 = 22.4-5.1*4 = 2
37
37
Solution:
Best Linear Fit

38
38
Question-2:
Consider the following five training examples
Find the best linear fit

39
39
Example

9/27/2022 CSE3505-FDA 40
Linear Regression Model (with Normally Distributed Errors)
• In most linear regression analyses, it is common to assume that the error term
is a normally distributed random variable with mean equal to zero and
constant variance.

• Thus, the linear regression model is expressed as

Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

2 12 -2 -10.4 20.8 4
3 17 -1 -5.4 5.4 1
4 23 0 0.6 0 0
5 28 1 5.6 5.6 1
6 32 2 9.6 19.2 4 The best linear fit is
Sum 20 112 -2 -10.4 20.8 4
Y = 5.1X + 2
Mean 4 22.4 Sum 51 10
Substituting in
the formula 1 = 51/10 = 5.1
0 = 22.4-5.1*4 = 2
42
42
Sample data and Model
• Data: Marketing from Datarium package
• We want to predict future sales on the basis of advertising budget spent on
youtube.
• sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear
model:

#building a linear model

model <- lm(sales~youtube,data=marketing)
model
Coefficients:
(Intercept) youtube
8.43911 0.04754
Performance Metrics
– Mean Absolute Error MAE
– Mean Square Error MSE
– Root MSE RMSE
– Residual Sum of Squares RSS
– Total Sum of Squares TSS
– Residual Standard Error RSE
– Sum of Squares Error SSE
– Standard Error SE
– R-Squared
– Adjusted R Squared
– F Statistic
Mean Absolute Error(MAE)

We aim to get a minimum MAE because this is a loss.

Mean Absolute Percentage Error (MAPE)

What is a good value of MAPE?

The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is acceptably accurate.
Mean Squared Error(MSE)

The lower the value the better and 0 means the model is perfect.
Root Mean Squared Error(RMSE)

This produces a value between 0 and 1, where values closer to 0 represent better fitting models. Based on a rule of
thumb, it can be said that RMSE values between 0.2 and 0.5 shows that the model can relatively predict the data
accurately.
RSS and TSS

50
50
RSS and TSS

To calculate RSS, use the following formula

where

51
51
RSE and Standard Error

52
52
RSquared
• The RSE provides an absolute measure of lack of fit of the model to the
data. A small RSE indicates that the model fits the data well whereas a
large RSE indicates that the model doesn’t fit the data well. But since it
is measured in the units of Y, it is not always clear what constitutes a
good RSE
• The R2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as

• Note that R2 statistic is independent of the scale of Y, and it always

takes a value between 0 and 1 53
53
• For example, if 𝑅2 =0.8, then 80% of variance in the data is
explained by the model.
R-squared and Adjusted R-squared
• A high value of R2 is a good indication.
• However, as the value of R2 tends to increase when more
predictors are added in the model, such as in multiple linear
regression model, you should mainly consider the adjusted R-
squared, which is a penalized R2 for a higher number of
predictors.
– An (adjusted) R2 that is close to 1 indicates that a large proportion of the
variability in the outcome has been explained by the regression model.
– A number near 0 indicates that the regression model did not explain
much of the variability in the outcome.
Rsquared Adjusted
It measures the proportion of variation explained by only those independent variables that really help
in explaining the dependent variable. It penalizes you for adding independent variable that do not help
in predicting the dependent variable.

Adjusted R-Squared can be calculated mathematically in terms of sum of squares. The only difference
between R-square and Adjusted R-square equation is degree of freedom.

56
56
F-Statistic
• The F-statistic gives the overall significance of the model. It assess whether at least
one predictor variable has a non-zero coefficient.
.
• p-value is the probability your results could have happened by chance.
• A big F, with a small p-value, means that the null hypothesis is
discredited, and we would assert that there is a general relationship
between the response and predictors (while a small F, with a big p-value
indicates that there is no relationship).
Statistical hypotheses
• Null hypothesis (H0): the coefficients are equal to zero (i.e., no
relationship between x and y)
• Alternative Hypothesis (Ha): the coefficients are not equal to zero
(i.e., there is some relationship between x and y)

• State the null and alternative hypothesis:

H0: β1 = β2 = , ... , = βp-1 = 0
H1: βj ≠ 0 for some j
F = MSM / MSE = (explained variance) / (unexplained variance)

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE

• Degrees of Freedom for Model: DFM = p
• Degrees of Freedom for Error: DFE = n – p -1
Question-4:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b).
(a) Find the best linear fit
(b) Find the residuals.
(c) Evaluate the standard errors associated with
(d) Compute R2 and Adjusted R2
(e) Calculate F-statistic 60
60
Solution:
(X-
X Y Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 3.2732 + 4.9029X
Mean 4 22.88472
Substituting in
the formula

61
61
Solution:

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2

2 12.8978 -2 -9.9869 19.9738 4 13.0789 0.0328
3 17.7586 -1 -5.1261 5.1261 1 17.9818 0.0498
4 23.3192 0 0.4345 0.0000 0 22.8847 0.1888
5 28.3129 1 5.4282 5.4282 1 27.7876 0.2759
6 32.1351 2 9.2504 18.5008 4 32.6905 0.3085
Sum 20 114.4236 0 0.0000 49.0289 10 RSS 0.8558
Mean 4 22.88472

Y predicted is calculated using the best linear fit RSS𝒎𝒊𝒏 = 0.8558

Y = 4.9029 + 3.2732 X 62
62
Substituting RSS = 0.8558 and n = 5, then RSE = 0.5341.

Standard error for a is

 = RSE
SE(a) = 0.1689

Standard error for b is

SE(b) = 0.7186

63
63
To find R2 value, first find TSS
= 241.2391

= 0.9965

R2adjusted = 1-(1-0.9965)*4/3= 0.9953

64
64
F = MSM / MSE = (explained variance) / (unexplained variance)

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE

• Degrees of Freedom for Model: DFM = p
• Degrees of Freedom for Error: DFE = n – p -1
• Mean of Squares for Model: MSM = SSM / DFM= 240.383/1=240.383
Mean of Squares for Error: MSE = SSE / DFE=0.8558/3=0.2853
• Fstatistic=MSM/MSE = 240.383/0.2853=842.56
F_statistic > F_critical
F-statistic: 842.6 on 1 and 3 DF

Null Hypothesis is rejected

Question-4:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12 17 23 28 32]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b).
(a) Find the best linear fit
(b) Find the residuals.
(c) Evaluate the standard errors associated with
(d) Compute R2 and Adjusted R2
(e) Calculate F-statistic 68
68
Multiple Regression
• Multiple regression extends simple regression to include several independent variables
(called predictors).
• Multiple regression is required when a single-predictor model is inadequate to describe
the true relationship between the dependent variable Y (the response variable) and its
potential predictors (X1, X2, X3, . . .).
• The interpretation of multiple regression is similar to simple regression because simple
regression is a special case of multiple regression.

69
Regression Terminology
• The response variable (Y) is assumed to be related to the
k predictors (X1, X2, . . . , Xk ) by a linear equation called
the population regression model:
y = β 0 + β 1 x1 + β 2 x2 + . . . + β k xk + ε
• A random error ε represents everything that is not part of
the model. The unknown regression coefficients β0, β1, β2,
. . . , βk are parameters and are denoted by Greek letters

70
Regression Terminology
• The sample estimates of the regression coefficients are
denoted by Roman letters b0, b1, b2, . . . , bk . The
predicted value of the response variable is denoted 𝑦 and
is calculated by inserting the values of the predictors into
the estimated regression equation:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + … . +𝑏𝑘 𝑥𝑘 (predicted value of Y)

71
Data Format
• To obtain a fitted regression, we need n observed values
of the response variable Y and its proposed predictors X1,
X2, . . . , Xk . A multivariate data set is a single column of
Y-values and k columns of X-values.

72
Example

73
https://www.statology.org/multiple-linear-regression-

74
75
76
Sample Data
• Illustration: Home Prices
• Table shows sales of 30 new homes in an upscale development. Although the
selling price of a home (the response variable) may depend on many factors, we
will examine three potential explanatory variables.
• Definition of Variable Short Name
• Y = selling price of a home (thousands of dollars) Price
• X1 = home size (square feet) SqFt
• X2 = lot size (thousand square feet) LotSize
• X3 = number of bathrooms Baths

77
Sample Data

78
Sample Data Home
1
X1
Sqft
2192
X2
LotSize
16.4
X3
Baths
2.5
Y
Price
505.5
2 3429 24.7 3.5 784.1
3 2842 17.7 3.5 649
4 2987 20.3 3.5 689.8
5 3029 22.2 3 709.8
6 2616 20.8 2.5 590.2
7 2978 17.3 3 643.3
8 3595 22.4 3.5 789.7
9 2838 27.4 3 683
10 2591 19.2 2 544.3
11 3633 26.9 4 822.8
12 2822 23.1 3 637.7
13 2994 20.4 3 618.7
14 2696 22.7 3.5 619.3
15 2134 13.4 2.5 490.5
16 3076 19.8 3 675.1
17 3259 20.8 3.5 710.4
18 3162 19.4 4 674.7
19 2885 23.2 3 663.6
20 2550 20.2 3 606.6
21 3380 19.6 4.5 758.9
22 3131 22.5 3.5 723.3
23 2754 19.2 2.5 621.8
24 2710 21.6 3 622.4
25 2616 20.8 2.5 631.3
26 2608 17.3 3.5 574
27 3572 29 4 863.8
28 2924 21.8 2.5 652.7
29 3614 25.5 3.5 844.2
30 2600 24.1 3.5 629.9

79
Estimation of Regression

• Intercept = -28.85
• Slope Sqft = 0.171
• Slope LotSize = 6.78
• Slope Baths = 15.53

80
Predictions from a Fitted Regression

• We can use the fitted regression model to make

predictions for various assumed predictor values. For
example, what would be the expected selling price of a
2,800-square-foot home with 21⁄2 baths on a lot with
18,500 square feet?

81
Predictions from a Fitted Regression

• We can use the fitted regression model to make

predictions for various assumed predictor values. For
example, what would be the expected selling price of a
2,800-square-foot home with 21⁄2 baths on a lot with
18,500 square feet?

82
Example:

We’d like to fit a simple linear regression model using hours

studied as a predictor variable and exam score as a response
variable for 15 students in a particular class:
We can use the lm() function to fit this simple linear regression model in R:
Correlation

86
What is correlation?
• What is the pattern of the data points that you observe?
– Line, curve or no pattern at all

Could observe a line pattern. Ie., it follows a linear pattern

When temperature increases, no.of O-Ring failures

decreases
Correlation
• Bivariate analysis that measures the strength of association
between two variables and the direction of the relationship.
• A correlation is a relationship between two variables.
• Is there a relationship between the number of employee training
hours and the number of jobs produced?
• Is there a relationship between the number of hours a student spends
studying for a Mathematics test and the student’s score on that test?
• Let x to be the independent variable and y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y)
• Mathematically, the strength and direction of a linear relationship
between two variables is represented by the correlation coefficient.

89
89
• If r is close to 1, the
variables are positively
correlated  there is likely
a strong linear relationship
between the two variables,
with a positive slope.

90
• If r is close to -1, the
variables are negatively
correlated  there is likely
a strong linear relationship
between the two variables,
with a negative slope.

91
91
• If r is close to 0, the
variables are not correlated
 that there is likely no
linear relationship between
the two variables, however,
the variables may still be
related in some other way.

92
92
What is a correlation matrix

A correlation matrix is essentially a table depicting the correlation coefficients for various variables. The
rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
Correlation
Correlation
Question to Ponder
• Do you recollect any other bivariate analysis that measures
the linear relationship between two variables?
Covariance Vs. Correlation
Covariance Correlation
• Provides the direction of linear • Provides both direction and the
relationship between two variables strength of the linear relationship
between two variables
• Has no upper or lower bound • Ranges between -1 to +1

• Not standardized
• Standardized
Covariance Formula
Covariance
X Y X-MX Y-MY (X-MX)(Y-MY)
2 12 -2 -10.4 20.8
3 17 -1 -5.4 5.4
4 23 0 0.6 0
5 28 1 5.6 5.6
6 32 2 9.6 19.2
4 22.4 51

12.75
• A correlation is a relationship between two variables.
• Is there a relationship between the number of employee training
hours and the number of jobs produced?
• Is there a relationship between the number of hours a student spends
studying for a Mathematics test and the student’s score on that test?
• Let x to be the independent variable and y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y)
• Mathematically, the strength and direction of a linear relationship
between two variables is represented by the correlation coefficient.

10
101
1
Correlation Analysis
• Pearson correlation (r)
– measures a linear dependence between two variables (x and y).

– It’s also known as a parametric correlation test because it depends to the

distribution of the data.
– It can be used only when x and y are from normal distribution.
 The correlation coefficient r is given by

 This will always be a number between -1 and 1 (inclusive).

10
103
3
Question:
 The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r

10
104
4
10
105
5
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1

10
106
6
Correlation coefficient can be computed using the
functions cor() or cor.test():
• cor() computes the correlation coefficient

• cor.test() test for association/correlation between paired

samples. It returns both the correlation coefficient and

the significance level(or p-value) of the correlation .
cor(x, y, method = "pearson“)
Method="pearson", "kendall", "spearman"

107
If your data contain missing values, use the following R code to
handle missing values by case-wise deletion.
cor(x, y, method = "pearson", use = "complete.obs")

108
Question:
 The time x in years that an employee spent at a company and the
employee’s hourly pay, y, for 5 employees are listed in the table below.
Calculate and interpret the correlation coefficient r

10
109
9
11
110
0
• Interpret this result: There is a strong positive correlation
between the number of years and employee has worked and the
employee’s salary, since r is very close to 1

11
111
1
ANOVA

11
2
One-way ANOVA test
• ANOVA – Analysis of Variance
• One-way ANOVA, also known as one-factor ANOVA is a test for
comparing means of more than two groups
• ANOVA is a hypothesis testing procedure that is used to evaluate
differences between 2 or more samples
• It checks whether the means of two or more sample groups are
statistically different or not.
• ANOVA test hypotheses:
– Null hypothesis: the means of the different groups are the same
– Alternative hypothesis: At least one sample mean is not equal to the
others.
Assumptions of ANOVA test
• The observations are obtained independently and randomly
from the population defined by the factor levels
• The data of each factor level are normally distributed.
• These normal populations have a common variance. (Levene’s
test can be used to check this.)
How it works?
• Assume that we have 3 groups (A, B, C) to compare:
– Compute the common variance, which is called variance within
samples (S2within) or residual variance.
– Compute the variance between sample means as follow:
• Compute the mean of each group
• Compute the variance between sample means (S2between)
– Produce F-statistic as the ratio of S2between / S2within.
Is there a statistically significant difference in the mean weight loss among the four diets?

116
18 employees (six each from first year to third year of experience) were selected for an
informal study about their Performance Evaluation score. The evaluation was done for
a score of 100. Using One-way ANOVA technique, find out whether or not a difference
exists somewhere between the three different year levels

117
Step1:
Setting the hypothesis (Null hypothesis or alternate hypothesis)
• Null Hypothesis (H0: 1=2=3)
• Alternate Hypothesis (Ha: Alteast one difference among the means)
And
• Fixing the confidence interval (90%, 95%)
=0.1 or 0.05

Step2: Find the df

• df between the groups/columns
• df within the groups/columns
• df_total

118
Step3:Calculating the Means
• Means for each group and
• Grand mean

Step4: All variability across the columns/groups

• SST
• SSB (Sum of Squares between/Columns)
• SSE(Sum of Squares within/Errors)
Step5: To calculate the variance between and within

𝑆𝑆𝐵
• Mean Squares_between =
𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛

𝑆𝑆𝐸
• Mean Squares_within =
𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛

Step 6: To perform F test (To calculate F_ratio)

• F_statistic = Mean Square_between / Mean Square_within
• F_critical from F distribution table 119
120
• F-statistic value is
less than Fcritical
• Null hypothesis is
accepted.
• It means there is no
significant
difference in mean
values

121
ANOVA

At least one mean is an outlier and each

distribution is narrow; distinct from each other

Means are fairly close to overall mean and/

or distributions overlap a bit; hard to
distinguish

The means are very close to overall

mean and/ or distribution “melt” together

12
122
2
Question:
18 employees (six each from first year to third year of experience) were selected for an informal study
about their Performance Evaluation score. The evaluation was done for a score of 100. Using One-way
ANOVA technique, find out whether or not a difference exists somewhere between the three different
year levels
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
123
3
124
Groups/ Columns

Scores
First Year Second Year Third Year
Random Sample
within each group 82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
12
125
5
Calculate the mean of each column

Scores Calculate Grand Mean/

First Year Second Year Third Year Overall Mean 𝒙
82 62 64
The mean of all 18 scores
93 85 73
is
61 94 87 𝒙 = 74.28
74 78 91
69 71 56
53 66 78
12
Mean 𝑥 72 76 74.83 126
6
Partitioning Sum of Squares

12
127
7
Scores
First Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
Mean 𝑥 72 76 74.83
𝒙 = 74.28 12
128
8
Scores
First Year Second Year Third Year (XA-Xmean)^2 (XB-Xmean)^2 (Xc-Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83

SST = 1067.130 + 747.796 + 896.685 = 2711.611

𝒙 = 74.28 12
129
9
Sum of
Scores Squares_between
First Year Second Year Third Year
1. Find difference
82 62 64 between each
93 85 73 group mean and
61 94 87 the overall mean
2. Square the
74 78 91 deviations
69 71 56 3. Multiply with no.
of values of each
53 66 78 column
Mean 𝑥 72 76 74.83 4. Add them up

𝒙 = 74.28
SSC = 6(72 – 74.28)2 + 6(76 – 74.28)2 +6 (74.83 – 74.28)2 = 50.778
13
130
0
Scores

First Year Second Year Third Year (XA-xa_mean)^2 (XB-xb_mean)^2 (XC-xc_mean)^2 Sum of
82 62 64 100 196 117.361
93 85 73 441 81 3.361 Squares_within
61 94 87 121 324 148.028
74 78 91 4 4 261.361
69 71 56 9 25 354.694
53 66 78 361 100 10.028
Sum 432 456 449 1036 730 894.833
Mean 72 76 74.83

SSE = 1036 + 730 + 894.833 = 2660.833

131
Formulas for One-Way ANOVA
df = Degrees of Freedom
1. DoF b/w the columns Mean Squares_between
SS_between
= 𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛

2. DoF within the columns

SS_𝑤𝑖𝑡ℎ𝑖𝑛
Mean Squares_within = 𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛

ANOVA F - statistic
Mean Squares_between
F=
Mean Squares_within

MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within 132
132
Substituting the values
Formula to calculate Critical Value in Excel:

F.INV.RT(ALPHA,NUMERATOR DOF, DENOMINATOR DOF)

50.778
Mean Squares_between = = 25.389
3−1
• F-statistic value is less than
2660.833 Fcritical
Mean Squares_within = = 177.389
• Null hypothesis is accepted.
18−3
• It means there is no
25.389 significant difference in
= = 0.1431 mean values
177.389

Critical value of F: F, dfc, dfe = F0.05, 2, 15 = 3.68

133
133
R Code – ANOVA

Fcritical=qf(0.05,df1,df2,lower.tail=FALSE)

Pvalue=pf(fstatistic, df1, df2, lower.tail = FALSE)

An F-statistic greater than the critical value is equivalent to a p-value less than alpha and both means
that you reject the null hypothesis.
The F value can be used along with the p value in deciding whether your results are significant enough
to reject the null hypothesis.
134
Thank You

Designing A Roller Coaster
100% (6)
Designing A Roller Coaster
18 pages
Module 2
No ratings yet
Module 2
139 pages
Module 2
No ratings yet
Module 2
139 pages
Linear Regression Logistic Regression Classification
No ratings yet
Linear Regression Logistic Regression Classification
66 pages
Module 1
No ratings yet
Module 1
138 pages
Supervised Machine Learning - Linear Regression
No ratings yet
Supervised Machine Learning - Linear Regression
92 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Machine Learning
No ratings yet
Machine Learning
100 pages
IDA117V Supervised ML
No ratings yet
IDA117V Supervised ML
39 pages
d3 It ML Jan 2023 Part 2
No ratings yet
d3 It ML Jan 2023 Part 2
32 pages
AI Chapter 3 Part 4
No ratings yet
AI Chapter 3 Part 4
57 pages
Unit 3
No ratings yet
Unit 3
45 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Tybsc Cs368 Data Analytics Labbook
No ratings yet
Tybsc Cs368 Data Analytics Labbook
58 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Lecture 3 - Machine Learning and Data Driven Analysis
No ratings yet
Lecture 3 - Machine Learning and Data Driven Analysis
36 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
AI ML 3 Updated
No ratings yet
AI ML 3 Updated
34 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
ML 2
No ratings yet
ML 2
155 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Chapter - 2-ML
No ratings yet
Chapter - 2-ML
63 pages
Machine Learning
No ratings yet
Machine Learning
115 pages
Linear Regression For ML Ass
No ratings yet
Linear Regression For ML Ass
99 pages
Unit 2
No ratings yet
Unit 2
19 pages
5 Regression-1
No ratings yet
5 Regression-1
46 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
445 Lecture 4
No ratings yet
445 Lecture 4
28 pages
Ict515 Lec1
No ratings yet
Ict515 Lec1
70 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Regression
No ratings yet
Regression
45 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Christian Dior The Magic of Fashion
100% (3)
Christian Dior The Magic of Fashion
66 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
DS-05 Introduction To Machine Learning
No ratings yet
DS-05 Introduction To Machine Learning
103 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Unit 2 Supervised Learning and Applications
No ratings yet
Unit 2 Supervised Learning and Applications
13 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Ai ML 3
No ratings yet
Ai ML 3
27 pages
Machine Learning Lecture1
No ratings yet
Machine Learning Lecture1
56 pages
Unit 2 - NOTES1 - ML
No ratings yet
Unit 2 - NOTES1 - ML
35 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Unit 1 (DS)
No ratings yet
Unit 1 (DS)
15 pages
Life Saving Rules Poster in English
No ratings yet
Life Saving Rules Poster in English
11 pages
Slide 1
No ratings yet
Slide 1
29 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Packet Tracer 8.6.1.3
0% (1)
Packet Tracer 8.6.1.3
16 pages
REGRESSION
No ratings yet
REGRESSION
13 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
Unit 5
No ratings yet
Unit 5
104 pages
Florida Department of Children and Families Legislative Budget Request FY 2010-11
No ratings yet
Florida Department of Children and Families Legislative Budget Request FY 2010-11
419 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Md-070 Application Extensions Technical Design
100% (1)
Md-070 Application Extensions Technical Design
16 pages
Tanaman Hias
No ratings yet
Tanaman Hias
8 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Week 4 Day 2 Science
No ratings yet
Week 4 Day 2 Science
3 pages
Module 5
No ratings yet
Module 5
48 pages
QAQC
100% (1)
QAQC
15 pages
14 Concrete Structures Cast in Situ - Colour
No ratings yet
14 Concrete Structures Cast in Situ - Colour
233 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Capacity Planning For Products and Services
No ratings yet
Capacity Planning For Products and Services
31 pages
Fag Smartcheck: High Process Security by Means of Decentralised Machinery Monitoring
No ratings yet
Fag Smartcheck: High Process Security by Means of Decentralised Machinery Monitoring
26 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
Central Civil Services (Conduct) Rules MCQ
No ratings yet
Central Civil Services (Conduct) Rules MCQ
11 pages
Gul Nawaz CV
No ratings yet
Gul Nawaz CV
2 pages
Exercises On Connectors
No ratings yet
Exercises On Connectors
4 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Some Rules of Gerunds and Infinitives PDF
100% (1)
Some Rules of Gerunds and Infinitives PDF
1 page
Retail Management
No ratings yet
Retail Management
8 pages
Waves - Label
100% (1)
Waves - Label
2 pages
Jyoti PPT (20erwcs025)
No ratings yet
Jyoti PPT (20erwcs025)
20 pages
Lausd Menu and Cals
No ratings yet
Lausd Menu and Cals
10 pages
SME - Metal Enclosed Switchgears
No ratings yet
SME - Metal Enclosed Switchgears
4 pages
5 PDF
No ratings yet
5 PDF
1 page
Applied Chemistry Feb 2023
No ratings yet
Applied Chemistry Feb 2023
4 pages
Arts 8 LM Month 1
No ratings yet
Arts 8 LM Month 1
7 pages
Gifted 3
No ratings yet
Gifted 3
12 pages
6 Hobbies That Can Build Up Your Creativity and Imagination
No ratings yet
6 Hobbies That Can Build Up Your Creativity and Imagination
1 page
Cotton Case Study
No ratings yet
Cotton Case Study
2 pages
"To Dsign and Fabricate 360 Flexible Drilling Machine": Class Room Notes Warm
No ratings yet
"To Dsign and Fabricate 360 Flexible Drilling Machine": Class Room Notes Warm
3 pages
Invoice Nntl-Msa003
No ratings yet
Invoice Nntl-Msa003
1 page
Spark Fun
No ratings yet
Spark Fun
1 page

CSE3506 PPT Ref1

Uploaded by

CSE3506 PPT Ref1

Uploaded by

CSE3506 Essentials of Data Analytics

Dr. Asnath Phamila Y

• Analytics is the systematic computational analysis of data or

• Data analysis is a process of inspecting, cleaning, transforming,

• Diagnostic analytics - focuses more on why something happened.

• Machine learning is defined as a set of methods that can automatically detect

– Predictive or Supervised Learning

– Descriptive or Unsupervised Learning

• A training set of examples with the correct

• Correct responses are not provided, but instead

• Each 𝑥𝑖 is a d-dimensional vector of numbers - features, attributes or

• Real valued variable

• Data Collection and Preparation

 The Advertising data set consists of the

Linear regression can be used to answer each of these questions 31

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

• Thus, the linear regression model is expressed as

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2

#building a linear model

We aim to get a minimum MAE because this is a loss.

What is a good value of MAPE?

To calculate RSS, use the following formula

• Note that R2 statistic is independent of the scale of Y, and it always

• State the null and alternative hypothesis:

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE

X Y (X-Xmean) (Y-Ymean) (X-Xmean)(Y-Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2

Y predicted is calculated using the best linear fit RSS𝒎𝒊𝒏 = 0.8558

Standard error for a is

Standard error for b is

R2adjusted = 1-(1-0.9965)*4/3= 0.9953

• Mean of Squares for Model: MSM = SSM / DFM

• Mean of Squares for Error: MSE = SSE / DFE

Null Hypothesis is rejected

• We can use the fitted regression model to make

• We can use the fitted regression model to make

We’d like to fit a simple linear regression model using hours

Could observe a line pattern. Ie., it follows a linear pattern

When temperature increases, no.of O-Ring failures

– It’s also known as a parametric correlation test because it depends to the

 This will always be a number between -1 and 1 (inclusive).

• cor.test() test for association/correlation between paired

samples. It returns both the correlation coefficient and

Step2: Find the df

Step4: All variability across the columns/groups

Step 6: To perform F test (To calculate F_ratio)

At least one mean is an outlier and each

Means are fairly close to overall mean and/

The means are very close to overall

Scores Calculate Grand Mean/

SST = 1067.130 + 747.796 + 896.685 = 2711.611

SSE = 1036 + 730 + 894.833 = 2660.833

2. DoF within the columns

F.INV.RT(ALPHA,NUMERATOR DOF, DENOMINATOR DOF)

Critical value of F: F, dfc, dfe = F0.05, 2, 15 = 3.68

Pvalue=pf(fstatistic, df1, df2, lower.tail = FALSE)

You might also like