0% found this document useful (0 votes)
90 views

Module 1

This document outlines the course objectives, outcomes, syllabus, and assessment process for CSE3506-Essentials of Data Analytics taught by Dr. Vergin Raja Sarobin at VIT Chennai. The course aims to teach students concepts of data analytics using machine learning models, supervised and unsupervised learning techniques, and aspects of computational learning theory. The syllabus covers topics like regression analysis, classification, clustering, optimization, data management, and self-development. Students will be assessed through assignments, quizzes, mid-term exams, and a final exam.

Uploaded by

Divjot Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Module 1

This document outlines the course objectives, outcomes, syllabus, and assessment process for CSE3506-Essentials of Data Analytics taught by Dr. Vergin Raja Sarobin at VIT Chennai. The course aims to teach students concepts of data analytics using machine learning models, supervised and unsupervised learning techniques, and aspects of computational learning theory. The syllabus covers topics like regression analysis, classification, clustering, optimization, data management, and self-development. Students will be assessed through assignments, quizzes, mid-term exams, and a final exam.

Uploaded by

Divjot Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 138

CSE3506-Essentials of Data Analytics

Dr. Vergin Raja Sarobin

School of Computer Science and Engineering


VIT Chennai
[email protected]
Course Objectives
✔ To understand the concepts of analytics using various machine learning
models.
✔ To appreciate supervised and unsupervised learning for predictive
analysis
✔ To understand data analytics as the next wave for businesses looking for
competitive advantage
✔ Validate the results of their analysis according to statistical guidelines
✔ Validate and review data accurately and identify anomalies
✔ To learn aspects of computational learning theory
✔ Apply statistical models to perform Regression Analysis, Clustering and
Classification
2
Course Outcomes

✔ Identify and apply the appropriate supervised learning techniques to


solve real world problems.
✔ Choose and implement typical unsupervised algorithms for different
types of applications.
✔ Implement statistical analysis techniques for solving practical
problems.
✔ Understand different techniques to optimize the learning algorithms.
✔ Aware of health and safety policies followed in organization, data and
information management and knowledge & skill development.

3
Syllabus

Module-1: Regression Analysis


✔Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation (6 Hours)

4
Syllabus

Module-2: Classification
✔Logistic Regression, Decision Trees, Naïve Bayes-conditional
probability - Random Forest - SVM Classifier (6 Hours)

5
Syllabus

Module-3: Clustering
✔K-means, K-medoids, Hierarchical clustering (4 Hours)

6
Module-4: Optimization
✔Gradient descent - Variants of gradient descent - Momentum - Adagrad
- RMSprop - Adam – AMSGrad (3 Hours)

7
Syllabus

Module-5: Managing Health and Safety


✔Comply with organization’s current health, safety and security policies
and procedures - Report any identified breaches in health, safety, and
security policies and procedures to the designated person - Identify and
correct any hazards that they can deal with safely, competently and
within the limits of their authority - Report any hazards that they are not
competent to deal with to the relevant person in line with organizational
procedures and warn other people who may be affected. (4 Hours)

8
Syllabus

Module-6: Data and Information Management


✔Establish and agree with appropriate people the data/information they
need to provide, the formats in which they need to provide it, and when
they need to provide it - Obtain the data/information from reliable
sources - Check that the data/information is accurate, complete and up-
to-date (4 Hours)

9
Syllabus

Module-7: Learning and Self Development


✔Obtain advice and guidance from appropriate people to develop their
knowledge, skills and competence - Identify accurately the knowledge
and skills they need for their job role - Identify accurately their current
level of knowledge, skills and competence and any learning and
development needs - Agree with appropriate people a plan of learning
and development activities to address their learning needs (3 Hours)

10
Syllabus

Text Book
✔Cathy O’Neil and Rachel Schutt. “Doing Data Science, Straight talk from
the Frontline”, O’Reilly. 2014.
✔Dan Toomey, “R for Data Science”, Packt Publishing, 2014.
✔Trevor Hastie, Robert Tibshirani and Jerome Friedman. “Elements of
Statistical Learning”, Springer , Second Edition. 2009.
✔Kevin P. Murphy. “Machine Learning: A Probabilistic Perspective”, MIT
Press; 1st Edition, 2012.

11
Syllabus
Reference Books
✔Glenn J. Myatt, “Making Sense of Data : A Practical Guide to Exploratory Data
Analysis and Data Mining”, John Wiley & Sons, Second Edition, 2014.
✔G. K. Gupta, ―Introduction to Data Mining with Case Studies”, Easter Economy
Edition, Prentice Hall of India, 2006.
✔Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer, 2007.
✔Colleen Mccue, “Data Mining and Predictive Analysis: Intelligence Gathering
and Crime Analysis”, Elsevier, 2007.
✔R N Prasad, Seema Acharya, “Fundamentals of Business Analytics”, Wiley;
Second edition, 2016.
✔https://www.sscnasscom.com/qualification-pack/SSC/Q2101/

12
Assessment Process (Theory)

CAT-1 15
CAT-2 15
Assignments/Quizzes 30

FAT 40
Total 100

13
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation (6 Hours)

14
Data Analytics – What?
 Science of analyzing raw data in order to make
conclusions about that information [Investopedia]

 Analytics is the systematic computational analysis of


data or statistics [Wikipedia]

 Data analysis is a process of inspecting, cleaning,


transforming, and modeling data with the goal of
discovering useful information, deriving conclusions,
and supporting decision-making [Wikipedia]
 
Data Analytics – Why?
 Helps business to
optimize their
performances
 Reduce cost
 Improved business
 Make better decision
Data Analytics – Types
 Descriptive analytics - describes what has happened
over a given period of time.
 Have the number of views gone up?
 Are sales stronger this month than last?

 Diagnostic analytics  - focuses more on why


something happened. This involves more diverse data
inputs and a bit of hypothesizing.
 Did the weather affect sales of a cool drink?
 Did that latest marketing campaign impact sales?
Data Analytics – Types
 Predictive analytics - focuses on what is likely going to
happen in the near term.
 What happened to sales the last time we had a hot summer?
 How many weather models predict a hot summer this year?
 Prescriptive analytics suggests a course of action.
 If the likelihood of a hot summer is measured as an average of say
five weather models is above 58%, we should add an evening shift
to the workers to increase output.
What is Machine Learning?

 Large volume of data demands automated methods of


data analysis which is what machine learning provides.

 Machine learning is defined as a set of methods that can


automatically detect patterns in data, and then use the
uncovered patterns to predict future data, or to perform
other kinds of decision making under uncertainty.
Machine Learning Paradigms
 Three Learning Paradigms

 Predictive or Supervised Learning

 Descriptive or Unsupervised Learning

 Reinforcement Learning
Statistical Learning

21 21
Statistical Learning
• Supervised statistical learning:
 It is defined by its use of labelled datasets to train algorithms that to
classify data or predict outcomes accurately.
 Examples: Problems occur in business, medicine, astrophysics, and
public policy
• Unsupervised statistical learning:
 Unsupervised learning uses unlabelled data. From that data, it
discovers patterns that help solve for clustering or we can learn
relationships and structure from such data
 Example: Input dataset containing images of different types of cats and
dogs
22 22
Statistical Learning – Wage Data

• The Wage data involves


predicting a continuous or
quantitative output value.
• This is often referred to as
a regression problem

23 23
Statistical Learning – Advertising Data

 The Advertising data set consists of the


sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, radio, and
newspaper

24 24
Statistical Learning – Advertising Data

 The Advertising data set consists of the


sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, Radio, and
newspaper

25
Statistical Learning – Advertising Data

 The Advertising data set consists of the sales


of a product in 200 different cities, along with
advertising budgets for three different media:
TV, Radio, and newspaper
 Goal is to develop an accurate model that can
be used to predict sales on the basis of the
three media budgets

26
Statistical Learning – Advertising Data
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
27
• Variables
27
Statistical Learning – Advertising Data

• Output Variable: Sales


• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables

28
Statistical Learning – Advertising Data

 There is some relationship between


Y and X = (X1, X2,...,Xp)
 General form of relationship is
 Y = f(X) + 
 where
 f is some fixed but unknown
function of X1,...,Xp
  is a random error term, which is
independent of X and has mean zero

29
Statistical Learning – Income Data

• The black lines represent the


error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)
• Overall, these errors have
Observed values of
income and years of
approximately mean zero
education for 30
individuals
30
Statistical Learning – Income Data

 Statistical Learning refers to a set of


approaches for estimating f in the
equation
• Y = f(X) + 
 Reasons to estimate ‘f’,:
 Prediction
 Inference

31
Linear Regression - Introduction
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.

JEE Score 32
Advertising Data

On the basis of given advertising data,


• Marketing plan for next year can be made
•To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
•Predicting sales with a high level of accuracy
requires a strong relationship.
•If it is strong relationship then
•In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect 33
Advertising Data
The important questions are
Which media contribute more to sales?
Do all three contribute to sales, or do just
one or two.
The individual effects of each medium on
the money spent
For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount will
sales increase?
How accurately can we predict this amount
of increase?

Linear regression can be used to answer each of these questions 34 34


Linear Regression - Types
• Types:
 Based on the number of independent variables, there are two types of linear
regression
 Simple Linear Regression
 Multiple Linear Regression
• Mathematically, the linear relationship is approximately modeled as
• y = 0 + 1 x
0 - Intercept
1 - Slope
0 and 1 – Model coefficients

35 35
Simple Linear Regression
Estimating the coefficients 0 and 1

36 36
Simple Linear Regression
Estimating the coefficients 0 and 1

37 37
Simple Linear Regression
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.
(a)(4 3)
(b)(5 3)
(c)(5 1)
(d)(1 5) 38 38
Simple Linear Regression
Solution:
1.Calculate Ypredicted for the given X using the given (a, b) values
2.For each (a, b) value, calculate the RSS
3.The best set of parameters is the one that gives minimum RSS

To calculate RSS, use the following formula

where
39 39
Simple Linear Regression
Solution: Formula
For a = 4 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 15 7.6099
4 23.3192 19 18.6555
5 28.3129 23 28.2269
6 32.1351 27 26.3693
RSS 84.4632

40 40
Simple Linear Regression
Solution: Formula
For a = 5 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 13 0.0104
3 17.7586 18 0.0583
4 23.3192 23 0.1019
5 28.3129 28 0.0979
6 32.1351 33 0.7481
RSS 1.0166

41 41
Simple Linear Regression
Solution: Formula
For a = 5 and b = 1

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 16 3.0927
4 23.3192 21 5.3787
5 28.3129 26 5.3495
6 32.1351 31 1.2885
RSS 18.7110

42 42
Simple Linear Regression
Solution: Formula
For a = 1 and b = 5

X Y Ypredicted (Y-YPredicted)2
2 12.8978 7 34.7840
3 17.7586 8 95.2303
4 23.3192 9 205.0395
5 28.3129 10 335.3623
6 32.1351 11 446.6925
RSS 1117.1086

Answer: The parameter (5,3) which gives least RSS (1.016).


Hence (5,3) is used to model this function 43 43
Simple Linear Regression
Question-2:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a)Find the best linear fit
(b)Determine the minimum RSS
(c)Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

44 44
Simple Linear Regression
Solution:
(a) To find the best fit, calculate the model coefficients using the formula

45 45
Simple Linear Regression
Solution:

(X- (Y- (X-Xmean)(Y-


X Y Xmean) Ymean) Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting
in the formula 0 =3.2732
1 = 4.9029
46 46
Simple Linear Regression
Solution:

Best Linear Fit

47 47
Simple Linear Regression
Solution:

(b) To determine RSS


(X- (Y- (X-Xmean)(Y-
X Y Xmean) Ymean) Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2
2 12.8978 -2 -9.9869 19.9738 4 13.0789 0.0328
3 17.7586 -1 -5.1261 5.1261 1 17.9818 0.0498
4 23.3192 0 0.4345 0.0000 0 22.8847 0.1888
5 28.3129 1 5.4282 5.4282 1 27.7876 0.2759
6 32.1351 2 9.2504 18.5008 4 32.6905 0.3085
Sum 20 114.4236 0 0.0000 49.0289 10 RSS 0.8558
Mean 4 22.88472
Y predicted is calculated using the best linear fit
Y = 4.9029 + 3.2732 X 48 48
Simple Linear Regression
Solution:

Residual Plot
(c) Residual plot for the best linear fit
Residual
X Y Ypredicted (Y-YPredicted)
2 12.8978 13.0789 -0.1811
3 17.7586 17.9818 -0.2232
4 23.3192 22.8847 0.4345
5 28.3129 27.7876 0.5253
6 32.1351 32.6905 -0.5554

The random pattern in it is an indication that a linear model is suitable for this data
49 49
Linear regression model
validation
1. R² values are just one such measure.
R-squared, otherwise known as R² typically has a value
in the range of 0 through to 1. The closer the r-
squared value is to 1, the better the fit.

This is the difference between original and predicted sample

R2 squared is also known as Coefficient of


Determination or sometimes also known as
Goodness of fit
2. Residual Plot:
For regression, there are numerous methods to evaluate the goodness of
your fit i.e. how well the model fits the data. One such method is residual
plot.

A typical residual plot has the residual values on the Y-axis


and the independent variable on the x-axis.

If the points are randomly dispersed around the horizontal axis, a


linear regression model is appropriate for the data; otherwise, a non-
linear model is more appropriate.
3. Mean Absolute Percentage Error (MAPE)

What is a good value of MAPE?

The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is
acceptably accurate.
4. Mean Absolute Error(MAE)

We aim to get a minimum MAE because this is a loss.


5. Mean Squared Error(MSE)

The lower the value the better and 0 means the model is
perfect. 
6. Root Mean Squared Error(RMSE)

This produces a value between 0 and 1, where values closer to 0


represent better fitting models. Based on a rule of thumb, it can be
said that RMSE values between 0.2 and 0.5 shows that the model
can relatively predict the data accurately. 
7. Standard Error (SE) and Residual Standard Error (RSE)

56 56
8. Confidence Interval)

• For linear regression, the 95% confidence interval for β0


approximately takes the form

• That is, there is approximately a 95 % chance that the interval

will contain the true value of β0

A confidence interval, in statistics, refers to the probability that a 


population parameter will fall between a set of values for a certain
proportion of times. Analysts often use confidence intervals than contain
either 95% or 99% of expected observations.
95% confidence interval
• Similarly, a confidence interval for β1 approximately takes the form

will contain the true value of β1

• The word ‘approximately’ is included mainly because


 The errors are assumed to be Gaussian and
 The factor ‘2’ in front of term will vary slightly depending on
the number of observations ‘n’ in the linear regression

58 58
Example: Confidence Interval for Regression
Coefficient in R

Suppose we’d like to fit a simple linear


regression model using hours studied as a
predictor variable and exam score as a
response variable for 15 students in a particular
class:
We can use the lm() function to fit this simple linear regression
model in R:
We can use the confint() function to calculate a 95% confidence interval
for the regression coefficient:

Since this confidence interval doesn’t contain the value 0, we can


conclude that there is a statistically significant association between hours
studied and exam score.
We can also confirm this is correct by calculating the 95% confidence
interval for the regression coefficient by hand:

Alpha=0.05

The 95% confidence interval for the regression coefficient is [1.446,


2.518].
Correlation
Covariance and correlation

Covariance and correlation are two mathematical concepts used in statistics.

Both terms are used to describe how two variables relate to each other. 
Covariance
Covariance signifies the direction of the linear relationship between the two
variables.

By direction we mean if the variables are directly proportional or inversely


proportional to each other. (Increasing the value of one variable might have a
positive or a negative impact on the value of the other variable).

The values of covariance can be any number between the two opposite
infinities. 
Covariance Formula
Types of covariance

Positive covariance
Positive covariance means both the variables (X, Y) move in the same direction
(i.e. show similar behavior). So, if greater values of one variable (X) seem to
correspond with greater values of another variable (Y), then the variables are
considered to have positive covariance. This tells you something about the linear
relationship between the two variables. So, for example, if an increase in a
person’s height corresponds with an increase in a person’s weight, there is
positive covariance between the two.
Types of covariance…

Negative covariance
Negative covariance means both the variables (X, Y) move in the
opposite direction. As opposed to positive covariance, if the greater
values of one variable (X) correspond to lesser values of another
variable (Y) and vice-versa, then the variables are considered to have
negative covariance.

Variables whose covariance is zero are called uncorrelated variables.


Step 3: Substitute the above values in the covariance formula 

Cov(x,y)=2.67, It is a positive covariance


Correlation Computation

Correlation between the two variables is obtained from normalizing the


covariance by dividing it with the product of the standard deviations of the
two variables.
Correlation…
Correlation value ranges from -1 to +1.

The closer it is to +1 or -1, the more closely the two variables are related.  

If there is no relationship at all between two variables, then the correlation


coefficient will certainly be 0.

When the correlation coefficient is positive, an increase in one variable also


increases the other.

When the correlation coefficient is negative, the changes in the two variables
are in opposite directions.
Types of correlation
Positive correlation
What is a correlation matrix

A correlation matrix is essentially a table depicting the correlation coefficients for


various variables. The rows and columns contain the value of the variables, and each
cell shows the correlation coefficient.
ANOVA- Analysis of Variance
Population

Sampling
•IoE inspection to get feedback from students/faculty/parent/Alumni/Industry
•Quality control (Statistical Quality Control)
• 100% inspection
• Sample inspection
•Conducting Experiments

Note:
There should not be significant variation between the sample mean and the
population mean.
This is to be proved statistically.

Why ANOVA?
Helps us to understand how different sample groups respond.

83
ANOVA
• ANOVA – ANalysis of Variance
• Variance:
• The variance measures the average degree to which each data point is different
from the mean.
• The variance is greater when there is a wider range of numbers in the group.
• The calculation of variance uses squares because it weighs outliers more heavily
than data point closer to the mean.
• This prevents differences above the mean from canceling out those below, which
would result in a variance of zero.
• Thus variance is the average of the squared differences from the mean.
• ANOVA is a hypothesis testing procedure that is used to evaluate differences
between 2 or more samples.
Standard Deviation:
•Standard Deviation tells how far the data points are from the mean.
•It is the square root of variance
•These two statistical concepts are closely related
•For Data analysts, these two mathematical concepts are of paramount importance as
they are used to measure volatility of data distribution.
•In stock trading, if the standard deviation is less, it indicates the investment is less
risky.

85
ANOVA

Put all the data points in all


of the THREE samples into where is each mean
a common larger relative to the overall
distribution data set sorted in the
background?

Shows how far the mean it is away from the mean of the
larger sort of combined population
ANOVA

Oddball distribution, sort of


the one that doesn’t belong
in the same population as
the other two
ANOVA

Means are in very different


locations relative to the
overall mean
Step1:
Setting the hypothesis (Null hypothesis or alternate hypothesis)
•Null Hypothesis (H0: 1=2=3)
•Alternate Hypothesis (Ha: Alteast one difference among the means)
And
•Fixing the confidence interval (90%, 95%)
=0.1 or 0.05

Step2: Find the df


•df between the groups/columns
•df within the groups/columns
•df_total

91
Step3:Calculating the Means
• Means for each group and
•Grand mean

Step4: All variability across the columns/groups


•SST
•SSC (Sum of Squares between/Columns)
•SSE(Sum of Squares within/Errors)

Step 6: To perform F test (To calculate F_ratio)


•F_statistic = Mean Square_between / Mean Square_within
•F_critical from F distribution table (Corr to df_numerator and df_denominator)

92
F_statistic < F_critical

93
ANOVA
ANOVA

This is One way ANOVA/ Single Factor ANOVA


ANOVA

At least one mean is an outlier and each


distribution is narrow; distinct from each
other

Means are fairly close to overall mean and/


or distributions overlap a bit; hard to
distinguish

The means are very close to overall


mean and/ or distribution “melt” together
ANOVA
ANOVA
Question-4:
18 students (six each from first year to third year) were selected for an
informal study about their understanding skill level. The evaluation was
done for a score of 100. Using One-way ANOVA technique, find out
whether or not a difference exists somewhere between the three different
year levels Scores
First
Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
ANOVA

Groups/ Columns

Scores
First
Random Sample
Year Second Year Third Year
within each group
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
ANOVA
Calculate the mean of each column
ANOVA
ANOVA
Partitioning Sum of Squares
ANOVA
ANOVA

Scores
First Second (XB- (Xc-
Year Year Third Year (XA-Xmean)^2 Xmean)^2 Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83      

SST = 1067.130 + 747.796 + 896.685 = 2711.611


ANOVA

Sum of
Squares_between

1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up
ANOVA

Sum of
Squares_between

1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up

SSC = 6(72 – 74.28)2 + 6(76 – 74.28)2 +6 (74.83 – 74.28)2 = 50.778


ANOVA

Sum of
Squares_between
ANOVA

Scores

First Second (XA- (XB- (XC-


Year Year Third Year xa_mean)^2 xb_mean)^2 xc_mean)^2
Sum of
82 62 64 100 196 117.361
93 85 73 441 81 3.361 Squares_within
61 94 87 121 324 148.028
74 78 91 4 4 261.361
69 71 56 9 25 354.694
53 66 78 361 100 10.028
Sum 432 456 449 1036 730 894.833
Mean 72 76 74.83      

SSE = 1036 + 730 + 894.833 = 2660.833


ANOVA
Formulas for One-Way ANOVA

df = Degrees of Freedom
1. DoF b/w the columns

2. DoF within the columns

ANOVA F - statistic

MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within
ANOVA

Formula to calculate Critical Value in Excel:

F.INV.RT(ALPHA,NUMERATOR DOF, DENOMINATOR DOF)

• F-statistic value is less than


Fcritical
• Null hypothesis is accepted.
• It means there is no
significant difference in
mean values

Critical value of F: F, dfc, dfe = F0.05, 2, 15 = 3.68


Forecasting
Time Series Forecasting

A time series is a sequence of observations recorded over a certain period


of time. A simple example of time series is how we come across different
temperature changes day by day or in a month. 

Timeseries forecasting in simple words means to forecast or to predict the


future value(eg-stock price) over a period of time. 
Terminologies
 Time series data
 experimental data that have been observed at different points in
time
 Examples:
 daily stock market quotations
 monthly unemployment figures
 No. of COVID-19 cases observed over a period of time
 BP measured over time
Components of Time Series
Components of Time Series
Trend Seasonal

Direction in which something is Repetition of peak or dip at


increasing or decreasing regular intervals

Random

Irregular fluctuations – uncontrolled


situations contributing to changes
in values
Factors associated with time-series forecasting

•Amount of data
•Data quality
•Seasonality
•Trends
•Unexpected events
The amount of data is probably the most important factor (assuming that the data is
accurate). A good rule of thumb would be the more data we have, the better our
model will generate forecasts. 

Data quality entails some basic requirements, such as having no duplicates, a


standardized data format, and for the data to be collected consistently or at regular
intervals.

Seasonality means that there are distinct periods of time when the data contains
consistent irregularities. For example, if an online web shop analyzed its sales
history, it would be evident that the holiday season results in an increased amount
of sales.
Trends are probably the most important information you are looking for.
They indicate whether a variable in the time series will increase or decrease in
a given period. 

Unexpected events (sometimes also referred to as noise or irregularities) can


always occur, and we need to consider that when creating a prediction model.
They present noise in historical data, and they are also not predictable.
Time Series Forecast Methods

1. The Average as a Forecast

The simplest form of an average as a forecast can be represented by the


following formula:

In other words, our forecast for next month (or any month in the future,
for that matter) is the average of all sales that have occurred in the past
2. Autoregression

In a multiple regression model, we forecast the variable of interest using a linear


combination of predictors.

y= b0 + b1*X1

This technique can be used on time series where we can predict the value for the
next time step (t+1) given the observations at the last two time steps (t and t-1).

As a regression model, this would look as follows,

X(t+1) = b0 + b1*X(t) + b2*X(t-1)

Because the regression model uses data from the same input variable at previous
time steps, it is referred to as an autoregression (regression of self).
3. Simple Moving Average

Rather than use all the previous data in the calculation of an average
as the forecast, why not just use some of the more recent data? This is
precisely what a moving average does, with the following formula.
Simple Moving Average

Example 1: 3 year Simple Moving Average forecast

year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s

Calculate 3 year Simple Moving Average forecast


Solution:
Calculation of 3 year moving averages of the data

Year Sales 3 year MA


1 5.2  
2 4.9  
3 5.5  
4 4.9 (5.2+4.9+5.5)/3=5.2
5 5.2 (4.9+5.5+4.9)/3=5.1
6 5.7 (5.5+4.9+5.2)/3=5.2

7 5.4 (4.9+5.2+5.7)/3=5.267

8 5.8 (5.2+5.7+5.4)/3=5.433

9 5.9 (5.7+5.4+5.8)/3=5.633
10 6 (5.4+5.8+5.9)/3=5.7
11 5.2 (5.8+5.9+6)/3=5.9
12 4.8 (5.9+6+5.2)/3=5.7
(7) Calculation: |((Actual-forecast)/actual)|%= |(4.9-5.2)/4.9|%=6.12%
Forecasting errors
Example 2:
Calculate a four-year moving average from the following data set:
4. Weighted Moving Average forecast example

Example 1: 3 year Weighted Moving Average forecast

year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s

Calculate 3 year Weighted Moving Average forecast with weight=1,2,1


Solution:
The weights of the 3 years are respectively 1,2,1 and their sum is 4.

Calculation of 3 year moving averages of the data


Year Sales 3 year MA
1 5.2  
2 4.9  
3 5.5  

4 4.9 (1*5.2+2*4.9+1*5.5)/4=5.125
5 5.2 (1*4.9+2*5.5+1*4.9)/4=5.2

6 5.7 (1*5.5+2*4.9+1*5.2)/4=5.125
7 5.4 (1*4.9+2*5.2+1*5.7)/4=5.25
8 5.8 (1*5.2+2*5.7+1*5.4)/4=5.5

9 5.9 (1*5.7+2*5.4+1*5.8)/4=5.575

10 6 (1*5.4+2*5.8+1*5.9)/4=5.725
11 5.2 (1*5.8+2*5.9+1*6)/4=5.9
12 4.8 (1*5.9+2*6+1*5.2)/4=5.775
5. Exponential Smoothing forecast

Example: 3 year Single Exponential Smoothing forecast

year 1 2 3 4 5 6 7 8 9 10
Sales 30 25 35 25 20 30 35 40 30 45

Calculate 3 year Single Exponential Smoothing forecast


Solution:
6. ARIMA Models

AutoRegressive Integrated Moving Average, or ARIMA, is a forecasting method


that combines both an autoregressive model and a moving average model.

 Autoregression uses observations from previous time steps to predict future


values using a regression equation.
An autoregressive model utilizes a linear combination of past variable values to
make forecasts:
Thus, an autoregressive model of order p can be written as:

Likewise a pure Moving Average (MA only) model is one where Yt depends only
on the lagged forecast errors.
If we combine differencing with autoregression and a moving average model, we
obtain a non-seasonal ARIMA model. The full model can be represented with the
following equation:
Autocorrelation
Autocorrelation analysis is an important step in the Exploratory Data
Analysis of time series forecasting. 
The autocorrelation analysis helps detect patterns and check for
randomness.
Autocorrelation is the correlation between a time series with a lagged
version of itself. 

Any autocorrelation that may be present in time series data is determined using a
correlogram, also known as an ACF plot. This is used to help you determine whether
your series of numbers is exhibiting autocorrelation at all, at which point you can then
begin to better understand the pattern that the values in the series may be predicting.
An autoregressive model is when a value from a time series is regressed on
previous values from that same time series.

Autocorrelation

The coefficient of correlation between two values in a time series is called


the autocorrelation function (ACF) For example the ACF for a time series yt is
given by:

The ACF is a way to measure the linear relationship between an observation at


time t and the observations at previous times.
The key statistics in time series analysis is the autocorrelation coefficient (or
the correlation of the time series with itself, lagged by 1, 2, or more periods),
which is given by the following formula,
ACF Example:

You might also like