0% found this document useful (0 votes)

94 views

Module 1

This document outlines the course objectives, outcomes, syllabus, and assessment process for CSE3506-Essentials of Data Analytics taught by Dr. Vergin Raja Sarobin at VIT Chennai. The course aims to teach students concepts of data analytics using machine learning models, supervised and unsupervised learning techniques, and aspects of computational learning theory. The syllabus covers topics like regression analysis, classification, clustering, optimization, data management, and self-development. Students will be assessed through assignments, quizzes, mid-term exams, and a final exam.

Uploaded by

Divjot Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Module 1

Uploaded by

Divjot Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 138

CSE3506-Essentials of Data Analytics

Dr. Vergin Raja Sarobin

School of Computer Science and Engineering

VIT Chennai
[email protected]
Course Objectives
✔ To understand the concepts of analytics using various machine learning
models.
✔ To appreciate supervised and unsupervised learning for predictive
analysis
✔ To understand data analytics as the next wave for businesses looking for
competitive advantage
✔ Validate the results of their analysis according to statistical guidelines
✔ Validate and review data accurately and identify anomalies
✔ To learn aspects of computational learning theory
✔ Apply statistical models to perform Regression Analysis, Clustering and
Classification
2
Course Outcomes

✔ Identify and apply the appropriate supervised learning techniques to

solve real world problems.
✔ Choose and implement typical unsupervised algorithms for different
types of applications.
✔ Implement statistical analysis techniques for solving practical
problems.
✔ Understand different techniques to optimize the learning algorithms.
✔ Aware of health and safety policies followed in organization, data and
information management and knowledge & skill development.

3
Syllabus

Module-1: Regression Analysis

✔Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation (6 Hours)

4
Syllabus

Module-2: Classification
✔Logistic Regression, Decision Trees, Naïve Bayes-conditional
probability - Random Forest - SVM Classifier (6 Hours)

5
Syllabus

Module-3: Clustering
✔K-means, K-medoids, Hierarchical clustering (4 Hours)

6
Module-4: Optimization
✔Gradient descent - Variants of gradient descent - Momentum - Adagrad
- RMSprop - Adam – AMSGrad (3 Hours)

7
Syllabus

Module-5: Managing Health and Safety

✔Comply with organization’s current health, safety and security policies
and procedures - Report any identified breaches in health, safety, and
security policies and procedures to the designated person - Identify and
correct any hazards that they can deal with safely, competently and
within the limits of their authority - Report any hazards that they are not
competent to deal with to the relevant person in line with organizational
procedures and warn other people who may be affected. (4 Hours)

8
Syllabus

Module-6: Data and Information Management

✔Establish and agree with appropriate people the data/information they
need to provide, the formats in which they need to provide it, and when
they need to provide it - Obtain the data/information from reliable
sources - Check that the data/information is accurate, complete and up-
to-date (4 Hours)

9
Syllabus

Module-7: Learning and Self Development

✔Obtain advice and guidance from appropriate people to develop their
knowledge, skills and competence - Identify accurately the knowledge
and skills they need for their job role - Identify accurately their current
level of knowledge, skills and competence and any learning and
development needs - Agree with appropriate people a plan of learning
and development activities to address their learning needs (3 Hours)

10
Syllabus

Text Book
✔Cathy O’Neil and Rachel Schutt. “Doing Data Science, Straight talk from
the Frontline”, O’Reilly. 2014.
✔Dan Toomey, “R for Data Science”, Packt Publishing, 2014.
✔Trevor Hastie, Robert Tibshirani and Jerome Friedman. “Elements of
Statistical Learning”, Springer , Second Edition. 2009.
✔Kevin P. Murphy. “Machine Learning: A Probabilistic Perspective”, MIT
Press; 1st Edition, 2012.

11
Syllabus
Reference Books
✔Glenn J. Myatt, “Making Sense of Data : A Practical Guide to Exploratory Data
Analysis and Data Mining”, John Wiley & Sons, Second Edition, 2014.
✔G. K. Gupta, ―Introduction to Data Mining with Case Studies”, Easter Economy
Edition, Prentice Hall of India, 2006.
✔Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer, 2007.
✔Colleen Mccue, “Data Mining and Predictive Analysis: Intelligence Gathering
and Crime Analysis”, Elsevier, 2007.
✔R N Prasad, Seema Acharya, “Fundamentals of Business Analytics”, Wiley;
Second edition, 2016.
✔https://www.sscnasscom.com/qualification-pack/SSC/Q2101/

12
Assessment Process (Theory)

CAT-1 15
CAT-2 15
Assignments/Quizzes 30

FAT 40
Total 100

13
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation (6 Hours)

14
Data Analytics – What?
 Science of analyzing raw data in order to make
conclusions about that information [Investopedia]

 Analytics is the systematic computational analysis of

data or statistics [Wikipedia]

 Data analysis is a process of inspecting, cleaning,

transforming, and modeling data with the goal of
discovering useful information, deriving conclusions,
and supporting decision-making [Wikipedia]

Data Analytics – Why?
 Helps business to
optimize their
performances
 Reduce cost
 Improved business
 Make better decision
Data Analytics – Types
 Descriptive analytics - describes what has happened
over a given period of time.
 Have the number of views gone up?
 Are sales stronger this month than last?

 Diagnostic analytics - focuses more on why

something happened. This involves more diverse data
inputs and a bit of hypothesizing.
 Did the weather affect sales of a cool drink?
 Did that latest marketing campaign impact sales?
Data Analytics – Types
 Predictive analytics - focuses on what is likely going to
happen in the near term.
 What happened to sales the last time we had a hot summer?
 How many weather models predict a hot summer this year?
 Prescriptive analytics suggests a course of action.
 If the likelihood of a hot summer is measured as an average of say
five weather models is above 58%, we should add an evening shift
to the workers to increase output.
What is Machine Learning?

 Large volume of data demands automated methods of

data analysis which is what machine learning provides.

 Machine learning is defined as a set of methods that can

automatically detect patterns in data, and then use the
uncovered patterns to predict future data, or to perform
other kinds of decision making under uncertainty.
Machine Learning Paradigms
 Three Learning Paradigms

 Predictive or Supervised Learning

 Descriptive or Unsupervised Learning

 Reinforcement Learning
Statistical Learning

21 21
Statistical Learning
• Supervised statistical learning:
 It is defined by its use of labelled datasets to train algorithms that to
classify data or predict outcomes accurately.
 Examples: Problems occur in business, medicine, astrophysics, and
public policy
• Unsupervised statistical learning:
 Unsupervised learning uses unlabelled data. From that data, it
discovers patterns that help solve for clustering or we can learn
relationships and structure from such data
 Example: Input dataset containing images of different types of cats and
dogs
22 22
Statistical Learning – Wage Data

• The Wage data involves

predicting a continuous or
quantitative output value.
• This is often referred to as
a regression problem

23 23
Statistical Learning – Advertising Data

 The Advertising data set consists of the

sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, radio, and
newspaper

24 24
Statistical Learning – Advertising Data

 The Advertising data set consists of the

sales of a product in 200 different cities,
along with advertising budgets for three
different media: TV, Radio, and
newspaper

25
Statistical Learning – Advertising Data

 The Advertising data set consists of the sales

of a product in 200 different cities, along with
advertising budgets for three different media:
TV, Radio, and newspaper
 Goal is to develop an accurate model that can
be used to predict sales on the basis of the
three media budgets

26
Statistical Learning – Advertising Data
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
27
• Variables
27
Statistical Learning – Advertising Data

• Output Variable: Sales

• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables

28
Statistical Learning – Advertising Data

 There is some relationship between

Y and X = (X1, X2,...,Xp)
 General form of relationship is
 Y = f(X) + 
 where
 f is some fixed but unknown
function of X1,...,Xp
  is a random error term, which is
independent of X and has mean zero

29
Statistical Learning – Income Data

• The black lines represent the

error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)
• Overall, these errors have
Observed values of
income and years of
approximately mean zero
education for 30
individuals
30
Statistical Learning – Income Data

 Statistical Learning refers to a set of

approaches for estimating f in the
equation
• Y = f(X) + 
 Reasons to estimate ‘f’,:
 Prediction
 Inference

31
Linear Regression - Introduction
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.

JEE Score 32
Advertising Data

On the basis of given advertising data,

• Marketing plan for next year can be made
•To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
•Predicting sales with a high level of accuracy
requires a strong relationship.
•If it is strong relationship then
•In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect 33
Advertising Data
The important questions are
Which media contribute more to sales?
Do all three contribute to sales, or do just
one or two.
The individual effects of each medium on
the money spent
For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount will
sales increase?
How accurately can we predict this amount
of increase?

Linear regression can be used to answer each of these questions 34 34

Linear Regression - Types
• Types:
 Based on the number of independent variables, there are two types of linear
regression
 Simple Linear Regression
 Multiple Linear Regression
• Mathematically, the linear relationship is approximately modeled as
• y = 0 + 1 x
0 - Intercept
1 - Slope
0 and 1 – Model coefficients

35 35
Simple Linear Regression
Estimating the coefficients 0 and 1

36 36
Simple Linear Regression
Estimating the coefficients 0 and 1

37 37
Simple Linear Regression
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.
(a)(4 3)
(b)(5 3)
(c)(5 1)
(d)(1 5) 38 38
Simple Linear Regression
Solution:
1.Calculate Ypredicted for the given X using the given (a, b) values
2.For each (a, b) value, calculate the RSS
3.The best set of parameters is the one that gives minimum RSS

To calculate RSS, use the following formula

where
39 39
Simple Linear Regression
Solution: Formula
For a = 4 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 15 7.6099
4 23.3192 19 18.6555
5 28.3129 23 28.2269
6 32.1351 27 26.3693
RSS 84.4632

40 40
Simple Linear Regression
Solution: Formula
For a = 5 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 13 0.0104
3 17.7586 18 0.0583
4 23.3192 23 0.1019
5 28.3129 28 0.0979
6 32.1351 33 0.7481
RSS 1.0166

41 41
Simple Linear Regression
Solution: Formula
For a = 5 and b = 1

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 16 3.0927
4 23.3192 21 5.3787
5 28.3129 26 5.3495
6 32.1351 31 1.2885
RSS 18.7110

42 42
Simple Linear Regression
Solution: Formula
For a = 1 and b = 5

X Y Ypredicted (Y-YPredicted)2
2 12.8978 7 34.7840
3 17.7586 8 95.2303
4 23.3192 9 205.0395
5 28.3129 10 335.3623
6 32.1351 11 446.6925
RSS 1117.1086

Answer: The parameter (5,3) which gives least RSS (1.016).

Hence (5,3) is used to model this function 43 43
Simple Linear Regression
Question-2:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a)Find the best linear fit
(b)Determine the minimum RSS
(c)Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

44 44
Simple Linear Regression
Solution:
(a) To find the best fit, calculate the model coefficients using the formula

45 45
Simple Linear Regression
Solution:

(X- (Y- (X-Xmean)(Y-

X Y Xmean) Ymean) Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting
in the formula 0 =3.2732
1 = 4.9029
46 46
Simple Linear Regression
Solution:

Best Linear Fit

47 47
Simple Linear Regression
Solution:

(b) To determine RSS

(X- (Y- (X-Xmean)(Y-
X Y Xmean) Ymean) Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2
2 12.8978 -2 -9.9869 19.9738 4 13.0789 0.0328
3 17.7586 -1 -5.1261 5.1261 1 17.9818 0.0498
4 23.3192 0 0.4345 0.0000 0 22.8847 0.1888
5 28.3129 1 5.4282 5.4282 1 27.7876 0.2759
6 32.1351 2 9.2504 18.5008 4 32.6905 0.3085
Sum 20 114.4236 0 0.0000 49.0289 10 RSS 0.8558
Mean 4 22.88472
Y predicted is calculated using the best linear fit
Y = 4.9029 + 3.2732 X 48 48
Simple Linear Regression
Solution:

Residual Plot
(c) Residual plot for the best linear fit
Residual
X Y Ypredicted (Y-YPredicted)
2 12.8978 13.0789 -0.1811
3 17.7586 17.9818 -0.2232
4 23.3192 22.8847 0.4345
5 28.3129 27.7876 0.5253
6 32.1351 32.6905 -0.5554

The random pattern in it is an indication that a linear model is suitable for this data
49 49
Linear regression model
validation
1. R² values are just one such measure.
R-squared, otherwise known as R² typically has a value
in the range of 0 through to 1. The closer the r-
squared value is to 1, the better the fit.

This is the difference between original and predicted sample

R2 squared is also known as Coefficient of

Determination or sometimes also known as
Goodness of fit
2. Residual Plot:
For regression, there are numerous methods to evaluate the goodness of
your fit i.e. how well the model fits the data. One such method is residual
plot.

A typical residual plot has the residual values on the Y-axis

and the independent variable on the x-axis.

If the points are randomly dispersed around the horizontal axis, a

linear regression model is appropriate for the data; otherwise, a non-
linear model is more appropriate.
3. Mean Absolute Percentage Error (MAPE)

What is a good value of MAPE?

The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is
acceptably accurate.
4. Mean Absolute Error(MAE)

We aim to get a minimum MAE because this is a loss.

5. Mean Squared Error(MSE)

The lower the value the better and 0 means the model is
perfect.
6. Root Mean Squared Error(RMSE)

This produces a value between 0 and 1, where values closer to 0

represent better fitting models. Based on a rule of thumb, it can be
said that RMSE values between 0.2 and 0.5 shows that the model
can relatively predict the data accurately.
7. Standard Error (SE) and Residual Standard Error (RSE)

56 56
8. Confidence Interval)

• For linear regression, the 95% confidence interval for β0

approximately takes the form

• That is, there is approximately a 95 % chance that the interval

will contain the true value of β0

A confidence interval, in statistics, refers to the probability that a

population parameter will fall between a set of values for a certain
proportion of times. Analysts often use confidence intervals than contain
either 95% or 99% of expected observations.
95% confidence interval
• Similarly, a confidence interval for β1 approximately takes the form

will contain the true value of β1

• The word ‘approximately’ is included mainly because

 The errors are assumed to be Gaussian and
 The factor ‘2’ in front of term will vary slightly depending on
the number of observations ‘n’ in the linear regression

58 58
Example: Confidence Interval for Regression
Coefficient in R

Suppose we’d like to fit a simple linear

regression model using hours studied as a
predictor variable and exam score as a
response variable for 15 students in a particular
class:
We can use the lm() function to fit this simple linear regression
model in R:
We can use the confint() function to calculate a 95% confidence interval
for the regression coefficient:

Since this confidence interval doesn’t contain the value 0, we can

conclude that there is a statistically significant association between hours
studied and exam score.
We can also confirm this is correct by calculating the 95% confidence
interval for the regression coefficient by hand:

Alpha=0.05

The 95% confidence interval for the regression coefficient is [1.446,

2.518].
Correlation
Covariance and correlation

Covariance and correlation are two mathematical concepts used in statistics.

Both terms are used to describe how two variables relate to each other.
Covariance
Covariance signifies the direction of the linear relationship between the two
variables.

By direction we mean if the variables are directly proportional or inversely

proportional to each other. (Increasing the value of one variable might have a
positive or a negative impact on the value of the other variable).

The values of covariance can be any number between the two opposite
infinities.
Covariance Formula
Types of covariance

Positive covariance
Positive covariance means both the variables (X, Y) move in the same direction
(i.e. show similar behavior). So, if greater values of one variable (X) seem to
correspond with greater values of another variable (Y), then the variables are
considered to have positive covariance. This tells you something about the linear
relationship between the two variables. So, for example, if an increase in a
person’s height corresponds with an increase in a person’s weight, there is
positive covariance between the two.
Types of covariance…

Negative covariance
Negative covariance means both the variables (X, Y) move in the
opposite direction. As opposed to positive covariance, if the greater
values of one variable (X) correspond to lesser values of another
variable (Y) and vice-versa, then the variables are considered to have
negative covariance.

Variables whose covariance is zero are called uncorrelated variables.

Step 3: Substitute the above values in the covariance formula

Cov(x,y)=2.67, It is a positive covariance

Correlation Computation

Correlation between the two variables is obtained from normalizing the

covariance by dividing it with the product of the standard deviations of the
two variables.
Correlation…
Correlation value ranges from -1 to +1.

The closer it is to +1 or -1, the more closely the two variables are related.

If there is no relationship at all between two variables, then the correlation

coefficient will certainly be 0.

When the correlation coefficient is positive, an increase in one variable also

increases the other.

When the correlation coefficient is negative, the changes in the two variables
are in opposite directions.
Types of correlation
Positive correlation
What is a correlation matrix

A correlation matrix is essentially a table depicting the correlation coefficients for

various variables. The rows and columns contain the value of the variables, and each
cell shows the correlation coefficient.
ANOVA- Analysis of Variance
Population

Sampling
•IoE inspection to get feedback from students/faculty/parent/Alumni/Industry
•Quality control (Statistical Quality Control)
• 100% inspection
• Sample inspection
•Conducting Experiments

Note:
There should not be significant variation between the sample mean and the
population mean.
This is to be proved statistically.

Why ANOVA?
Helps us to understand how different sample groups respond.

83
ANOVA
• ANOVA – ANalysis of Variance
• Variance:
• The variance measures the average degree to which each data point is different
from the mean.
• The variance is greater when there is a wider range of numbers in the group.
• The calculation of variance uses squares because it weighs outliers more heavily
than data point closer to the mean.
• This prevents differences above the mean from canceling out those below, which
would result in a variance of zero.
• Thus variance is the average of the squared differences from the mean.
• ANOVA is a hypothesis testing procedure that is used to evaluate differences
between 2 or more samples.
Standard Deviation:
•Standard Deviation tells how far the data points are from the mean.
•It is the square root of variance
•These two statistical concepts are closely related
•For Data analysts, these two mathematical concepts are of paramount importance as
they are used to measure volatility of data distribution.
•In stock trading, if the standard deviation is less, it indicates the investment is less
risky.

85
ANOVA

Put all the data points in all

of the THREE samples into where is each mean
a common larger relative to the overall
distribution data set sorted in the
background?

Shows how far the mean it is away from the mean of the
larger sort of combined population
ANOVA

Oddball distribution, sort of

the one that doesn’t belong
in the same population as
the other two
ANOVA

Means are in very different

locations relative to the
overall mean
Step1:
Setting the hypothesis (Null hypothesis or alternate hypothesis)
•Null Hypothesis (H0: 1=2=3)
•Alternate Hypothesis (Ha: Alteast one difference among the means)
And
•Fixing the confidence interval (90%, 95%)
=0.1 or 0.05

Step2: Find the df

•df between the groups/columns
•df within the groups/columns
•df_total

91
Step3:Calculating the Means
• Means for each group and
•Grand mean

Step4: All variability across the columns/groups

•SST
•SSC (Sum of Squares between/Columns)
•SSE(Sum of Squares within/Errors)

Step 6: To perform F test (To calculate F_ratio)

•F_statistic = Mean Square_between / Mean Square_within
•F_critical from F distribution table (Corr to df_numerator and df_denominator)

92
F_statistic < F_critical

93
ANOVA
ANOVA

This is One way ANOVA/ Single Factor ANOVA

ANOVA

At least one mean is an outlier and each

distribution is narrow; distinct from each
other

Means are fairly close to overall mean and/

or distributions overlap a bit; hard to
distinguish

The means are very close to overall

mean and/ or distribution “melt” together
ANOVA
ANOVA
Question-4:
18 students (six each from first year to third year) were selected for an
informal study about their understanding skill level. The evaluation was
done for a score of 100. Using One-way ANOVA technique, find out
whether or not a difference exists somewhere between the three different
year levels Scores
First
Year Second Year Third Year
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
ANOVA

Groups/ Columns

Scores
First
Random Sample
Year Second Year Third Year
within each group
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
ANOVA
Calculate the mean of each column
ANOVA
ANOVA
Partitioning Sum of Squares
ANOVA
ANOVA

Scores
First Second (XB- (Xc-
Year Year Third Year (XA-Xmean)^2 Xmean)^2 Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83

SST = 1067.130 + 747.796 + 896.685 = 2711.611

ANOVA

Sum of
Squares_between

1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up
ANOVA

Sum of
Squares_between

1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up

SSC = 6(72 – 74.28)2 + 6(76 – 74.28)2 +6 (74.83 – 74.28)2 = 50.778

ANOVA

Sum of
Squares_between
ANOVA

Scores

First Second (XA- (XB- (XC-

Year Year Third Year xa_mean)^2 xb_mean)^2 xc_mean)^2
Sum of
82 62 64 100 196 117.361
93 85 73 441 81 3.361 Squares_within
61 94 87 121 324 148.028
74 78 91 4 4 261.361
69 71 56 9 25 354.694
53 66 78 361 100 10.028
Sum 432 456 449 1036 730 894.833
Mean 72 76 74.83

SSE = 1036 + 730 + 894.833 = 2660.833

ANOVA
Formulas for One-Way ANOVA

df = Degrees of Freedom
1. DoF b/w the columns

2. DoF within the columns

ANOVA F - statistic

MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within
ANOVA

Formula to calculate Critical Value in Excel:

F.INV.RT(ALPHA,NUMERATOR DOF, DENOMINATOR DOF)

• F-statistic value is less than

Fcritical
• Null hypothesis is accepted.
• It means there is no
significant difference in
mean values

Critical value of F: F, dfc, dfe = F0.05, 2, 15 = 3.68

Forecasting
Time Series Forecasting

A time series is a sequence of observations recorded over a certain period

of time. A simple example of time series is how we come across different
temperature changes day by day or in a month.

Timeseries forecasting in simple words means to forecast or to predict the

future value(eg-stock price) over a period of time.
Terminologies
 Time series data
 experimental data that have been observed at different points in
time
 Examples:
 daily stock market quotations
 monthly unemployment figures
 No. of COVID-19 cases observed over a period of time
 BP measured over time
Components of Time Series
Components of Time Series
Trend Seasonal

Direction in which something is Repetition of peak or dip at

increasing or decreasing regular intervals

Random

Irregular fluctuations – uncontrolled

situations contributing to changes
in values
Factors associated with time-series forecasting

•Amount of data
•Data quality
•Seasonality
•Trends
•Unexpected events
The amount of data is probably the most important factor (assuming that the data is
accurate). A good rule of thumb would be the more data we have, the better our
model will generate forecasts.

Data quality entails some basic requirements, such as having no duplicates, a

standardized data format, and for the data to be collected consistently or at regular
intervals.

Seasonality means that there are distinct periods of time when the data contains
consistent irregularities. For example, if an online web shop analyzed its sales
history, it would be evident that the holiday season results in an increased amount
of sales.
Trends are probably the most important information you are looking for.
They indicate whether a variable in the time series will increase or decrease in
a given period.

Unexpected events (sometimes also referred to as noise or irregularities) can

always occur, and we need to consider that when creating a prediction model.
They present noise in historical data, and they are also not predictable.
Time Series Forecast Methods

1. The Average as a Forecast

The simplest form of an average as a forecast can be represented by the

following formula:

In other words, our forecast for next month (or any month in the future,
for that matter) is the average of all sales that have occurred in the past
2. Autoregression

In a multiple regression model, we forecast the variable of interest using a linear

combination of predictors.

y= b0 + b1*X1

This technique can be used on time series where we can predict the value for the
next time step (t+1) given the observations at the last two time steps (t and t-1).

As a regression model, this would look as follows,

X(t+1) = b0 + b1X(t) + b2X(t-1)

Because the regression model uses data from the same input variable at previous
time steps, it is referred to as an autoregression (regression of self).
3. Simple Moving Average

Rather than use all the previous data in the calculation of an average
as the forecast, why not just use some of the more recent data? This is
precisely what a moving average does, with the following formula.
Simple Moving Average

Example 1: 3 year Simple Moving Average forecast

year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s

Calculate 3 year Simple Moving Average forecast

Solution:
Calculation of 3 year moving averages of the data

Year Sales 3 year MA

1 5.2
2 4.9
3 5.5
4 4.9 (5.2+4.9+5.5)/3=5.2
5 5.2 (4.9+5.5+4.9)/3=5.1
6 5.7 (5.5+4.9+5.2)/3=5.2

7 5.4 (4.9+5.2+5.7)/3=5.267

8 5.8 (5.2+5.7+5.4)/3=5.433

9 5.9 (5.7+5.4+5.8)/3=5.633
10 6 (5.4+5.8+5.9)/3=5.7
11 5.2 (5.8+5.9+6)/3=5.9
12 4.8 (5.9+6+5.2)/3=5.7
(7) Calculation: |((Actual-forecast)/actual)|%= |(4.9-5.2)/4.9|%=6.12%
Forecasting errors
Example 2:
Calculate a four-year moving average from the following data set:
4. Weighted Moving Average forecast example

Example 1: 3 year Weighted Moving Average forecast

year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s

Calculate 3 year Weighted Moving Average forecast with weight=1,2,1

Solution:
The weights of the 3 years are respectively 1,2,1 and their sum is 4.

Calculation of 3 year moving averages of the data

Year Sales 3 year MA
1 5.2
2 4.9
3 5.5

4 4.9 (1*5.2+2*4.9+1*5.5)/4=5.125
5 5.2 (1*4.9+2*5.5+1*4.9)/4=5.2

6 5.7 (1*5.5+2*4.9+1*5.2)/4=5.125
7 5.4 (1*4.9+2*5.2+1*5.7)/4=5.25
8 5.8 (1*5.2+2*5.7+1*5.4)/4=5.5

9 5.9 (1*5.7+2*5.4+1*5.8)/4=5.575

10 6 (1*5.4+2*5.8+1*5.9)/4=5.725
11 5.2 (1*5.8+2*5.9+1*6)/4=5.9
12 4.8 (1*5.9+2*6+1*5.2)/4=5.775
5. Exponential Smoothing forecast

Example: 3 year Single Exponential Smoothing forecast

year 1 2 3 4 5 6 7 8 9 10
Sales 30 25 35 25 20 30 35 40 30 45

Calculate 3 year Single Exponential Smoothing forecast

Solution:
6. ARIMA Models

AutoRegressive Integrated Moving Average, or ARIMA, is a forecasting method

that combines both an autoregressive model and a moving average model.

Autoregression uses observations from previous time steps to predict future

values using a regression equation.
An autoregressive model utilizes a linear combination of past variable values to
make forecasts:
Thus, an autoregressive model of order p can be written as:

Likewise a pure Moving Average (MA only) model is one where Yt depends only
on the lagged forecast errors.
If we combine differencing with autoregression and a moving average model, we
obtain a non-seasonal ARIMA model. The full model can be represented with the
following equation:
Autocorrelation
Autocorrelation analysis is an important step in the Exploratory Data
Analysis of time series forecasting.
The autocorrelation analysis helps detect patterns and check for
randomness.
Autocorrelation is the correlation between a time series with a lagged
version of itself.

Any autocorrelation that may be present in time series data is determined using a
correlogram, also known as an ACF plot. This is used to help you determine whether
your series of numbers is exhibiting autocorrelation at all, at which point you can then
begin to better understand the pattern that the values in the series may be predicting.
An autoregressive model is when a value from a time series is regressed on
previous values from that same time series.

Autocorrelation

The coefficient of correlation between two values in a time series is called

the autocorrelation function (ACF) For example the ACF for a time series yt is
given by:

The ACF is a way to measure the linear relationship between an observation at

time t and the observations at previous times.
The key statistics in time series analysis is the autocorrelation coefficient (or
the correlation of the time series with itself, lagged by 1, 2, or more periods),
which is given by the following formula,
ACF Example:

Test Bank for Business Analytics 3rd Edition by Evans
No ratings yet
Test Bank for Business Analytics 3rd Edition by Evans
28 pages
Untitled
No ratings yet
Untitled
1,326 pages
STAT 2601 Final Exam Extra Practice Questions
No ratings yet
STAT 2601 Final Exam Extra Practice Questions
9 pages
Question Bank - R 2020 - U20MABT02 - Advanced Calculus and Complex Analysis
No ratings yet
Question Bank - R 2020 - U20MABT02 - Advanced Calculus and Complex Analysis
9 pages
All Multiple Choice Without Answer
No ratings yet
All Multiple Choice Without Answer
26 pages
Recommender System With PHP-SQL
100% (1)
Recommender System With PHP-SQL
8 pages
1 Prob & Stats FAST (Final Term-Online Paper)
No ratings yet
1 Prob & Stats FAST (Final Term-Online Paper)
3 pages
Business Statistics Problems
No ratings yet
Business Statistics Problems
8 pages
Problem Set 2 - Analyzing Using Spreadsheets
No ratings yet
Problem Set 2 - Analyzing Using Spreadsheets
4 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
No ratings yet
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
7 pages
Mumbai Educational Trust: MET Institute of Computer Science
No ratings yet
Mumbai Educational Trust: MET Institute of Computer Science
368 pages
Quiz 3 - Recommendation systems , Association rule mining_ Machine Learning 3 - Ravi
No ratings yet
Quiz 3 - Recommendation systems , Association rule mining_ Machine Learning 3 - Ravi
7 pages
Discriminant Analysis Chapter-Seven
No ratings yet
Discriminant Analysis Chapter-Seven
7 pages
Sharda Dss10 PPT 08 ST
No ratings yet
Sharda Dss10 PPT 08 ST
14 pages
Interview Questions For DS & DA (ML)
100% (1)
Interview Questions For DS & DA (ML)
66 pages
Havells Final
No ratings yet
Havells Final
59 pages
Mining Class Comparisons
100% (1)
Mining Class Comparisons
4 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Data Science Engineering Full Time Program Brochure
No ratings yet
Data Science Engineering Full Time Program Brochure
21 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
Theory of Computation - Question Bank
No ratings yet
Theory of Computation - Question Bank
19 pages
Ma2262 Probability and Queuing Theory Question Bank Download
No ratings yet
Ma2262 Probability and Queuing Theory Question Bank Download
4 pages
P-N Junction Diode
No ratings yet
P-N Junction Diode
4 pages
Module 1 Quiz
No ratings yet
Module 1 Quiz
7 pages
Marketing Model (BA ZC 421) Sidharth Mishra 13/01/2021: BITS Pilani BITS Pilani
No ratings yet
Marketing Model (BA ZC 421) Sidharth Mishra 13/01/2021: BITS Pilani BITS Pilani
268 pages
Unit - I
No ratings yet
Unit - I
22 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
M-1 - Weightage of Marks - Topic Wise
No ratings yet
M-1 - Weightage of Marks - Topic Wise
5 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
MLQuestion-Bank (2)_For IA1
No ratings yet
MLQuestion-Bank (2)_For IA1
2 pages
Chapter-3 DATA MINING PDF
No ratings yet
Chapter-3 DATA MINING PDF
13 pages
Python Skill Course File
No ratings yet
Python Skill Course File
73 pages
Machine Learning: Notes by Aniket Sahoo - Part II
No ratings yet
Machine Learning: Notes by Aniket Sahoo - Part II
140 pages
Brief Lecture Notes On Simple Linear Regression Regression Analysis
No ratings yet
Brief Lecture Notes On Simple Linear Regression Regression Analysis
8 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
Unit-2 PDS (Final) PDF (G)
No ratings yet
Unit-2 PDS (Final) PDF (G)
14 pages
Data Dictionary
No ratings yet
Data Dictionary
6 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Midterm Sp16 Solutions
100% (1)
Midterm Sp16 Solutions
17 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Ma - Na.Pa - Bhavan: 35 Bus Time Schedule & Line Map
No ratings yet
Ma - Na.Pa - Bhavan: 35 Bus Time Schedule & Line Map
3 pages
Motilal Nehru College University of Delhi B.Sc. (H) Physics, III Semester (Scilab)
No ratings yet
Motilal Nehru College University of Delhi B.Sc. (H) Physics, III Semester (Scilab)
2 pages
Data-Driven Decision Making in The Public Sector: A Systematic Review
No ratings yet
Data-Driven Decision Making in The Public Sector: A Systematic Review
13 pages
Chapter 3 Synthetic Curves
No ratings yet
Chapter 3 Synthetic Curves
26 pages
DSA5102_lecture9
100% (1)
DSA5102_lecture9
35 pages
MCQ of Basic Introduction To C
No ratings yet
MCQ of Basic Introduction To C
19 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Chapter 8 - Social Media Information Systems
No ratings yet
Chapter 8 - Social Media Information Systems
38 pages
UNIT2
No ratings yet
UNIT2
25 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
ML QB WITH ANSWER
No ratings yet
ML QB WITH ANSWER
20 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
bi-unit1
No ratings yet
bi-unit1
93 pages
Wiley R Johnston (1982) NumericalMethods A Software Approach
No ratings yet
Wiley R Johnston (1982) NumericalMethods A Software Approach
295 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Crash Course_Introduction to Data Science
No ratings yet
Crash Course_Introduction to Data Science
121 pages
FD
No ratings yet
FD
16 pages
Rani Ristia Lanvi - Kode Etik
No ratings yet
Rani Ristia Lanvi - Kode Etik
17 pages
GPSReceiver Errorsand Corrections
No ratings yet
GPSReceiver Errorsand Corrections
22 pages
BlueCAD20 Eng Tutorial
No ratings yet
BlueCAD20 Eng Tutorial
154 pages
Management Information Systems-0109 PDF
No ratings yet
Management Information Systems-0109 PDF
13 pages
Schneider Electric Lexium-62-ILM ILM0703P01A0000
No ratings yet
Schneider Electric Lexium-62-ILM ILM0703P01A0000
3 pages
Douglas Kruger - How To Grow Rich - 50 Ways To Debunk Money Myths and Master Wealth-Penguin Books (South Africa) (2021)
100% (1)
Douglas Kruger - How To Grow Rich - 50 Ways To Debunk Money Myths and Master Wealth-Penguin Books (South Africa) (2021)
183 pages
Decision Structure and Boolean Logic (T)
No ratings yet
Decision Structure and Boolean Logic (T)
29 pages
Company Profile AHDI 2019
No ratings yet
Company Profile AHDI 2019
11 pages
Bms Model Question Papers 2021-22
50% (2)
Bms Model Question Papers 2021-22
3 pages
Liebert Ac4 User Manual PDF
No ratings yet
Liebert Ac4 User Manual PDF
89 pages
Lex l11 Accessory Catalogue Eng
No ratings yet
Lex l11 Accessory Catalogue Eng
7 pages
Form 4 2025 Exam
No ratings yet
Form 4 2025 Exam
9 pages
Experiment 1
No ratings yet
Experiment 1
3 pages
Criteria Exceeded Expectations 4 Meets Expectations 3 Missed Expectations 2 Did Not Meet Expectations 1 Following Directions
No ratings yet
Criteria Exceeded Expectations 4 Meets Expectations 3 Missed Expectations 2 Did Not Meet Expectations 1 Following Directions
2 pages
Adil Farooqi
No ratings yet
Adil Farooqi
2 pages
99 Problems But A Job Ain't One PDF
No ratings yet
99 Problems But A Job Ain't One PDF
24 pages
TRION 8000 Spec Sheet
No ratings yet
TRION 8000 Spec Sheet
1 page
Maths Ut 3 Amity
No ratings yet
Maths Ut 3 Amity
3 pages
XII CS Chapter Data Structures Notes
No ratings yet
XII CS Chapter Data Structures Notes
4 pages
Industrial Communication
100% (1)
Industrial Communication
82 pages
Skills and Strategies for Teaching English for Specific Purposes
No ratings yet
Skills and Strategies for Teaching English for Specific Purposes
71 pages
National Broadband Strategy 2023 FINAL
No ratings yet
National Broadband Strategy 2023 FINAL
139 pages
BBS OF BOX LHS 227
No ratings yet
BBS OF BOX LHS 227
1 page
Marketing Analysis - Horlicks
No ratings yet
Marketing Analysis - Horlicks
50 pages
PM Shri Kendriya Vidyalaya Hebbal
No ratings yet
PM Shri Kendriya Vidyalaya Hebbal
14 pages
Tl260-Tl265 Installation Manual Eng Fre Spa Braz-Port 29007618r002
No ratings yet
Tl260-Tl265 Installation Manual Eng Fre Spa Braz-Port 29007618r002
68 pages
E32-433T30D Usermanual EN v1.9
No ratings yet
E32-433T30D Usermanual EN v1.9
22 pages
Networking Fundamentals Assignment
No ratings yet
Networking Fundamentals Assignment
10 pages
Hoclieu_15-MINUTE TEST 1 (UNITS 1-2)
No ratings yet
Hoclieu_15-MINUTE TEST 1 (UNITS 1-2)
2 pages