Unit 2 - DA - Data Analysis
Unit 2 - DA - Data Analysis
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
One of the fundamental task in data analysis is to find how different variables
are related to each other and one of the central tool for learning about such
relationships is regression.
Lets take a simple example: Suppose your manager asked you to predict annual
sales. There can be factors (drivers) that affects sales such as competitive
pricing, product quality, shipping time & cost, online reviews, easy return policy,
loyalty rewards, word of mouth recommendations, ease of checkout etc. In this
case, sales is your dependent variable. Factors affecting sales are independent
variables.
Regression analysis would help to solve this problem. In simple words,
regression analysis is used to model the relationship between a dependent
variable and one or more independent (predictors) variables and then use the
relationships to make predictions about the future.
Regression analysis helps to answer the following questions:
Which of the drivers have a significant impact on sales?
Which is the most important driver of sales?
How do the drivers interact with each other?
School of Computer Engineering
Regression Modelling Techniques cont…
20
The regression analysis allows to model the dependent variable as a function of its predictors
i.e. Y = f(Xi, β) + ei where Y is dependent variable, f is the function, Xi is the independent
variable, β is the unknown parameters, ei is the error term, and i varies from 1 to n.
Terminologies
Outliers: Suppose there is an observation in the dataset which is having a very high or
very low value as compared to the other observations in the data, i.e. it does not belong to
the population, such an observation is called an outlier. In simple words, it is extreme
value. An outlier is a problem because many times it hampers the results we get.
Multicollinearity: When the predictors are highly correlated to each other then the
variables are said to be multicollinear. Many types of regression techniques assumes
multicollinearity should not be present in the dataset. It is because it causes problems in
ranking variables based on its importance or it makes job difficult in selecting the most
important independent variable (factor).
Heteroscedasticity: When dependent variable's variability is not equal across values of
an independent variable, it is called heteroscedasticity. Example -As one's income
increases, the variability of food consumption will increase. A poorer person will spend a
rather constant amount by always eating inexpensive food; a wealthier person may
occasionally buy inexpensive food and at other times eat expensive meals. Those with
higher incomes display a greater variability of food consumption.
Sample and population: A population is the entire group of elements meant to draw
conclusions. A sample is a smaller part of the whole, i.e., a subset of the entire
population. The size of the sample is always less than the total size of the population.
To fit the regression line, a statistical approach known as least squares method.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
If b > 0, then x(predictor) and y(target) have a positive relationship. That is increase
in x will increase y.
If b < 0, then x(predictor) and y(target) have a negative relationship. That is increase
in x will decrease y.
If sum of squared error is taken as a metric to evaluate the model, then goal to obtain
a line that best reduces the error.
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes
known simply as multiple regression, and it is an extension of linear regression.
Example:
Do age and intelligence quotient (IQ) scores predict grade point average (GPA)?
Do weight, height, and age explain the variance in cholesterol levels?
Do height, weight, age, and hours of exercise per week predict blood pressure?
The formula for a multiple linear regression is:
y = β0+ β1x1 + β2x2 + β3x3 + β4x4+ … … … … … … + βnxn + e
where, y = the predicted value of the dependent variable.
β0 = the y-intercept (value of y when all other parameters are set to 0)
β1x1= the regression coefficient (β1) of the first independent variable (x1)
βnxn= the regression coefficient (βn) of the last independent variable (xn)
e = model error
The scatter diagram of sales increase for various discount percentage looks as follows:
The value of r is 0.97 which indicates a very strong, almost perfect, positive correlation,
and the data value appears to form a slight curve.
School of Computer Engineering
Non-Linear Regression cont…
44
Polynomials are the equations that involve powers of the independent variables. A second
degree (quadratic), third degree (cubic), and n degree polynomial functions:
Second degree: y = β0+ β1x + β2x2 + e
Third degree: y = β0+ β1x + β2x2 + β3x3 + e
n degree: y = β0+ β1x + β2x2 + β3x3 + … … + βnxn + e
Where:
β0 is the intercept of the regression model
β1, β2, β3 are the coefficient of the predictors.
How to find the right degree of the equation?
As we increase the degree in the model, it tends to increase the performance of the
model. However, increasing the degrees of the model also increases the risk of over-fitting
and under-fitting the data. So, one of the approach can be adopted:
Forward Selection: This method increases the degree until it is significant enough to
define the best possible model.
Backward Elimination: This method decreases the degree until it is significant
enough to define the best possible model.
Class work
Define the second-order polynomial model with two independent variables.
Define the second-order polynomial model with three independent
variables.
Define the third-order polynomial model with two independent variables.
A polynomial regression is
regression that involves multiple
powers of predictor(s). So,
regression tools and diagnostics
can be applied to polynomial
regression.
The tools exists in software such as SAS, Excel or the language such as
Python, R can estimate the value of coefficients of predictor such as β0, β1 etc
and to fit a curve in a non-linear fashion for the given data.
Following figure depicts the graph of increase in sale vs. discount.
Curve
An R2 of 1 indicates that the regression model perfectly fits the data while an
R2 of 0 indicate that model does not fit the data at all.
An R2 is calculated as follows:
where
It measures the relationship between the categorical dependent variable and one
or more independent variables by estimating probabilities using a logistic
function which is the cumulative logistic distribution.
Since the predicted values are probabilities and therefore are restricted to (0, 1),
a logistic regression model only predicts the probability of particular
outcome given the values of the existing data.
Example: A group of 20 students spends between 0 and 6 hours studying for an
exam. How does the number of hours spent studying affect the probability of the
student passing the exam? The reason for using logistic regression for this
problem is that the values of the dependent variable, pass and fail, while
represented by "1" and "0", are not cardinal numbers. If the problem was
changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers),
then simple regression analysis could be used. The table shows the number of
hours each student spent studying, and whether they passed (1) or failed (0).
Consider a model with one predictor X1, and one binary response variable Y, which
we denote p = P(Y = 1 | X1 = x), where p is the probability of success. p should meet
criteria: (i) it must always be positive, (ii) it must always be less than equals to 1.
We assume a linear relationship between the independent variable and the logit of
the event i.e. Y = 1. In statistics, the logit is the logarithm of the odds i.e. p / (1-p).
This linear relationship can be written in the following mathematical form (where ℓ
is the logit, b is the base of the logarithm, and β is the parameter of the model.
Where Sb is the sigmoid function with base b. However in some cases it can be easier
to communicate results by working in base 2, base 10, or exponential constant e.
In reference to the students example, solving the equation with software tool and
considering base as e, the coefficient is β0 = -4.0777 and β1= 1.5046
School of Computer Engineering
Logistic Regression cont…
57
For example, for a student who studies 2 hours, entering the value Hours = 2 in
the equation gives the estimated probability of passing the exam of 0.26.
Similarly, for a student who studies 4 hours, the estimated probability of passing
the exam is 0.87:
Following table shows the probability of passing the exam for several values of
hours studying.
Hours of study Probability of passing the exam
1 0.07
2 0.26
3 0.61
5 0.97
Conditional probability is the probability of one thing being true given that
another thing is true. This is distinct from joint probability, which is the
probability that both things are true without knowing that one of them must be
true.
For example, one joint probability is "the probability that your left and right
socks are both black," whereas a conditional probability is "the probability that
your left sock is black if you know that your right sock is black,“
Event A is that it is raining outside, and it has a 0.3 (30%) chance of raining
today. Event B is that you will need to go outside, and that has a probability of 0.5
(50%). A conditional probability would look at these two events in relationship
with one another, such as the probability that it is both raining and you will need
to go outside. The formula for conditional probability is: P(B|A) = P(A∩B) / P(A)
Example: In a group of 100 sports car buyers, 40 bought alarm systems, 30
purchased bucket seats, and 20 purchased an alarm system and bucket seats. If a
car buyer chosen at random bought an alarm system, what is the probability they
also bought bucket seats?
There are three events: A, B, and C. Events B and C are distinct from each other
while event A intersects with both events. We do not know the probability of
event A. However, we know the probability of event A under condition B and the
probability of event A under condition C. The total probability rule states that by
using the two conditional probabilities, we can find the probability of event A.
Mathematically, the total probability rule can be written in the following
equation where n is the number of events and Bn is the distinct event.
As per the diagram, the total probability of event A from the situation can be found
using the equation is : P(A) = P(A ∩ B) + P(A ∩ C).
Example: You are a stock analyst following ABC Corp. You discovered that the
company is planning to launch a new project that is likely to affect the company’s
stock price. You have identified the following probabilities:
There is a 60% probability of launching a new project.
If a company launches the project, there is a 75% probability that its stock price
will increase.
If a company does not launch the project, there is a 30% probability that its stock
price will increase.
You want to find the probability that the company’s stock price will increase.
Solution:
P(Launch a project | Stock price increases) = 0.6 × 0.75 = 0.45
P(Do not launch | Stock price increases) = 0.4 × 0.3 = 0.12
P(Stock price increases) = P(Launch a project | Stock price increases) + P(Do not
launch | Stock price increases) = 0.45 + 0.12 = 0.57. Thus, there is a 57% probability
that the company’s share price will increase.
School of Computer Engineering
Law of total probability cont…
65
where:
H is the hypothesis whose probability is affected by data.
E is the evidence i.e. the unseen data which was not used in computing
the
prior probability
P(H) is the prior probability i.e. it is the probability of H before E is
observed
P(H | E) is the posterior probability i.e. the probability of H given E and
after
E is observed.
P(E | H) is the probability of observing E given H. It indicates the
compatibility of the evidence with the given hypothesis.
School of Computer Engineering
Bayesian Interface cont…
68
Now, the values for each can be obtained by looking at the dataset and substitute them into the
equation. For all entries in the dataset, the denominator does not change, it remain static.
Therefore, the denominator can be removed and a proportionality can be introduced.
In the example, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, the need is to find the class y with
maximum probability.
Using the above function, we can obtain the class, given the predictors.
P(Y) = 9/ 14 and P(N) = 5/14 where Y stands for Yes and N stands for No.
The outlook probability is: P(sunny | Y) = 2/9, P(overcast | Y) = 4/9, P(rain | Y) = 3/9, P(sunny |
N) = 3/5, P(overcast | N) = 0, P(rain | N) = 2/5
The temperature probability is: P(hot | Y) = 2/9, P(mild | Y) = 4/9, P(cool | Y) = 3/9, P(hot | N) =
2/5, P(mild | N) = 2/5, P(cool | N) = 1/5
The humidity probability is: P(high | Y) = 3/9, P(normal | Y) = 6/9, P(high | N) = 4/5, P(normal |
N) = 2/5.
The windy probability is: P(true | Y) = 3/9, P(false | Y) = 6/9, P(true | N) = 3/5, P(false | N) = 2/5
Now we want to predict “Enjoy Sport” on a day with the conditions: <outlook = sunny;
temperature = cool; humidity = high; windy = strong>
P(Y) P(sunny | Y) P(cool | Y) P(high | Y) P(strong | Y) = .005 and P(N) P(sunny | N) P(cool | N)
P(high | N) P(strong | N) = .021
Since, the probability of No is the larger, we can predict “Enjoy Sport” to be No on that day.
Pros
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction.
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).
Cons
The assumption of independent predictors. In real life, it is almost impossible to get a
set of predictors which are completely independent.
If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
For instance, we can see that p(SAT=s¹ | Intelligence = i¹) is 0.8, that is, if the
intelligence of the student is high, then the probability of the SAT score being high as
well is 0.8. On the other hand, p(SAT=s⁰ | Intelligence = i¹), which encodes the fact
that if the intelligence of the student is high, then the probability of the SAT score
being low is 0.2.
Note that the sum of values in each row is 1. That makes sense because given that
Intelligence=i¹, the SAT score can be either s⁰ or s¹, so the two probabilities must add
up to 1. Similarly, the CPD for “Letter” encodes the conditional probabilities
p(Letter=l | Grade=g). Because “Grade” can take three values, we have three rows in
this table.
The CPD for “Grade” is easy to understand with the above knowledge. Because it has
two parents, the conditional probabilities will be of the form p(Grade=g |
Difficulty=d, SAT=s), that is, what is the probability of “Grade” being g, given that the
value of “Difficulty” is d and that of “SAT” is s. Each row now corresponds to a pair of
values of “Difficulty” and “Intelligence.” Again, the row values add up to 1.
An essential requirement for Bayesian networks is that the graph must be a directed
acyclic graph (DAG).
While we don’t have a CPD that gives us that information directly, we can see that
a high SAT score from the student would suggest that the student is likely
intelligent, and consequently, the probability of a good grade is high if the
difficulty of the course is low, as shown using the red arrows in the previous
image. We may also want to estimate the probability of multiple variables
simultaneously, like what is the probability of the student getting a good grade
and a good letter?
The variables with known values are called “observed variables,” while those
whose values are unobserved are called “hidden variables” or “latent variables.”
Conventionally, observed variables are denoted using grey nodes, while latent
variables are denoted using white nodes, as in the previous image. We may be
interested in finding the values of some or all of the latent variables.
The graph structures that we’ve been talking about so far actually capture
important information about the variables. Specifically, they define a set of
conditional independences between the variables, that is, statements of the
form — “If A is observed, then B is independent of C.” Let’s look at some
examples.
School of Computer Engineering
Probabilistic Graphical Models cont…
81
In the student network, let’s say you know that a student had a high SAT
score. What can you say about her grade? As we saw earlier, a high SAT score
suggests that the student is intelligent, and therefore, you would expect a
good grade. What if the student has a low SAT score? In this case, you would
not expect a good grade.
Now, let’s say that you also know that the student is intelligent, in addition
to her SAT score. If the SAT score was high, then you would expect a good
grade. What if the SAT score was low? You would still expect a good grade
because you know that the student is intelligent, and you would assume that
she just didn’t perform well enough on the SAT. Therefore, knowing the SAT
score doesn’t tell us anything if we see the intelligence of the student. To put
this as a conditional independence statement, we would say — “If
Intelligence is observed, then SAT and Grade are independent.”
We got this conditional independence information from the way these nodes
were connected in the graph. If they were connected differently, we would
get different conditional independence information.
School of Computer Engineering
Probabilistic Graphical Models cont…
82
Let’s see this with another example. Let’s say you know that the student is
intelligent. What can you say about the difficulty of the course? Nothing,
right? Now, what if I tell you that the student got a bad grade on the course?
This would suggest that the course was hard because we know that an
intelligent student got a bad grade. Therefore we can write our conditional
independence statement as follows — “If Grade is unobserved, then
Intelligence and Difficulty are independent.”
Because these statements capture an independence between two nodes
subject to a condition, they are called conditional independences. Note that
the two examples have opposite semantics — in the first one, the
independence holds if the connecting node is observed; in the second one,
the independence holds if the connecting node is unobserved. This
difference is because of the way the nodes are connected, that is, the
directions of arrows.
The host of the game shows you three closed doors, with a car behind one of the doors
and something invaluable behind the others. You get to pick a door. Then, the host
opens one of the remaining doors, and shows that it does not contain the car. Now, you
have an option to switch the door, from the one you picked initially to the one that the
host left unopened. Do you switch?
School of Computer Engineering
Application: Monty Hall Problem cont…
84
Now, the host picks a door H and opens it. So, H is now observed. Therefore, we
get the following Bayesian network for our problem:
Observing H does not tell us anything about I, that is,
whether we have picked the right door. That is what our
intuition tells us. However, it does tell us something
about D (Again, drawing analogy with the student
network, if you know that the student is intelligent, and
the grade is low, it tells you something about the
difficulty of the course.)
Let’s see this using numbers. The CPD tables for the variables are as follows
(This is when no variables have been observed.):
p(D) p(F)
1 2 3 1 2 3
1/3 1/3 1/3 1/3 1/3 1/3
p(I | D, F) p(H | D, F)
0 1 1 2 3
D=1, F=1 0 1 D=1, F=1 0 1/2 1/2
D=1, F=2 1 0 D=1, F=2 0 0 1
D=1, F=3 1 0 D=1, F=3 0 1 0
D=2, F=1 1 0 D=2, F=1 0 0 1
D=2, F=2 0 1 D=2, F=2 1/2 0 1/2
D=2, F=3 1 0 D=2, F=3 1 0 0
D=3, F=1 1 0 D=3, F=1 0 1 0
D=3, F=2 1 0 D=3, F=2 1 0 0
D=3, F=3 0 1 D=3, F=3 1/2 1/2 0
I=1 when D and F are identical, and I=0 when D and If D and F are equal, then the host picks one door from
F are different. the other two with equal probability, while if D and F
are different, then the host picks the third door.
Now, let’s assume that we have picked a door, that is, F is now observed, say
F=1. What are the conditional probabilities of I and D, given F?
So far, the probability that we have picked the correct door is 1/3 and the car
could still be behind any door with equal probability.
Now, the host opens one of the doors other than F, so we observe H. Assume
H=2. Let’s compute the new conditional probabilities of I and D given both F
and H.
Our first choice is correct with probability 1/3. So, if we switch, we get the car
with probability 1/3, if we don’t, we get the car with probability 1/3.
Support Vector Machine (SVM) is an algorithm which can be used for both classification
or regression challenges. However, it is mostly used in two-group classification problems.
In the SVM algorithm, each data item is plotted as a point in n-dimensional space (where
n is number of features) with the value of each feature being the value of a particular
coordinate. Then, classification is performed by finding the hyperplane that differentiates
the two classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
In one dimension, it is called a point.
In two dimensions, it is a line.
In three dimensions, it is a plane.
In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a
separating hyperplane.
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose
distance to the nearest element of each tag is the largest.
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A,
B and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyperplane?
Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low
margin then there is high chance of misclassification.
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous
section to identify the right hyperplane.
Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier.
One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single
straight line. Such data points are termed as linearly separable data, and the
classifier is used described as a linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for data that are non-linearly
separable data i.e. a straight line cannot be used to classify the dataset. For
this, we use something known as a kernel trick that sets data points in a
higher dimension where they can be separated using planes or other
mathematical functions. Such data points are termed as non-linear data, and
the classifier used is termed as a non-linear SVM classifier.
periods
Seasonal component
These are the rhythmic forces which operate in a regular and periodic
manner over a span of less than a year. They have the same or almost the
same pattern during a period of 12 months. This variation will be present in
a time series if the data are recorded hourly, daily, weekly, quarterly, or
monthly.
These variations come into play either because of the natural forces or
person-made conventions. The various seasons or climatic conditions play
an important role in seasonal variations. Such as production of crops
depends on seasons, the sale of umbrella and raincoats in the rainy season,
and the sale of electric fans and A.C. shoots up in summer seasons.
The effect of person-made conventions such as some festivals, customs,
habits, fashions, and some occasions like marriage is easily noticeable. They
recur themselves year after year. An upswing in a season should not be taken
as an indicator of better business conditions.
Cyclical component
The variations in a time series which operate themselves over a span of
more than one year are the cyclic variations. This oscillatory movement has
a period of oscillation of more than a year. One complete period is a cycle.
This cyclic movement is sometimes called the ‘Business Cycle’.
It is a four-phase cycle comprising of the phases of prosperity, recession,
depression, and recovery. The cyclic variation may be regular are not
periodic. The upswings and the downswings in business depend upon the
joint nature of the economic forces and the interaction between them.
Irregular component
They are not regular variations and are purely random or irregular. These
fluctuations are unforeseen, uncontrollable, unpredictable, and are erratic.
These forces are earthquakes, wars, flood, famines, and any other disasters.
Mixed model
Different assumptions lead to different combinations of additive and multiplicative
models as Yt = Tt + St + Ct * It
The time series analysis can also be done using the model as:
Yt = Tt + St * Ct * It
Yt = Tt * St + Ct * It
Home Work
How to determine if a time series has a trend component?
How to determine if a time series has a seasonal component?
How to determine if a time series has both a trend and seasonal component?
MA(3), MA(5) and MA(12) are commonly used for monthly data and MA(4) is
normally used for quarterly data.
MA(4), and MA(12) would average out the seasonality factors in quarterly and
monthly data respectively.
The advantage of MA method is that the data requirement is very small.
The major disadvantage is that it assumes the data to be stationary.
MA also called as simple moving average.
School of Computer Engineering
Moving Averages (MAs) cont…
113
8 286
9 212
10 275
11 188
12 312
Error calculation
The error is calculated as Et = yt – St (i.e. difference of actual and smooth at time t)
Then error square is calculated i.e. ESt = Et * Et
Then, sum of the squared errors (SSE) is calculated i.e. SSE = ΣESi for i = 1 to n where
n is the number of observations.
Then, the mean of the squared errors is calculated i.e. MSE = SSE/(n-1)
The best value for α is choose so the value which results in the smallest MSE.
School of Computer Engineering
Simple Exponential Smoothing cont…
117
Let us illustrate this principle with an example. Consider the following data set consisting
of 12 observations taken over time with α as 0.1:
Time yt St Et ESt
1 71
2 70 0.1 * 70 + (1-0.1) * 71 = 71 70 - 71= -1.0 (-1.0)2 = 1.00
3 69 0.1 * 69 + (1-0.1) * 71 = 70.9 69 – 70.9 = -1.90 (-1.90) 2 = 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79
School of Computer Engineering
Simple Exponential Smoothing cont…
118
The sum of the squared errors (SSE) = 208.94. The mean of the squared
errors (MSE) is the SSE /11 = 19.0.
In the similar fashion, the MSE was again calculated for α=0.5 and turned out
to be 16.29, so in this case we would prefer an α of 0.5.
Can we do better?
We could apply the proven trial-and-error method. This is an iterative
procedure beginning with a range of α between 0.1 and 0.9.
We determine the best initial choice for α and then search between α−Δ
and α+Δ. We could repeat this perhaps one more time to find the best α
to 3 decimal places.
In general, most well designed statistical software programs should be able to find the
value of α that minimizes the MSE.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual
data value of point t. Calculate MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
MAPE is commonly used because it’s easy to interpret and easy to explain. For
example, a MAPE value of 11.5% means that the average difference between the
forecasted value and the actual value is 11.5%.
The lower the value for MAPE, the better a model is able to forecast values e.g. a
model with a MAPE of 2% is more accurate than a model with a MAPE of 10%.