0% found this document useful (0 votes)
10 views30 pages

Lecture 5 - Spring 2024

Uploaded by

hinishachawla72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views30 pages

Lecture 5 - Spring 2024

Uploaded by

hinishachawla72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Quick Review

• SUMPRODUCT
 Get the summation of several products (e.g., total grade):
 Get the summation of a series with the same structure (expressed by the summation symbol):
, e.g., sample variable:
 Group analysis: combine SUMPRODUCT function with logical test:
SUMPRODUCT(--(A2:A9 = “north”)) is equivalent to COUNTIF(A2:A9, “north”)
• PivotTable
 Use PivotTable when we want to summarize the information (i.e., a statistic such as mean, median, summation, etc.) by one or more columns. These columns are generally
non-numerical such as region, location, type, year, category, etc. For example, we want to know the average sales in each region.
 Insert Values to PivotTable
 Rows Field and Columns Field: Rows mean Row Names and Columns mean Column Names, put non-numerical values into these two fields as row/column names
 Values Field: Put numerical values into this field. A statistic will be calculated for the added values. We can put multiple numerical values into Values Field and we
can also repeatedly put one numerical value with different statistics here.
 Edit Values
 How to change the statistic: under “Values” area > Value Field Settings or under the Analyze Tab > Active Field group > Field Settings. We can also change the
number format here.
 How to combine multiple groups: Select multiple column/row names at the same time, then click Group Selection under Analyze Tab > “Group” group (or right click
> Group).
 Change the group name for this newly generated group
 Change the label for this new categorization
 How to calculate percentage of parent total
 Show Values As > % of Parent Total (need to decide which parent to use by choosing the Base Field)

Review 1
Quick Review
• Figure vs. Table
 The table contains the complete information and you can check any data point you want. Figures are easier to follow and
people need much less time to understand a figure than a table. A figure can also show the trend in the data.
 In an official report or a research paper, complete information is necessary. In general, people use figures to describe some
preliminary results and use tables to report the main results.
• Chart Elements: Chart area, Plot area, X-axis, Y-axis, and Legend
• How to insert a chart, after selecting the data
 Method 1: Quick Analysis (Ctrl+Q)
 Method 2: Insert tab > Charts group > directly choose an appropriate chart
 Method 3: Insert tab > Charts group > Recommended Charts > choose an appropriate chart

• Edit the chart elements


 Chart title, X-axis and Y-axis, font, font color, vertical axis major gridline, border of the whole chart area
 Edit element: Double click a chart element to use each Format Pane of a certain element

• Add chart elements

Review 2
DSME 2051 Business Information Systems
Lecture 5: Business Analytics

Forecasting
Introduction to Forecasting
Stephen Curry’s Average Points
• Deduce the next number:
Age PTS
 1, 2, 3… 21 17.5
22 18.6
 1, 3, 5…
23 14.7
 1, 1, 2, 3, 5, 8… 24 22.9
25 24.0
• Why can we deduce the next number? 26 23.8
 We depend on the information from existing numbers. 27 30.1
28 25.3
Similarly, if we have a time frame, we can use previous
29 27.6
information to predict the information of next time point
(i.e., the future). Hints
Can we predict his PTS next year?
 Those numbers share a strict math pattern, but usually we do • The main advantage of prediction by a time frame is:
not have such a beautiful pattern. the environment will not change largely, especially in a
short time. Next year, we know Stephen Curry is very
likely to be a player at the same level. Usually, the
most recent information will be the most useful (when
there is no seasonal effect).
• Assumption: things will not change a lot in a short time.

Moving Average Forecasting 4


Simple Moving Average

• One possible solution: directly use data from last period. However, we still have a lot of information left. Maybe we can
use information from last two years, or three years.

• In financial applications a simple moving average (SMA) is the unweighted mean of the previous data points.

 One-year moving average: use the data point in the most recent period to predict next period

 Two-year moving average: use the average of the data points in the most recent two periods to predict next period

 …

• Data: Stephen Curry’s Average Points


 Calculate the two-year and three-year moving average results in Excel.
 Using an appropriate figure to visualize the actual values, predictive values with two-year and three-year moving average.

 As we increase the number of periods to predict the next period, the prediction will become more “smooth”, because we are using
more information in previous period. Therefore, more variation in data will be cancelled out when we take the average. If the data is
sensitive to time change, using a long-period moving average may not be a good idea because information in the most recent periods
will be “attenuated” when we take the average for a long period.

Moving Average Forecasting 5


Weighted Moving Average

• One issue for the simple moving average is that if we use too many previous periods, the recent information will be
cancelled out, which may have a bigger impact for the next period. However, if we use very few previous period, the
prediction can also be unstable.

• To overcome the above shortcomings, we can use a weighted moving average instead of simple moving average.
Specifically, we can add a weight for each period and more recent period will have a larger weight. In this situation, the
information in recent periods will not be attenuated too much even if we use much historical information.

• In Excel, use three-month weighted moving average to predict the data in November. Data: Weighted Moving Average

Month Weight Data


August 17% 130
September 33% 110
October 50% 90
November ?

Moving Average Forecasting 6


Exponential Smoothing
• Two things could be arbitrary when we use moving average. First, how many periods of information should we use?
Second, how should we determine the weights when we use weighted moving average?

• Exponential smoothing

 Where represents the forecast value for next period. is actual value for the present period. means previously determined forecast
for present period. is called the smoothing factor and .

• Suppose we have two-period data, and , let , we want to predict the data point in the third period. Then

• As we see, to get the prediction result using exponential smoothing, we have to assign an initial value for , which is the
prediction result for the first period. An easiest way to do this is to set value equal to .

 What is if we let ?

Exponential Smoothing Forecasting 7


Exponential Smoothing

• Suppose , then let’s expand the formula for exponential smoothing:

• Can you calculate the summation of the weights using high-school mathematics?

Tips: Summation of geometric progression

Exponential Smoothing is also a type of weighted moving average.

Exponential Smoothing Forecasting 8


Exponential Smoothing

• The name “exponential smoothing” is attributed to the use of the exponential window function during convolution.
In other words, the predicted value , which is called a “Smoothed Statistic” (can you recall the definition of
“statistic”?), is the weighted average of all past observations, and the weights assigned to previous observations are
proportional to the terms of the geometric progression .
• Exponential smoothing is a rule of thumb technique for smoothing time series data using the exponential window
function, whereas in the simple moving average, the past observations are weighted equally. Exponential functions
are used to assign exponentially decreasing weights over time.

Exponential Smoothing Forecasting 9


Exponential Smoothing in Excel

• Now let’s predict Stephen Curry’s average points with exponential smoothing in Excel.

• Practice 5.1

 Please write your formula for exponential smoothing using two different smoothing factors 0.8 and 0.2.

 Draw a line chart when the smoothing factor is 0, 0.1, 0.9, and 1, respectively. Observe how the prediction changes with different
smoothing factors.

Age PTS Age Actual alpha = 0.8 alpha = 0.2


21 17.5 21 17.5 17.5 17.5
22 18.6 17.5 17.5
22 18.6
23 14.7 18.4 17.7
23 14.7
24 22.9 15.4 17.1
24 22.9 25 24.0 21.4 18.3
25 24.0 26 23.8 23.5 19.4
26 23.8 27 30.1 23.7 20.3
27 30.1 28 25.3 28.8 22.3
28 25.3 29 27.6 26.0 22.9
29 27.6 30 27.3 23.8

Exponential Smoothing Forecasting 10


Simple Linear Regression

• Remember the trend line we added to the scatterplot? A Linear Trend Line is defined as follows:

• Where is the intercept (the value of when ) and is slope of the line. If we want to use the linear trend line for prediction,
then can be the time period and will be the forecast for demand for period . (data: demand)

Regression x(period) y(demand)


Suppose the table describes the relationship between the period (x)
1 37
and the demand of a certain product. How can we use linear trend line
to predict the demand in the 13th period? We can use the following 2 40
formula to calculate the intercept and the slope. 3 41
4 37
5 45
Once we get the formula of the linear trend line, we can let and 6 50
insert the value of into the linear trend line to get the predictive 7 43
value of (called ).This model is called simple linear regression and 8 47
the way we estimate and is called least squares. The term 9 56
"regression" was coined by Francis Galton in the 19th century to
describe a biological phenomenon. The phenomenon was that the 10 52
heights of descendants of tall ancestors tend to regress down 11 55
towards a normal average (a phenomenon also known as regression 12 54
toward the mean).

Regression Forecasting 11
Simple Linear Regression

• Can you calculate the coefficients and by hand?

x(period) y(demand)
1 37 37 1
2 40 80 4
3 41 123 9
4 37 148 16
5 45 225 25
6 50 300 36
7 43 301 49
Linear trend line:
8 47 376 64
9 56 504 81
10 52 520 100 Forecast for period 13
11 55 605 121
12 54 648 144
sum 78 () 557 () 3867 () 650 ()
average 6.5() 46.42 ()

Regression Forecasting 12
Simple Linear Regression

• The does not have to be the period and it can be anything that can predict the . For example, on an e-commerce platform
such as Amazon and Taobao, we can use the number of consumers’ reviews to predict the sales of a product.

• In a regression model, is called the independent variable (predictor variable, explanatory variable) and is called the
dependent variable (response variable, explained variable). Practice 5.2: estimate the regression function.

• Instead of calculating the regression coefficients (the slope and the intercept) by hand, we can use the add-in functions of
Excel to conduct the calculation.
Number of reviews Sales
• Data tab > Analysis group > Data Analysis > Regression 322 142
880 286
Tips 527 223
829 255
If you cannot find the Data Analysis option, click the File
564 210
tab, click Options, and then click the Add-Ins category. In 697 251
the Manage box, select Excel Add-ins and then click Go. In 531 202
356 178
the Add-Ins available box, select the Analysis ToolPak 462 172
check box, and then click OK. 331 171

MAC Users: Tools > Excel add-ins

Regression Forecasting 13
Simple Linear Regression

• How to use the Regression module in Data Analysis

 Select Regression, click OK

 Select the data for in the “Input X/Y Range”. If you select the column names (the header), you need to click “Labels” (Here I
selected A1 and B1 so I should click “Labels”)

 You can select an Output location, either in the same worksheet or a new worksheet or a new workbook

Regression Forecasting 14
Interpret Regression Outputs
How to interpret the results?
Coefficients Standard Error t Stat P-value We can interpret
Lower 95% the estimate
Upper 95% ofLower
the slope
95.0%(which is 95.0%
Upper
Intercept 89.473 13.615 6.572 0.000 0.217) as follows: 120.870
58.077 58.077 120.870
Number of reviews 0.217 0.023 9.280 0.000 0.163 0.271 0.163 0.271
When the number of reviews increases by one unit, the
• Coefficients. These are the estimates of the slope and the intercept. Wesales will increase
can substitute by 0.217
these values intounits.
the regression function . The
estimated regression function would be (you should use concrete name of variables in your assignment or exam, you cannot just use x
If the p-value of the slope is significant (<0.1), the
and y if there is a concrete context) change in y when x changes is significant. Otherwise, this
• change canthat
P-value. Intuitively but not strictly, you can understand P-value as the probability bethe
ignored in a linear
corresponding sense.coefficient equals to
estimated
0. In general, if the P-value is larger than 0.1, we would say the corresponding estimated coefficient is not likely to be significantly
different from 0 (i.e., not significant or insignificant). Especially, if the estimate of the slope is insignificant, we can say there is no
significant linear relationship between and .

Comparison between Regression and Other Forecasting Methods


• Advantages. Using regression for forecasting is very flexible. We can substitute the independent variable “x” in a reasonable domain
(which is called Support in statistics, e.g., the support of number of reviews is non-negative integer) to predict the value of “y”.
Regression also uses more information in the data than moving average and exponential smoothing. In regression, we do not have to
manually assign an initial value for the first predictive value as in exponential smoothing.
• Disadvantages. Regression is more vulnerable to extreme point (outliers) than moving average and exponential smoothing. Regression
requires strong assumptions: 1) Linear (Scatterplot) 2) observations are independent of each other, etc.

Regression Forecasting 15
Comments
• The relationship between the slope and forecasting ability
 For a simple linear regression model (), we want to know whether can predict or not. When the slope (b, the coefficient of ) is not
0 in from a statistical point of view (we call this significantly different from 0, or significant, or the probability of the slope being 0
is smaller than 10% or 5% or 1%), we can say we are able to use to “linearly” predict . Think about this, when the slope is 0, no
matter what value takes, will always be the same number or a constant (a flat line, ). In this situation, cannot provide any
information for the prediction of .
• The relationship between the p-value of the slope and the Null hypothesis ()
 Statisticians usually make a hypothesis which is not what they want so that they may reject it and get the result they want. Under
our context, we say that cannot linearly predict is our null hypothesis (). The p-value of b represents the probability that b equals
to 0. Therefore, if the p-value is very small, the null hypothesis is not likely to be true. Especially, when the p-value is smaller than
10%/5%/1%, we say we can reject under 0.1/0.05/0.01 significance level.
• About the linear assumption
 When the independent variable is non-linear, e.g., , we can calculate a new variable such that and run the following regression
model: . So this linear assumption means “linear” in the coefficient. A counter-example would be . In this example, we cannot say
is linear in since is the coefficient of .

Regression Forecasting 16
Extension: Multiple Linear Regression
• We can extend the simple linear regression model to multiple regression, which allows predictions of systems with
multiple independent variables, e.g., .
• We can directly use the Data Analysis Module to estimate and in Excel.
• The only difference between simple linear regression and multiple linear regression is to extend the Input X Range from
one column to multiple columns. Notice, the Input range of X must be a contiguous reference.

Regression Forecasting 17
Extension: Multiple Linear Regression
Coefficient Standard Lower Upper
t Stat P-value Lower 95% Upper 95%
s Error 95.0% 95.0%
Intercept 39.34403 6.270459 6.274506 2.55E-08 26.83799 51.85007 26.83799 51.85007
weight -0.00586 0.000853 -6.8692 2.16E-09 -0.00756 -0.00416 -0.00756 -0.00416
headroom -0.20967 0.550716 -0.38072 0.704561 -1.30804 0.888698 -1.30804 0.888698
gearRatio 0.089589 1.373434 0.06523 0.948177 -2.64964 2.828817 -2.64964 2.828817

From the above regression results, we can have the following conclusions.
• The estimated regression function should be:

• The P-values of the coefficients of headroom and gearRatio are larger than 0.1, indicating the correlations of headroom and
gearRatio with mpg are not significantly different from 0. Therefore, we can say headroom and gearRatio are not likely to
linearly predict the mpg.
• How to interpret the coefficient?
 When the weight increases by one unit, the mpg will decrease by 0.0059 unit, holding other factors constant. The coefficient we
estimate using multiple linear regression reflects a “partial” relationship (i.e., ).
 When there are multiple factors influencing the DV we care about, (e.g., suppose grade is mainly influenced by effort level and
intelligence), and we want to discuss how one variable influences the DV, we must keep the other factor(s) unchanged. Otherwise, such
discussion would be meaningless. The power of multiple regression can help us solve this issue.

• Practice 5.3. Transform “foreign” to numerical values and include it in the above regression.

Regression Forecasting 18
Extension: Multiple Linear Regression

• Suppose we estimate the above equation, which is a multiple linear regression (linear in and ). Let’s substitute and , then
the estimation function becomes

• If we want to test if and have a quadratic relationship, we can use the p-value of the estimate of . If the p-value of is
larger than 0.1 (suppose the significance level is 0.1), which means is not significantly different from zero, that means
and do not have a significant quadratic relationship.

Hands-on Exercise
Use the data Automobile. Estimate the following regression model by Excel:

Question:
• When the weight increases by 1 unit, how will mpg change?
• What is the p-value of ? Does weight and mpg have a significant quadratic relationship?
• Should we include the insignificant variable (e.g., the quadratic term) into the regression model?

Regression Forecasting 19
DSME 2051 Business Information Systems
Data Analysis – Forecasting Assessment

Forecasting
Forecasting Assessment

• Recall the effect of the smoothing factor ()

 If, then, which means forecasting does not reflect the information in the recent actual data

 If, then, which means forecasting is only based on the most recent actual data

• How to choose the best α? We can also generalize this question to other prediction methods, i.e. how to evaluate the
forecasting result?

 Rule of thumb: if there is a trend in the time series data, which means there is a gradual, long-term up or down movement, then most
recent data is more useful for forecasting. Therefore, we need to choose a short period moving average or a large for exponential
smoothing.

 Mathematical methods: MAD, MSE, and

Forecasting Assessment Forecasting 21


Mean Absolute Deviation (MAD)
is the distance from the forecasting value to the true value for a specific period/observation. If this distance is
small, that means our prediction is very close to the true value, indicating the prediction is good.
• Where is the period number. is the actual value in period . is the forecast for period . is the total number of periods. ||
means the absolute value.

Period (α =0.3) (α =0.5)


Hands-on Experiment
1 37 37.00 0.00 0.00 37.00 0.00
• Consider the previous example, suppose 2 40 37.00 3.00 3.00 37.00 3.00
we use exponential smoothing for 3 41 37.90 3.10 3.10 38.50 2.50
prediction. Calculate the MAD using 4 37 38.83 -1.83 1.83 39.75 2.75
Excel. According to MAD, which is 5 45 38.28 6.72 6.72 38.38 6.63
better?
6 50 40.29 9.69 9.69 41.69 8.31
7 43 43.20 -0.20 0.20 45.84 2.84
• Draw a line chart for the actual values
8 47 43.14 3.86 3.86 44.42 2.58
and the predictive values with two
9 56 44.30 11.70 11.70 45.71 10.29
different smoothing factors. Can you 10 52 47.81 4.19 4.19 50.86 1.14
tell which line is a better prediction? 11 55 49.06 5.94 5.94 51.43 3.57
Can we have the same conclusion using a 12 54 50.84 3.15 3.15 53.21 0.79
line chart? 51.79 49.31 53.39 44.40

Forecasting Assessment Forecasting 22


Mean-Squared Error (MSE)

𝑀𝑆𝐸 =
∑ ( 𝐷𝑡 − 𝐹 𝑡 ) 2
𝑛
Practice 5.4 Period (α =0.3) (α =0.5)
• Consider the previous example, suppose 1 37 37.00 0.00 0.00 37.00 0.00
2 40 37.00 9.00 37.00 9.00 37.00
we use exponential smoothing for
3 41 37.90 9.61 38.50 6.25 37.90
prediction. Calculate the MSE using
4 37 38.83 3.35 39.75 7.56 38.83
Excel. According to MSE, which is 5 45 38.28 45.16 38.38 43.89 38.28
better? 6 50 40.29 93.90 41.69 69.10 40.29
• Draw a line chart for the actual values 7 43 43.20 0.04 45.84 8.09 43.20
and the predictive values with two 8 47 43.14 14.90 44.42 6.65 43.14
9 56 44.30 136.89 45.71 105.86 44.30
different smoothing factors. Can you
10 52 47.81 17.56 50.86 1.31 47.81
tell which line is a better prediction?
11 55 49.06 35.28 51.43 12.76 49.06
Can we have the same conclusion using a 12 54 50.84 9.92 53.21 0.62 50.84
line chart? 51.79 375.61 53.61 271.09 51.79

Should we include the first period in exponential smoothing when we calculate MSE/MAD?
 It could be unreasonable to use the first period, where the predictive value is achieved by manual construction, to calculate the MAD or MSE. However, if we use MSE or
MAD to compare different exponential smoothing models (i.e., choosing a better smoothing factor ), then either including or excluding the first period will not influence our
decision. My suggestion is that since we have the predictive value for the first period (even if it is by our construction), we may include it in our calculation for MSE and
MAD. (Heizer, J., Render, B., & Munson, C. (2008). Operations management. Prentice-Hall.)

Forecasting Assessment Forecasting 23


Coefficient of determination ()

• In the above equation, represent the actual values and represent the predicted values.
• , we can interpret such that of the variation in the dependent variable is explained by the prediction model. In general,
the larger the is, the better the prediction model is (not always). We may use to evaluate simple/multiple linear
regression model. Do not use to evaluate exponential smoothing or moving average.
• If our focus is the dependent variable (y), e.g., we want to predict y given specific x’s, we generally want a big R-squared
(perhaps larger than 0.7). However, if our focus is the independent variable (x), say we want to study whether some x has
a significant relationship or whether some x can significantly predict y, in this situation, the p-value of the coefficient of
this x is what we are concerned about. Even if we have a small R-squared, we can still build our conclusion such that x
can significantly predict y.
• Excel reports for a regression model.
• Practice 5.5: calculate the of the regression model of consumer review. Compare your results with the reported by
Excel.

Forecasting Assessment Forecasting 24


DSME 2051 Business Information Systems
Outlier Detection

Forecasting
Quartile

• Look at a series of numbers ordered ascendingly

 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

• The median is 6. Now let’s use the median to separate this series into two sections:

 1, 2, 3, 4, 5 and 7, 8, 9, 10, 11

 The median of the first half is 3, which is called the first quartile/lower quartile/Q1/25th percentile. Q1 splits off the lowest 25% of
data from the highest 75%

 The median of the second half is 9, which is called the third quartile/upper quartile/Q3/75th percentile. Q3 splits off the highest 25%
of data from the lowest 75%

 The median of the whole series is called second quartile/median/50th percentile, which cuts data set in half.

• Notice, for discrete distributions, there is no universal method on calculating the quartiles. The above method to calculate
Q1 and Q3 is just one of many ways to calculate the quartiles. For example, if we include 6 into both series, what is Q1
and Q3, respectively?

Outlier Detection Forecasting 26


Quartile

• The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. It is
also known as the lower quartile and the 25th percentile. It marks where 25% of the data is below or to the left of it (if data
is ordered on a timeline from smallest to largest).

• The second quartile (Q2) is the median or 50th percentile of a data set and 50% of the data lies below this point.

• The third quartile (Q3) is the middle value between the median and the highest value of the data set. It is also known as the
upper quartile or the 75th percentile and 75% of the data lies below this point.

• The minimum value is 0th percentile, which means there are 0% data point smaller than the minimum value.

• The maximum value is 100th percentile, which means there are 100% data points smaller than the maximum value.

• The minimum (smallest observation), the lower quartile or first quartile, the median (the middle value), the upper quartile
or third quartile, and the maximum (largest observation) are the most important percentiles. They are called the Five-
number summary.

Outlier Detection Forecasting 27


Quartile Formula in Excel
• Excel functions to calculate quartiles

 QUARTILE.INC: this function can calculate quartiles including the percentiles at the boundary of the data series, i.e., 0th percentile
(minimum) and 100th percentile (maximum)

 QUARTILE.EXC: this function can calculate quartiles excluding the percentiles at the boundary of the data series, i.e., 0th percentile
(minimum) and 100th percentile (maximum)

 The algorithms of how these two functions calculate quartiles are also different. Another function called “QUARTILE” in old-
version Excel works similar to QUARTILE.INC.

Extension
To find the quartile, Excel will first find the “position” of that quartile. For example, if there are 11 numbers ordered ascendingly (e.g., the first
eleven prime number 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31), QUARTILE.EXC (series, 1) works as follows: it first calculates the position of Q1 as 3
and then it finds the value in the third position, which is 5 (just like INDEX function). The algorithm of calculating the position of
QUARTILE.EXC is (N+1)×P%, where P is percentile so it can be 25%, 50%, and 75%.

What if the position is not an integer? Suppose we only have ten numbers (e.g., 2, 3, 5, 7, 11, 13, 17, 19, 23, 29), the position for Q1 calculated
by QUARTILE.EXC is (10+1)×25%=2.75. What is the number in the 2.75 position? There is no such a number in the original data series but we
know the number is within the value in second and the third position, which is 3 and 5, respectively. In this situation, we can divide the distance
between 3 and 5 into pieces and take the first 0.75 (2.75 position-2 position = 0.75), i.e., (5-3)×0.75=1.5. Therefore, the Q1 should be 3+1.5=4.5,
which is in the 2.75 position.

QUARTILE.INC calculates the position using a different algorithm, which is (N-1)×P%+1. When will these two algorithms return the same value?
When N is very large, there will be no significant difference between these two algorithms. You may use either in practice.

Outlier Detection Forecasting 28


Outlier Detection
• Interquartile range (IQR): . IQR can be interpreted as the “range” of the middle portion of the data. This statistic is quite
resistant to extreme observations in the data set.

• Use IQR to define outliers

 Mild Outliers: any data points larger than or smaller than ; meanwhile, these data points should be smaller than or equal to or
larger than or equal to . In other words, any observations that lie in the interval

 Serious (or Extreme) Outliers: any data points larger than or smaller than . In other words, any observations that lie in

• Practice 5.6: Find the mild outliers and extreme outliers using QUARTILE.INC

Comments
Using IQR to find outliers is a
Q1-3IQR Q1-1.5IQR Q3+1.5IQR Q3+3IQR rule of thumb. This method is
widely used in practice. In
1.5IQR 1.5IQR IQR 1.5IQR 1.5IQR general, it works better when
Serious Serious data is normally distributed.
Mild Mild
outliers outliers outliers outliers Even if the data does not follow
Q1 Q2 Q3
a normal distribution, this
method can still give us some
hints about the outliers.

Outlier Detection Forecasting 29


Extension – log-transformation
• Removing outliers will always lead to information loss. As mentioned earlier, using IQR to find outliers works better when the data is
normally distributed. Under certain situation, we can transform the variable before we run the regressions. The motivation to do this is that
we usually require the dependent variable in a regression model should have a normal distribution when the number of observations is not
large enough.

• For simplicity, let’s just consider one situation, Y is non-negative (e.g., income, sales, etc.).

 When the distribution of a variable is highly right-skewed, we can take the natural logarithm on this variable. After taking log, the distribution tends to be
normal. If a non-negative variable contains zero, we can take log(Y+1).

 If we take log transformation on both X and Y, the interpretation of the coefficient of X will be as follows: when X changes by 1%, Y will change by b% (b
is the estimated coefficient, i.e., )) Normal distribution is very common; intuitively, extreme things
happen rarely, general things happen a lot, e.g., animals’
height/weight; launch time of your computers
A right-skewed distribution has a long right tail. Such distribution describes that a large amount of Rule of Thumb
observations have small or ordinary values while very few observations have extremely large values. 1 Something is effected by multiple factors
2 These factors will not influence each other
Typical examples include household income, sales of product on platforms, military power of countries, 3 No factor dominates
etc.
• Such distribution reflects an unbalanced situation—“Matthew Effect”, which is summarized by the
adage “the rich get richer and the poor get poorer”.
• Median is more representative than mean when estimating the center because mean is distorted by
extreme values. “Jack Ma and I can earn 1 billion per year on average”.
• How to judge skewness? 1) Calculate skewness; 2) Draw histogram; 3) By priori knowledge

Outlier Detection Forecasting 30

You might also like