Lecture 5 - Spring 2024
Lecture 5 - Spring 2024
• SUMPRODUCT
Get the summation of several products (e.g., total grade):
Get the summation of a series with the same structure (expressed by the summation symbol):
, e.g., sample variable:
Group analysis: combine SUMPRODUCT function with logical test:
SUMPRODUCT(--(A2:A9 = “north”)) is equivalent to COUNTIF(A2:A9, “north”)
• PivotTable
Use PivotTable when we want to summarize the information (i.e., a statistic such as mean, median, summation, etc.) by one or more columns. These columns are generally
non-numerical such as region, location, type, year, category, etc. For example, we want to know the average sales in each region.
Insert Values to PivotTable
Rows Field and Columns Field: Rows mean Row Names and Columns mean Column Names, put non-numerical values into these two fields as row/column names
Values Field: Put numerical values into this field. A statistic will be calculated for the added values. We can put multiple numerical values into Values Field and we
can also repeatedly put one numerical value with different statistics here.
Edit Values
How to change the statistic: under “Values” area > Value Field Settings or under the Analyze Tab > Active Field group > Field Settings. We can also change the
number format here.
How to combine multiple groups: Select multiple column/row names at the same time, then click Group Selection under Analyze Tab > “Group” group (or right click
> Group).
Change the group name for this newly generated group
Change the label for this new categorization
How to calculate percentage of parent total
Show Values As > % of Parent Total (need to decide which parent to use by choosing the Base Field)
Review 1
Quick Review
• Figure vs. Table
The table contains the complete information and you can check any data point you want. Figures are easier to follow and
people need much less time to understand a figure than a table. A figure can also show the trend in the data.
In an official report or a research paper, complete information is necessary. In general, people use figures to describe some
preliminary results and use tables to report the main results.
• Chart Elements: Chart area, Plot area, X-axis, Y-axis, and Legend
• How to insert a chart, after selecting the data
Method 1: Quick Analysis (Ctrl+Q)
Method 2: Insert tab > Charts group > directly choose an appropriate chart
Method 3: Insert tab > Charts group > Recommended Charts > choose an appropriate chart
Review 2
DSME 2051 Business Information Systems
Lecture 5: Business Analytics
Forecasting
Introduction to Forecasting
Stephen Curry’s Average Points
• Deduce the next number:
Age PTS
1, 2, 3… 21 17.5
22 18.6
1, 3, 5…
23 14.7
1, 1, 2, 3, 5, 8… 24 22.9
25 24.0
• Why can we deduce the next number? 26 23.8
We depend on the information from existing numbers. 27 30.1
28 25.3
Similarly, if we have a time frame, we can use previous
29 27.6
information to predict the information of next time point
(i.e., the future). Hints
Can we predict his PTS next year?
Those numbers share a strict math pattern, but usually we do • The main advantage of prediction by a time frame is:
not have such a beautiful pattern. the environment will not change largely, especially in a
short time. Next year, we know Stephen Curry is very
likely to be a player at the same level. Usually, the
most recent information will be the most useful (when
there is no seasonal effect).
• Assumption: things will not change a lot in a short time.
• One possible solution: directly use data from last period. However, we still have a lot of information left. Maybe we can
use information from last two years, or three years.
• In financial applications a simple moving average (SMA) is the unweighted mean of the previous data points.
One-year moving average: use the data point in the most recent period to predict next period
Two-year moving average: use the average of the data points in the most recent two periods to predict next period
…
As we increase the number of periods to predict the next period, the prediction will become more “smooth”, because we are using
more information in previous period. Therefore, more variation in data will be cancelled out when we take the average. If the data is
sensitive to time change, using a long-period moving average may not be a good idea because information in the most recent periods
will be “attenuated” when we take the average for a long period.
• One issue for the simple moving average is that if we use too many previous periods, the recent information will be
cancelled out, which may have a bigger impact for the next period. However, if we use very few previous period, the
prediction can also be unstable.
• To overcome the above shortcomings, we can use a weighted moving average instead of simple moving average.
Specifically, we can add a weight for each period and more recent period will have a larger weight. In this situation, the
information in recent periods will not be attenuated too much even if we use much historical information.
• In Excel, use three-month weighted moving average to predict the data in November. Data: Weighted Moving Average
• Exponential smoothing
Where represents the forecast value for next period. is actual value for the present period. means previously determined forecast
for present period. is called the smoothing factor and .
• Suppose we have two-period data, and , let , we want to predict the data point in the third period. Then
• As we see, to get the prediction result using exponential smoothing, we have to assign an initial value for , which is the
prediction result for the first period. An easiest way to do this is to set value equal to .
What is if we let ?
• Can you calculate the summation of the weights using high-school mathematics?
• The name “exponential smoothing” is attributed to the use of the exponential window function during convolution.
In other words, the predicted value , which is called a “Smoothed Statistic” (can you recall the definition of
“statistic”?), is the weighted average of all past observations, and the weights assigned to previous observations are
proportional to the terms of the geometric progression .
• Exponential smoothing is a rule of thumb technique for smoothing time series data using the exponential window
function, whereas in the simple moving average, the past observations are weighted equally. Exponential functions
are used to assign exponentially decreasing weights over time.
• Now let’s predict Stephen Curry’s average points with exponential smoothing in Excel.
• Practice 5.1
Please write your formula for exponential smoothing using two different smoothing factors 0.8 and 0.2.
Draw a line chart when the smoothing factor is 0, 0.1, 0.9, and 1, respectively. Observe how the prediction changes with different
smoothing factors.
• Remember the trend line we added to the scatterplot? A Linear Trend Line is defined as follows:
• Where is the intercept (the value of when ) and is slope of the line. If we want to use the linear trend line for prediction,
then can be the time period and will be the forecast for demand for period . (data: demand)
Regression Forecasting 11
Simple Linear Regression
x(period) y(demand)
1 37 37 1
2 40 80 4
3 41 123 9
4 37 148 16
5 45 225 25
6 50 300 36
7 43 301 49
Linear trend line:
8 47 376 64
9 56 504 81
10 52 520 100 Forecast for period 13
11 55 605 121
12 54 648 144
sum 78 () 557 () 3867 () 650 ()
average 6.5() 46.42 ()
Regression Forecasting 12
Simple Linear Regression
• The does not have to be the period and it can be anything that can predict the . For example, on an e-commerce platform
such as Amazon and Taobao, we can use the number of consumers’ reviews to predict the sales of a product.
• In a regression model, is called the independent variable (predictor variable, explanatory variable) and is called the
dependent variable (response variable, explained variable). Practice 5.2: estimate the regression function.
• Instead of calculating the regression coefficients (the slope and the intercept) by hand, we can use the add-in functions of
Excel to conduct the calculation.
Number of reviews Sales
• Data tab > Analysis group > Data Analysis > Regression 322 142
880 286
Tips 527 223
829 255
If you cannot find the Data Analysis option, click the File
564 210
tab, click Options, and then click the Add-Ins category. In 697 251
the Manage box, select Excel Add-ins and then click Go. In 531 202
356 178
the Add-Ins available box, select the Analysis ToolPak 462 172
check box, and then click OK. 331 171
Regression Forecasting 13
Simple Linear Regression
Select the data for in the “Input X/Y Range”. If you select the column names (the header), you need to click “Labels” (Here I
selected A1 and B1 so I should click “Labels”)
You can select an Output location, either in the same worksheet or a new worksheet or a new workbook
Regression Forecasting 14
Interpret Regression Outputs
How to interpret the results?
Coefficients Standard Error t Stat P-value We can interpret
Lower 95% the estimate
Upper 95% ofLower
the slope
95.0%(which is 95.0%
Upper
Intercept 89.473 13.615 6.572 0.000 0.217) as follows: 120.870
58.077 58.077 120.870
Number of reviews 0.217 0.023 9.280 0.000 0.163 0.271 0.163 0.271
When the number of reviews increases by one unit, the
• Coefficients. These are the estimates of the slope and the intercept. Wesales will increase
can substitute by 0.217
these values intounits.
the regression function . The
estimated regression function would be (you should use concrete name of variables in your assignment or exam, you cannot just use x
If the p-value of the slope is significant (<0.1), the
and y if there is a concrete context) change in y when x changes is significant. Otherwise, this
• change canthat
P-value. Intuitively but not strictly, you can understand P-value as the probability bethe
ignored in a linear
corresponding sense.coefficient equals to
estimated
0. In general, if the P-value is larger than 0.1, we would say the corresponding estimated coefficient is not likely to be significantly
different from 0 (i.e., not significant or insignificant). Especially, if the estimate of the slope is insignificant, we can say there is no
significant linear relationship between and .
Regression Forecasting 15
Comments
• The relationship between the slope and forecasting ability
For a simple linear regression model (), we want to know whether can predict or not. When the slope (b, the coefficient of ) is not
0 in from a statistical point of view (we call this significantly different from 0, or significant, or the probability of the slope being 0
is smaller than 10% or 5% or 1%), we can say we are able to use to “linearly” predict . Think about this, when the slope is 0, no
matter what value takes, will always be the same number or a constant (a flat line, ). In this situation, cannot provide any
information for the prediction of .
• The relationship between the p-value of the slope and the Null hypothesis ()
Statisticians usually make a hypothesis which is not what they want so that they may reject it and get the result they want. Under
our context, we say that cannot linearly predict is our null hypothesis (). The p-value of b represents the probability that b equals
to 0. Therefore, if the p-value is very small, the null hypothesis is not likely to be true. Especially, when the p-value is smaller than
10%/5%/1%, we say we can reject under 0.1/0.05/0.01 significance level.
• About the linear assumption
When the independent variable is non-linear, e.g., , we can calculate a new variable such that and run the following regression
model: . So this linear assumption means “linear” in the coefficient. A counter-example would be . In this example, we cannot say
is linear in since is the coefficient of .
Regression Forecasting 16
Extension: Multiple Linear Regression
• We can extend the simple linear regression model to multiple regression, which allows predictions of systems with
multiple independent variables, e.g., .
• We can directly use the Data Analysis Module to estimate and in Excel.
• The only difference between simple linear regression and multiple linear regression is to extend the Input X Range from
one column to multiple columns. Notice, the Input range of X must be a contiguous reference.
Regression Forecasting 17
Extension: Multiple Linear Regression
Coefficient Standard Lower Upper
t Stat P-value Lower 95% Upper 95%
s Error 95.0% 95.0%
Intercept 39.34403 6.270459 6.274506 2.55E-08 26.83799 51.85007 26.83799 51.85007
weight -0.00586 0.000853 -6.8692 2.16E-09 -0.00756 -0.00416 -0.00756 -0.00416
headroom -0.20967 0.550716 -0.38072 0.704561 -1.30804 0.888698 -1.30804 0.888698
gearRatio 0.089589 1.373434 0.06523 0.948177 -2.64964 2.828817 -2.64964 2.828817
From the above regression results, we can have the following conclusions.
• The estimated regression function should be:
• The P-values of the coefficients of headroom and gearRatio are larger than 0.1, indicating the correlations of headroom and
gearRatio with mpg are not significantly different from 0. Therefore, we can say headroom and gearRatio are not likely to
linearly predict the mpg.
• How to interpret the coefficient?
When the weight increases by one unit, the mpg will decrease by 0.0059 unit, holding other factors constant. The coefficient we
estimate using multiple linear regression reflects a “partial” relationship (i.e., ).
When there are multiple factors influencing the DV we care about, (e.g., suppose grade is mainly influenced by effort level and
intelligence), and we want to discuss how one variable influences the DV, we must keep the other factor(s) unchanged. Otherwise, such
discussion would be meaningless. The power of multiple regression can help us solve this issue.
• Practice 5.3. Transform “foreign” to numerical values and include it in the above regression.
Regression Forecasting 18
Extension: Multiple Linear Regression
• Suppose we estimate the above equation, which is a multiple linear regression (linear in and ). Let’s substitute and , then
the estimation function becomes
• If we want to test if and have a quadratic relationship, we can use the p-value of the estimate of . If the p-value of is
larger than 0.1 (suppose the significance level is 0.1), which means is not significantly different from zero, that means
and do not have a significant quadratic relationship.
Hands-on Exercise
Use the data Automobile. Estimate the following regression model by Excel:
Question:
• When the weight increases by 1 unit, how will mpg change?
• What is the p-value of ? Does weight and mpg have a significant quadratic relationship?
• Should we include the insignificant variable (e.g., the quadratic term) into the regression model?
Regression Forecasting 19
DSME 2051 Business Information Systems
Data Analysis – Forecasting Assessment
Forecasting
Forecasting Assessment
If, then, which means forecasting does not reflect the information in the recent actual data
If, then, which means forecasting is only based on the most recent actual data
• How to choose the best α? We can also generalize this question to other prediction methods, i.e. how to evaluate the
forecasting result?
Rule of thumb: if there is a trend in the time series data, which means there is a gradual, long-term up or down movement, then most
recent data is more useful for forecasting. Therefore, we need to choose a short period moving average or a large for exponential
smoothing.
𝑀𝑆𝐸 =
∑ ( 𝐷𝑡 − 𝐹 𝑡 ) 2
𝑛
Practice 5.4 Period (α =0.3) (α =0.5)
• Consider the previous example, suppose 1 37 37.00 0.00 0.00 37.00 0.00
2 40 37.00 9.00 37.00 9.00 37.00
we use exponential smoothing for
3 41 37.90 9.61 38.50 6.25 37.90
prediction. Calculate the MSE using
4 37 38.83 3.35 39.75 7.56 38.83
Excel. According to MSE, which is 5 45 38.28 45.16 38.38 43.89 38.28
better? 6 50 40.29 93.90 41.69 69.10 40.29
• Draw a line chart for the actual values 7 43 43.20 0.04 45.84 8.09 43.20
and the predictive values with two 8 47 43.14 14.90 44.42 6.65 43.14
9 56 44.30 136.89 45.71 105.86 44.30
different smoothing factors. Can you
10 52 47.81 17.56 50.86 1.31 47.81
tell which line is a better prediction?
11 55 49.06 35.28 51.43 12.76 49.06
Can we have the same conclusion using a 12 54 50.84 9.92 53.21 0.62 50.84
line chart? 51.79 375.61 53.61 271.09 51.79
Should we include the first period in exponential smoothing when we calculate MSE/MAD?
It could be unreasonable to use the first period, where the predictive value is achieved by manual construction, to calculate the MAD or MSE. However, if we use MSE or
MAD to compare different exponential smoothing models (i.e., choosing a better smoothing factor ), then either including or excluding the first period will not influence our
decision. My suggestion is that since we have the predictive value for the first period (even if it is by our construction), we may include it in our calculation for MSE and
MAD. (Heizer, J., Render, B., & Munson, C. (2008). Operations management. Prentice-Hall.)
• In the above equation, represent the actual values and represent the predicted values.
• , we can interpret such that of the variation in the dependent variable is explained by the prediction model. In general,
the larger the is, the better the prediction model is (not always). We may use to evaluate simple/multiple linear
regression model. Do not use to evaluate exponential smoothing or moving average.
• If our focus is the dependent variable (y), e.g., we want to predict y given specific x’s, we generally want a big R-squared
(perhaps larger than 0.7). However, if our focus is the independent variable (x), say we want to study whether some x has
a significant relationship or whether some x can significantly predict y, in this situation, the p-value of the coefficient of
this x is what we are concerned about. Even if we have a small R-squared, we can still build our conclusion such that x
can significantly predict y.
• Excel reports for a regression model.
• Practice 5.5: calculate the of the regression model of consumer review. Compare your results with the reported by
Excel.
Forecasting
Quartile
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
• The median is 6. Now let’s use the median to separate this series into two sections:
1, 2, 3, 4, 5 and 7, 8, 9, 10, 11
The median of the first half is 3, which is called the first quartile/lower quartile/Q1/25th percentile. Q1 splits off the lowest 25% of
data from the highest 75%
The median of the second half is 9, which is called the third quartile/upper quartile/Q3/75th percentile. Q3 splits off the highest 25%
of data from the lowest 75%
The median of the whole series is called second quartile/median/50th percentile, which cuts data set in half.
• Notice, for discrete distributions, there is no universal method on calculating the quartiles. The above method to calculate
Q1 and Q3 is just one of many ways to calculate the quartiles. For example, if we include 6 into both series, what is Q1
and Q3, respectively?
• The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. It is
also known as the lower quartile and the 25th percentile. It marks where 25% of the data is below or to the left of it (if data
is ordered on a timeline from smallest to largest).
• The second quartile (Q2) is the median or 50th percentile of a data set and 50% of the data lies below this point.
• The third quartile (Q3) is the middle value between the median and the highest value of the data set. It is also known as the
upper quartile or the 75th percentile and 75% of the data lies below this point.
• The minimum value is 0th percentile, which means there are 0% data point smaller than the minimum value.
• The maximum value is 100th percentile, which means there are 100% data points smaller than the maximum value.
• The minimum (smallest observation), the lower quartile or first quartile, the median (the middle value), the upper quartile
or third quartile, and the maximum (largest observation) are the most important percentiles. They are called the Five-
number summary.
QUARTILE.INC: this function can calculate quartiles including the percentiles at the boundary of the data series, i.e., 0th percentile
(minimum) and 100th percentile (maximum)
QUARTILE.EXC: this function can calculate quartiles excluding the percentiles at the boundary of the data series, i.e., 0th percentile
(minimum) and 100th percentile (maximum)
The algorithms of how these two functions calculate quartiles are also different. Another function called “QUARTILE” in old-
version Excel works similar to QUARTILE.INC.
Extension
To find the quartile, Excel will first find the “position” of that quartile. For example, if there are 11 numbers ordered ascendingly (e.g., the first
eleven prime number 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31), QUARTILE.EXC (series, 1) works as follows: it first calculates the position of Q1 as 3
and then it finds the value in the third position, which is 5 (just like INDEX function). The algorithm of calculating the position of
QUARTILE.EXC is (N+1)×P%, where P is percentile so it can be 25%, 50%, and 75%.
What if the position is not an integer? Suppose we only have ten numbers (e.g., 2, 3, 5, 7, 11, 13, 17, 19, 23, 29), the position for Q1 calculated
by QUARTILE.EXC is (10+1)×25%=2.75. What is the number in the 2.75 position? There is no such a number in the original data series but we
know the number is within the value in second and the third position, which is 3 and 5, respectively. In this situation, we can divide the distance
between 3 and 5 into pieces and take the first 0.75 (2.75 position-2 position = 0.75), i.e., (5-3)×0.75=1.5. Therefore, the Q1 should be 3+1.5=4.5,
which is in the 2.75 position.
QUARTILE.INC calculates the position using a different algorithm, which is (N-1)×P%+1. When will these two algorithms return the same value?
When N is very large, there will be no significant difference between these two algorithms. You may use either in practice.
Mild Outliers: any data points larger than or smaller than ; meanwhile, these data points should be smaller than or equal to or
larger than or equal to . In other words, any observations that lie in the interval
Serious (or Extreme) Outliers: any data points larger than or smaller than . In other words, any observations that lie in
• Practice 5.6: Find the mild outliers and extreme outliers using QUARTILE.INC
Comments
Using IQR to find outliers is a
Q1-3IQR Q1-1.5IQR Q3+1.5IQR Q3+3IQR rule of thumb. This method is
widely used in practice. In
1.5IQR 1.5IQR IQR 1.5IQR 1.5IQR general, it works better when
Serious Serious data is normally distributed.
Mild Mild
outliers outliers outliers outliers Even if the data does not follow
Q1 Q2 Q3
a normal distribution, this
method can still give us some
hints about the outliers.
• For simplicity, let’s just consider one situation, Y is non-negative (e.g., income, sales, etc.).
When the distribution of a variable is highly right-skewed, we can take the natural logarithm on this variable. After taking log, the distribution tends to be
normal. If a non-negative variable contains zero, we can take log(Y+1).
If we take log transformation on both X and Y, the interpretation of the coefficient of X will be as follows: when X changes by 1%, Y will change by b% (b
is the estimated coefficient, i.e., )) Normal distribution is very common; intuitively, extreme things
happen rarely, general things happen a lot, e.g., animals’
height/weight; launch time of your computers
A right-skewed distribution has a long right tail. Such distribution describes that a large amount of Rule of Thumb
observations have small or ordinary values while very few observations have extremely large values. 1 Something is effected by multiple factors
2 These factors will not influence each other
Typical examples include household income, sales of product on platforms, military power of countries, 3 No factor dominates
etc.
• Such distribution reflects an unbalanced situation—“Matthew Effect”, which is summarized by the
adage “the rich get richer and the poor get poorer”.
• Median is more representative than mean when estimating the center because mean is distorted by
extreme values. “Jack Ma and I can earn 1 billion per year on average”.
• How to judge skewness? 1) Calculate skewness; 2) Draw histogram; 3) By priori knowledge