0% found this document useful (0 votes)
46 views3 pages

bt1101 Cheat Sheet

This document provides a cheat sheet for an introduction to business analytics course. It summarizes key concepts related to probability, statistical analysis, metrics, databases, and data visualization. Metrics and dashboards are introduced as tools to measure and monitor business performance. Common statistical techniques are outlined, including measures of central tendency, probability distributions, hypothesis testing, and methods for constructing confidence and prediction intervals.

Uploaded by

Random Dude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views3 pages

bt1101 Cheat Sheet

This document provides a cheat sheet for an introduction to business analytics course. It summarizes key concepts related to probability, statistical analysis, metrics, databases, and data visualization. Metrics and dashboards are introduced as tools to measure and monitor business performance. Common statistical techniques are outlined, including measures of central tendency, probability distributions, hypothesis testing, and methods for constructing confidence and prediction intervals.

Uploaded by

Random Dude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

lOMoARcPSD|10039096

Finals Cheat Sheet - Summary Introduction to Business


Analytics
Introduction to Business Analytics (National University of Singapore)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Good Lee ([email protected])
lOMoARcPSD|10039096

Data: numerical or textual facts and figures that are collected through some type of PROBABILITY
measurement process. Classical definition: probabilities can be deduced from theoretical arguments
Information: result of analyzing data; that is, extracting meaning from data to support Relative frequency definition: probabilities are based on empirical data
evaluation and decision making. Subjective definition: probabilities are based on judgment and experience
A data set is simply a collection of data. Marketing survey responses, a table of
historical stock prices, and a collection of measurements of dimensions of a Statistical measures of goodness of fit:
manufactured item are examples of data sets. • Chi-square (need at least 50 data points)
A database is a collection of related files containing records on people, places, or • Kolmogorov-Smirnov (works well for small samples) → Compare cumulative
things. The people, places, or things for which we store and maintain information are distribution
called entities. A database for an online retailer that sells instructional fitness books H0: same distribution H1: different distribution
and DVDs, for instance, might consist of a file for three entities: publishers from which • Anderson-Darling (puts more weight on the differences between the tails of the
goods are purchased, customer sales transactions, and product inventory. A distributions)
database file is usually organized in a two-dimensional table, where the columns • Shapiro’s Normality Test (test data against normal distribution)
correspond to each individual element of data (called fields, or attributes), and the H0: normally distributed H1: not normally distributed
rows represent records of related data elements.
CENTRAL LIMIT THEOREM
A metric is a unit of measurement that provides a way to objectively quantify The central limit theorem states that if the sample size is large enough, the
performance. For example, senior managers might assess overall business sampling distribution of the mean is approximately normally distributed, regardless
performance using such metrics as net profit, return on investment, market share, of the distribution of the population and that the mean of the sampling distribution
and customer satisfaction. A plant manager might monitor such metrics as the will be the same as that of the population.
proportion of defective parts produced or the number of inventory turns each month. The central limit theorem also states that if the population is normally distributed,
For a Web-based retailer, some useful metrics are the percentage of orders filled then the sampling distribution of the mean will also be normal for any sample size.
accurately and the time taken to fill a customer’s order. Measurement is the act of
𝜎
obtaining data associated with a metric. Measures are numerical values associated Standard Error of the Mean =
√𝑛
with a metric.
CONFIDENCE INTERVALS FOR THE MEAN
Reliability means that data are accurate and consistent. with Known Population Standard Deviation
Validity means that data correctly measure what they are supposed to measure. 𝜎
𝑥̅ ± 𝑧𝛼⁄2 ( )
√𝑛
A dashboard is a visual representation of a set of key business measures. It is derived Standard 𝑧𝛼⁄2 values: 𝑧0.975 = 1.96
from the analogy of an automobile’s control panel, which displays speed, gasoline
level, temperature, and so on. Dashboards provide important summaries of key
business information to help manage a business process or function.

Pareto Analysis: 80% output – 20% input

A cross-tabulation is a tabular method that displays the number of observations in a


data set for different subcategories of two categorical variables. A cross-tabulation
table is often called a contingency table. The subcategories of the variables must be
mutually exclusive and exhaustive, meaning that each observation can be classified
into only one subcategory, and, taken together over all subcategories, they must with Unknown Population Standard Deviation
𝑠
constitute the complete data set. 𝑥̅ ± 𝑡𝛼⁄2,𝑛−1 ( )
√𝑛
DATA TYPE SINGLE VARIABLE MULTIPLE VARIABLES
CATEGORICAL Pie Barplot
Frequency Barplot Contigency Table
Frequency Table
NUMERICAL Barplot Group Barplot
Histogram Scatterplot
Frequency Table Contigency Table
TREND Line Chart Line Chart
Surface Chart
CONFIDENCE INTERVALS FOR A PROPORTION
MEASURES 𝒑̂ (𝟏 − 𝒑
̂) 𝒙
̂ ± 𝑧𝛼⁄2 √
𝒑 ̂=
𝒑
• LOCATION Mean, Median, Mode 𝑛 𝒏
• DISPERSION Range, Variance, Standard Deviation, Chebyshev’s Theorem,
Coefficient of Variation
• SHAPE Skewness, Kurtosis
• ASSOCIATION Covariance, Correlation
MEAN STANDARD DEVIATION
POPULATION ∑𝑁
𝑖=1 𝑥𝑖
𝜇= ∑𝑛 (𝑥𝑖 − 𝜇)
𝑁 𝜎 = √ 𝑖=1
𝑁
SAMPLE ∑𝑛𝑖=1 𝑥𝑖 𝑛
𝑥̅ = ∑ (𝑥𝑖 − 𝑥𝑖 )
𝑛 𝑠 = √ 𝑖=1
𝑛−1

Chebyshev’s Theorem Empirical Rule PREDICTION INTERVALS


1 k = 1 ~ 68%
𝑃(𝜇 ± 𝑘𝜎) = 1 − 2 1
𝑘 k = 2 ~ 95% 𝑥̅ ± 𝑡𝛼⁄2,𝑛−1 (𝑠√1 + )
k = 3 ~ 99.7% 𝑛

𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐃𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧
𝐂𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐨𝐟 𝐕𝐚𝐫𝐢𝐚𝐭𝐢𝐨𝐧 (𝐂𝐕) =
𝐌𝐞𝐚𝐧
𝟏
𝐑𝐞𝐭𝐮𝐫𝐧 𝐭𝐨 𝐑𝐢𝐬𝐤 =
𝐂𝐕

Covariance is a measure of the linear association between two variables, X and Y.


Like
the variance, different formulas are used for populations and samples.
Correlation is a measure of the linear relationship between two variables, X and
Y, which does not depend on the units of measurement.

(𝑥 − 𝜇) 𝑥̅ − 𝜎0
𝑧= 𝑡= 𝑠
𝜎 ⁄ 𝑛
√ Downloaded by Good Lee ([email protected])
lOMoARcPSD|10039096

Multicollinearity: affects each other (cor > 0.7)


Interaction: Multiplication (A x B)

The principle of parsimony: models as simple as possible

FORECASTING NO SEASONALITY SEASONALITY


NO TREND Simple Moving Average Holt-Winters no-trend
Simple Exponential smoothing
Smoothing Multiple Regression
TREND Double Exponential Holt-Winters
Smoothing additive/multiplicative

Single Exponential Smoothing


ŷx = 𝛼 ∙ yx + (1 − 𝛼) ∙ ŷx−1

The null hypothesis is denoted by H0, and the alternative hypothesis is denoted by
H1. Using sample data, we either
1. reject the null hypothesis and conclude that the sample data provide
sufficient statistical evidence to support the alternative hypothesis, or
2. fail to reject the null hypothesis and conclude that the sample data does
not support the alternative hypothesis.
If we fail to reject the null hypothesis, then we can only accept as valid the existing DATA MINING
theory or belief, but we can never prove it. Approaches
- Data Exploration and Reduction: identifying groups in which elements are in
FALSE TRUE some way similar
Type I Error o Sampling
REJECT CORRECT
𝒑 = 𝜶 o Data Visualisations: Boxplots, Parallel Coordinates Chart,
Type II Error Scatterplot/Variable Plot Matrix
ACCEPT CORRECT
𝒑 = 𝜷 o Cluster Analysis (Hierarchical Clustering) → dendogram
▪ Agglomerative Clustering Method
The probability of making a Type I error, that is, P(rejecting H0 |H0 is true), • Single Linkage (Nearest Neighbor)
is denoted by α and is called the level of significance. This defines the likelihood that • Complete Linkage (furthest distance)
you are willing to take in making the incorrect conclusion that the alternative • Average Linkage (averaging groups)
hypothesis is true when, in fact, the null hypothesis is true. The value of a can be • Ward’s hierarchical clustering (sum of squares)
controlled by the decision maker and is selected before the test is conducted. ▪ Divisive Clustering Method
Commonly used levels for a are 0.10, 0.05, and 0.01. - Classification: analyzing data to predict how to classify a new data element
The probability of correctly failing to reject the null hypothesis, or P(not o k-Nearest Neighbors (categorical)
rejecting H0 |H0 is true), is called the confidence coefficient and is calculated as 1 -
▪ 𝑘 = √𝑛 < 20
α. For a confidence coefficient of 0.95, we mean that we expect 95 out of 100 samples
o Discriminant Analysis
to support the null hypothesis rather than the alternate hypothesis when H0 is actually
true.
Unfortunately, we cannot control the probability of a Type II error, P(not
rejecting H0 |H0 is false), which is denoted by β. Unlike α, β cannot be specified in
advance but depends on the true value of the (unknown) population parameter.
The value 1 -β is called the power of the test and represents the probability
of correctly rejecting the null hypothesis when it is indeed false, or P(rejecting H0 |H0
is false). We would like the power of the test to be high (equivalently, we would like
the probability of a Type II error to be low) to allow us to make a valid conclusion. The
power of the test is sensitive to the sample size; small sample sizes generally result
in a low value of 1 -β. The power of the test can be increased by taking larger samples,
which enable us to detect small differences between the sample statistics and o Logistic Regression (result will be in probability 0~1)
population parameters with more accuracy. However, a larger sample size incurs - Association: analyzing databases to identify natural associations among
higher costs, giving new meaning to the adage, there is no such thing as a free lunch. variables and create rules for target marketing or buying recommendations
This suggests that if you choose a small level of significance, you should try to - Cause-and-Effect Modelling: developing analytic models to describe
compensate by having a large sample size when you conduct the test. relationships between metrics that drive business performance
𝑛
STATISTICAL TESTS: 𝐶𝑖
Compare two sample means with normal errors: t.test NPV = ∑
- (1 + 𝑟)𝑖
𝑖=0
- Compare means of two or more population groups: aov (ANOVA)
o H0: 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑛
o H1: at least one mean is different from the others
Requirements:
a. Are randomly and independent obtained
b. Are normally distributed
c. Have equal variances
- Compare equality of two variances: var.test (F-test)
o H0: 𝜎12 − 𝜎22 = 0
o H1: 𝜎12 − 𝜎22 ≠ 0
𝑠12
𝐹= 2
𝑠2
- Compare more than two variances: Bartlett.test
o H0: 𝜎12 = 𝜎22 ⋯ = 𝜎 2
o Ha: 𝜎𝑖2 ≠ 𝜎𝑗2 for at least one pair (i,j).
- Compare proportions: prop.test

REGRESSION STATISTICS
R-squared = ∑(𝑦 − 𝑦′)2 → [worst ]0 < r < 1 [best]
Multiple R = |r| → sample correlation coefficient -1<= r <= 1
Adjusted R-squared → adjusted for sample size and number of X vars
Standard Error → variability between observed and predicted values

DIAGNOSTIC PLOTS
RESIDUALS vs FITTED the more linear and random, the better
NORMAL Q-Q the less deviation from the line, the more normal
SCALE-LOCATION homoscedasticity; better random, better horizontal
Downloaded by Good Lee ([email protected])
RESIDUALS vs LEVERAGE any influential outliers?

You might also like