Statistical Measures
Statistical Measures
\
Learning objectives
• Explain the difference between Populations and Samples
• Understand and be able to distinguish and apply the different measures:
• Measures of Location (Mean, Median, Mode),
'
I
• Measures of Dispersion (Range, Variance, Standard Deviation,
Chebyshev’s Theorem, Coefficient of Variation)
• Measures of Shape (Skewness, Kurtosis)
• Measures of Association (Covariance and Correlation)
• Be able to identify outliers
stat
Populations and Samples > similar to Jc
'
to he striie
items
the entire group of
Basically
f)
v Population - all items of interest for a particular
decision or investigation
• all married drivers over 25 years old
• all subscribers to Netflix
v Sample - a subset of the population
• a list of married drivers over 25 years old who
bought a new car in the past year
• a list of individuals who rented a comedy from
Netflix in the past year
v Purpose of sampling is to obtain sufficient
information to draw a valid inference about a
population
Measures of Location – Mean
>
> µ ,
population men
Example: Finding the Median Cost per Order (Purchase Orders data)
WH '
°ᵗ
aware
the
guy the median
a]
values
the 2
Using R to compute mean and median
:
Measures of Location – Mode
~ observation that occurs most often or, for grouped data, the group with
the greatest frequency.
Eg for observation data: Finding the Mode of A/P terms
(Purchase Orders data)
} Mode of A/P terms:
= 30 months
turtle
Revise table
and frequency
tables ! !
!
Eg for grouped data: Finding the Mode of Cost per order (Purchase Orders data)
{
Compute
vein
metier
# Compute Mode
compute ☆
••
n.ae
☆
☆ install.packages (psych)
library (psych)
Measures of Dispersion – Variance É
E
a- =
:& "
i a.
^
/
↑
If n=N ,
SE :{ Cn ;
-
mi
l s?
n
-
N -
I
. To caviar population
Wmu
,
Recall that
:& ( ni Mi
6! { 62
-
=
✗ 52
N
= nice i. € ,
cni -
¥÷i
n -
1
AND
sample size n =
popnktie size N
for
Measures of Dispersion – Standard Deviation → Math
deviation
Stahler
~ square root of the variance
(popular measure of risk)
Sqrt(((N-1)/N)*var(X))
} For a population:
sd (X)
} For a sample:
Measures of Dispersion – Standard Deviation
• Which has a higher standard deviation?
¥
* Chebyshev’s Theorem
• For any data set, the proportion of values that lie within k (k > 1) standard
deviations of the mean is at least 1 – 1/k2
:
Substituting values of k, we get:
• For k = 2: at least ¾ or 75% of the data lie within two standard deviations
of the mean ÷:
• For k = 3: at least 8/9 or 89% of the data lie within three standard
deviations of the mean
☆
Eg: Using Empirical Rules to Measure the Capability of a Manufacturing
Process
Cp=0.4/0.7
In practice:
AIM for Cp => 1.5
distribution )
Standardized Values ( used In probability
• A standardized value, commonly called a z-score, provides a relative measure of the
distance an observation is from the mean (independent of units of measurement)
• z-score for ith observation in a data set is:
- +
Z = -1 Z=1
(observation is 1 SD to the left of mean) (observation is 1 SD to the right of mean)
E G : C o m p u t i n g z - S c o re s
is
the dentures
' skewed
left / right
} CS is negative for left-skewed data.
} CS is positive for right-skewed data.
} |CS| > 1 suggests high degree of skewness.
} 0.5 ≤ |CS| ≤ 1 suggests moderate skewness.
} |CS| < 0.5 suggests relative symmetry.
I CS win
! !!
Mun fhlthh to get
p hui
Eg: Measuring Skewness
• Using Purchase Orders database
Eg: Measuring Skewness
• Using Purchase Orders database
thered)
• Cost per order data: CS = 1.61 ( right
• A/P terms data: CS = 0.58 In Which has higher skewness?
Positive or Negative?
CS = 1.61 CS = 0.58
High positive skewness Moderate positive skewness
Shape and Measures of Location
Comparing measures of location can sometimes reveal information about the shape of the distribution of
observations.
Negatively skewed Positively skewed
Mean < Median < Mode Mode < Median < Mean
For example:
• If distribution was perfectly symmetrical and unimodal, the mean, median, and
mode would all be the same.
• If it were negatively skewed, mean < median < mode
• Positive skewness would suggest that mode < median < mean
Measures of Shape: Kurtosis
• Kurtosis refers to the tailedness of the distribution.
• Coefficient of kurtosis (CK) measures the degree of kurtosis
☆ kurtosis > 3
points or outliers)
} CK > 3 indicates leptokurtic distribution has a
thinner & longer (or heavier) tail (more
extreme points or outliers) Source: https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
Measures of Shape: Kurtosis
-
[ –3 ,
Excess kurtosis > 0
}
• Population mean: variation for egm
• Population variance:
• Sample variance:
Eg: Computing Statistical Measures from Frequency Distributions
• Computer Repair Times
5.962 = 35.50
Eg: Computing Home Value by Type and Region
function in
`psych` package
Descriptive Statistics for Categorical Data: The Proportion
• For a population:
• For a sample:
For a sample:
{
(< 0.3% for normal data)
• } Extreme outliers are > 3*IQR to the left of Q1 or right of Q3
} Mild outliers are between (1.5to 3)*IQR to the left of Q1 or
right of Q3
Eg: Investigating Outliers
• Home Market Value data
• None of the z-scores exceed 3. However, while individual variables might not
exhibit outliers, combinations of them might.
• The last observation has a high market value ($120,700) but a relatively small house size
(1,581 square feet) and may be an outlier.
What do you do with outliers?
↓
- Leave them in the data if it is important
- Remove them if they are different from the rest
- Correct error in data entry
Statistical Thinking in Business DM
• Statistical Thinking is a philosophy of learning and action for
improvement, based on principles that:
• all work occurs in a system of interconnected processes
• variation exists in all processes
• better performance results from understanding and reducing
variation
• Business Analytics provide managers with insights into facts
and relationships that enables them to make better
decisions.
Applying Statistical Thinking
• Excel file Surgery Infections
• Is month 12 simply random variation or some explainable phenomenon?
Applying Statistical Thinking
• Excel file Surgery Infections
• Is month 12 simply random variation or some explainable phenomenon?
Applying the 3 std dev empirical rule!
upper limit
lower limit
Variability in Samples
• Different samples from any population will vary
• different means, standard deviations, and other statistical measures
• differences in shapes of histograms
• Samples are extremely sensitive to the sample size – the
number of observations included in the samples.
Eg: Variation in Sample Data
• Samples from Computer Repair Times data
• Population statistics: µ = 14.91 days, σ2 = 35.5 days2
• Two samples of size 50:
→ Diff
→ DIG