0% found this document useful (0 votes)
4 views54 pages

Statistical Measures

The document outlines descriptive statistical measures, including differences between populations and samples, and various measures such as location (mean, median, mode), dispersion (range, variance, standard deviation), shape (skewness, kurtosis), and association (covariance, correlation). It provides examples and applications for calculating these measures, as well as discussing the implications of skewness and kurtosis on data interpretation. Additionally, it emphasizes the importance of understanding these measures for making valid inferences in statistical analysis.

Uploaded by

gerald.tanwh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views54 pages

Statistical Measures

The document outlines descriptive statistical measures, including differences between populations and samples, and various measures such as location (mean, median, mode), dispersion (range, variance, standard deviation), shape (skewness, kurtosis), and association (covariance, correlation). It provides examples and applications for calculating these measures, as well as discussing the implications of skewness and kurtosis on data interpretation. Additionally, it emphasizes the importance of understanding these measures for making valid inferences in statistical analysis.

Uploaded by

gerald.tanwh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Descriptive Statistical Measures

\
Learning objectives
• Explain the difference between Populations and Samples
• Understand and be able to distinguish and apply the different measures:
• Measures of Location (Mean, Median, Mode),
'
I
• Measures of Dispersion (Range, Variance, Standard Deviation,
Chebyshev’s Theorem, Coefficient of Variation)
• Measures of Shape (Skewness, Kurtosis)
• Measures of Association (Covariance and Correlation)
• Be able to identify outliers
stat
Populations and Samples > similar to Jc

'
to he striie
items
the entire group of
Basically
f)
v Population - all items of interest for a particular
decision or investigation
• all married drivers over 25 years old
• all subscribers to Netflix
v Sample - a subset of the population
• a list of married drivers over 25 years old who
bought a new car in the past year
• a list of individuals who rented a comedy from
Netflix in the past year
v Purpose of sampling is to obtain sufficient
information to draw a valid inference about a
population
Measures of Location – Mean
>

The average value of a variable

> µ ,
population men

For a population of size N:

For a sample of n observations: >


Ñ
sample mean

Mean is also commonly known as average


Measures of Location – Mean
Example: Computing Mean Cost per Order (Using Purchase Orders data)
- Using formula:
Mean = $2,471,760/94
= $26,295.32
Measures of Location – Median
~ middle value of the data when arranged from least to greatest

Example: Finding the Median Cost per Order (Purchase Orders data)

Sort the data in column B.


Since n = 94,
Median = $15,656.25
(average of 47th &
48th observations)

NOTE: If n is odd number, then median is the


value of the middle observation
1 is ever
n ,
when
middle
there are 2

WH '
°ᵗ
aware
the
guy the median
a]
values
the 2
Using R to compute mean and median

:
Measures of Location – Mode
~ observation that occurs most often or, for grouped data, the group with
the greatest frequency.
Eg for observation data: Finding the Mode of A/P terms
(Purchase Orders data)
} Mode of A/P terms:
= 30 months
turtle
Revise table

and frequency
tables ! !
!

note use table C) function rot with the


There Is no Fin] the
take
function R to get th frequency main frequent
Measures of Location – Mode
~ observation that occurs most often or, for grouped data, the group
with the greatest frequency.

Eg for grouped data: Finding the Mode of Cost per order (Purchase Orders data)

} Mode is the group


between $0 and
$20,000
Measures of Location: Application
Problem: Quoting Computer Repair Times
Data set (Computer Repair Times) includes 250 repair times for customers.
– What repair time would be reasonable to quote
to a new customer?
# Use R functions for Mean & Median

{
Compute
vein

metier

# Compute Mode

compute ☆
••
n.ae

# Use Table function to obtain frequencies for each value of X


Measures of Location: Application
Problem: Quoting Computer Repair Times
Data set (Computer Repair Times) includes 250 repair times for customers.
– What repair time would be reasonable to quote to a new customer?
– Mean repair time is about 15 days
– Median repair time: 2 weeks
– Mode is 12 and 15 days
?? ?
Measures of Location: Application
Problem: Quoting Computer Repair Times
Data set (Computer Repair Times) includes 250 repair times for customers.
– What repair time would be reasonable to quote to a new customer?
– Mean repair time is about 15 days
– Median repair time: 2 weeks
– Mode is 12 and 15 days
Measures of Dispersion
• Dispersion refers to the degree of variation (numerical spread or
compactness) in the data
• Range is the difference between the maximum and minimum data values
• Interquartile range difference between the first and third quartiles (Q3 –
Q1) (uses middle 50% data)
• Variance is an average of squared deviations from mean (uses all data
values) var ( )
• Standard Deviation is the square root of the variance sd ( )
x is the computer ☆
repair times variable Range=?
IQR =?


☆ install.packages (psych)
library (psych)
Measures of Dispersion – Variance É
E

a- =
:& "
i a.
^

/

~ average of squared deviations from mean ñj


É Cn ; -

If n=N ,

SE :{ Cn ;
-
mi
l s?
n
-

N -
I

Computing the Variance i.


6! ✗ 52

If sample data is also population data, then


n=N, to compute population variance:
• For a population: [(N-1)/N]*var(X)

• For a sample: In R, sample variance is var(X)


WIN] cacubtel

Sample mince only


He
• Note the difference in denominators <
'

. To caviar population
Wmu
,

Recall that

:& ( ni Mi
6! { 62
-

=
✗ 52
N

= nice i. € ,
cni -

¥÷i
n -
1

AND

sample size n =
popnktie size N
for
Measures of Dispersion – Standard Deviation → Math
deviation
Stahler
~ square root of the variance
(popular measure of risk)

Computing the Standard Deviation

Sqrt(((N-1)/N)*var(X))
} For a population:

sd (X)
} For a sample:
Measures of Dispersion – Standard Deviation
• Which has a higher standard deviation?

Source: Wikipedia (Standard Deviation)


Measures of Dispersion - Application
Mean & Standard Deviation of Closing Stock Prices

Intel (INTC): → Higher rim


Mean = $18.81
Stdev. = $0.50

General Electric (GE):


Mean = $16.19
Stdev. = $0.35

Whose risk may be higher?


Measures of Dispersion

¥
* Chebyshev’s Theorem
• For any data set, the proportion of values that lie within k (k > 1) standard
deviations of the mean is at least 1 – 1/k2
:
Substituting values of k, we get:
• For k = 2: at least ¾ or 75% of the data lie within two standard deviations
of the mean ÷:
• For k = 3: at least 8/9 or 89% of the data lie within three standard
deviations of the mean

Why is this useful?


• Able to use mean and standard deviation to find percentage of total
observations that fall within a given interval about the mean
Example: For Cost per order data in Purchase orders database
Example: For Cost per order data in Purchase orders database
• Applying two std dev interval (i.e. k=2)

94.68% falls within 2 sd of the mean.


Ø k= 2; 75% of according to Chebyshev’s
theorem

• Applying three std dev interval (i.e. k=3)

97.87% falls within 3 sd of the mean


Ø k=3; 89% according to Chebyshev’s
theorem
Measures of Dispersion
• Empirical Rules- For normally distributed data set, the proportion of values
☆ that lie within k (k > 1) standard deviations of the mean follows the empirical
rules:
• For k = 1: about 68% lie within one standard deviations of the mean
• For k = 2: about 95% lie within two standard deviations of the mean
• For k = 3: about 99.7% lie within three standard deviations of the mean

Source: Statistics Libretexts


Application of Empirical Rule - Process Capability Index
• Process capability index (Cp) is a measure of how well a
manufacturing process can achieve specifications
• Using a sample of output, measure the dimension of interest, and
compute the total variation using the third empirical rule.
• Compare results to specifications using:


Eg: Using Empirical Rules to Measure the Capability of a Manufacturing
Process

Part dimension (cm)

Cp=0.4/0.7

Third Empirical rule

In practice:
AIM for Cp => 1.5
distribution )
Standardized Values ( used In probability
• A standardized value, commonly called a z-score, provides a relative measure of the
distance an observation is from the mean (independent of units of measurement)
• z-score for ith observation in a data set is:

- +
Z = -1 Z=1
(observation is 1 SD to the left of mean) (observation is 1 SD to the right of mean)
E G : C o m p u t i n g z - S c o re s

• Purchase Orders Cost per order data


*
Computing 2 sure in :

df$zscore <- (df$cost -mean(df$cost))


sd(df$cost)
Coefficient of Variation
} The coefficient of variation (CV) provides a relative measure of dispersion
in data relative to the mean:

} Sometimes expressed as a percentage (x100)


} Provides a relative measure of risk to return
} Useful when comparing variability of two or more data sets with different
scales
} Smaller CV à smaller risk
} Reciprocal of CV à return to risk
Eg: Applying the Coefficient of Variation
• Closing Stock Prices database
• Which investment is most risky?
• Which investment would have least risk?

Intel (INTC) is slightly riskier than the other stocks.

Index fund has the least risk (lowest CV).


Measures of Shape: Skewness
• Skewness describes the lack of symmetry of data.
• Distributions that tail off to the right are called positively skewed; those
that tail off to the left are said to be negatively skewed.

Positively skewed Symmetrical


Coefficient of Skewness
Cs
Uted
• Coefficient of Skewness (CS):
y tr than if

is
the dentures
' skewed
left / right
} CS is negative for left-skewed data.
} CS is positive for right-skewed data.
} |CS| > 1 suggests high degree of skewness.
} 0.5 ≤ |CS| ≤ 1 suggests moderate skewness.
} |CS| < 0.5 suggests relative symmetry.

In R, can be obtained using the “skew” function in “psych” package.

I CS win
! !!
Mun fhlthh to get
p hui
Eg: Measuring Skewness
• Using Purchase Orders database
Eg: Measuring Skewness
• Using Purchase Orders database
thered)
• Cost per order data: CS = 1.61 ( right
• A/P terms data: CS = 0.58 In Which has higher skewness?
Positive or Negative?

CS = 1.61 CS = 0.58
High positive skewness Moderate positive skewness
Shape and Measures of Location
Comparing measures of location can sometimes reveal information about the shape of the distribution of
observations.
Negatively skewed Positively skewed
Mean < Median < Mode Mode < Median < Mean

For example:
• If distribution was perfectly symmetrical and unimodal, the mean, median, and
mode would all be the same.
• If it were negatively skewed, mean < median < mode
• Positive skewness would suggest that mode < median < mean
Measures of Shape: Kurtosis
• Kurtosis refers to the tailedness of the distribution.
• Coefficient of kurtosis (CK) measures the degree of kurtosis

☆ kurtosis > 3

} CK = 3 for normal (or mesokurtic) distribution kurtosis = 3

} CK < 3 indicates platykurtic distribution which


has a thicker & shorter tail (fewer extreme kurtosis < 3

points or outliers)
} CK > 3 indicates leptokurtic distribution has a
thinner & longer (or heavier) tail (more
extreme points or outliers) Source: https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
Measures of Shape: Kurtosis
-

Note, several functions in R computes excess kurtosis:


=

[ –3 ,
Excess kurtosis > 0

Eg: psych::kurtosis; e10071::kurtosis Excess kurtosis = 0

Excess kurtosis < 0

Excess kurtosis cut off is zero.


Excess kurtosis > 0, hence the distributions of both Cost and AP are leptokurtic with longer and
thinner tails than normal distribution.
2T
DON'T GET
→ I
Descriptive Statistics for Grouped Data

}
• Population mean: variation for egm

" " " ""


• Sample mean: in group Inta

• Population variance:

• Sample variance:
Eg: Computing Statistical Measures from Frequency Distributions
• Computer Repair Times

5.962 = 35.50
Eg: Computing Home Value by Type and Region
function in
`psych` package
Descriptive Statistics for Categorical Data: The Proportion

• Proportion (p), is the fraction of data that have a certain


characteristic.
• Proportions are key descriptive statistics for categorical data,
such as defects or errors in quality control applications or
consumer preferences in market research.
Eg: Computing a Proportion
} Proportion of orders placed by Spacetime Technologies

Proportion = Number of Orders by SpaceTime Technologies


Total number of Orders

can also use filter function in dplyr
Measures of Association
• Data from 49 top liberal arts and research universities can be used to
answer questions:
• Is Top 10% HS related to Graduation %?
• Is Accept. Rate related to Expenditures/Student?
• Is Median SAT related to Acceptance Rate?
Measures of Association - Covariance
• Covariance is a measure of the linear association between two
variables, X and Y.
if Mr

• For a population:

• For a sample:

Positive Covariance à direct relationship


Negative Covariance à inverse relationship
Magnitude à degree of association
Measures of Association - Covariance
Eg: Computing the Covariance
• Scatterplot of the Colleges and Universities data
convince
Diff for

Measures of Association - Correlation


• Correlation is a measure of the linear association between two variables, X and Y (not
dependent on units of measurement)
• Correlation Coefficient formulas: For a population:

For a sample:

• Range: -1 (Strong negative) and 1 (Strong positive linear relationship)

• 0 indicates no linear relationship


• Also known as: Pearson product moment correlation or Pearson's correlation coefficient
Measures of Association
• Correlation as a measure of LINEAR association

Source: Wikipedia (Correlation and dependence)


Computing Correlation of Multiple Variables
• Is Top 10% HS related to Graduation %?
• Is Accept. Rate related to Expenditures/Student?
• Is Median SAT related to Acceptance Rate?
Plotting a correlation matrix
¢ Introduction to outliers ]
Outliers
• Mean and Range are sensitive to outliers
• No standard definition of what constitutes an outlier.
• How do we identify potential outliers?
• Some rules of thumbs: aye
Into quartile
} z-scores > +3 or < -3

{
(< 0.3% for normal data)
• } Extreme outliers are > 3*IQR to the left of Q1 or right of Q3
} Mild outliers are between (1.5to 3)*IQR to the left of Q1 or
right of Q3
Eg: Investigating Outliers
• Home Market Value data

• None of the z-scores exceed 3. However, while individual variables might not
exhibit outliers, combinations of them might.
• The last observation has a high market value ($120,700) but a relatively small house size
(1,581 square feet) and may be an outlier.
What do you do with outliers?

- Leave them in the data if it is important
- Remove them if they are different from the rest
- Correct error in data entry
Statistical Thinking in Business DM
• Statistical Thinking is a philosophy of learning and action for
improvement, based on principles that:
• all work occurs in a system of interconnected processes
• variation exists in all processes
• better performance results from understanding and reducing
variation
• Business Analytics provide managers with insights into facts
and relationships that enables them to make better
decisions.
Applying Statistical Thinking
• Excel file Surgery Infections
• Is month 12 simply random variation or some explainable phenomenon?
Applying Statistical Thinking
• Excel file Surgery Infections
• Is month 12 simply random variation or some explainable phenomenon?
Applying the 3 std dev empirical rule!

upper limit

lower limit
Variability in Samples
• Different samples from any population will vary
• different means, standard deviations, and other statistical measures
• differences in shapes of histograms
• Samples are extremely sensitive to the sample size – the
number of observations included in the samples.
Eg: Variation in Sample Data
• Samples from Computer Repair Times data
• Population statistics: µ = 14.91 days, σ2 = 35.5 days2
• Two samples of size 50:

→ Diff

• Two samples of size 25:

→ DIG

You might also like