shortened
Notation
Measures of Location
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
Population - all items of interest for a particular
decision or investigation
- all married drivers over 25 years old
- all subscribers to Netflix
Sample - a subset of the population
- a list of individuals who rented a comedy from
Netflix in the past year
The purpose of sampling is to obtain sufficient
information to draw a valid conclusion about a
population.
Is the Netflix sample above a good sample? Why?
Other ways to select a sample?
We typically label the elements of a data set using subscripted
variables, x1, x2 , … , and so on, where xi represents the ith
observation. Upper-case letters like X represent often random
variables.
It is common practice in statistics to use
◦ Greek letters, such as m (mu; mean), s (sigma; std. deviation), and p (pi;
proportion), to represent population measures and
◦ italic letters such as by 𝑥ҧ (called x-bar), s, and p to represent sample statistics.
N represents the number of items in a population and n represents
the number of observations in a sample.
Notation
Measures of Location
Mean
Median
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
Population mean:
Sample mean:
Excel function: =AVERAGE(data range)
Property of the mean:
Outliers can affect the value of the mean.
Mean valid for interval/ratio variables and often
questionable for ordinal variables.
Purchase Orders database
Using formula:
=SUM(B2:B95)/COUNT(B2:B95)
Mean = $2,471,760/94
= $26,295.32
Using Excel AVERAGE Function
=AVERAGE(B2:B95)
Person Age Person Age
1 17 1 17
2 21 2 21
3 15 3 15
4 18 4 18
5 999 5
6 22 6 22
7 11 7 11
8 25 8 25
Mean 141.00 Mean 18.43
Wikipedia: In statistics, an outlier is an observation point that is distant from
other observations. An outlier may be due to variability in the
measurement or it may indicate experimental error; the latter are
sometimes excluded from the data set.
The median specifies the middle value when the data are arranged
from least to greatest.
◦ Half the data are below the median, and half the data are above it.
◦ For an odd number of observations, the median is the middle of the
sorted numbers.
◦ For an even number of observations, the median is the mean of the two
middle numbers.
We could use the Sort option in Excel to rank-order the data and
then determine the median. The Excel function =MEDIAN(data
range) could also be used.
The median is meaningful for ratio, interval, and ordinal data.
Not affected by outliers.
Sort the data from smallest to largest. Since we
have 90 observations, the median is the average
of the 47th and 48th observation.
Median =
($15,562.50 + $15,750.00)/2
= $15,656.25
=MEDIAN(B2:B94)
Person Age
1 17.00
2 21.00
3 15.00
4 18.00
5 999.00
6 22.00
7 11.00
8 25.00
Mean 141.00
Median 19.50
Median is insensitive to outliers!
The Excel file Computer Repair Times includes 250
repair times for customers.
What repair time would be
reasonable to quote to a
new customer?
Median repair time is 2
weeks; mean and mode are
about 15 days.
Examine the histogram.
90% are completed within 3 weeks
Distribution is important!
Notation
Measures of Location
Measures of Dispersion
Range
Interquartile Range
Variance
Standard Deviation
Empirical Rules
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
Dispersion refers to the degree of variation in
the data; that is, the numerical spread (or
compactness) of the data.
Key measures:
◦ Range
◦ Interquartile range
◦ Variance
◦ Standard deviation
The range is the simplest and is the difference
between the maximum value and the minimum
value in the data set.
In Excel, compute as =MAX(data range) -
MIN(data range).
The range is affected by outliers, and is often
used only for very small data sets.
Purchase Orders data
For the cost per order data:
◦ Maximum = $127,500
◦ Minimum = $68.78
Range = $127,500 - $68.78 = $127,431.22
The interquartile range (IQR), or the midspread
is the difference between the first and third
quartiles, Q3 – Q1.
This includes only the middle 50% of the data and,
therefore, is not influenced by extreme values.
Purchase Orders data
For the Cost per order data:
Third Quartile = Q3 = $27,593.75
First Quartile = Q1 = $6,757.81
Interquartile Range = $27,593.75 – $6,757.81
=$20,835.94
The variance is the “average” of the squared
deviations from the mean.
For a population:
◦ In Excel: =VAR.P(data range)
For a sample:
◦ In Excel: =VAR.S(data range)
Note the difference in denominators!
The standard deviation is the square root of the
variance.
◦ Note that the dimension of the variance is the square of the
dimension of the observations, whereas the dimension of the
standard deviation is the same as the data. This makes the
standard deviation more practical to use in applications.
For a population:
◦ In Excel: =STDEV.P(data range)
For a sample:
◦ In Excel: =STDEV.S(data range)
Excel file: Closing Stock
Prices
Intel (INTC):
Mean = $18.81
Standard deviation = $0.50
General Electric (GE):
Mean = $16.19
Standard deviation = $0.35
INTC is a higher risk
investment than GE.
For many data sets encountered in practice:
Approximately 68% of the observations fall within one
standard deviation of the mean
Approximately 95% fall within two standard deviations of
the mean
Approximately 99.7% fall within three standard deviations
of the mean
These rules are commonly used to characterize
the natural variation in manufacturing processes
and other business phenomena.
The empirical Rule comes from the normal distribution .
Most data does not follow a normal distribution!
For any data set (any distribution), the
proportion of values that lie within +/- k (k > 1)
standard deviations of the mean is at least 1 –
1/k2
Examples:
◦ For k = 2: at least ¾ or 75% of the data lie within two
standard deviations of the mean
◦ For k = 3: at least 8/9 or 89% of the data lie within
three standard deviations of the mean
Notation
Measures of Location
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
A standardized value, commonly called a z-score,
provides a relative measure of the distance an
observation is from the mean, which is independent of
the units of measurement.
The z-score for the ith observation in a data set is
calculated as follows:
◦ Excel function: =STANDARDIZE(x, mean, standard_dev).
Standardized data is needed by many predictive
methods since it makes variables comparable.
Purchase Orders Cost per order data
=(B2 - $B$97)/$B$98, or
=STANDARDIZE(B2,$B$97,$B$98).
0
1
Notation
Measures of Location
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
The proportion, denoted by p, is the fraction of
data that have a certain characteristic.
Proportions are key descriptive statistics for
categorical data, such as defects or errors in
quality control applications or consumer
preferences in market research.
Example: Proportion of female students is 60%.
Proportion of orders placed by Spacetime Technologies
=COUNTIF(A4:A97, “Spacetime Technologies”)/94
= 12/94 = 0.128
Notation
Measures of Location
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Correlation
Outliers
Two variables have a strong statistical relationship
with one another if they appear to “move” together.
When two variables appear to be related, you
might suspect a cause-and-effect relationship.
Caution: Correlation does not prove causation!
Statistical relationships may exist even though a
change in one variable is not caused by a change
in the other.
Covariance is a measure of the linear association between two
variables, X and Y. Like the variance, different formulas are used for
populations and samples.
Population covariance:
◦ Excel function: =COVARIANCE.P(array1,array2)
Sample covariance:
◦ Excel function: =COVARIANCE.S(array1,array2)
The covariance between X and Y is the average of the product of
the deviations of each pair of observations from their respective
means.
Colleges and
Universities data
Correlation is a measure of the linear relationship between two
variables, X and Y, which does not depend on the units of
measurement.
Correlation is measured by the correlation coefficient, also known as
the Pearson product moment correlation coefficient.
Correlation coefficient for a population:
Correlation coefficient for a sample:
The correlation coefficient is scaled between -1 and 1.
Excel function: =CORREL(array1,array2)
Why is correlation important?
Colleges and Universities data
Is a schools graduation rate related to the SAT score of
incoming students?
Is there a causal relationship?
Data >
Data Analysis >
Correlation
Excel computes the correlation coefficient
between all pairs of variables in the Input Range.
Input Range data must be in contiguous columns.
Colleges and Universities data
◦ Moderate negative correlation between acceptance rate and
graduation rate, indicating that schools with lower acceptance
rates have higher graduation rates.
◦ Acceptance rate is also negatively correlated with the median
SAT and Top 10% HS, suggesting that schools with lower
acceptance rates have higher student profiles.
◦ The correlations with Expenditures/Student suggest that schools
with higher student profiles spend more money per student.
Value Field Settings include several statistical
measures:
Average
Max and Min
Product
Standard deviation
Variance
Credit Risk Data
First, create a PivotTable.
In the PivotTable Field List, move Job to the Row Labels
field and Checking and Savings to the Values field. Then
change the field settings from “Sum of Checking” and
“Sum of Savings” to the averages.
Notation
Measures of Location
Measures of Dispersion
Standardization
Proportions for Categorical Variables
Measures of Association
Outliers
There is no standard definition of what constitutes an
outlier!
Wikipedia: “In statistics, an outlier is an observation point that is
distant from other observations. […] Outliers can occur by
chance in any distribution, but they often indicate either
measurement error or that the population has a heavy-tailed
distribution.”
If the outlier is due to a measurement error then we often want to
exclude it from the analysis.
Some typical rules of thumb:
Normal distribution: z-scores greater than +3 or less than -3
Boxplot:
Extreme outliers are more than 3*IQR to the left of Q1 or right of Q3
Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or right of Q3
Home Market Value data
None of the z-scores exceed 3. However, while
individual variables might not exhibit outliers,
combinations of them might.
◦ The last observation has a high market value ($120,700) but
a relatively small house size (1,581 square feet) and may be
an outlier.
Excel file Surgery Infections
◦ Is month 12 simply random variation or some explainable
phenomenon?
Three-standard deviation empirical rule:
There is only a 0.3% (for normally distributed data) or a 11% (for any
distribution) chance to see an observation outside +/- 3 std.dev.
This suggests that month 12 is statistically different from the rest of
the data.