Chapter 3: Statistics
Chapter 3: Statistics
o If the data have more than two modes, the data are multimodal.
DESCRIPTIVE STATISTICS: NUMERICAL MEASURES
Numerical Measures
Weighted Mean
If the measures are computed for data from a sample, they are called sample statistics.
o In some instances the mean is computed by giving each observation a weight that
If the measures are computed for data from a population, they are called population
reflects its relative importance.
parameters.
o The choice of weights depends on the application.
A sample statistic is referred to as the point estimator of the corresponding population
o The weights might be the number of credit hours earned for each grade, as in GPA.
parameter.
o In other weighted mean computations, quantities such as pounds, dollars, or
Measures of Location volume are frequently used.
Mean ∑ w i xi
o The mean provides a measure of central location. o x́=
o The mean of a data set is the average of all the data values. ∑ wi
o The sample mean x́ is the point estimator of the population mean µ. where: xi = value of observation i
wi = weight for observation I
o Sample Mean: x́=
∑ xi Numerator: sum of the weighted data values
n Denominator: sum of the weights
Geometric Mean
o Population Mean: μ=
∑ xi o The geometric mean is calculated by finding the nth root of the product of n values.
N o It is often used in analyzing growth rates in financial data (where using the
where: Sxi = sum of the values of the N observations arithmetic mean will provide misleading results).
N = number of observations in the population o It should be applied anytime you want to determine the mean rate of change over
Median several successive periods (be it years, quarters, weeks, . . .).
o The median of a data set is the value in the middle when the data items are o Other common applications include: changes in populations of species, crop yields,
arranged in ascending order. pollution levels, and birth and death rates.
o Whenever a data set has extreme values, median is the preferred measure of central o x́ g =√n ( x 1 ) ( x 2 ) …( x n)
location.
o The median is the measure of location most often reported for annual income and = [(x1)(x2)…(xn)]1/n
property value data. Percentiles
o A few extremely large incomes or property values can inflate the mean. o A percentile provides information about how the data are spread over the interval
o For an odd number of observations, arrange it in ascending order and the middle from the smallest value to the largest value.
o Admission test scores for colleges and universities are frequently reported in terms
value is the median.
o For an even number of observations, arrange it in ascending order and the average of percentiles.
o The pth percentile of a data set is a value such that at least p percent of the items
of the middle two values is the median.
take on this value or less and at least (100 - p) percent of the items take on this
Trimmed Mean
value or more.
o Another measure sometimes used when extreme values are present
o Arrange the data in ascending order.
o It is obtained by deleting a percentage of the smallest and largest values from a data
o Compute Lp, the location of the pth percentile.
set and then computing the mean of the remaining values.
Lp = (p/100)(n + 1)
o For example, the 5% trimmed mean is obtained by removing the smallest 5% and
Quartiles
the largest 5% of the data values and then computing the mean of the remaining
o Quartiles are specific percentiles.
values.
o First Quartile = 25th Percentile
Mode
o The mode of a data set is the value that occurs with greatest frequency. o Second Quartile = 50th Percentile = Median
o The greatest frequency can occur at two or more different values. o Third Quartile = 75th Percentile
σ
Measures of Variability
It is often desirable to consider measures of variability (dispersion), as well as measures
o Population CoV:
[ μ
x 100 ] %
o Population Variance: σ 2=
∑ ( xi −μ ) 2 o At least 75% of the data values must be within z = 2 standard deviations of the
mean.
N o At least 89% of the data values must be within z = 3 standard deviations of the
Standard Deviation
mean.
o The standard deviation of a data set is the positive square root of the variance.
o At least 94% of the data values must be within z = 4 standard deviations of the
o It is measured in the same units as the data, making it more easily interpreted than
mean.
the variance.
Empirical Rule
o Sample SD: s = √ s2 o When the data are believed to approximate a bell-shaped distribution:
o Population SD: = √ 2 The empirical rule can be used to determine the percentage of data values that
must be within a specified number of standard deviations of the mean.
Coefficient of Variation The empirical rule is based on the normal distribution, which is covered in
o The coefficient of variation indicates how large the standard deviation is in relation Chapter 6.
to the mean. o For data having a bell-shaped distribution:
s
o Sample CoV:
[ x́
x 100 ] %
Approximately 68% of the data values will be within +/- 1 standard deviation
of its mean.
Approximately 95% of the data values will be within +/- 2 standard deviations
of its mean.
Almost all of the data values will be within +/- 3 standard deviations of its s xy
mean. o Sample CC: r xy=
Detecting Outliers sx s y
o An outlier is an unusually small or unusually large value in a data set. σ xy
o A data value with a z-score less than -3 or greater than +3 might be considered an o Population CC: ρ xy=
σxσ y
outlier.
o It might be: o The coefficient can take on values between -1 and +1.
an incorrectly recorded data value o Values near -1 indicate a strong negative linear relationship.
a data value that was incorrectly included in the data set o Values near +1 indicate a strong positive linear relationship.
a correctly recorded unusual data value that belongs in the data set o The closer the correlation is to zero, the weaker the relationship.
Five-Number Summaries and Box Plots Data Dashboards: Adding Numerical Measures to Improve Effectiveness
Summary statistics and easy-to-draw graphs can be used to quickly summarize large Data dashboards are not limited to graphical displays.
quantities of data. The addition of numerical measures, such as the mean and standard deviation of KPIs,
Five-Number Summary to a data dashboard is often critical.
o Smallest Value Dashboards are often interactive.
o First Quartile Drilling down refers to functionality in interactive dashboards that allows the user to
o Median access information and analyses at increasingly detailed level.
o Third Quartile
o Largest Value
Box Plot
o A box plot is a graphical summary of data that is based on a five-number summary.
o A key to the development of a box plot is the computation of the median and the
quartiles Q1 and Q3.
o Box plots provide another way to identify outliers.
o Limits are located (not drawn) using the interquartile range (IQR).
o Data outside these limits are considered outliers
o The locations of each outlier is shown with the symbol