Chapter 3
Chapter 3
Exploring Data
‘ Learning outcomes
➢ describe the various central and non-central location measures
➢ calculate and interpret each of these location measures
➢ describe the appropriate central location measure for different data types
➢ describe the various measures of spread (or dispersion)
➢ calculate and interpret each measure of dispersion
➢ describe the concept of skewness
➢ calculate and interpret the coefficient of skewness
➢ explain how to identify and treat outliers
➢ calculate the five-number summary table and construct its box plot
➢ explain how outliers influence the choice of valid descriptive statistical measures
Describing the data profile of a random variable
Measures of location (both central and non-central)
➢ the arithmetic mean (also called the average) – valid for numeric data
➢ the median (also called the second quartile, the middle quartile or the 50th
percentile – valid for numeric data
➢ the mode (or modal value) – valid for numeric and categorical data
Measures of spread (or dispersion)
➢ Range
➢ Variance
➢ Standard deviation
➢ Coefficient of Variance
Measure of shape (skewness)
➢ Symmetrical Distribution
➢ Positively Skewed Distribution
➢ Negatively Skewed Distribution
Measures of Central Tendency
. Where data are centred
Advantages:
Disadvantages
➢ 50% of the data values lie below the median and 50% lie above it
➢ Find the median by first identifying the middle position in the data set as
follows:
Using the cumulative frequency counts of the ‘less than’ ogive summary
table, find the median interval (i.e. the interval that contains the median
position [the (n/2)th data value]).
The median value can be approximated using the midpoint of the median
interval, or calculated using the following formula to give a more representative
median value:
Median for grouped data - example
Courier Delivery Times Study A courier company recorded 30 delivery times (in minutes) to
deliver parcels to their clients from its depot. The data are summarised in the numeric
frequency – and cumulative frequency – distributions as shown in Table 3.3.
Median (advantages and disadvantages)
Disadvantages
Advantages
➢ Valid measure of central location for all data types (i.e. categorical and numeric)
➢ For categorical data → the mode defines the most frequently occurring category
➢ For numeric data → the mode is the most frequently occurring data value
(ungrouped) / the midpoint value of a modal interval (grouped)
➢ Not influenced by outliers → represents the most frequently occurring data value
(or response category).
Disadvantages
➢ Representative measure of central location only if the histogram of the numeric
random variable is unimodal (i.e. has one peak only)
Which Central Location Measure is Best?
Depends on:
Data Type
➢ For categorical (nominal or ordinal scaled) data → only the mode is the only valid
and representative measure
➢ For numeric (interval or ratio-scaled) → all three measures (mean, median and
mode) are valid and representative
Outliers
➢ It distorts the mean but do not affect the median or the mode.
➢ If outliers are detected in a set of data chose the median (or mode); the median is
preferred to the mode as it can be used in further analysis.
However, if there are good reasons to remove the outlier(s) from the data set then
the mean can again be used as the best central location measure.
Other Measures of Central Location
Geometric mean
➢ used to find the average of percentage change data, such as
indexes, growth rates or rates of change.
Weighted Mean
➢ Different weights are given to each data value to arrive at an average value
➢ Use when the importance (weight) of each data value is different
Formula
Weighted Mean (example)
Non-central Location Measures
Quartiles are non-central measures that divide an ordered data set into quarters
(i.e. four equal parts).
The lower quartile, Q1, is that data value that separates the lower (bottom) 25% of
(ordered) data values from the top 75% of ordered data values.
The middle quartile, Q2, is the median. It divides an ordered data set into two
equal halves.
The upper quartile, Q3, is that data value that separates the top (upper) 25% of
(ordered) data values from the bottom 75% of ordered data values.
Non-central Location Measures
Non-central Location Measures
Quartiles
➢ Calculated in a similar way to the median
➢ Difference lies in the identification of the quartile position & the choice of the quartile
interval.
Steps to calculate quartiles (lower, middle and upper) for ungrouped (raw) data:
➢ Sort the data in ascending order
➢ Count to the quartile position (rounded down to the nearest integer) to find the
(approximate) quartile value.
➢ Use formula similar to median formula to find both the lower and upper
quartiles
➢ Modify formula to identify either the lower or the upper quartile position
Once the percentile position is found, apply the same rules as for quartiles to
find the appropriate percentile value.