0% found this document useful (0 votes)
34 views33 pages

Chapter 3

The document discusses various measures for describing numeric data, including central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and shape (skewness). It explains how to calculate and interpret these statistics, and how to identify and treat outliers. The appropriate measure depends on factors like the data type and presence of outliers. For example, the median and mode are less impacted by outliers than the mean.

Uploaded by

Evelyn Maile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views33 pages

Chapter 3

The document discusses various measures for describing numeric data, including central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and shape (skewness). It explains how to calculate and interpret these statistics, and how to identify and treat outliers. The appropriate measure depends on factors like the data type and presence of outliers. For example, the median and mode are less impacted by outliers than the mean.

Uploaded by

Evelyn Maile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Learning unit 2

Exploring Data

Week 2: Summarising Data: Summary Tables and Graphs

Week 3: Describing Data: Numeric Descriptive Statistics


Describing Data: Numeric Descriptive Statistics

‘ Learning outcomes
➢ describe the various central and non-central location measures
➢ calculate and interpret each of these location measures
➢ describe the appropriate central location measure for different data types
➢ describe the various measures of spread (or dispersion)
➢ calculate and interpret each measure of dispersion
➢ describe the concept of skewness
➢ calculate and interpret the coefficient of skewness
➢ explain how to identify and treat outliers
➢ calculate the five-number summary table and construct its box plot
➢ explain how outliers influence the choice of valid descriptive statistical measures
Describing the data profile of a random variable
Measures of location (both central and non-central)
➢ the arithmetic mean (also called the average) – valid for numeric data
➢ the median (also called the second quartile, the middle quartile or the 50th
percentile – valid for numeric data
➢ the mode (or modal value) – valid for numeric and categorical data
Measures of spread (or dispersion)
➢ Range
➢ Variance
➢ Standard deviation
➢ Coefficient of Variance
Measure of shape (skewness)
➢ Symmetrical Distribution
➢ Positively Skewed Distribution
➢ Negatively Skewed Distribution
Measures of Central Tendency
. Where data are centred

Real Equity Returns of 16 Major Equity Markets


Australia 9.0% Japan 9.3%
Belgium 4.8% Netherlands 7.7%
Canada 7.7% South Africa 9.1%
Denmark 6.2% Spain 5.8%
France 6.3% Sweden 9.9%
Germany 8.8% Switzerland 6.9%
Ireland 7.0% United Kingdom 7.6%
Italy 6.8% United States 8.7%
# Arithmetic Mean = 7.6% …… center of gravity and

subject to extreme large or small outliers


Arithmetic Mean for Grouped Numeric Data

When numeric data is grouped into intervals and shown in a numeric


frequency distribution; then arithmetic mean can be approximated by:

➢ finding the midpoint of each interval - representing all the x values in


each interval

➢ multiply each interval’s midpoint value by the frequency count

➢ summing up the total values of each interval

➢ divide the total sum by the sample size, n.


Arithmetic mean for grouped numeric data
Fuel
Truck
(km/l) Interval midpoint xi frequency fi xi fi
1 13 6-<9 7.5 4 30
2 11 9 - < 12 10.5 9 94.5
3 10 12 - < 15 13.5 5 67.5
4 13 15 - < 18 16.5 2 33
5 10 n = 20 225
6 13
7 8
8 10
9 10
10 13 ഥ =
𝒙 225/20 = 11.25 km/litre
11 11
12 8
13 16
14 16
15 11
16 9
17 11
18 13
19 7
20 12
Arithmetic mean (advantages and disadvantages)

Advantages:

➢ It uses all the data values in its calculation

➢ It is an unbiased statistic (meaning that, on average, it represents the true


mean)

Disadvantages

➢ It is not appropriate for categorical (i.e. nominal or ordinal-scaled) data;


only be applied to numeric (i.e. interval and ratio-scaled) data.

➢ It is distorted by outliers. An outlier is an extreme value in a data set.


Median for ungrouped data

➢ The middle number of an ordered set of data

➢ Divides an ordered set of data values into two equal halves

➢ 50% of the data values lie below the median and 50% lie above it

To calculate the median for ungrouped (raw) numeric data:

➢ Arrange the n data values in ascending order.

➢ Find the median by first identifying the middle position in the data set as
follows:

Odd number location = (n + 1) / 2

Even number locations = n/2 and (n+2)/2 [middle 2 items]


Median for ungrouped data - example
Outliers - example
Problem 4: P/Es for a Client Portfolio
Stock Price EPS P/E

A 16.83 1.23 13.68


D 16.54 1.06 15.60
C 86.92 4.95 17.56
B 60.83 3.19 19.07
G 38.66 1.84 21.01
F 28.43 1.11 25.61
E 13.30 0.03 443.33

* Mean = 79.41 (based on outlier E)

> Odd No. Location = (n + 1) / 2


> Even No. Locations = n/2 and (n+2)/2
[middle 2 items]

* Median P/E = 19.07 …… better indication of central location (not


affected by outlier)
Median for grouped data
Graphical approach
Using the ‘less than’ ogive graph, the median value is found by reading off the
data value on the x-axis that is associated with the 50% cumulative frequency
located on the y-axis.
Arithmetic approach
Based on the sample size, n, calculate n/2 to find the median position.

Using the cumulative frequency counts of the ‘less than’ ogive summary
table, find the median interval (i.e. the interval that contains the median
position [the (n/2)th data value]).

The median value can be approximated using the midpoint of the median
interval, or calculated using the following formula to give a more representative
median value:
Median for grouped data - example
Courier Delivery Times Study A courier company recorded 30 delivery times (in minutes) to
deliver parcels to their clients from its depot. The data are summarised in the numeric
frequency – and cumulative frequency – distributions as shown in Table 3.3.
Median (advantages and disadvantages)

Advantage over the mean

➢ it is not affected by outliers → a more representative measure of central


location than the mean when significant outliers occur in a set of data.

Disadvantages

➢ it cannot be calculated for categorical data – only be applied to numeric


data

➢ it is more affected by sampling fluctuations than the mean as it uses only


the middle data values (and not all the data values) and is therefore less
stable than the mean.
Mode
➢ the most frequently occurring value in a set data
➢ can be calculated both for categorical data and numeric data

To calculate the mode:


Ungrouped data
➢ rank the data from lowest to highest
➢ identify the data value that occurs most frequently.
Large samples of discrete or categorical (nominal and ordinal-scaled)
data:
➢ construct a categorical frequency table
➢ identify the modal value or modal category that occurs most frequently.
Mode – example
Refer to previous example – Courier Delivery Times
Mode (advantages and disadvantages)

Advantages
➢ Valid measure of central location for all data types (i.e. categorical and numeric)
➢ For categorical data → the mode defines the most frequently occurring category
➢ For numeric data → the mode is the most frequently occurring data value
(ungrouped) / the midpoint value of a modal interval (grouped)
➢ Not influenced by outliers → represents the most frequently occurring data value
(or response category).

Disadvantages
➢ Representative measure of central location only if the histogram of the numeric
random variable is unimodal (i.e. has one peak only)
Which Central Location Measure is Best?
Depends on:
Data Type
➢ For categorical (nominal or ordinal scaled) data → only the mode is the only valid
and representative measure
➢ For numeric (interval or ratio-scaled) → all three measures (mean, median and
mode) are valid and representative

Outliers
➢ It distorts the mean but do not affect the median or the mode.
➢ If outliers are detected in a set of data chose the median (or mode); the median is
preferred to the mode as it can be used in further analysis.

However, if there are good reasons to remove the outlier(s) from the data set then
the mean can again be used as the best central location measure.
Other Measures of Central Location

Geometric mean
➢ used to find the average of percentage change data, such as
indexes, growth rates or rates of change.

When each data value is calculated from a different base, the


appropriate measure of central location is the geometric mean.
Example: Share Price at end of
Week 1 = R25
Week 2 = R30
Week 3 = R33
% change week 1 to 2 =20% [(MVend/MVbegin – 1) x 100]
% change week 2 to 3 = 10%
Geometric Mean (example)

The percentage changes must be expressed as decimal values. For example, a 7%


increase must be written as 1.07 (1+0.07) and a 4% decrease must be written as 0.96
(1+(-0.04).
Other Measures of Central Location (continue)

Weighted Mean
➢ Different weights are given to each data value to arrive at an average value
➢ Use when the importance (weight) of each data value is different

To calculate the weighted arithmetic mean:


➢ Each observation, (xi) is first multiplied by its frequency count, fi (weighting)
➢ The weighted observations are then summed
➢ The sum is then divided by the sum of the weights

Formula
Weighted Mean (example)
Non-central Location Measures

Quartiles are non-central measures that divide an ordered data set into quarters
(i.e. four equal parts).

The lower quartile, Q1, is that data value that separates the lower (bottom) 25% of
(ordered) data values from the top 75% of ordered data values.

The middle quartile, Q2, is the median. It divides an ordered data set into two
equal halves.

The upper quartile, Q3, is that data value that separates the top (upper) 25% of
(ordered) data values from the bottom 75% of ordered data values.
Non-central Location Measures
Non-central Location Measures

Quartiles
➢ Calculated in a similar way to the median
➢ Difference lies in the identification of the quartile position & the choice of the quartile
interval.

Steps to calculate quartiles (lower, middle and upper) for ungrouped (raw) data:
➢ Sort the data in ascending order

➢ Each quartile position is determined as follows (regardless of whether n is even or odd):

➢ Count to the quartile position (rounded down to the nearest integer) to find the
(approximate) quartile value.

Quartile value = approximate quartile value + fraction part of quartile position ×


(consecutive value after quartile position − approximate quartile value)
Quartiles (Example)
Quartiles (Example)
Non-central Location Measures
Quartiles for grouped data

➢ Use formula similar to median formula to find both the lower and upper

quartiles

➢ Modify formula to identify either the lower or the upper quartile position

➢ Then find the lower or upper quartile interval


Non-central Location Measures
Quartiles for grouped data
Non-central Location Measures
Quartiles for grouped data
Quartiles (Example - grouped data)
Quartiles (Example - grouped data)
Percentiles
Similar to quartiles
lower quartile = 25th percentile
upper quartile = 75th percentile

Percentiles are calculated in the same way as quartiles


➢ First find the percentile position
➢ Then identify the percentile value in that position.

Example to find the 40th


➢ percentile position is 0.40(n + 1)

Once the percentile position is found, apply the same rules as for quartiles to
find the appropriate percentile value.

You might also like