Idl 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

MATH 353: STATISTICS

Dr. Eric Nimako Aidoo


[email protected]
0202901980
3- 1

Numerical Summary
Raw data
The data represent the highest temperature recorded by a
remote sensor in 50 countries.

112, 100, 127, 120, 134, 118, 105, 110, 109, 112,
110, 118, 117, 116, 118, 112, 114, 114, 105, 109,
107, 112, 114, 115, 118, 117, 118, 122, 106, 110,
116, 108, 110, 121, 113, 120, 119, 111, 104, 111,
120, 113, 120, 117, 105, 110, 118, 112, 114, 114.

What can you say about this data?


Measures of Central Tendency
Measures of Central Tendency

Measures of Central Tendency: These measures, often


referred to as averages, describe the centre of any given
data set. They are useful measure to summarize a given
data set.

Measures of Central Tendency


1. The Arithmetic Mean
2. The Median
3. The Mode
The Arithmetic Mean

The Arithmetic Mean


is the most widely used
measure of location and
shows the central value of the
data.
Average
Joe
It is calculated by
summing the values and
dividing by the number
of values.
The Arithmetic Mean
Population Mean
For ungrouped data, the
Population Mean is the
sum of all the population
values divided by the total 
X
number of population N
values:

where
µ is the population mean

N is the total number of observations.

X is a particular value.

 indicates the operation of adding.


The Arithmetic Mean: Example
Example 1

Aidoo’s family 56,000


owns four cars. 42,000
The following
23,000
is the current
mileage on 73,000
each of the
four cars.
Find the mean mileage for the cars.


X 
56 ,000  ...  73,000
 48 ,500
N 4
The Arithmetic Mean
Sample Mean

For ungrouped data, the sample mean is


the sum of all the sample values divided
by the number of sample values:

X
X 
n

where n is the total number of


values in the sample.
The Arithmetic Mean: Example

The number of days travelled by 14,


five drones to record event in 15,
space last year: 17,
16,
15

X 14  15  17  16  15 .0
X  
n 5
77
  15 .4
5
The Arithmetic Mean 3- 10

The Mean of Grouped Data

The Mean of a sample of data


organized in a frequency
distribution is computed by the
following formula:

 Xf
X 
n
where X and f represent the midpoint and
frequency of each class respectively
The Arithmetic Mean for grouped data

The minimum Class Frequency Midpoint fx


temperature of limit (f) (x)
twenty districts in 6 – 10 1 8 8
11 – 15 2 13 26
Ghana recorded by
16 – 20 3 18 54
drone last year
21 – 25 5 23 115
December is
26 – 30 4 28 112
presented in the 31 – 35 3 33 99
frequency 36 – 40 2 38 76
distribution table. n=20 Ʃfx = 490
Compute the mean
number of movies
showing.  fx 490
X    24 .5
n 20
The Arithmetic Mean

Characteristics of the Arithmetic Mean


The mean is unique for any set of numerical data.
All the values are included in computing the mean.
It is applicable to quantitative data only
The mean is affected by extremely large or small
data values.
The Median

The Median is the


midpoint of the values after
they have been ordered from
the smallest to the largest. There are as many
values above the
median as below it in
the data array.

For an even set of values, the median will be the


arithmetic average of the two middle numbers.
The Median: Example 3- 14

The number of traffic crashes occurred on a road


segment over five year period:
21, 25, 19, 20, 22.
Calculate the median crashes

Arranging the data in


ascending order gives:

19, 20, 21, 22, 25.

Thus the median crashes is 21.


The Median: Example

The number of minutes travelled by four missiles


to hit target at the same location are:
76, 73, 80, 75.

Arranging the data in


ascending order gives:

73, 75, 76, 80


The median is given
by
Thus, the median is 75.5.
(75+76)/2 = 75.5
The Median: Grouped Data 3- 16

The Median of a sample of data organized in a


frequency distribution is computed by:
n
 CF
Median  L  2 ( w)
fm

where L is the lower class boundary of the median class,


CF is the cumulative frequency before the median class,
fm is the frequency of the median class,
w is the class width,
n is the total number of values in the sample.
The Median 3- 17

Finding the Median Class

To determine the median class for grouped data


 Construct a cumulative frequency distribution.
 Divide the total number of data values by 2.
 Determine which class will contain this value. For
example, if n=50, 50/2 = 25, then determine which
class will contain the 25th value.
The Median: Example

Class limit Class boundary Frequency (f) Cum. Freq.

6 – 10 5.5 – 10.5 1 1
11 – 15 10.5 – 15.5 2 3
16 – 20 15.5 – 20.5 3 6
21 – 25 20.5 – 25.5 5 11
26 – 30 25.5 – 30.5 4 15
31 – 35 30.5 – 35.5 3 18
36 - 40 35.5 – 40.5 2 20
n=20
The Median: Example

From the table,


L=20.5, n=20, fm=5, w=5, CF=6

n 20
 CF 6
Median  L  2 ( w )  20 .5  2 (5)  24 .5
fm 5
The Median

Characteristics of the Median

 There is a unique median for each data set.


 It is not affected by extremely large or small
values and is therefore a valuable measure of
location when such values occur.
 It can be computed for ratio-level, interval-
level, and ordinal-level data.
 50% of the observations lies above the median
and 50% falls below it.
The Mode 3- 21

The Mode is another measure of location and


represents the value of the observation that appears
most frequently.

Example 6: The number of minutes travelled by ten


different satellites to record an event in a location are:
81, 93, 84, 75, 68, 87, 81, 75, 81, 87.
Find the mode

Because the distance of 81 occurs the most often, it is


the mode.
The Mode: Grouped Data 3- 22

The Mode for grouped data is approximated by the


midpoint of the class with the largest class frequency.
Class limit Class boundary Frequency (f) Cum. Freq.
6 – 10 5.5 – 10.5 1 1
11 – 15 10.5 – 15.5 2 3
16 – 20 15.5 – 20.5 3 6
21 – 25 20.5 – 25.5 5 11
26 – 30 25.5 – 30.5 4 15
31 – 35 30.5 – 35.5 3 18
36 - 40 35.5 – 40.5 2 20
n=20
The mode is 23. But modal class is 21–25 and modal
frequency is 5
The Mode

Characteristics of the Mode

 Data can have more than one mode. If it has two


modes, it is referred to as bimodal.
 The mode is not influence by extreme values
 The mode can be found for both quantitative and
qualitative data.
The Measures of Dispersion
The Measures of Dispersion
Measures of Dispersion: These measures, describe the
spread or variability in a data set.

range,
Measures of dispersion include the following:
mean deviation, variance, standard deviation
and coefficient of variation.
The Range 3- 26

Range = Largest value – Smallest value


The following represents the minimum temperature
recorded by a remote sensor for 25 countries in Europe.
-8.1 3.2 5.9 8.1 12.3
-5.1 4.1 6.3 9.2 13.3
-3.1 4.6 7.9 9.5 14.0
-1.4 4.8 7.9 9.7 15.0
1.2 5.7 8.0 10.3 22.1
Highest value: 22.1 Lowest value: -8.1
Range = Highest value – lowest value
= 22.1-(-8.1)
= 30.2
The Mean Deviation 3- 27

Mean The main features of the


Deviation mean deviation are:
The arithmetic
 All values are used in the
mean of the
calculation.
absolute values
 It is not unduly influenced by
of the
deviations from large or small values.
the arithmetic  The absolute values are

mean. difficult to manipulate.

X-X
MD =
n
The Mean Deviation: Example 3- 28

The number of minutes travelled by five missiles to


hit an object at the same location are:
103, 97, 101, 106, 103
Find the mean deviation.

X = 102 The mean deviation is:

X X 103  102  ...  103  102


MD  
n 5
1 5 1 4 1
  2 .4
5
The Variance and Standard Deviation 3- 29

Variance: the
arithmetic mean
of the squared
deviations from
the mean.

Standard deviation: The square


root of the variance.
The Variance and Standard Deviation
3- 30

Population Variance formula:

 
=  (X - )2
N
X is an observed values in the population
u is the arithmetic mean of the population
N is the number of observations in the population

Population Standard Deviation formula:


 
2
The Variance and Standard Deviation: Example

The variance and standard deviation from the


previous example are:

 
=  (X - ) 2
N

(-8.1-6.62) 2 + (-5.1-6.62) 2 + ... + (22.1-6.62) 2


  25

  = 42.227

 
= 6.498
The Variance and Standard Deviation

Sample variance (s2)

(X - X) 2
s2 = n-1

Sample standard deviation (s)

s s 2
The Variance and Standard Deviation: Example

The hourly wages earned by a sample of five students are:


$7, $5, $11, $8, $6.
Find the sample variance and standard deviation.
X 37
X    7 .40
n 5
X  X  7  7 .4 2  ...  6  7 .4 
2 2

s2  
n 1 5 1
21 .2
  5 .30
5 1

s s 2
 5.30  2.30
The coefficient of variation
The coefficient of variation (cv) is defined as the ratio of
the standard deviation to the arithmetic mean. This is
usually expressed in percentage.

 S 
CV     100%

 X 

• Measures relative variation


• Shows variation relative to mean
• Can be used to compare two or more sets of
data measured in different units
The coefficient of variation

• Stock A:
– Average price last year = $50
– Standard deviation = $5
S $5
CVA     100%   100%  10%
X $50 Both stocks
• Stock B: have the same
standard
– Average price last year = $100 deviation, but
– Standard deviation = $5 stock B is less
variable relative
S $5
to its price
 
CVB     100%   100%  5%
X $100
The Measures of Position
The Measures of Position
Measures of Position: These measures describe the location
(position) of a particular value in a given distribution of data.
The position is described by quartiles, deciles and percentiles

• Quartiles split the ranked data into 4 segments with an equal


number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25% of the observations
are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50% are larger)
 Only 25% of the observations are greater than the third quartile
The Quartiles

The quartiles are found by determining the value in the appropriate


position in the ranked data, where

First quartile position: Q1 = (n+1)/4

Second quartile position: Q2 = (n+1)/2 (the median position)

Third quartile position: Q3 = 3(n+1)/4

Interquartile range: IQR = (𝑄3 − 𝑄1 )

(𝑄 −𝑄 )
Semi Interquartile range: Semi-IQR= 3 1
2
where n is the number of observed values
The Deciles and Percentiles
Deciles: The deciles are the values that divide the
set of data into ten equal parts.

The fifth decile is the median

Percentiles divide the data into 100 equal parts.


• 25th percentile= 𝑄1
• 50th percentile=𝑄2
• 75th percentile=𝑄3
The Quartiles: Example
Find the first quartile

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5

Q1 and Q3 are measures of noncentral location


Q2 = median, a measure of central tendency
The Quartiles: Example
Find the other quartiles
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data,


so Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,


so Q3 = 19.5

IQR= 19.5-12.5=7 semi-IQR=7/2=3.5


Measures of position for group data
Measures of position for grouped data can be determined by:
 Interpolation method

• The formula for the kth percentile for a grouped data is given by

Where
• K= Percentile (0.1,0.2,…,)
• 𝑙𝑘 = lower class boundary of the class in which the kth percentile lies
• 𝐶𝑘 = the class width of the kth percentile class boundary
• 𝐹𝑘 =the cumulative frequency just before the kth percentile class boundary
• 𝑓𝑘 =the frequency of the kth percentile class boundary
Measures of position for group data
Example

• Find the 10th, 45th and 90th percentile of the data


Measures of position for group data
Solution

The kth percentile can be calculated by

𝐶0.1 9
𝑃0.1 = 𝑙0.1 + (0.1X40-𝐹0.1 ) = 126.5 + 0.1 × 40 − 3 = 128.3
𝑓0.1 5
𝐶0.45 9
𝑃0.45 = 𝑙0.45 + (0.45x40-𝐹0.45 ) = 144.5 + 0.45 × 40 − 17 = 144.5
𝑓0.45 12
𝐶0.9 9
𝑃0.9 = 𝑙0.9 + (0.9X40-𝐹0.9 ) = 162.5 + 0.9 × 40 − 34 = 167
𝑓0.9 4
The Measure of Shape
The Measure of Shape
Measures of shape determine whether the distribution of data
exhibits a symmetric pattern or stretch out in a particular direction.
Two of such measures of shape are the
 Skewness
 Kurtosis/Peakness
Skewness

The skewness describe the degree of symmetry or


asymmetry of a distribution.

Symmetrical distribution
Positively Skewed distribution
Negatively Skewed distribution

Coefficient of skewness: 3( mean  median )


 
s tan dard deviation

3 3
Skewness
3- 48

Symmetric distribution: A distribution having the


same shape on either side of the center

Skewed distribution: One whose shapes on either


side of the center differ; a nonsymmetrical distribution.

Can be negatively or positively skewed


Skewness 3- 49

Zero skewness Mean


=Median
=Mode

Mean
Median
Mode

The Relative Positions of the Mean, Median, and Mode:


Symmetric Distribution
Skewness 3- 50

Negatively Skewed: Mean and Median are to the left of the Mode.

Mean<Median<Mode
Mean Mode
Median

The Relative Positions of the Mean, Median, and


Mode: Left Skewed Distribution
Skewness 3- 51

• Positively skewed: Mean and median are to the right of the mode.

Mean>Median>Mode

Mode Mean
Median

The Relative Positions of the Mean, Median, and Mode:


Right Skewed Distribution
Skewness 3- 52

Example

The lengths of stay on the cancer floor of a hospital were organised into a
frequency distribution. The mean length of stay was 28 days, the median,
25 days and mode, 23 days. The standard deviation was computed to be
4.2 days. Compute the skewness.
Kurtosis/Peakness 3- 53

The degree of peakness or kurtosis of a distribution is described

by the coefficient of kurtosis

• If k=3 the distribution is said to be symmetrical or normal

• If k<3 The distribution flattens at the centre than the normal distribution

• If k>3 the distribution is more peaked than the normal distribution


Kurtosis/Peakness 3- 54

Graphs of Distributions indicating their Peakness


Kurtosis/Peakness 3- 55

Example

The lengths of stay on the cancer floor of a hospital were organised into a
frequency distribution. The mean length of stay was 28 days, the median,
25 days and mode, 23 days. The standard deviation was computed to be
4.2 days. Compute the skewness.
Assignment 3- 56

The data below show the age distribution of cases of malaria


reported during year at a hospital.
34 17 25 37 19 19 27 19 44
24 24 22 32 12 13 16 18 14
12 16 14 17 10 16 22 20 15
15 10 10 14 17 20 18 13 32
13 13 18 30 24 34 44 31 43
40 28 31 15 22 15 31 18 27
35 35 20 32 38 32
(i) Organize the data into a frequency distribution table.
(ii) Calculate the coefficient of skewness and kurtosis and
interpret your results.
Interpretation and uses of Standard Deviation 3- 57

Empirical Rule: For any symmetrical, bell-shaped


distribution:

About 68% of the observations will lie within 1s


the mean

About 95% of the observations will lie within 2s of


the mean

Virtually all the observations will be within 3s of


the mean
Interpretation and uses of Standard Deviation

Bell - Shaped Curve showing the relationship between  and  .

68%

95%
99.7%
3  1  1  3
Assignment
The data represent the recorded high temperature from a
sensor for 50 districts in Ghana.

112, 100, 127, 120, 134, 118, 105, 110, 109, 112,
110, 118, 117, 116, 118, 112, 114, 114, 105, 109,
107, 112, 114, 115, 118, 117, 118, 122, 106, 110,
116, 108, 110, 121, 113, 120, 119, 111, 104, 111,
120, 113, 120, 117, 105, 110, 118, 112, 114, 114.

Calculate the coefficient of skewness for the


data and sketch shape of its distribution
Exploratory Data Analysis (EDA)

Box-and-Whisker Plot:
A Graphical display of data using 5-number
summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum

Example:

25% 25% 25% 25%

Minimum 1st Median 3rd Maximum


Minimum 1st
Quartile Median 3rd
Quartile Maximum
Quartile Quartile
Box-and-whisker plot

• The Box and central line are centered between the


endpoints if data are symmetric around the median

Min Q1 Median Q3 Max

• A Box-and-Whisker plot can be shown in either vertical


or horizontal format
Box-and-whisker plot: Example
• Below is a Box-and-Whisker plot for the following
data:
Min Q1 Q2 Q3 Max

0 2 2 2 3 3 4 5 5 10 27

00 22 33 55 27
27
• The data are right skewed, as the plot depicts
Distribution Shape and Box-and-whisker plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

You might also like