0% found this document useful (0 votes)
20 views60 pages

Lecture 2 & 3 - Numerical Presenation

The document discusses numerical presentation in applied statistics, focusing on summarizing datasets using central measures like mean, median, and mode, as well as absolute dispersion measures such as range, variance, and standard deviation. It highlights the importance of understanding outliers and their impact on statistical measures, along with methods to test for skewness and outliers. The document also provides examples and calculations to illustrate these concepts.

Uploaded by

Michael Yousry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views60 pages

Lecture 2 & 3 - Numerical Presenation

The document discusses numerical presentation in applied statistics, focusing on summarizing datasets using central measures like mean, median, and mode, as well as absolute dispersion measures such as range, variance, and standard deviation. It highlights the importance of understanding outliers and their impact on statistical measures, along with methods to test for skewness and outliers. The document also provides examples and calculations to illustrate these concepts.

Uploaded by

Michael Yousry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Applied Statistics

Dr. Aya Ahmed


Assistant Professor of Econometrics Applied Statistics

Lecture Three
Numerical Presentation

The main goal is to summarize all the values in the given dataset in a value or

more, where when we look at these values we can know what happened in the

dataset.

What? How? When?


Numerical Presentation

POPULATION Sample

Parameter Statistic
𝜇 𝑥ҧ
2
𝜎
𝑠2
𝑁
𝑛
Numerical Presentation

Example: If you need to know your mark and I told you that your marks are
normally distributed, is that a clear answer to your question?
If I told you that the minimum mark is 80, is that a clear answer to your
question?
Central Measures

The main goal is to summarize all the values in one value where the majority of

the values are around it.


Central Measures

Can it represents all values ?

** when the dataset contains the same

value**
Mean Median Mode
Mean

What does it indicate?

It is the value at the center of dataset

where the majority of the values are

around it
Mean

𝑿𝒊 Value
𝑿𝟏 1 million
𝑿𝟐 2 million
𝑿𝟑 3 million

𝑺𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔


𝑴𝒆𝒂𝒏 =
𝑪𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔

σ𝒏𝒊=𝟏 𝒙𝒊
ഥ=
𝑿
𝒏
Mean

𝑺𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔


𝑴𝒆𝒂𝒏 =
𝑪𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔

σ𝒏𝒊=𝟏 𝒙𝒊
ഥ=
𝑿
𝒏
Profits in million $
92, 85, 88, 95
𝟗𝟐 + 𝟖𝟓 + 𝟖𝟖 + 𝟗𝟓
ഥ=
𝑿 = 𝟗𝟎 𝒎𝒊𝒍𝒍𝒊𝒐𝒏 $
𝟒

Comment: the mean of the profits is 90 million $ which represents the value at the center of dataset where

the majority of the values are around it


Mean

In case we have a company which has a profit In case we have another company
of zero, the mean will be
which has a profit of zero, the
mean will be
𝟗𝟐 + 𝟖𝟓 + 𝟖𝟖 + 𝟗𝟓 + 𝟎
ഥ=
𝑿
𝟓
= 𝟕𝟐 𝒎𝒊𝒍𝒍𝒊𝒐𝒏 $ 𝟗𝟐 + 𝟖𝟓 + 𝟖𝟖 + 𝟗𝟓 + 𝟎 + 𝟎
ഥ=
𝑿
𝟔
There a big difference between 72 and zero = 𝟔𝟎 𝒎𝒊𝒍𝒍𝒊𝒐𝒏 $

Can we depend on 60 to
represents the data
Outlier

It is a value which has different nature of the


values in the given dataset.
By removing the outlier:
1-Sample size will be less
2-Less reliable estimates
3-We don’t only remove a value but we
remove a feature from the sample that is
found in the population
Outlier
When we can remove the outlier

Technical problem

Bad entry mistake


Mean

Advantages Disadvantages

• Easy to be calculated • It is affected by outliers


• Easy to be explained
• Takes all the values into
calculation
Median

What does it indicate?

It is the value at 50% distance of the ordered dataset


Median

92 85 88 95 0
Step 1: put the values in order
( smallest  largest)

0 85 88 92 95
Step2: location of the median (odd sample size) – Case 1
𝒏+𝟏 𝟓+𝟏
= = = 𝟑 (third value)
𝟐 𝟐

Step 3: value of the median


Comment: median of the profits is 88 million $ which represents the value at 50% distance of the
ordered dataset
Median
92 85 88 95 0 400
Step 1: put the values in order
( smallest  largest)

0 85 88 92 95 400
Step2: location of the median (even sample size) – Case 2
𝒏 𝟔 𝒏 𝟔
= = =𝟑 and +𝟏= +𝟏=𝟒
𝟐 𝟐 𝟐 𝟐
Step 3: value of the median= (88+92)/2 = 90
Comment: median of the profits is 90 million $ which represents the value at 50%
distance of the ordered dataset
Median

Advantages Disadvantages

• It concentrates on the location


• Easy to be calculated more than the value
• Easy to be explained • It does not take into calculation
• It is less sensitive to the outliers all the values in the dataset
• It is not applicable with
qualitative data, specially it is
nominal
Mode

What does it indicate?

It is the most frequent / repeated value(s)


Mode

Grades
A D A B B A C A A C A
Mode: A
A D B A A F B A D B B
Mode: A & B
A F B C D
Mode: no mode
Mode

Profits of Shark company in the last 6 weeks

0 0 500 120 125 36

Misleading value
Mode
Profits of Shark company in the last 6 weeks

10 20 500 120 125 36

Failed to provide you with a value


Mode

Advantages Disadvantages

• Easy to be calculated • It not preferred to be used


• Easy to be explained with continuous variables
due to:
• It is applicable with
qualitative data • Fail to estimate a value
• Misleading values
Absolute Dispersion Measures

The main goal is to evaluate how far the values are away from each other and

how far they are from the center of dataset. As a result of that we can evaluate

if the values are homogenous or heterogeneous.


Absolute Dispersion Measures

90 Million

85 95
Absolute Dispersion Measures

98 100 95 92 96 94

Case of homogeneity

85 74 93 20 100 0 94 52

Case of heterogeneity
Absolute Dispersion Measures

Variance and
Inter-quartile
Range Standard
range
Deviation
Range

What does it indicate?

It is the distance between the min. value

and max. value


Range

Profits in million $

92, 85, 88, 95


Range = Max. Value – Min. Value
= 95 – 85 = 10 million $

Comment: the range of the profits is 10 which represents the distance between min profit

(85 million $) and max. profit (95 million $)


Range

Company (1) Company (2)

Range of salaries is 30,000 Range of salaries is 30,000

Min salary 10,000 Min salary 20,000


Max. Salary 40,000 Max. Salary 50,000

• Meaningless until we linked with the Min. and Max. values.


• Can’t be used to compare between 2 datasets or more.
• Affected by outliers.
Range

Advantages Disadvantages

• Easy to be calculated • It takes only two values into


• Easy to be explained calculation
• It combines the tails of dataset • It does not provide us with
average distance around the
mean
• It is affected by outlier
Variance and Standard Deviation

Average distance around the


mean
Variance and Standard Deviation

𝒏
σ𝒊=𝟏 𝟐
𝟐

𝒙𝒊 − 𝒙
𝑺 =
𝒏−𝟏
ഥ=𝟎
𝒙𝒊 − 𝒙 Deviations Around the Mean
(not from mean)
Variance and Standard Deviation
𝒏 𝟐
𝒙𝒊 ഥ
𝒙 ഥ ( 𝒙𝒊 − 𝒙
𝒙𝒊 − 𝒙 ഥ)𝟐
𝟐
σ ഥ
𝒊=𝟏 𝒙𝒊 − 𝒙
𝑺 =
92 90 2 4 𝒏−𝟏
88 90 -2 4
𝟓𝟖
95 90 5 25 = = 𝟏𝟗. 𝟑𝟑
𝟒−𝟏
85 90 -5 25
Standard deviation (s) =
58
𝒗𝒂𝒓 = 𝟏𝟗. 𝟑𝟑

= 4.4 million $
Variance and Standard Deviation

Mean = 90 million $
SD = 4.4 million $

90 – 4.4 90 + 4.4
85.6 million $ 94.4 million $
Variance and Standard Deviation

Comment:

- SD of profits is 4.4 million $ which represents the average distance


around the mean profit (90 million $)
- As a result of that, the majority of the values range from 85.6 million $
to 94.4 million $ on average.
Variance and Standard Deviation

Disadvantages
Advantages

It is affected by outliers because the


• Easy to be calculated main component in its calculation is the

mean which has a main drawback of


• Easy to be explained
being impacted by outliers
• It takes all values into calculation
Inter Quartile Range (IQR)

Lowest 25 % Highest 25%


Distance Range of 50 % Distance of the ordered Distance
Dataset

Smallest
Value
Inter Quartile Range Largest
Value
First Third
Quartile
Quartile
(Q1)
(Q3)
25%
75%
Inter Quartile Range (IQR)

67, 72, 65, 77, 75, 70, 80, 82, 50, 112
Step 1: put the values in order from the smallest to the largest

50 65 67 70 72 75 77 80 82 112

Step 2: location of Q1 = ¼ (n + 1) = ¼ (10+1) = 2. 75

Value of Q1 = Start + ratio * distance = 65 + .75 (67 – 65) = 66.5 million $

Comment: Q1 of profits is 66.5 million $ which represents the value at 25% distance of the ordered dataset.
Inter Quartile Range (IQR)

50 65 67 70 72 75 77 80 82 112

Step 3: location of Q3 = ¾ (n + 1) = ¾ (10+1) = 8. 25


Value of Q3 = Start + ratio * distance = 80 + 0.25 (82 – 80) = 80.5 million $
Comment: Q3 of profits is 80.5 million $ which represents the value at 75% distance of the ordered dataset.
Step 4: IQR = Q3 – Q1 = 80.5 – 66.5 = 14 million $
Comment: IQR of profits is 14 million $ which represents the range of 50% distance of the ordered dataset after
excluding the lowest and the highest 25% of the ordered dataset.
Test of Outliers - Box Plot

*** **
LB UB
Q1 Q3

 Lower Bound (LB) = Q1 – 1.5 IQR


 Upper Bound (UB) = Q3 + 1.5 IQR
Test of Skewness

𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = 3
𝑆𝐷

 Symmetric SC = 0 ± 0.5 ( from -0.5 to +0.5)

 Positively skewed ( skewed to the right) SC is greater than +0.5

 Negatively Skewed ( Skewed to the left) SC is less than – 0.5


Test for Outliers

Yes No

Skewed Test of skewness

Symmetric Skewed
Median
Mean Median
IQR
SD IQR
Example
a) Is this sample containing any extreme values? Justify your answer with a
suitable test.
Answer
Test for the outliers - Box Plot
Step 1: put the values in order from the smallest to the largest

50 65 67 70 72 75 77 80 82 112

Step 2: location of Q1 = ¼ (n + 1) = ¼ (10+1) = 2. 75

Value of Q1 = Start + ratio * distance = 65 + .75 (67 – 65) = 66.5 million $


Example
a) Is this sample containing any extreme values? Justify your answer with a
suitable test.
Answer
Test for the outliers - Box Plot
Step 1: put the values in order from the smallest to the largest

50 65 67 70 72 75 77 80 82 112

Step 2: location of Q1 = ¼ (n + 1) = ¼ (10+1) = 2. 75

Value of Q1 = Start + ratio * distance = 65 + .75 (67 – 65) = 66.5 million $


Example

* 112
45.5 101.5

Comment: ???
Example
b) According to your conclusion in part (a), calculate the best central and
the best absolute dispersion measure.
Answer
IQR = 14 million $

Comment: IQR of profits is 14 million $ which represents the range of 50%


distance of the ordered dataset after excluding the lowest and the highest
25% of the ordered dataset.
Example
b) According to your conclusion in part (a), calculate the best central and the
best absolute dispersion measure.
Answer
Median
Step 1: put the values in order from the smallest to the largest

50 65 67 70 72 75 77 80 82 112
Step2: location of the median (even sample size) – Case 2
𝒏 𝟏𝟎 𝟏𝟎
= = = 𝟓 and +𝟏=𝟔
𝟐 𝟐 𝟐
Step 3: value of the median= (72+75)/2 = 73.5 million $
Example
C) Assuming that the outlier(s) are not found, what would be the best central measure
Answer
After removing 112

Median
Step 1: put the values in order from the smallest to the largest

50 65 67 70 72 75 77 80 82
Step2: location of the median (odd sample size)
𝒏+𝟏 𝟗+𝟏
= =𝟓
𝟐 𝟐
Step 3: value of the median= 72 million $
𝑋𝑖 𝑋 − 𝑏𝑎𝑟 𝑋 − 𝑥𝑏𝑎𝑟 (𝑋 − 𝑥𝑏𝑎𝑟)^2

50 70.89 -20.89 436.35

65 70.89 -5.89 34.68

67 70.89 -3.89 15.12

70 70.89 -0.89 0.79

72 70.89 1.11 1.23

75 70.89 4.11 16.90

77 70.89 6.11 37.35

80 70.89 9.11 83.01

82 70.89 11.11 123.46


Example
𝒏 𝟐
σ𝒏𝒊=𝟏 𝒙𝒊 𝟐
σ ഥ
𝒊=𝟏 𝒙𝒊 − 𝒙 Standard deviation (s) =
ഥ=
𝑿 𝑺 =
𝒏 𝒏−𝟏
= 𝟕𝟎. 𝟖𝟗 𝟕𝟒𝟖. 𝟖𝟗 𝒗𝒂𝒓 = 𝟗𝟑. 𝟔𝟏
= = 𝟗𝟑. 𝟔𝟏
𝟗−𝟏
= 9.68 million $
Example
𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = 3
𝑆𝐷

70.89 − 72
=3 = −0.34
9.68
Comment: ???
Coefficient of Variation
Can be used to compare the variability of two or more sets of
data measured in different units.

 S
CV     100%

 X 
Rule: The lower CV is the higher level of homogeneity
Coefficient of Variation

Question Two: The prices of stock A and Stock B recorded over several months as
follows.
Stock A: 10 10 12 10 11 11 10 11 10 9
Stock B: 9 10 12 7 10 16 10 15 10
Where that the Standard deviation for stock A is 0.843. The Variance of stock B is
8.933 and mean is 10.6
Which stock would you prefer to buy? And why? Comment on the results.
Stock A
 S
CV     100%

X 

10 + 10 + 12 + 10 + 11 + 11 + 10 + 11 + 10_+9
𝑋ത = = 10.4
10

0.843
𝐶. 𝑉𝐴 = × 100 = 8.108%
10.4
Stock B
 S
CV     100%

X 

𝑆= 8.933 = 2.98

2.98
𝐶. 𝑉𝐵 = × 100 = 28.19%
10.6
Comment

Since the 𝐶. 𝑉𝐴 < 𝐶. 𝑉𝐵 so prices of stock A is


more homogenous than the prices of stock B
as results we would prefer to buy stock A.
Thank you

See you next lecture

You might also like