1 Advanced Statistics
1 Advanced Statistics
STATISTICS
1
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
We now briefly define some key terms.
These definitions will be further elaborated
BASIC
throughout the rest of the website.
Data and data sets: observations from the
environment.
2
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Independent random variable: a variable
that is chosen, and then measured or
manipulated, by the researcher in order to study
some observed behavior.
3
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
REVIEW
STATISTICAL CONCEPTS
of
Variable: A numerical attribute that can take on Survey data: Data collected from the responses of
different values. Variables constitute the characteristic a group of participants.
of a sample set to which statistical analysis will be
analyzed. There can be categorical or numerical Frame data: Data collected using a pre-specified
variables. For example, a person's religion would be a list establishing the guidelines that will be used in
categorical variable whereas that person's disposable assembling the sample from the population.
income would be a numerical variable. Frames should be selected so that the resulting
sample will represent the population.
Parameter: A numerical measure describing a
particular characteristic of a population. Since we Bar chart: A chart made from categorical data in
typically do not study populations, parameters are often which the heights of bars represent the frequency
unobservable and must be estimated. An unbiased (or relative frequency aka percent) of membership
parameter estimate is one that is statistically equal to in each value of the variable. Unlike a histogram,
the true population parameter. the width of the bars carries no meaning.
Statistic: A numerical measure that describes some Histogram: A graph made from quantitative data in
property of the population. A statistic is obtained from a which the range of the data is divided into intervals
sample. We hope the statistic estimated from the called bins, and then bars are constructed above
sample is statistically equal to the same statistic if we each bin such that the heights of the bars represent
could collect it from the population. the frequency or relative frequency of data in the
particular bin. Unlike a bar chart, the width of the
bars is an important characteristic of the graph
4
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Box and whisker plot: A plot that incorporates Degree of freedom: The number of independent data
the median and upper and lower quartiles to values available to estimate the population's standard
graphically display the data range. Also deviation. The degrees of freedom equal the number
particularly useful for displaying outliers when of observations in the sample (N) minus the number
they are present in the data. of parameters to be estimated (K).
Time series plot: The plot of a specified variable Student's t-distribution: A family of curves each of
over time. which is a symmetrical bell-shaped distribution that
has greater area in the tails than the normal
Cross-sectional data: Data compared at one probability distribution. Each distribution will be
point in time. Comparisons can be intra-data or defined by its degrees of freedom. As the degrees of
with a benchmark data point. freedom increase the t-distribution approaches that of
the normal distribution.
Probability: The mathematical likelihood a
particular outcome will occur. Z-value: A statistic generated from a normal
probability distribution. It is a standardized value in
Probability distribution: A scaling of possible that it divides the difference between an observation
event outcomes based upon their likelihood and the mean value by the standard deviation of the
(probability) of occurring and described by a observations.
probability function.
A pie chart: A circular graph where wedge-shaped
Discrete probability distribution: A probability slices comprise proportions of the total circular graph.
distribution where each class contains only
certain values of the variable in any particular Pareto chart: A bar chart that displays the count of
interval (such as only whole number values, for each item as a number or percentage in ascending
example). order from left to right. The Pareto function represents
a cumulative percentage summing to 100%.
Continuous probability distribution: A
probability distribution described by any possible Frequency: The number or percent occurrence of a
value of the variable within the range of possible particular outcome out of N trials.
values.
Frequency table: A grouping of data into mutually
Symmetrical probability distribution: A exclusive classes showing the number of
probability function that has a vertical line of observations in each class. Relative frequency
symmetry creating left/right mirror images. The classes are derived from a frequency table by
most well-known example is the bell-shaped computing the percentage of the total observations
Normal distribution that is fully described by its made up by each class.
mean and standard deviation.
Joint frequency distribution: A table consisting of
Left-skewed probability distribution: A set of paired responses for two variables.
data values in which the mean is generally less
than the median. The left tail of the distribution is Scatter diagram: A graph that plots the coordinates
longer than the right tail of the distribution. from two series of data points. In a typical a scatter
diagram the X axis (the horizontal axis) represents the
Right-skewed probability distribution: A set of units of one variable while the Y axis (the vertical
data values in which the mean is generally axis) represents the units of the second variable.
greater than the median. The right tail of the Scatter diagrams can reveal patterns among data.
distribution is longer than the left tail of the
distribution. Mean: A measure of central tendency. It is computed
by summing all data values and dividing by the
Central Limit Theorem: The statistical law that number of data values summed. In this context the
states that regardless of the shape of the mean (average) is an ex post number. It is computed
distribution of the individual values in the after-the-fact. If the observations include all the values
population, as the sample size gets larger, the in a population the average is referred to as
sampling distribution of the mean can be a population mean. If the values used in the
computation only include those from a sample, the
result is referred to as a sample mean.
5
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Expected mean: A measure of central tendency. All data values are weighted by their probability of
occurring and then summed. The expected mean is an ex ante calculation (sometimes referred to as a
weighted mean where the probabilities are the weights). The expected mean can be from a population or
from a sample. Typically, it is computed from a sample. The expected mean is also referred to as
an expected value.
Median: A center value that divides the data array into two halves. The median is not affected by extreme
observation values in the data set.
Mode: The value in the data set that occurs most frequently. Some data sets may have more than one
mode if two different values tie for the most frequently occurring value. For example, a distribution of values
may be bi-modal in nature.
Population variance: The population variance is the average of the squared differences of the data values
from the mean value of observations divided by N observations.
Sample variance: The sum of the squared differences of the data values from the mean value of
observations where this sum is divided by the number of observations (N) minus 1. We divide by N – 1 to
correct for a bias produced in the sample variance when the number of observations is small.
Sample standard deviation: The square root of the sample variance. The sample standard deviation
represents the typical distance from the mean to an observation in the data. The sample standard deviation
is a measure of risk.
Sample coefficient of variation: The ratio obtained by dividing the sample standard deviation by the
sample mean. This calculation is useful when two different data sets have different means and standard
deviations. For two independent data sets we typically choose the data set with the lower coefficient of
variation—less variation per unit of expected value.
Point estimate: A single statistic (number) that is determined from a sample. It is used to estimate the
corresponding population parameter.
Sampling error: Differences from the mean that occur due to random chance.
Confidence interval: An interval
computed from a sample that is
expected to contain the population
parameter with a given level of
confidence.
Alternative hypothesis: The
subsequent test result that leads the
researcher to reject the null
hypothesis in favor of the alternative
hypothesis with a pre-specified level
of confidence. The null and
alternative hypotheses are mutual
exclusive states.
6
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
DESCRIPTIVE
STATISTICS
Measures of
Averages
Location
Averages can be tricky.
Consider:
Rate of Return
Year 1 Year 2 Year 3 Year 4 Year 5
(The correct average is that value which when compounded for 5 years gives the
same result as the observed compounding rates, in other words the solution to the
equation: (1+ x )5 =(1. 07 )×(1. 10)×(1 .12 )×(1 .30 )×(1. 15)=1. 9707688
Consider:
Dallas and Fort Worth are approximately 30 miles apart. On a round trip from
Dallas to Fort Worth and back, you average 30 mph on the first leg from Dallas to Fort
Worth. How fast to you have to travel on the return leg from Fort Worth to Dallas so that
you average 60 mph for the round trip?
7
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
The Arithmetic Average
The arithmetic average of a set of values is the sum of the values divided by the
number of values. If x1, x2, . . . . xn represent the n numerical values from a random
sample, then the formula for the sample mean is:
n
x̄=∑ x i / n
i =1
To find the average( when I use this term subsequently, I will mean the arithmetic
average), using EXCEL, one uses the function “average”. It is used just like the “median”
function.
Specifically, one types “=average( range of data)”. For the data on steel thickness,
you would have something that looks like the below:
By closing the parentheses, you get the average for the data as 354.55.
8
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Computation of the Arithmetic Mean
From Grouped Data
If we do not have the raw data but only the frequency distribution of the data, the
formula for the sample mean becomes:
x̄=∑ f i mi /n
i
EXCEL does not compute this formula directly. To compute this in EXCEL for the
steel thickness data, one can use the following procedure:
m(i) f(i)
Interval Midpoint Freq f(i)*m(i)
60 21276
Average 354.6
9
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Computation with the Average
Consider the problem of having two groups of people, 50 people in Group 1 with an
average hourly wage of $15.00 and 100 people in Group 2 with an average hourly wage of
$17.00, can I find the mean of the pooled group of 150 people.
The average of the pooled group is just the total hourly wages of all 150 people
divided by the 150 people. Using the formula for the arithmetic average, one can show
that:
n x̄=∑ x i
i
ATherefore, the sum of the hourly wages in the first group is 50 x 15 = 750.
The sum of the hour wages in the second group is 100 x 17 = 1700. Finally the mean of
the pooled group is:
x̄ pooled =∑ ni x̄ i / ∑ ni
i i
5 4 -1
10 12 2
15 18 3
20 19 -1
25 23 -2
Notice that the change in the means is the same as the mean of the changes
10
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Simpson’s Paradox
Consider the following data found in the file “meandemo.xls”:
Male Female
Males Average Females Average
30 31
Group 1 35 35 32 32 -3
48 75
14 60
Group 2 85 85 83 83 -2
98 85
60 61
Group 3 63 63 62 62 -1
65 98
All
Groups 60 62 2
Measures of Scale
The simplest way to measure scale is to find the average distance of each datapoint
from the measure of location (in our case the arithmetic mean). Symbolically this can be
written:
∑ (x i − x̄ )≡0
i
The fact that some deviations are positive and some negative can be corrected in
one of two ways:
1) Use the absolute value to compute the mean absolute deviation (MAD), which in
formula terms is:
11
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
or 2) Use the square of the deviations which in formula terms gives:
2 2
S =∑ ( x i− x̄ ) /(n−1)
i
and,
s= √ S2
In EXCEL, the function “stdev” uses the above formula for computing the sample
standard deviation:
For the steel thickness data, you would type “=stdev(range)” as shown below:
EXCEL does not automatically compute the standard deviation if the data is
grouped. The computing formula to use in this case is given by:
2 2 2
S =( ∑ f i mi −n x̄ )/(n−1)
i and then taking the square
root.
12
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
The necessary terms can be computed in EXCEL as shown in
the following table for the steel data:
m(i) f(i)
Interval Midpoint Freq f(i)*m(i) f(i)*m(i)*m(i)
2 2 2
S ≃∑ pi mi − x̄
i
The standard deviation for data following a theoretical distribution function f(x) can
also be defined as:
∞
σ 2 = ∫ x 2 f ( x)dx−μ2
−∞ and, σ =√ σ 2
13
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
MEASURES
OF
DISPERSION
14
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Measures of Dispersion
Dispersion:
Dispersion means scattering of the observations among themselves or from a central value
(Mean/ Median/ Mode) of data. We study the dispersion to have an idea about the variation.
These measures give us an idea about the amount of dispersion in a set of observations.
They give the answers in the same units as the units of the original observations.
There are two types of measures of dispersion.
1. Absolute measures of dispersion
2. Relative measures of dispersion
1. Absolute measures of dispersion
a) Range
b) Mean deviation
c) Standard deviation
d) Quartile deviation
2. Relative measures of dispersion
a. Coefficient of range
b. Coefficient of mean deviation
c. Co-efficient of variation
d. Coefficient of quartile deviation.
Range:
The range is the absolute difference between the highest and the smallest values in a set of
data.
Range= largest value-the smallest value.
Standard deviation:
The standard deviation is the most-used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. In
general, a lower value of the standard deviation for a data set indicates that the values of that
data set are spread over a relatively smaller range around the mean. In contrast, a larger
value of the standard deviation for a data set indicates that the values of that data set are
spread over a relatively larger range around the mean.
It is defined as the positive square root of the arithmetic mean of the squares of the deviations
of the given values from arithmetic mean. The square of the standard deviation is called
variance.
If x 1 , x 2 , x 3 ,… . x n are a set of n observations then standard deviation is given by
∑ ( xi −x́ ) 2
S . D=
Where
√ n−1
S . D=s=
1
n−1 √ {∑ 2
x−
(∑ x )
n }
15
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Example:
Calculate S.D. and variance for the following information.
5 6 7 7 9 4 5
Solution:
∑ ( x i−x́ )2
S . D=s=
√ n−1
5+6+ …+4 +5
x́= =6.142
7
( 5−6.142 )2 + ( 6−6.142 )2 +…+ ( 5−6.142 )2
∴ S . D=
√ 7−1
=1.68
∑ f i ( x i−x́ )2
S . D=s=
Or
√ n−1
S . D=s=
Example:
√1
n−1
∑ {
f i x2 −
(∑ x f i )
n }
The following data gives the frequency distribution of the daily commuting times (in minutes)
from home to IUBAT for 25 faculty members.
Daily Commuting Number of
Time faculties
(minutes)
0-10 4
10-20 9
20-30 6
30-40 2
40-50 4
Daily Number
Commuting of xi f i xi f i x i2
Time faculties
(minutes) ( f ¿¿ i) ¿
0-10 4 5 50 100
10-20 9 15 135 2025
20-30 6 25 150 3750
30-40 2 35 140 4900
40-50 4 45 90 4050
Total n=25 535 14,825
16
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
2
∴ S . D=s=
1
n−1 √ {∑ fix − 2 (∑ x f i )
n } √ {
=¿
1
25−1
14825−
( 535 )2
25 }=11.62 mins ¿
s=
√ ∑ f i ( x i− x́ )2
n−1 √ {
∨
1
n−1
∑ fix − 2 (∑ x f i )
n } Variance= s2
Coefficient of variation:
The most important and commonly used relative measure of dispersion is Coefficient
of variation (CV). Coefficient of variation is the percentage ratio of standard deviation and
the arithmetic mean. It is usually expressed in percentage. The formula for C.V. is
standard deviation σ
C.V= × 100= × 100
arithmatic mean x́
This measure is free of unit. Hence it is a measure to compare the dispersion of two or more
distribution.
Example:
Below are the scores of two cricketers in 10 innings. Who is more consistent scorer?
A 204 68 150 30 70 95 60 76 24 19
B 99 190 130 94 80 89 69 85 65 40
17
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Solution:
Let,
The player A = x
And the player B = y
σy
cv y = ×100
ý
204+68+ … 19
Now x́= =79.6
10
∑ ( x i−x́ )2 = ( 204−79.6 )2 + ( 68−79.6 )2+ … ( 19−79.6 )2
And σ x =
σx
√
n
58.23
√ 10
=58.23
∴ cv x = ×100= ×100=73.15 %
x́ 79.6
Similarly
99+190+ 40
ý= =94.1
10
And
σy 41.12
cv y = ×100= ×100=43.70 %
ý 94.1
Coefficient of variation of A is greater than coefficient of variation of B and hence we conclude
that player B is more consistent.
Chebyshev’s theorem:
1
For any number k greater than 1, at least 1− ( k2 )
of the data values lies within k
standard deviations of the mean.
18
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
1 1
Thus, for example if k=2 then 1− ( k )
2
=1− 2 =0.75Therefore, according to Chebyshev’s
2
theorem, at least 0.75 or 75% of the values of a data set lie within two standard deviations of
the mean.
Figure: Percentage of values within two standard deviations of the mean for Chebyshev’s
theorem.
Example:
The average systolic blood pressure for 4000 women who were screened for high blood
pressure was found to be 187 mm Hg with a standard deviation of 22. Using Chebyshev’s
theorem, find at least what percentage of women in this group has a systolic blood pressure
between 143 and 231 mm Hg.
Solution:
Let μand σ be the mean and the standard deviation, respectively, of the systolic blood
pressures of these women. Then, from the given information
μ=187
σ =22
To find the percentage of women whose systolic blood pressures are between 143 and 231
mm Hg, the first step is to determine k.
It can be written as
μ+kσ =231
Or, kσ =231−187
44
Or, k = =2
22
So for k=2
1 1
( )
1− 2 =1− 2 =0.75
k 2
Hence, according to Chebyshev’s theorem, at least 75% of the women have systolic
blood pressure between 143 and 231 mm Hg. This percentage is shown in the above
19
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
ASSESSMENT
TASK NO. 1
20
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved
Q1. The following table gives the frequency distribution of the amounts of telephone bills for
April 2013 for a sample of 50 students.
Q2. The production of jute goods in different days of first and second of the year are shown
below.
Q3. Terrier and SFP are two stocks traded on the New York Stock Exchange. For the past
seven weeks Friday closing price (dollars per share) was recorded:
Terrier 32 35 34 36 31 39 41
SFP 51 55 56 52 55 52 57
21
Advanced Statistics
Mr. Bornia (C) Copyright (2020-2021) All Rights Reserved