Introduction To Descriptive Statistics
Introduction To Descriptive Statistics
Lecture 1
INTRODUCTION TO DESCRIPTIVE
STATISTICS
What is Statistics?
Mean:
i. Arithmetic Mean (A.M)
ii. Geometric Mean(G.M)
iii. Harmonic Mean(H.M)
Median
Mode
Mean
A.M =
G.M =
H.M = =
A.M ≥ G.M ≥ H.M (Equality holds when all the observations are equal)
Median
Given a set of numeric values { , , ……..} the middle value in an ordered sequence of observations
is the median.
That is, to find the median we need to order the data set and then find the middle value. In case
of an even number of observations the average of the two middle most values is the median.
Example: Find the median of {9, 3, 6, 7, 5}. We first sort the data giving {3, 5, 6, 7, 9} & choose
the middle value 6.
If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the
two middle values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.
Mode
The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.
The median is less sensitive to outliers (extreme scores) than the mean and thus
a better measure than the mean for highly skewed distributions, e.g. family
income.
For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median
of these four observations is (30+40)/2 =35.
Here 3 observations out of 4 lie between 20-40. So, the mean 270 really fails to
give a realistic picture of the major part of the data. It is influenced by extreme
value 990.
Variability / Dispersion
Common measures:
i. Range
ii. Variance / standard deviation
iii. interquartile range
iv. coefficient of variation etc.
Range
Range is the difference between the largest and the smallest observations.
Variance (V) =
The median divides the data into two clusters: Upper & Lower
Compute the 1st Quartile (): the median of the lower cluster
Compute the 3rd Quartile (): the median of the lower cluster
IQR = -
IQR - Example
Consider the following set of numbers {1, 3, 4, 5, 5, 6, 7, 11}.
Q1 is the middle value in the first half of the data set i.e {1, 3, 4, 5}
Since there are an even number of data points in the first half of the data set, the middle value is the
average of the two middle values; that is, Q1 = (3 + 4)/2 = 3.5.
Q3 is the middle value in the second half of the data set i.e {5, 6, 7, 11}
Again, since the second half of the data set has an even number of observations, the middle value is the
average of the two middle values; that is, Q3 = (6 + 7)/2 = 6.5.
i. Variable: Variable refers to the characteristic that varies in magnitude or quantity. E.g. weight
of the students.
ii. Frequency: Frequency refers to the number of times each variable gets repeated. For
example there are 50 students having weight of 60 kgs. Here 50 students is the frequency.
Let’s say we have a variable X = and the frequency with occurs is given
by
Then A.M () =
Mode = if ≥
Dispersion (of a freq. dist.)
Let’s say we have a variable X = and the frequency with occurs is given by
Variance = , where =
• Positively Skewed
• Negatively Skewed
Skewed Distribution
Mean<Median<Mode Mode<Median<Mean
Tests of Skewness
In order to ascertain whether a distribution is skewed or not the following tests may
be applied. Skewness is present if:
•The values of mean, median and mode do not coincide.
•When the data are plotted on a graph they do not give the normal bell shaped form i.e.
when cut along a vertical line through the center the two halves are not equal.
•The sum of the positive deviations from the median is not equal to the sum of the
negative deviations.
•Quartiles are not equidistant from the median.
•Frequencies are not equally distributed at points of equal deviation from the mode.
Graphical Measures of Skewness
• Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry.
• Positive or negative skewness can be detected graphically (as below) depending on whether the
right tail or the left tail is longer but, we don’t get idea of the magnitude
• Hence some statistical measures are required to find the magnitude of lack of symmetry
•
Karl Pearson's Coefficient of Skewness……01
• This method is most frequently used for measuring skewness. The formula for
measuring coefficient of skewness is given by
Where,
SKP = Karl Pearson's Coefficient of skewness,
σ = standard deviation.
Mean – (3 Median - 2
SKP = Mean)
σ
Now this formula is equal to
3(Mean - Median)
SKP = σ
Where,
SKB = Bowley’s Coefficient of skewness,
Q1 = Quartile first Q2 = Quartile second
Q3 = Quartile Third
Bowley’s Coefficient of Skewness…..02
SKB = Q3 + Q1 – 2Median
(Q3 – Q1)
P90 – 2P50 + P
SKk = 10
P90 – P10
Where,
SKK = Kelly’s Coefficient of skewness,
= Percentile Ninety.
P90
= Percentile Fifty.
P50 = Percentile Ten.
P
Kelly’s Coefficient of Skewness…..02
SKk = D9 – 2D5 +
D 1 D9 – D 1
Where,
SKK = Kelly’s Coefficient of skewness,
D9 = Deciles Nine.
D5 = Deciles Five. D1 = Deciles one.
Example:
Homework:
• Ques: The following are the marks of 150 students in an examination. Calculate Karl Pearson’s coefficient of
skewness.
The utility of moments lies in the sense that they indicate different aspects of a
given distribution.
The first four moments about mean or central moments are following:-
Moments:
2nd moment:
(Variance)
Kurtosis is another measure of the shape of a frequency curve. It is a Greek word, which means
bulginess.
While skewness signifies the extent of asymmetry, kurtosis measures the degree of peaked-ness
of a frequency distribution.
Karl Pearson classified curves into three types on the basis of the shape of their peaks. These are:-
•Leptokurtic
•Mesokurtic
•Platykurtic
Kurtosis
Formula Result:
• •
Kelly’s Measure of Kurtosis
Formula Result:
• •
Example:
•
Homework:
• Ques: The first four raw moments of a distribution are 2, 136, 320, and 40,000.
Find out coefficients of skewness and kurtosis.
Bivariate Descriptive Statistics
Covariance & Correlation
Cov(X,Y) = )()
For a sample of size n the n raw observations / variables are converted to rank
variables R() and R()
= R() , R() )
Observe the and store the rank of that in another column R()
=1-
= 194
n = 10
= 1 - = 1 - = - 0.175
Measures of Inequality
Setup
The population principle states it doesn’t matter how large the population is,
we can convert everything to percentiles (bottom 1%, lowest 20%, top 25%)
Relative Income Principle
Only the relative incomes should matter and the absolute levels of these incomes
should not.
Income levels have no meaning for inequality measurement. Absolute measure matters
for assessing economic development. We will see that level matters for the
measurement of poverty.
Let (y1, y2, . . . , yn) be an income distribution and consider two incomes yi and yj with
yi ≤ y j .
If inequalities is strict yi < yj the regressive transfer is from the poorer individual to the
richer individual.
With weak inequality (≤) use the language “not richer” to “not poorer”
Our inequality index as a function of the form: I = I (y1, y2, . . . , yn)
with I defined over all conceivable distributions of income (y1,
y2, . . . , yn).
If for every income distribution (y1, y2, . . . , yn) and every transfer
δ > 0, I (y1, . . . , yi , . . . , yj , . . . , yn) < I (y1, . . . , yi −
δ, . . . , yj + δ, . . . , yn)
Lorenz Curve
Lorenz curve is a simple diagrammatic way to depict the distribution of income.
On the horizon axis we list the cumulative percentage of the population arranged
in increasing order of income.
Thus point A on the axis refer to the poorest 20% of the population, the poorest
half, etc.
On the vertical axis we measure the percentage of the national income accruing to
any particular fraction of the population thus arranged.
Source: U.S. Census Bureau, Historical Income Tables, Households, Table H-2.
Lorenz Curve Properties
The slope of the Lorenz curve is the contribution of the person at that
point to the cumulative share of national income.
Ordered from poorest to richest the “marginal contribution” can never fall.
Equivalently, the Lorenz curve can never get flatter as we move from left
to right.
The overall distance between the 45 ◦ and the Lorenz curve represents the
amount of inequality present in the society.
Lorenz Curves for Sweden, the United States,
and Bolivia
Sources: Statistics Sweden, online database, Disposable Income in Deciles 2011–2014; U.S. Census Bureau, Historical Income Tables, Households, Table H-2;
World Bank, World Development Indicators database.
Intersecting Lorenz Curve – Confusion !
The Gini Coefficient: A/(A+B)
Data and Trends
Figure 10.4: Gini Coefficient in the United States,
1967-2010
Source: U.S. Census Bureau, Historical Income Tables, Households, Table H-2.
Figure 10.5: Income Share of the Top 10 Percent
and Top 1 percent in the United States, 1917-2012
Source: Emmanuel Saez, income inequality database updated to 2012, University of California, Berkeley, http://elsa.berkeley.edu/~saez/.
Figure 10.6: The Distribution of Wealth in the
United States, 2009
Source: Sylvia A. Allegretto, “The State of Working America’s Wealth, 2011,” Economic Policy Institute, EPI Briefing Paper #292, March 23, 2011.