0% found this document useful (0 votes)
98 views73 pages

Introduction To Descriptive Statistics

Uploaded by

Vinay Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views73 pages

Introduction To Descriptive Statistics

Uploaded by

Vinay Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Statistics For Economics (HS20204) –

Lecture 1

INTRODUCTION TO DESCRIPTIVE
STATISTICS
What is Statistics?

 Statistics is the science of collection, presentation, analysis, and


reasonable interpretation of data.
Types of Data

 Based on collection – methodology:


i. Primary
ii. Secondary

 Based on time & space


i. Cross – sectional
ii. Time Series
iii. Panel
Statistical Description of Data

 Statistics describes a numeric set of data by its:


i. Central Tendency
ii. Variability / Dispersion
iii. Shape (Skewness & Kurtosis)

 Statistics describes a categorical set of data by:


Frequency, percentage or proportion of each category
Measures of Central Tendency

 Mean:
i. Arithmetic Mean (A.M)
ii. Geometric Mean(G.M)
iii. Harmonic Mean(H.M)

 Median

 Mode
Mean

 Given a set of numeric observations { , , ……..}

 A.M =

 G.M =

 H.M = =

 A.M ≥ G.M ≥ H.M (Equality holds when all the observations are equal)
Median

 Given a set of numeric values { , , ……..} the middle value in an ordered sequence of observations
is the median.

 That is, to find the median we need to order the data set and then find the middle value. In case
of an even number of observations the average of the two middle most values is the median.

 Example: Find the median of {9, 3, 6, 7, 5}. We first sort the data giving {3, 5, 6, 7, 9} & choose
the middle value 6.

 If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the
two middle values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.
Mode

 The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.

 Consider the set numbers {1,2,6,6,7,8,7,7,9,9,0,1}. Here clearly the


most occurring number is 7. So the mode of this set of observations is
= 7.
Mean vs Median

 The median is less sensitive to outliers (extreme scores) than the mean and thus
a better measure than the mean for highly skewed distributions, e.g. family
income.

 For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median
of these four observations is (30+40)/2 =35.

 Here 3 observations out of 4 lie between 20-40. So, the mean 270 really fails to
give a realistic picture of the major part of the data. It is influenced by extreme
value 990.
Variability / Dispersion

 Variability (or dispersion) measures the amount of scatter in a dataset.

 Common measures:
i. Range
ii. Variance / standard deviation
iii. interquartile range
iv. coefficient of variation etc.
Range

 Range is the difference between the largest and the smallest observations.

 Given a set of observations { , , ……..}, Range ® is given by


R = max { , , ……..} - min { , , ……..}

 The range of 10, 5, 2, 100 is (100-2)=98.

 It’s a crude measure of variability.


Standard Deviation

Given a set of numeric observations { , , ……..}

 Variance (V) =

 Standard deviation (s.d) =


Inter – Quartile Range (IQR)
 Given a set of numeric observations { , , ……..}

 Sort the numbers in increasing order & compute the median.

 The median divides the data into two clusters: Upper & Lower

 Compute the 1st Quartile (): the median of the lower cluster

 Compute the 3rd Quartile (): the median of the lower cluster

IQR = -
IQR - Example
Consider the following set of numbers {1, 3, 4, 5, 5, 6, 7, 11}.

Q1 is the middle value in the first half of the data set i.e {1, 3, 4, 5}

Since there are an even number of data points in the first half of the data set, the middle value is the
average of the two middle values; that is, Q1 = (3 + 4)/2 = 3.5.

Q3 is the middle value in the second half of the data set i.e {5, 6, 7, 11}

Again, since the second half of the data set has an even number of observations, the middle value is the
average of the two middle values; that is, Q3 = (6 + 7)/2 = 6.5.

The interquartile range is Q3 minus Q1, so IQR = 6.5 - 3.5 = 3.


Coefficient of Variation

Given a set of observations { , , ……..}

The coefficient of variation is given by CV =


(Usually by “Mean” we refer to A.M)
Frequency Distribution
 In this type of classification there are two elements

i. Variable: Variable refers to the characteristic that varies in magnitude or quantity. E.g. weight
of the students.

ii. Frequency: Frequency refers to the number of times each variable gets repeated. For
example there are 50 students having weight of 60 kgs. Here 50 students is the frequency.

 There are two types of quantitative classification of data:

i. Discrete frequency distribution (the variable is discrete)


ii. Continuous frequency distribution (the variable is continuous)
Frequency Distribution - Example
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:
Cumulative Frequency Distribution
Central Tendency (of a freq. dist.)

 Let’s say we have a variable X = and the frequency with occurs is given
by

 Then A.M () =

 The median is still the middle value of the ordered sequence.

 Mode = if ≥
Dispersion (of a freq. dist.)

 Let’s say we have a variable X = and the frequency with occurs is given by

 Variance = , where =

 Range = max - min

 IQR = - ; where cumul freq ) = .


& cumul freq ) = .
Continuous Frequency distribution
The following technical terms are important when a continuous
frequency distribution is formed
 Class limits: Class limits are the lowest and highest values
that can be included in a class. For example take the class 51-55.
The lowest value of the class is 51 and the highest value is 55. In
this class there can be no value lesser than 51 or more than 55.
51 is the lower class limit and 55 is the upper class limit.

 Class interval: The difference between the upper and lower


limit of a class is known as class interval of that class.

 Class frequency: The number of observations corresponding to a


particular class is known as the frequency of that class
Shape (Skewness & Kurtosis)
Concept of Skewness

 A frequency distribution is said to be skewed-when the mean, median and


mode fall at different position in the distribution and the balance (or center
of gravity) is shifted to one side or the other i.e. to the left or to the right.

 Therefore, the concept of skewness helps us to understand the relationship


between three measures-
• Mean.
• Median.
• Mode.
Symmetrical Distribution
 A frequency distribution is said to be symmetrical if the frequencies are
equally distributed on both the sides of central value.

 A symmetrical distribution may be either bell – shaped or U shaped.

 In symmetrical distribution, the values of mean, median and mode are


equal i.e. Mean=Median=Mode
Skewed Distribution

• A frequency distribution is said to be skewed if the frequencies are not


equally distributed on both the sides of the central value.

• A skewed distribution may be-

• Positively Skewed
• Negatively Skewed
Skewed Distribution

• Negatively Skewed • Positively Skewed


• In this, the distribution is skewed • In this, the distribution is skewed
to the left (negative) to the right (positive)
• Here, Mode exceeds Mean and • Here, Mean exceeds Mode and
Median. Median.

Mean<Median<Mode Mode<Median<Mean
Tests of Skewness

In order to ascertain whether a distribution is skewed or not the following tests may
be applied. Skewness is present if:
•The values of mean, median and mode do not coincide.
•When the data are plotted on a graph they do not give the normal bell shaped form i.e.
when cut along a vertical line through the center the two halves are not equal.
•The sum of the positive deviations from the median is not equal to the sum of the
negative deviations.
•Quartiles are not equidistant from the median.
•Frequencies are not equally distributed at points of equal deviation from the mode.
Graphical Measures of Skewness

• Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry.
• Positive or negative skewness can be detected graphically (as below) depending on whether the
right tail or the left tail is longer but, we don’t get idea of the magnitude
• Hence some statistical measures are required to find the magnitude of lack of symmetry

Mean> Median> Mode Mean=Median=Mode Mean<Median<Mode

Symmetrical Skewed to the Left Skewed to the Right


Statistical Measures of Skewness

Absolute Measures of Skewness Relative Measures of Skewness


Following are the absolute measures of There are four measures of skewness:
skewness:

• Skewness (Sk) = Mean – Median •β and γ Coefficient of skewness

•Karl Pearson's Coefficient of skewness


• Skewness (Sk) = Mean – Mode
•Bowley’s Coefficient of skewness
• Skewness (Sk) = (Q3 - Q2) - (Q2 -
•Kelly’s Coefficient of skewness
Q1)
β and γ Coefficient of Skewness

• 
Karl Pearson's Coefficient of Skewness……01

• This method is most frequently used for measuring skewness. The formula for
measuring coefficient of skewness is given by

SKP = Mean – Mode


σ

Where,
SKP = Karl Pearson's Coefficient of skewness,
σ = standard deviation.

Normally, this coefficient of skewness lies between -3 to +3.


Karl Pearson's Coefficient of Skewness…..02
In case the mode is indeterminate, the coefficient of skewness is:

Mean – (3 Median - 2
SKP = Mean)
σ
Now this formula is equal to

3(Mean - Median)
SKP = σ

The value of coefficient of skewness is zero, when the distribution is symmetrical.


The value of coefficient of skewness is positive, when the distribution is positively skewed.
The value of coefficient of skewness is negative, when the distribution is negatively skewed.
Bowley’s Coefficient of Skewness……01

Bowley developed a measure of skewness, which is based on quartile values.


The formula for measuring skewness is:

(Q3 – Q2) – (Q2 – Q1)


SKB =
(Q3 – Q1)

Where,
SKB = Bowley’s Coefficient of skewness,
Q1 = Quartile first Q2 = Quartile second
Q3 = Quartile Third
Bowley’s Coefficient of Skewness…..02

The above formula can be converted to-

SKB = Q3 + Q1 – 2Median
(Q3 – Q1)

The value of coefficient of skewness is zero, if it is a symmetrical


distribution.
If the value is greater than zero, it is positively skewed distribution.
And if the value is less than zero, it is negatively skewed distribution.
Kelly’s Coefficient of Skewness…..01

Kelly developed another measure of skewness, which is based on percentiles and


deciles.
The formula for measuring skewness is based on percentile as follows:

P90 – 2P50 + P
SKk = 10
P90 – P10
Where,
SKK = Kelly’s Coefficient of skewness,
= Percentile Ninety.
P90
= Percentile Fifty.
P50 = Percentile Ten.

P
Kelly’s Coefficient of Skewness…..02

This formula for measuring skewness is based on percentile are as follows:

SKk = D9 – 2D5 +
D 1 D9 – D 1

Where,
SKK = Kelly’s Coefficient of skewness,
D9 = Deciles Nine.
D5 = Deciles Five. D1 = Deciles one.
Example:

 
Homework:

• Ques: The following are the marks of 150 students in an examination. Calculate Karl Pearson’s coefficient of
skewness.

Marks No. of Students


0-10 20
10-20 10
20-30 40
30-40 0
40-50 15
50-60 20
60-70 15
70-80 10
80-90 30
Moments:

 In Statistics, moments is used to indicate peculiarities of a frequency


distribution.

 The utility of moments lies in the sense that they indicate different aspects of a
given distribution.

 Thus, by using moments, we can measure the central tendency of a series,


dispersion or variability, skewness and the peakedness of the curve.

 The moments about the actual arithmetic mean are denoted by μ.

 The first four moments about mean or central moments are following:-
Moments:

Moments around Mean Moments around any Arbitrary No


Conversion formula for Moments

1st moment: (Mean)

2nd moment:
(Variance)

3rd moment: (Skewness)

4th moment: (Kurtosis)


Two important constants calculated from μ2, μ3 and μ4 are:-

β1 (read as beta one)


β2 (read as beta two)
•  • 
Kurtosis

 Kurtosis is another measure of the shape of a frequency curve. It is a Greek word, which means
bulginess.

 While skewness signifies the extent of asymmetry, kurtosis measures the degree of peaked-ness
of a frequency distribution.

 Karl Pearson classified curves into three types on the basis of the shape of their peaks. These are:-

•Leptokurtic
•Mesokurtic
•Platykurtic
Kurtosis

 When the peak of a curve becomes


relatively high then that curve is
called Leptokurtic.

 When the curve is flat-topped,


then it is called Platykurtic.

 Since normal curve is neither very


peaked nor very flat topped, so it
is taken as a basis for comparison.

 This normal curve is called


Mesokurtic.
Measure of Kurtosis

 There are two measure of Kurtosis:

i. Karl Pearson’s Measures of Kurtosis


ii. Kelly’s Measure of Kurtosis
Karl Pearson’s Measures of Kurtosis

Formula Result:

•  • 
Kelly’s Measure of Kurtosis

Formula Result:

•  • 
Example:

• 
Homework:

• Ques: The first four raw moments of a distribution are 2, 136, 320, and 40,000.
Find out coefficients of skewness and kurtosis.
Bivariate Descriptive Statistics
Covariance & Correlation

Given two variables X = & Y

Cov(X,Y) = )()

Pearson’s Correlation coefficient: (X,Y) =


Spearman’s Rank Correlation Coefficient
 The Spearman correlation coefficient is defined as the 
Pearson correlation coefficient between the rank variables’

 For a sample of size n the n raw observations / variables are converted to rank
variables R() and R()

 = R() , R() )

 If all the ranks are n distinct integers = 1 -


Where = R() - R()
Example

In this example, the raw data in the table


below is used to calculate the
correlation between the IQ of a person
with the number of hours spent in front
of TV per week.
The Steps
 Sort the data by the 1st column

 Create a new column R() and assign ranks 1,2….n in it.

 Observe the and store the rank of that in another column R()

 Create a fifth column = R( ) – R( ) and compute

 =1-
 = 194

 n = 10

 = 1 - = 1 - = - 0.175
Measures of Inequality
Setup

 Society has n individuals.

 Index i stands for a generic individual, i = 1, 2, . . . , n.

 An income distribution is a description of how much income y is


received by individual i : (y1, y2, . . . , yn).

 Compare relative “inequality” of two income distributions.


Anonymity Principle
 It does not matter who earns the income. Anonymity because we care
about the ordering but not the identity of each earner.

 All that matters is the ranking from lowest to highest:


y1, y2, . . . , yn → y(1), y(2), · · · , y(n)
ordered statistic y(1) ≤ y(2) ≤ y(3) ≤ · · · ≤ y(n)
Population Principle

 If we compare an income distribution over n people and another population


with 2n people with the same income pattern repeated twice, there should be
no difference in inequality among the two income distributions.

 Anonymity states that no information is lost by retaining only the sequence


of individual incomes (and not the identities of each)

 The population principle states it doesn’t matter how large the population is,
we can convert everything to percentiles (bottom 1%, lowest 20%, top 25%)
Relative Income Principle
 Only the relative incomes should matter and the absolute levels of these incomes
should not.

 Thus if we transform one distribution by multiply by a positive constant (e.g., Y 1 = λY


0 ) then inequality should be the same for the two distributions.

 Income levels have no meaning for inequality measurement. Absolute measure matters
for assessing economic development. We will see that level matters for the
measurement of poverty.

 Roughly think of poverty as a measure of location (level) and inequality as a measure


of dispersion.
Dalton’s Principle

 This is fundamental to the construction of inequality measures.

 Let (y1, y2, . . . , yn) be an income distribution and consider two incomes yi and yj with
yi ≤ y j .

 A transfer of income from individual i to individual j is called a regressive transfer.

 If inequalities is strict yi < yj the regressive transfer is from the poorer individual to the
richer individual.

 With weak inequality (≤) use the language “not richer” to “not poorer”
 Our inequality index as a function of the form: I = I (y1, y2, . . . , yn)
with I defined over all conceivable distributions of income (y1,
y2, . . . , yn).

 Dalton principle: if one income distribution can be achieved from


another by constructing a sequence of regressive transfers, then the
former distribution must be deemed more unequal than the latter.

 If for every income distribution (y1, y2, . . . , yn) and every transfer
δ > 0, I (y1, . . . , yi , . . . , yj , . . . , yn) < I (y1, . . . , yi −
δ, . . . , yj + δ, . . . , yn)
Lorenz Curve
 Lorenz curve is a simple diagrammatic way to depict the distribution of income.

 On the horizon axis we list the cumulative percentage of the population arranged
in increasing order of income.

 Thus point A on the axis refer to the poorest 20% of the population, the poorest
half, etc.

 On the vertical axis we measure the percentage of the national income accruing to
any particular fraction of the population thus arranged.

 The diagonal line (45◦ ) represents equal distribution income.


Lorenz Curve for the United States

Source: U.S. Census Bureau, Historical Income Tables, Households, Table H-2.
Lorenz Curve Properties
 The slope of the Lorenz curve is the contribution of the person at that
point to the cumulative share of national income.

 Ordered from poorest to richest the “marginal contribution” can never fall.

 Equivalently, the Lorenz curve can never get flatter as we move from left
to right.

 The overall distance between the 45 ◦ and the Lorenz curve represents the
amount of inequality present in the society.
Lorenz Curves for Sweden, the United States,
and Bolivia

Sources: Statistics Sweden, online database, Disposable Income in Deciles 2011–2014; U.S. Census Bureau, Historical Income Tables, Households, Table H-2;
World Bank, World Development Indicators database.
Intersecting Lorenz Curve – Confusion !
The Gini Coefficient: A/(A+B)
Data and Trends
Figure 10.4: Gini Coefficient in the United States,
1967-2010

Source: U.S. Census Bureau, Historical Income Tables, Households, Table H-2.
Figure 10.5: Income Share of the Top 10 Percent
and Top 1 percent in the United States, 1917-2012

Source: Emmanuel Saez, income inequality database updated to 2012, University of California, Berkeley, http://elsa.berkeley.edu/~saez/.
Figure 10.6: The Distribution of Wealth in the
United States, 2009

Source: Sylvia A. Allegretto, “The State of Working America’s Wealth, 2011,” Economic Policy Institute, EPI Briefing Paper #292, March 23, 2011.

You might also like