Introduction To Descriptive Statistics
Introduction To Descriptive Statistics
Descriptive
Statistics
17.871
Key measures
Describing data
Skew Skewness --
Peaked Kurtosis --
Key distinction
Population vs. Sample Notation
i
x
i 1
X
n
Variance, Standard Deviation
n
( xi )2
i 1 n
,
2
n
( xi )
2
i 1 n
Variance, S.D. of a Sample
n
( xi )2
i 1 n 1
s ,
2
Degrees of freedom
n
( xi )
2
i 1 n 1
s
Binary data
“No skew”
“Zero skew”
Symmetrical
Value
Mean = median = mode
1 ( x ) / 2 2
f ( x) e
2
Skewness
Asymmetrical distribution
Frequency Income
Contribution to
candidates
Populations of
countries
“Residual vote” rates
“Negative skew”
“Left skew”
Value
Skewness
Frequency
Value
Kurtosis
k>3 leptokurtic
Frequency
k=3 mesokurtic
k<3 platykurtic
Value
Normal distribution
Skewness = 0
Kurtosis = 3
1 ( x ) / 2 2
f ( x) e
2
More words about the normal curve
The z-score
or the
“standardized score”
z x x
x
Commands in STATA for
univariate statistics
summarize varname
summarize varname, detail
histogram varname, bin() start() width()
density/fraction/frequency normal
graph box varnames
tabulate [NB: compare to table]
Example of Sophomore Test
Scores
High School and Beyond, 1980: A Longitudinal
Survey of Students in the United States (ICPSR
Study 7896)
--------------------------
recodedty |
pe | mean(totals~e)
----------+---------------
1 | .3729735
2 | .4475548
3 | .589883
--------------------------
Graph totalscore
. hist totalscore
2
1.5
Density
1
.5
0
-.5 0 .5 1
totalscore
Divide into “bins” so that each bar
represents 1% correct
hist totalscore,width(.01)
(bin=124, start=-.24209334, width=.01)
2
1.5
Density
1
.5
0
-.5 0 .5 1
totalscore
Add ticks at each 10% mark
histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
(bin=124, start=-.24209334, width=.01)
2
1.5
Density
1
.5
0
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Superimpose the normal curve
(with the same mean and s.d. as the empirical distribution)
1
.5
0
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Histograms by category
.histogram totalscore, width(.01) xlabel(-.2 (.1)1)
by(recodedtype)
(bin=124, start=-.24209334, width=.01)
1 2
3
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
3
3
Nonsectarian private
2
1
0
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Graphs by recodedtype
Main issues with histograms
Proper level of aggregation
Non-regular data categories
A note about histograms with
unnatural categories
From the Current Population Survey (2000), Voter and Registration Survey
-9 No Response
-3 Refused
-2 Don't know
-1 Not in universe
1 Less than 1 month
2 1-6 months
3 7-11 months
4 1-2 years
5 3-4 years
6 5 years or longer
Solution, Step 1
Map artificial category onto
“natural” midpoint
-9 No Response missing
-3 Refused missing
-2 Don't know missing
-1 Not in universe missing
1 Less than 1 month 1/24 = 0.042
2 1-6 months 3.5/12 = 0.29
3 7-11 months 9/12 = 0.75
4 1-2 years 1.5
5 3-4 years 3.5
6 5 years or longer 10 (arbitrary)
Graph of recoded data
histogram longevity, fraction
.557134
Fraction
0
0 1 2 3 4 5 6 7 8 9 10
longevity
Density plot of data
Total area of last bar = .557
Width of bar = 11 (arbitrary)
Solve for: a = w h (or)
.557 = 11h => h = .051
0
0 1 2 3 4 5 6 7 8 9 10 15
longevity
Density plot template
Height
Category Fraction X-min X-max X-length (density)
< 1 mo. .0156 0 1/12 .082 .19*
* = .0156/.082
Draw the previous graph with a box
plot
. graph box totalscore
1
Upper quartile
Inter-quartile
.5
Median } range
Lower quartile
} 1.5 x IQR
0
-.5
Draw the box plots for the different
types of schools
. graph box totalscore, by(recodedtype)
1 2
1
.5
0
-.5
3
1
.5
0
-.5
Graphs by recodedtype
Draw the box plots for the different
types of schools using “over” option
graph box totalscore, over(recodedtype)
1
.5
0
-.5
1 2 3
Three words about pie charts:
don’t use them
So, what’s wrong with them
For non-time series data, hard to get a
comparison among groups; the eye is very
bad in judging relative size of circle slices
For time series, data, hard to grasp cross-
time comparisons
Some words about graphical
presentation
Aspects of graphical integrity (following
Edward Tufte, Visual Display of
Quantitative Information)
Main point should be readily apparent
Show as much data as possible
Write clear labels on the graph
Show data variation, not design variation