0% found this document useful (0 votes)
30 views35 pages

Introduction To Descriptive Statistics

The document discusses key measures and concepts in descriptive statistics such as measures of center, spread, skew, and kurtosis. It also covers the distinction between population and sample notation, how to calculate the mean, variance, standard deviation, and how to visualize univariate data through histograms, density plots, box plots, and other graphs.

Uploaded by

Sudhir Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views35 pages

Introduction To Descriptive Statistics

The document discusses key measures and concepts in descriptive statistics such as measures of center, spread, skew, and kurtosis. It also covers the distinction between population and sample notation, how to calculate the mean, variance, standard deviation, and how to visualize univariate data through histograms, density plots, box plots, and other graphs.

Uploaded by

Sudhir Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

Introduction to

Descriptive
Statistics
17.871
Key measures
Describing data

Moment Non-mean based


measure
Center Mean Mode, median

Spread Variance Range,


(standard deviation) Interquartile range

Skew Skewness --

Peaked Kurtosis --
Key distinction
Population vs. Sample Notation

Population vs. Sample


Greeks Romans
μ, σ, β s, b
Mean
n

 i
x
i 1
X
n
Variance, Standard Deviation
n
( xi   )2


i 1 n
 ,
2

n
( xi   )
2


i 1 n

Variance, S.D. of a Sample
n
( xi   )2


i 1 n 1
s ,
2

Degrees of freedom
n
( xi   )
2


i 1 n 1
s
Binary data

X  prob( X )  1  proportion of time x  1


s  x (1  x )  s x  x (1  x )
2
x
Normal distribution example
 IQ
Frequency  SAT
 Height

 “No skew”
 “Zero skew”
 Symmetrical
Value
 Mean = median = mode
1 ( x   ) / 2 2
f ( x)  e
 2
Skewness
Asymmetrical distribution
Frequency  Income
 Contribution to
candidates
 Populations of
countries
 “Residual vote” rates

Value  “Positive skew”


 “Right skew”
Skewness
Asymmetrical distribution
Frequency
 GPA of MIT students

 “Negative skew”
 “Left skew”

Value
Skewness
Frequency

Value
Kurtosis
k>3 leptokurtic
Frequency

k=3 mesokurtic

k<3 platykurtic

Value
Normal distribution
 Skewness = 0
 Kurtosis = 3

1 ( x   ) / 2 2
f ( x)  e
 2
More words about the normal curve
The z-score
or the
“standardized score”

z x x
x
Commands in STATA for
univariate statistics
 summarize varname
 summarize varname, detail
 histogram varname, bin() start() width()
density/fraction/frequency normal
 graph box varnames
 tabulate [NB: compare to table]
Example of Sophomore Test
Scores
 High School and Beyond, 1980: A Longitudinal
Survey of Students in the United States (ICPSR
Study 7896)

 totalscore = % of questions answered correctly


minus penalty for guessing
 recodedtype = (1=public school, 2=religious
private, 3 = non-sectarian private)
Explore totalscore some more

. table recodedtype,c(mean totalscore)

--------------------------
recodedty |
pe | mean(totals~e)
----------+---------------
1 | .3729735
2 | .4475548
3 | .589883
--------------------------
Graph totalscore
. hist totalscore

2
1.5
Density

1
.5
0

-.5 0 .5 1
totalscore
Divide into “bins” so that each bar
represents 1% correct
 hist totalscore,width(.01)
 (bin=124, start=-.24209334, width=.01)

2
1.5
Density

1
.5
0

-.5 0 .5 1
totalscore
Add ticks at each 10% mark
histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
(bin=124, start=-.24209334, width=.01)
2
1.5
Density

1
.5
0

-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Superimpose the normal curve
(with the same mean and s.d. as the empirical distribution)

. histogram totalscore, width(.01) xlabel(-.2 (.1) 1)


normal
(bin=124, start=-.24209334, width=.01)
2
1.5
Density

1
.5
0

-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Histograms by category
.histogram totalscore, width(.01) xlabel(-.2 (.1)1)
by(recodedtype)
(bin=124, start=-.24209334, width=.01)

1 2
3

Public Religious private


2
1
0
Density

-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

3
3

Nonsectarian private
2
1
0

-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Graphs by recodedtype
Main issues with histograms
 Proper level of aggregation
 Non-regular data categories
A note about histograms with
unnatural categories
From the Current Population Survey (2000), Voter and Registration Survey

How long (have you/has name) lived at this address?

-9 No Response
-3 Refused
-2 Don't know
-1 Not in universe
1 Less than 1 month
2 1-6 months
3 7-11 months
4 1-2 years
5 3-4 years
6 5 years or longer
Solution, Step 1
Map artificial category onto
“natural” midpoint
-9 No Response  missing
-3 Refused  missing
-2 Don't know  missing
-1 Not in universe  missing
1 Less than 1 month  1/24 = 0.042
2 1-6 months  3.5/12 = 0.29
3 7-11 months  9/12 = 0.75
4 1-2 years  1.5
5 3-4 years  3.5
6 5 years or longer  10 (arbitrary)
Graph of recoded data
histogram longevity, fraction

.557134
Fraction

0
0 1 2 3 4 5 6 7 8 9 10
longevity
Density plot of data
Total area of last bar = .557
Width of bar = 11 (arbitrary)
Solve for: a = w h (or)
.557 = 11h => h = .051

0
0 1 2 3 4 5 6 7 8 9 10 15
longevity
Density plot template
Height
Category Fraction X-min X-max X-length (density)
< 1 mo. .0156 0 1/12 .082 .19*

1-6 mo. .0909 1/12 ½ .417 .22

7-11 mo. .0430 ½ 1 .500 .09

1-2 yr. .1529 1 2 1 .15

3-4 yr. .1404 2 4 2 .07

5+ yr. .5571 4 15 11 .05

* = .0156/.082
Draw the previous graph with a box
plot
. graph box totalscore
1

Upper quartile
Inter-quartile
.5

Median } range
Lower quartile

} 1.5 x IQR
0
-.5
Draw the box plots for the different
types of schools
. graph box totalscore, by(recodedtype)

1 2
1
.5
0
-.5

3
1
.5
0
-.5

Graphs by recodedtype
Draw the box plots for the different
types of schools using “over” option
graph box totalscore, over(recodedtype)
1
.5
0
-.5

1 2 3
Three words about pie charts:
don’t use them
So, what’s wrong with them
 For non-time series data, hard to get a
comparison among groups; the eye is very
bad in judging relative size of circle slices
 For time series, data, hard to grasp cross-
time comparisons
Some words about graphical
presentation
 Aspects of graphical integrity (following
Edward Tufte, Visual Display of
Quantitative Information)
 Main point should be readily apparent
 Show as much data as possible
 Write clear labels on the graph
 Show data variation, not design variation

You might also like