Lecture 1 Descriptive Statistics
Lecture 1 Descriptive Statistics
ADVANCED
STATISTICAL
METHODS
Lecture 1
Descriptive Statistics
DESCRIPTIVE STATISTICS
POPULATION VS. SAMPLE
Sample
A part of a
population
Population
Consist of all possible
values of a variable
Sample
What is sample?
• It is randomly sampled
from a large population
DESCRIPTIVE STATISTICS - EXAMPLE
An Illustration:
Below are data of IQ test from 13 students from class A and B. Which group
is smarter?
Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Each individual may be different. If you try to understand a group by remembering the qualities of
each member, you become overwhelmed and fail to understand the group.
DESCRIPTIVE STATISTICS
110.54 110.23
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Graphs
• Bar Chart or Histogram
• Stem and Leaf Plot
• Frequency Polygon
Grouped Relative Frequency Distribution
Stem Leaf
2. Numeric Summarizing Data:
Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Mean
110 cm
17 21
4 units
units
below 0
units above
below units
The scale is balanced because…
17 + 4 on the left = 21 on the right
Mean
• Means can be badly affected by outliers (data points with extreme values
unlike the rest)
• Outliers can make the mean a bad measure of central tendency or common
experience
If n is even, median =
Median
Median = 109
(six cases on the left, six on the right)
Median
If the first student were to drop out of Class A, there would be a new median
(even n):
Median = 109.5
(six cases on the left, six cases on the right)
Median
All Americans
Bill Gates
(outlier)
Median
• If the recorded values for a variable form a symmetric distribution, the median
and mean are identical.
• In skewed data, the mean lies further toward the skew than the median.
Symmetric
Skewed
Median
Mean
Mean Median
Mode
The spread, or the distance, between the lowest and highest values of a
variable.
To get the range for a variable, you subtract its lowest value from its highest
value.
Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
•
Variance
Mean, of A= 110.54
Example: Deviations of IQ for Class A
i Yi Deviation
1 102 -8.54
2 128 17.46
3 131 20.46
4 98 -12.54
5 140 29.46
6 93 -17.54
7 110 -0.54
8 115 4.46
9 109 -1.54
10 89 -21.54
11 106 -4.54
12 119 8.46
Variance
If you were to add all the squared deviations together, you’d get
what we call the
“Sum of Squares” (SS)
The formula for sum of square (SS):
Example: Sum of square of IQ for class A
i Yi Deviation Deviation squared
1 102 -8.54 72.9316
2 128 17.46 304.8516
3 131 20.46 418.6116
4 98 -12.54 157.2516
5 140 29.46 867.8916
6 93 -17.54 307.6516
7 110 -0.54 0.2916
8 115 4.46 19.8916
9 109 -1.54 2.3716
10 89 -21.54 463.9716
11 106 -4.54 20.6116
12 119 8.46 71.5716
13 97 -13.54 183.3316
Variance
In general,
Sum of square
(SS)
Degree of freedom
= total number of sample -1
Variance
•
Standard Deviation
In other words, standard deviation is the average distance of observation from its
mean.
Standard Deviation
15.34
Standard Deviation
• s.d. = 0 only when all values are the same (only when you have a constant and
not a “variable”)
• If you were to “rescale” a variable, the s.d. would change by the same
magnitude—if we changed units so the mean equaled 30, the s.d. on the left
would be 10, and on the right, 5.
• Like the mean, the s.d. will be inflated by an outlier case value.
Mean and Variance
Review
1. Deviation
2. Deviation squared
3. Sum of squares
4. Variance
5. Standard deviation
2. Numeric Summarizing Data:
A way to graphically portray almost all descriptive statistics at once is the box-plot
(shows the location and spread).
A box-plot shows:
• Upper and lower quartiles
• Mean
• Median
• Range
• Outliers (values greater than 1.5 IQR)
Box-plots
= 27
162
160.00
140.00
whiskers
123.5
120.00
Mean =110.5
106.5 (Median)
100.00
96.5
whiskers
80.00 82
IQ
The max and min value of the data are 162 and 82, respectively.
• The calculation of bottom‘whiskers’ length:
= Value at 25th percentile + (1.5* IQR)
= 96.5 - (1.5*27)
= 96.5 - 40.5
= 56 (the lowest value of the bottom whiskers)
• The calculation of top‘whiskers’ length:
= Value at 75th percentile + (1.5* IQR)
= 123.5 + (1.5*27)
= 123.5 + 40.5
= 164 (the highest value of the top whiskers)
• The minimum and maximum whiskers length is 56 and 164, respectively, so any value
smaller or greater than 56 and 164, respectively are considered outliers.
• The max value of our data is 162 (which is <164), thus, there is no outlier in the
dataset. The same goes to the minimum value.
Distribution Shape and Box-plot