Tian Statistics Lesson 3 Descriptive Statistics
Tian Statistics Lesson 3 Descriptive Statistics
and Statistics
Page 1 of 13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
Page 2 of 50
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
Example: Use SAS to roll up data to a higher lever (data set is shown in
the following image and named as data1). FAC is the unique ID, the goal is
to roll up the data to BOR level with value x equal to the sum of all unique
Page 3 of 50
Gender Income
M 50000
F 30000
M 54000
M 37000
F 48000
F 55000
M 90000
F 67000
M 110000
M 40000
F 20000
F 80000
Page 4 of 13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
PROC SQL;
CREATE TABLE TEMP AS
SELECT BOR,
SUM(X) AS X
FROM DATA1
GROUP BY BOR;
QUIT;
Page 5 of 50
Overview of Statistics
Statistics
Page 6 of 50
Statistical Data Analysis:
•Descriptive statistics
• Numerical
• Graphical
•Inferential statistics
Page 7 of 13
Populations and Samples:
A population is the set of all items of interest in a
statistical problem.
H T
123456 123456
H1 H2 H3 H4 H5 H6 T1 T2 T3 T4 T5 T6
Page 8 of 13
Let us throw the die 10 times, the outcome of results:
2345136 434
HHTHTTHTHH
Variable -price
Page 9 of 13
Random variables
Page 10 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
Variables
Qualitative Quantitative
Page 11 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Page 12 of 50
SAS out put data
True or false:
Page 13 of
13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
Descriptive statistics involves arranging, summarizing and presenting a set of data that
meaningful essentials of the data can be extracted and interpreted easily.
correlation coefficient
Mean
Standard Error
Median
Mode
Standard Deviation
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Page 15 of
13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
The following Box plot is showing several summary statistics. A box plot is a
quartiles. Box plots may also have lines extending vertically from the boxes
indicating variability outside the upper and lower quartiles. Outliers may be plotted
Page 16 of 50
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)
Page 17 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
describe data? There are two ways: measures of central tendency and
Page 18 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Population Mean – or average is calculated by finding the sum of the
study data and dividing it by the total number of data.
Sample Mean
X=( x1+….+xn)/n
Dial: 1 2 3 4 5 6
Throw 5 times: 2 3 1 5 2
Page 19 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Median – is the middle value in a set of data when the n obs are
arranged in order of magnitude.
When n=odd
median=middle value
When n=even
median=mean of the two middle values
7 3 9 4 6 median=?
Page 20 of 50
Mode – is the value that appears most frequently in the
set of data
Data set: 28 60 26 32 30 26 29
Mode=? 26
Data set: 28 60 29 30 33
Mode=? No mode D N E
Page 21 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
• The mean and median can be used with numeric data. The mode can
be used with both numerical and nominal data.
• The median is preferred in cases where there are outliers, since the
Page 22 of 50
median only considers the middle values.
Topic 2: Concepts of descriptive statistics, and calculation
method
Example: calculate the mean, median and mode for the following data.
8, 4, 9, 3, 5, 8, 6, 6, 7, 8 and 10.
6.73.
Median: In a data set of 11, the median is the number in the sixth place.
Page 23 of 50
Mode: The number 8 appears more than any other number. The mode is 8.
Topic 2: Concepts of descriptive statistics, and calculation
method
Measures of Dispersion
Range Variance Standard deviation
Range – is the difference between the smallest number and the largest
number.
2 5 7 9 10 11 15 range=?
0 25 7 9 11 15 138 range=?
Variance – is a measure of the average distance that a set of data lies from its
mean. The higher the variance, the more spread out your data are.
Page 25 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Example: Find the sample variance for the following set of numbers:
Page 26 of 50
Step 1: mean=(12+15+17+20+30)/5=19
Step 3: 49 16 1 1 121
Step 4: =49+16+1+1+121=188
Step 5: n-1 = 5 – 1 = 4
Page 27 of
13
Practice:
A: 8, 9, 10, 11, 12
B: 4, 7, 10, 13, 16
Page 28 of
13
Solution:
Va=2
Vb=18
Shortcut formula:
Page 29 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
std: s=6.86.
The unit attached to variance is the square of the unit attached to the original obs. Statistician
often want a measure of variability that is expressed in the same units as the original obs.
Page 30 of 50
Coefficient of variation:
Defined as CV=S/ X
Mean=19 S=6.86
Then CV=.36
If mean=12 S=6.5
CV=S/mean=6.5/12=.54
Page 31 of
13
The standard deviation is a measure of dispersion. According to
Page 32 of
13
For the example:
Mean=19 s=6.86
2s=13.72
Mean-2s=19-13.72=5.28
Mean+2s=19+13.72=32.72
(5.28, 32.72)
Page 33 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
In general, for any number k > 1, at least 1-1/k2 percentage of the data lie
within k standard deviations to either side of the mean, that is, between -
Page 34 of 50
Example:
Page 35 of
13
Solution:
To find the k
36.42-15.23=21.19
57.61-36.42=21.19
kS=21.19
8.15k=21.19
k=2.6
Percentage=1-(1/k^2)=1-.147=.853=85.3%
Page 36 of
13
Empirical Rule:
Page 37 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
score for a data value is the number of standard deviations that the data value is
away from the mean of the data set. The sample z-score is computed by using the
formula
Where and s are respectively, the mean and standard deviation of the sample
data. A negative z-score indicate that a data value is smaller than the mean,
whereas a positive z-score indicates that a data value is larger than the mean.
Page 38 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Descriptive measures that indicate the relative position of a data value are
measure of relative standing. If a data value has large positive z-score, then
it is larger than most of the other data values in the data set; if a data value
has a large negative z-score, then it is smaller than most of the other data
values in the data set; and if a data value has a z-score near 0, then it is
located near the mean of the data set. We can use z-score to compare the
Page 40 of 50
Percentile formula:
e.g.
R=(25/100)*(10+1)=.25*11=2.75
Page 41 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
Quartiles definition:
The first quartile (Q1) is the median of the data lying at or below the
The second quartile (Q2) is the median of the entire data set.
The third quartile (Q3) is the median of the data lying at or above the
Page 42 of 50
Example:
1, 2,3 4, 5, 6, 7, 8, 9
Q1=?
Q2=5
Q3=?
1,2,3,4,5,6,7,8,9,10
Q1=?
Q2=5.5
Q3=?
Page 43 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
Practice:
25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21
Page 44 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Solution:
below:
5 15 16 20 21 25 26 27 30 30 31 32 32 34 35 38 38 41 43 66
(20+1)/2 = 10.5, halfway between the 10th and 11th data values (in
boldface type) in the ordered list. Thus the median of the entire data
Solution:
3. The first quartile (Q1) is the median of the data lying at or below the median of
the entire dataset. We see that the data lying at or below the median of the entire
data set is
5 15 16 20 21 25 26 27 30 30
This data set has 10 pieces of data. Its median is at position (10+1)/2=5.5. Hence
Page 46 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Solution:
4. The third quartile (Q3) is the median of the data lying at or above the median of
the entire dataset. We see that the data lying at or below the median of the entire
data set is
31 32 32 34 35 38 38 41 43 66
This data set has 10 pieces of data. Its median is at position (10+1)/2=5.5.
Page 47 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Solution:
5. We conclude that 25% of the viewing times are less than 23 hours,
25% are between 23 and 30.5 hours, 25% are between 30.5 and 36.5
Page 48 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Inter-Quartile Range:
The interquartile range (IQR) is defined as the difference between the first and
IQR = Q3 – Q1.
Thus, roughly speaking the IQR gives the range of the middle 50% of the data.
Page 49 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Data values that lie between the inner and outer fences are considered possible
outliers; those that lie outside the outer fences are considered probable
outliers.
Page 50 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
(pdf) of a normal
N(0,σ2) Population
Page 51 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
and quartiles written in increasing order: Min, Q1, Q2, Q3, Max.
Page 52 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
summary can be used to provide a graphical display of the center and variation of a
dataset.
Largest
Upper quartile
Median
Lower quartile
Smallest
Page 53 of 50
Cases that have values more than 3 time the length of the box
below or above the box are extreme values outliers.
Page 54 of
13
/*Creating Box Plots */
Data bikerace;
Input Division $ NumberLaps @@;
Datalines;
Adult 44 Adult 33 Youth 33 Masters 38 Adult 40
Masters 32 Youth 32 Youth 38 Youth 33 Adult 24
Masters 33 Adult 44 Youth 35 Adult 49 Adult 38
Adult 39 Adult 42 Adult 32 Youth 42 Youth 70
Masters 33 Adult 33 Masters 32 Youth 37 Masters 40
;
Run;
Page 55 of
13
/*Creating Box Plots */
Page 56 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method
Page 57 of 50
Exercise 1: Sorting - Ordering
A data file called auto (shown in the following image) has a duplicate record for
the BMW. How to sort the data file by the Foreign and rep78 and also remove
Page 58 of 50
Exercise 2:
Given data like the following:
Page 60 of 50
Exercise 4:
The table below contains data on the ages of the two teams involved in game 1 of the
2010 National League Division Series. Is there a relationship between the ages of
(1) Determine the lower and upper quartiles of the ages for the Phillies. Then find
(2) Determine the lower and upper quartiles of the ages for the Reds. Then find
(3) Which team has the greater age range? Which has the greater IQR?
Page 61 of 50
Exercise 4 (Cont’d):
Page 62 of 50
Homework:
Page 63 of 50
Bibliography:
Ron Larson & Betsy Farber Elementary Statistics Picturing the World 5th
Page 64 of 50