0% found this document useful (0 votes)
23 views

Lecture 1 Descriptive Statistics

Uploaded by

haikal shariff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Lecture 1 Descriptive Statistics

Uploaded by

haikal shariff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

AGR 5201

ADVANCED
STATISTICAL
METHODS

Lecture 1
Descriptive Statistics
DESCRIPTIVE STATISTICS
POPULATION VS. SAMPLE

Sample
A part of a
population

Population
Consist of all possible
values of a variable
Sample

What is sample?

• A sample is a part of the


population

• It is randomly sampled
from a large population
DESCRIPTIVE STATISTICS - EXAMPLE

An Illustration:
Below are data of IQ test from 13 students from class A and B. Which group
is smarter?

Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109

Each individual may be different. If you try to understand a group by remembering the qualities of
each member, you become overwhelmed and fail to understand the group.
DESCRIPTIVE STATISTICS

Which group is smarter now?


Class A average IQ Class B average IQ

110.54 110.23

They’re roughly the same!


With a summary descriptive statistic, it is much easier to
answer our question.
TYPES OF DESCRIPTIVE STATISTICS

1. Graphics Organize Data


• Tables
• Graphs

2. Numeric Summarize Data


• Central Tendency measure of location
• Variation measure of spread
1. Graphics Organize Data

• Tables
• Frequency Distributions
• Relative Frequency Distributions

• Graphs
• Bar Chart or Histogram
• Stem and Leaf Plot
• Frequency Polygon
Grouped Relative Frequency Distribution

Relative Frequency Distribution of IQ for Two Classes

IQ Frequency Percent Cumulative percent


80 – 89 3 12.5 12.5
90 – 99 5 20.8 33.3
100 – 109 6 25 58.3
110 – 119 3 12.5 70.8
120 – 129 3 12.5 83.3
130 – 139 2 8.3 91.6
140 – 149 1 4.2 95.8
150 and over 1 4.2 100
Total 24 100
Histogram
Bar Graph
Stem And Leaf Plot

Stem Leaf
2. Numeric Summarizing Data:

• Central Tendency (measure of location)


• Mean
• Median
• Mode

• Variation (measure of spread)


• Range
• Interquartile Range
• Variance
• Standard Deviation
Mean

Most commonly called the “average.”


Symbol : is known as “Y bar”
To get the mean, add up the values for each case and divide by the total number of cases.
The formula for mean:
Mean

Some symbolic conventions in this class:

• Y = your variable (could be X or Q or )


• “bar” or line over symbol of your variable = mean of that variable
• Y1 = first case’s value on variable Y
• “. . .” = ellipsis = continue sequentially
• Yn = last case’s value on variable Y
• n = number of cases in your sample
• Σ = Greek letter “sigma” = sum or add up what follows
• i = a typical case or each case in the sample (1 through n)
Mean

Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Mean

The mean is the “balance point.”


Each person’s score is like 1 kg placed at the score’s position on a see-saw. Below, on a
200 cm see-saw, the mean equals 110, the place on the see-saw where a fulcrum finds
balance:
1 kg at 93 1 kg at 1 kg at
cm 106 cm 131 cm

110 cm

17 21
4 units
units
below 0
units above
below units
The scale is balanced because…
17 + 4 on the left = 21 on the right
Mean

• Means can be badly affected by outliers (data points with extreme values
unlike the rest)
• Outliers can make the mean a bad measure of central tendency or common
experience

Income in the U.S.

All Americans Bill Gates


Mean Outlier
Median
The middle value when a variable’s values are ranked in order; the point that
divides a distribution into two equal halves.
When data are listed in order, the median is the point at which 50% of the cases
are above and 50% below it.
The 50th percentile.

The formula for median:


If n is odd, median =

If n is even, median =
Median

IQ score in class A (13 students = odd n)


Use formula median for odd n =

89 93 97 98 102 106 109 110 115 119 128 131 140

Median = 109
(six cases on the left, six on the right)
Median

If the first student were to drop out of Class A, there would be a new median
(even n):

89 93 97 98 102 106 109 110 115 119 128 131 140

Median = 109.5
(six cases on the left, six cases on the right)
Median

• The median is unaffected by outliers, making it a better measure of central


tendency, better describing the “typical person” than the mean when data are
skewed.

All Americans
Bill Gates
(outlier)
Median

• If the recorded values for a variable form a symmetric distribution, the median
and mean are identical.
• In skewed data, the mean lies further toward the skew than the median.

Symmetric
Skewed

Median
Mean
Mean Median
Mode

The most common data point is called the mode.


The combined IQ scores for Classes A & B:
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120
127 128 131 131 140 162
The mode!!

BTW, It is possible to have more than one mode!


Mode

It may mot be at the center of a


distribution.

Data distribution on the right is


“bimodal”
Mode

It may give you the most likely


experience rather than the “typical” or
“central” experience.
2. Numeric Summarizing Data:

• Central Tendency (measure of location)


• Mean
• Median
• Mode

• Variation (measure of spread)


• Range
• Interquartile Range
• Variance
• Standard Deviation
Range

The spread, or the distance, between the lowest and highest values of a
variable.
To get the range for a variable, you subtract its lowest value from its highest
value.
Class A Class B
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109

Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82


Interquartile Range (IQR)

• A quartile is the value that marks one of


the divisions that breaks a series of
values into four equal parts.
• The median is a quartile and divides the
cases in half.
• 25th percentile is a quartile that divides
the first ¼ of cases from the latter ¾.
• 75th percentile is a quartile that divides
the first ¾ of cases from the latter ¼.
• The interquartile range is the distance
or range between the 25th percentile and
the 75th percentile.
Variance

A measure of the spread of the recorded values on a variable.


A measure of dispersion.
The larger the variance, the further the individual cases are from the mean.

The smaller the variance, the


closer the individual scores
are to the mean.
Variance


Variance

Class A The deviation of 102 from 110.54 is?


102 115
128 109
102 - 110.54 = -8.54
131 89
98 106 Deviation of 115?
140 119 115 - 110.54 = 4.46
93 97
110

Mean, of A= 110.54
Example: Deviations of IQ for Class A

i Yi Deviation
1 102 -8.54
2 128 17.46
3 131 20.46
4 98 -12.54
5 140 29.46
6 93 -17.54
7 110 -0.54
8 115 4.46
9 109 -1.54
10 89 -21.54
11 106 -4.54
12 119 8.46
Variance

• We want to add these to get total deviations, but if we


were to do that, we would get zero every time. Why?
• We need a way to eliminate negative signs.
• Squaring the deviations will eliminate the negative
signs…
Back to the IQ example,

A deviation squared for 102 is:


(102 - 110.54)2 = (-8.54)2 = 72.93

and a deviation squared of 115:


(115 - 110.54)2 = (4.46)2 = 19.89
Variance

If you were to add all the squared deviations together, you’d get
what we call the
“Sum of Squares” (SS)
The formula for sum of square (SS):
Example: Sum of square of IQ for class A
i Yi Deviation Deviation squared
1 102 -8.54 72.9316
2 128 17.46 304.8516
3 131 20.46 418.6116
4 98 -12.54 157.2516
5 140 29.46 867.8916
6 93 -17.54 307.6516
7 110 -0.54 0.2916
8 115 4.46 19.8916
9 109 -1.54 2.3716
10 89 -21.54 463.9716
11 106 -4.54 20.6116
12 119 8.46 71.5716
13 97 -13.54 183.3316
Variance

The last step…


The approximate average sum of squares is the
“VARIANCE”

In general,
Sum of square
(SS)

Degree of freedom
= total number of sample -1
Variance

For Class A, Variance = 2825.39 / n - 1


= 2825.39 / 12 = 235.45

How helpful is that???


Variance is in squared unit (IQ point2), so it does
not represent the true value of variation in that
dataset.
Variance


Standard Deviation

To convert variance into something meaningful, let’s create standard deviation.


The square root of the variance reveals the average deviation of the observations from
the mean.

In other words, standard deviation is the average distance of observation from its
mean.
Standard Deviation

15.34
Standard Deviation

• Larger s.d. = greater amounts of variation around the mean.


For example:
Mean=3
Mean=3
s.d. = 1 s.d. = 0.5

• s.d. = 0 only when all values are the same (only when you have a constant and
not a “variable”)
• If you were to “rescale” a variable, the s.d. would change by the same
magnitude—if we changed units so the mean equaled 30, the s.d. on the left
would be 10, and on the right, 5.
• Like the mean, the s.d. will be inflated by an outlier case value.
Mean and Variance
Review

1. Deviation
2. Deviation squared
3. Sum of squares
4. Variance
5. Standard deviation
2. Numeric Summarizing Data:

• Central Tendency (measure of location)


• Mean
• Median
• Mode

• Variation (measure of spread)


• Range
• Interquartile Range
• Variance
• Standard Deviation

• …Wait! There’s one more!!


Box-plots

A way to graphically portray almost all descriptive statistics at once is the box-plot
(shows the location and spread).
A box-plot shows:
• Upper and lower quartiles
• Mean
• Median
• Range
• Outliers (values greater than 1.5 IQR)
Box-plots

Inter Quartile Range (IQR)


= 123.5 – 96.5 180.00

= 27
162
160.00

140.00
whiskers

123.5
120.00
Mean =110.5

106.5 (Median)
100.00
96.5
whiskers
80.00 82
IQ

The max and min value of the data are 162 and 82, respectively.
• The calculation of bottom‘whiskers’ length:
= Value at 25th percentile + (1.5* IQR)
= 96.5 - (1.5*27)
= 96.5 - 40.5
= 56 (the lowest value of the bottom whiskers)
• The calculation of top‘whiskers’ length:
= Value at 75th percentile + (1.5* IQR)
= 123.5 + (1.5*27)
= 123.5 + 40.5
= 164 (the highest value of the top whiskers)
• The minimum and maximum whiskers length is 56 and 164, respectively, so any value
smaller or greater than 56 and 164, respectively are considered outliers.
• The max value of our data is 162 (which is <164), thus, there is no outlier in the
dataset. The same goes to the minimum value.
Distribution Shape and Box-plot

You might also like