0% found this document useful (0 votes)
46 views13 pages

Lecture5 Stat104 Fall2017 V1 6up

1. The document discusses various measures of dispersion, or spread, of a data set beyond just the mean and median. These measures provide information about how variable or spread out the data values are. 2. The range is the simplest measure but ignores data distribution and can be overly influenced by outliers. The interquartile range eliminates some outliers by only considering values between the first and third quartiles. 3. A better measure is the mean absolute deviation (MAD), which calculates the average distance of each value from the mean using absolute values to avoid negative distances. This provides a measure of average variability around the mean.

Uploaded by

hwangj3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views13 pages

Lecture5 Stat104 Fall2017 V1 6up

1. The document discusses various measures of dispersion, or spread, of a data set beyond just the mean and median. These measures provide information about how variable or spread out the data values are. 2. The range is the simplest measure but ignores data distribution and can be overly influenced by outliers. The interquartile range eliminates some outliers by only considering values between the first and third quartiles. 3. A better measure is the mean absolute deviation (MAD), which calculates the average distance of each value from the mean using absolute values to avoid negative distances. This provides a measure of average variability around the mean.

Uploaded by

hwangj3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1 2

Measure of Dispersion
The mean and median give us information about the
central tendency of a set of observations, but these
numbers shed no light on the dispersion, or spread
of the data.
Example: Which data set is more variable ??
Stat 104: Quantitative Methods 5,5,5,5,5 Mean = 5
Class 5: Descriptive Statistics, Part II 1,3,5,8,8 Mean = 5
Measures of variation give information on the spread or
variability of the data values.

3 4

Range Disadvantages of the Range


n Ignores the way in which data are distributed
n Simplest measure of variation
n Difference between the largest and the smallest observations: 7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Range = xmaximum xminimum
n Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Example:
Range = 5 - 1 = 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Range = 14 - 1 = 13

Interquartile Range (IQR)


5
Interquartile Range
Developed by John Tukey, the founder of EDA (exploratory data analysis)
6

Doesnt take into account all your data-not used that much
n Can eliminate some outlier problems by using
the interquartile range

n Eliminate some high-and low-valued


observations and calculate the range from the
IQR
remaining values.
n Interquartile range = 3rd quartile 1st quartile
7 8

Example: Haircut Data Again How should we measure variability ?


The basic idea is to view variability in terms of distance
> summary(mydata$haircut) between each measurement and the mean.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 15.00 21.50 32.21 40.00 250.00
A natural measure of dispersion is to calculate the
IQR=40-15 = 25 average distance all the observations are from the center
of the data: n
1
> IQR(mydata$haircut) spread =
n
(x i - x)
[1] 25 i =1

Is this a good measure of dispersion ? No, its horrible.


Any idea why ?

9 10

Distance from a Fixed Point Distance Has to be Positive


n We can think of a measure of spread as average n We know that distance cant be negative-that is, if you
distance-like what is the average distance everyone lives live north of the SC you are positive miles away and
from the Science Center. south of the SC you are negative miles away.
n Say this average value is 1 mile. Then if you live less n But this spread formula doesnt know that-it just takes
than 1 mile from the Science Center you realize you are the difference between each value and the mean, which
closer than a lot of your fellow students, and if you live could result in negative or positive numbers.
20 miles away you know you are an outlier. 1 n
In fact, this formula always
spread =
n
(x
i =1
i - x)
returns a value of 0!

11 12

Does anyone have a calculator ? One Solution: Mean Absolute Deviation


(MAD)
n We need 3 numbers n One way to get rid of negative distances is
n X1 = X2 = X2 = by using absolute values.
n Calculate the mean = n The Mean Absolute Deviation (MAD) of a
n Now calculate data set is defined to be
1 n
spread =
1 3

( xi - x )
MAD = | xi - x | What are the units of MAD?
Do people use it?
3 i =1 n i =1
13 14

MAD for haircut data Another Solution: The Variance


n The function mad in R is not our mad; its a different, The variance of a set of data is defined as
n
complicated robust measure of location. We use n-1 instead of n for
(x i - x )2 technical reasons that will be
n We can compute our MAD in R as follows (dont worry about s X2 = i =1 discussed later-you could
divide by n; n-1 is just better.
the code) n -1
mean(abs(mydata$haircut-mean(mydata$haircut))) What practical significance can be attached to the
[1] 22.38523 variance as a measure of variability ? Large
n The MAD for the haircut data is then $22.39 variances imply a large amount of variation, but
n This is very close to the IQR=$25. Hmmm what constitutes large ?
The answer will appear in a few slides.

15 16

The variance of the haircut data is 1471.86. Yikes!! Standard Deviation


That seems like a pretty big number. >[1]var(mydata$haircut)
1471.866

What are the units of this number anyway ??


n Most commonly used measure of variation
n Shows variation about the mean
A measure of spread should have the same units
as the original data. In the salary example, the n Has the same units as the original data
variance is measured in dollars squared. n

(x i - x )2
i =1
s=
n -1
What can we do to get back to our original units??
:

17 18
The standard deviation for the haircut data is Standard Deviation-a Measure of Risk?
$38.36 which still seems large, reflecting the wide
spread in the data. n Standard deviation measures spread of a data set, so it
seems natural for financial instruments to say the higher
> var(mydata$haircut)
[1] 1471.866 the standard deviation the riskier the asset.
> sd(mydata$haircut)
[1] 38.3649
> describe(mydata$haircut)
n This can work, in that generally the higher the standard
X1
vars n mean sd median trimmed
1 74 32.21 38.36 21.5
mad min max range skew kurtosis
25.85 17.05 0 250 250 3.63
se
16.2 4.46
deviation the riskier the investment, but it does have
some problems and you should keep these issues in
Actually, how we determine if a std dev is large or small is
mind.
something we will discuss in the next class.

Why is the std dev a lot larger than MAD or IQR?


19 20

Comparing can be difficult Finance Example : Comparing Mutual Funds


Lets use means and sds to compare mutual funds.
n Manager 1 makes a 2% return every month. For 10 different assets we compute the mean and sd.
n Manager 2 makes a -2% return every month. Then plot mean vs sd.

n If you compare them using standard The assets are:


deviation, who is better?

21 22

Some Tools so Far


n New toolbox additions
qDotplot and Histograms
qSummary Statistics (mean, median, std dev)

23 24

Shifting and Rescaling Data Examples


Examples: Changing
n Original data x1, x2, . . . xn 1. from feet (x) to inches (y): y=12x
n Linear Transformation 2. from dollars (x) to cents (y): y=100x
Yi = a + bX i 3. from degrees celsius (x) to degrees fahrenheit (y): y = 32 +
(9/5)x
Shifts data Changes 4. from ACT (x) to SAT (y): y=150+40x
by a scale
5. from inches (x) to centimeters (y):
n Linear Transformations do not change the shape of the y = 2.54x
data distribution, but do change the center and spread.
25 26

Linear Transformations (a+bX rule) Effects of Linear Transformations


n The mean and variance of a data set have two n Your Transformation: y = a + b*x
interesting properties.
n These properties occur when one shifts a data set, or
n meannew = a + b*mean
multiplies by a value (expands or contracts a data set). n mediannew = a + b*median
n stdevnew = |b|*stdev
Var (a + bX ) = b 2Var ( X )
n IQRnew = |b|*IQR
Average(a + bX ) = a + b[ Average( X )]

27 28

Example Example
n Winter temperature recorded in Fahrenheit
q mean = 20
q stdev = 10
q median = 22
Descriptive Statistics: X, 2X, -4X, X+2, X-1, -2X+1, 0.5X-2
q IQR = 11
n Convert into Celsius C=(5/9)F-17.78 Variable N Mean Median TrMean StDev SE Mean
q mean = -17.78 + 5/9 * 20 = -6.67 C X 50 0.4695 0.4428 0.4685 0.2880 0.0407

q stdev = 5/9 * 10 = 5.56 2X 50 0.9389 0.8856 0.9370 0.5760 0.0815


-4X 50 -1.878 -1.771 -1.874 1.152 0.163
q median = -17.78 + 5/9 * 22 = -5.56 C
X+2 50 2.4695 2.4428 2.4685 0.2880 0.0407
q IQR = (5/9)(11)=6.11 X-1 50 -0.5305 -0.5572 -0.5315 0.2880 0.0407
-2X+1 50 0.0611 0.1144 0.0630 0.5760 0.0815
0.5X-2 50 -1.7653 -1.7786 -1.7657 0.1440 0.0204

29 30
The Most Common Linear Transformation Using Zs to compare values
n The Z-score is a common linear transformation
n Since z-scores reflect how far a score is from the mean they are a
X - X good way to standardize scores.
z =
S n We can take any distribution and express all the values as z-scores
n By z scoring a data set, the new data set will have (distances from the mean). So, no matter the scale we originally
mean 0 and variance 1. used to measure the variable, it will be expressed in a standard
form.
n The number of standard deviations a raw score n This standard form can be used to convert different scales to the
(individual score) deviates from the mean. same scale so that direct comparison of values from the two
different distributions can be directly compared.
31 32

Used for Comparison Purposes Calculate the Z-scores


Jason: 900-1000 = -1
n Marys ACT score is 26. Jasons SAT score
100
is 900. Who did better? Mary: 26-22
= +2

n The mean SAT score is 1000 with a standard 2


deviation of 100 SAT points. n From these findings, we gather that Jasons score is
n The mean ACT score is 22 with a standard 1 standard deviation below the mean SAT score and
Marys score is 2 standard deviations above the
deviation of 2 ACT points. mean ACT score.
n Therefore, Marys score is relatively better.

33 34

Interpreting standard deviation Rule of thumb


We now have the two summaries
x sx The most basic analysis is to simply compare the value
how spread out,
of the mean to the value of the standard deviation.
where the data is
or variable the data is
Intuitively, what do you think the following data sets
The mean is pretty easy to understand. What are the units? x s
look like ? spread
Data Set 1 50 0 none small medium large
We know that the bigger sx is, the more variable the data is, Data
Data
Set
Set
2
3
50
50
3
14
none small medium large
none small medium large
but how do we interpret the number ? Data Set 4 50 42 none small medium large
Data Set 4 50 1000 none small medium large

What is a big sx , what is a small one ?


What are the units of sx ?

The empirical rule will help us understand sx and 35 36

relate the summaries back to the dotplot (or histogram).


Empirical rule: no

For mound shaped data: What good is the empirical rule again ?
Approximately 68% of the data is in the interval
( x - s x, x + s x ) = x s x
Approximately 95% of the data is in the interval
( x - 2s x , x + 2s x ) = x 2s x
37 38

Empirical Rule Example You find +/- 2s in Many Places


n A survey of 1000 U.S. gas stations was
conducted and you were told the average
price charged for a gallon of regular gas was
$3.90 with a std dev of $0.20.
n You were also told the data is mound shaped.
n What can you deduce?
2 sx (based on a moving window of
Bollinger bands are x 20 time periods)
See http://en.wikipedia.org/wiki/Bollinger_Bands or take Stat 107

39 40

Dont fall in love with +/- 2s Chebyshevs Rule


n Standard deviation is a good measure of n For any set of data and for any number k,
spread if your data is symmetric; if your data greater than one, the proportion of the data
is not symmetric it really isnt interpretable. that lies within k standard deviations of the
n If your data is not symmetric, one needs to mean is at least:
1
use Chebyshevs rule for interpreting the 1 - 2

spread of your data. k

41 42

So for ( x - 2s x , x + 2s x ) = x 2s x Detecting Outliers


n According to Chebyshevs Theorem, at least n The detection of outliers is important for a variety of reasons.
n One rather mundane reason is that they can help identify
what fraction of the data falls within k (k = 2) erroneously recorded results.
standard deviations of the mean? n We have already seen that even a single outlier can grossly affect
n At least 1 - 1 = 3 = 75 % of the data falls the sample mean and variance, and of course we do not want a
within 2 standard typing error to substantially alter or color our perceptions of the data.
22 4 deviations of the n So it can be prudent to check for outliers, and if any are found, make
mean. sure they are valid.
Hey, thats not 95% of the data. Exactly!
Outliers are Naughty
43
Effect of Outliers on Summary Stats 44

q Outliers can lead to too-high, too-low or nearly correct estimates of the


population mean, depending upon the number and location of the outliers
(asymmetrical vs. symmetrical patterns)
Resistant to Outliers: Median, IQR
q Outliers always lead to overestimates of the standard deviation
Not Resistant to Outliers: Mean, Standard Deviation,
Variance, Range

Mean estimate is Mean estimate is Mean estimate is


too high & std is too low & std is right & std is
overestimated overestimated overestimated

45 46

Classic Outlier Detection Example


n A classic outlier technique is to simply Zscore n Consider the values
the data and declare any point an outlier if 2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,1000
X -X
Z = 2 n For this data mean=65.94 and s=249.1
s
n The Z-score for the point 1000 is
n The value 2 is motivated by the normal 1000 - 65.94
= 3.75
distribution that we will see in a few classes. 249.1

n So 1000 is declared an outlier.

47 48

Example Yuck
n The classic method would not declare the value 100,000
n Consider the data an outlier even though certainly it is highly unusual
2,2,3,3,3,4,4,4,100000,100000 relative to the other eight values.
n The problem is that both the sample mean and the
n For this data mean=20002.5 and s=42162.38 sample standard deviation are sensitive to outliers,
n The Z-score for the point 100000 is which can effect our detection ability.
1000000 - 20002.5 n An outlier detection technique is said to suffer from
= 1.897
42162.38 masking if the very presence of outliers causes them to
be missed.
n So 100000 is NOT declared an outlier.
49 50

Example A Histogram
> mydata=read.csv("https://goo.gl/e8nYDF")
n Pedersen et al. (1998) conducted a study, a portion of which dealt with the > hist(mydata$x,main="Number of Partners Desired")
sexual attitudes of undergraduate students.
n Among other things, the students were asked how many sexual partners
they desired over the next 30 years.
n The responses of 105 males

51 52

Summary Statistics Outliers


n One participant surveyed responded he wanted 6000
sexual partners, over the next 30 years, which is clearly
> describe(mydata$x)
vars n mean sd median trimmed mad min max range skew kurtosis se unusual compared to the other stat 100 students. Heck
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11
> its unusual in general.
> summary(mydata$x)
Min. 1st Qu. Median
0.00 1.00 1.00
Mean 3rd Qu.
64.92
Max.
6.00 6000.00
n Also, two gave the response 150, which again is
unusual.
> sum(mydata$x<mean(mydata$x))
[1] 102
n For HW you will see that the 6000 is flagged as an
outlier, but not the 150. Though it probably should be.
n The mean is not very typical since 102 of the 105 people surveyed
gave a response less than the mean.

53 54

The Boxplot Rule The BoxPlot Rule


n One of the earliest improvements on the classic n In particular, the boxplot rule declares the value X an
outlier detection rule is called the boxplot rule. outlier if

n It is based on the fundamental strategy of X < Q1 - 1.5(Q3 - Q1)


or
avoiding masking by replacing the mean and
X > Q3 + 1.5(Q3 - Q1)
standard deviation with measures of location
and dispersion that are relatively insensitive to n So the rule is based on the lower and upper quartiles, as
well as the interquartile range, which provide resistance
outliers. to outliers.
55 56

Example Outlier Detection in R


n Remember the sexual attitude data n Consider the following
> mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat104/cars10.csv")
> describe(mydata$x) > head(mydata)
vars n mean sd median trimmed mad min max range skew kurtosis se make price mpg headroom trunk weight length turn displacement
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11 1 AMC Concord 4099 22 2.5 11 2930 186 40 121
> 2 AMC Pacer 4749 17 3.0 11 3350 173 40 258
> summary(mydata$x) 3 AMC Spirit 3799 22 3.0 12 2640 168 35 121
Min. 1st Qu. Median Mean 3rd Qu. Max. 4 Buick Century 4816 20 4.5 16 3250 196 40 196
0.00 1.00 1.00 64.92 6.00 6000.00 5 Buick Electra 7827 15 4.0 20 4080 222 43 350
6 Buick LeSabre 5788 18 4.0 21 3670 218 43 231

> attach(mydata) ### this makes the variables directly available to us

n Outlier if > 6+1.5(6-1)=13.5 so 12 points are flagged now


instead of 1 as being outliers.

57 58

Finding Outliers in R-Direct Method Finding Outliers in R-Direct Method


n Consider the following n Consider the following
> summary(price) > summary(price)
Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. Max.
3291 4220 5006 6165 6332 15910 3291 4220 5006 6165 6332 15906

> price[price>6332+1.5*IQR(price)] > price[(price-mean(price))/sd(price) < -1.96]


[1] 10372 11385 14500 15906 11497 13594 13466 10371 9690 9735 12990 11995 integer(0)

> price[price<4220-1.5*IQR(price)] > price[(price-mean(price))/sd(price) > 1.96]


integer(0) [1] 14500 15906 13594 13466 12990 11995

59 60

Finding Outliers in R-Easy Method How to remove outliers from the data
> IQR(price)
n Consider the following for finding outliers [1] 2112
> outliers=boxplot.stats(price)$out
setdiff(a,b) removes b from a
based on the boxplot rule. > cleanprice=setdiff(price,outliers)
> IQR(cleanprice)
[1] 1596.5

> sd(price)
> boxplot.stats(price)$out #### easier way to get the outliers
[1] 2949.496
[1] 10372 11385 14500 15906 11497 13594 13466 10371 9690 9735 12990 11995
> sd(cleanprice)
[1] 1166.073

> mean(price)
[1] 6165.257
> mean(cleanprice)
[1] 5011.742
61 62

Skewness Values of Skewness


n A related idea to outliers is skewness(and one which we n A symmetric data set should have a skewness value
always wonder-do we really have outliers or is the data near 0
skewed, or both)? n Negative values for the skewness indicate data that are
n Skewness measures the degree of asymmetry exhibited skewed left and positive values for the skewness
by the data indicate data that are skewed right.
n
Never will n By skewed left, we mean that the left tail is long relative
(x
i =1
i - x )3 calculate this
by hand
to the right tail. Similarly, skewed right means that the
skewness = right tail is long relative to the left tail.
ns 3

63 64

Skewness Example: Haircut Data

> describe(mydata$haircut)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 74 32.21 38.36 21.5 25.85 17.05 0 250 250 3.63 16.2 4.46
> describe(mydata$haircut[mydata$haircut<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<100])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<50])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 62 20.42 12.78 17 20.21 10.38 0 48 48 0.24 -0.77 1.62

65 66

Example: Sexual Partners Remember data is time dependent


> library(quantmod)
> getSymbols("AAPL")
[1] "AAPL"

> aaplret=dailyReturn(Ad(AAPL))

> describe(aaplret)
vars n mean sd median trimmed mad min max range skew kurtosis se
> describe(sexpart) daily.returns 1 2630 0 0.02 0 0 0.01 -0.18 0.14 0.32 -0.19 6.29 0
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11
> describe(sexpart[sexpart<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 102 5.07 7.85 1 3.27 1.48 0 45 45 2.95 9.84 0.78
> describe(sexpart[sexpart<10])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 84 2.2 2.1 1 1.84 0 0 9 9 1.49 1.49 0.23
67 68

Remember data is time dependent Transforming Skewed Data


> aaplret=monthlyReturn(Ad(AAPL))
n When a distribution is skewed, it can be hard to
> describe(aaplret)
vars n mean sd median trimmed mad min max range skew kurtosis summarize the data simply with a center and
monthly.returns 1 126 0.03 0.09 0.03 0.03 0.07 -0.33 0.24 0.57 -0.69 2.17
se
monthly.returns 0.01
spread, and hard to decide whether the most
extreme values are outliers or just part of the
stretched-out tail.
n How can we say anything useful about such
data? The secret is to apply a simple function to
each data value.

69 70

Nonlinear Transformations Your dream job


n Sometimes there is need to transform our data in a nonlinear way; n Consider the graph below which shows 2005 CEO data
n Y=sqrt(X), Y=log(X), Y=1/x, etc. for the Fortune 500. The data is in thousands of dollars.
n This is usually done to try to symmetrize the data distribution to
improve their fit to assumptions of statistical analysis (will make
more sense in a few weeks).
n Basically to reduce outliers in the data and/or reduce skewness.

71 72

The data is heavily skewed Log the data


n Skewed distributions are difficult to summarize. Its hard n One way to make a skewed distribution more symmetric
to know what we mean by the center of a skewed is to re-express, or transform, the data by applying a
distribution, so its not obvious what value to use to simple function to all the data values.
summarize the distribution.
n What would you say was a typical CEO total
compensation? The mean value is $10,307,000, while
the median is only $4,700,000.
73 74

The Transform Cheat Sheet Todays Tools


n Calculate the skewness statistic for your data set n New toolbox additions
n If |skewness| < 0.8 data set is cool and unlikely to disrupt qTransformations, Skewness, Outliers
our analysis.
qEmpirical Rule
n Otherwise, try a transformation in the ladder of powers

75

Things you should know

n Emprical Rule, Chebyshevs Rule


n a+bX rule
n Z scoring
n Detecting Outliers
n Skewness and Transformations

You might also like