Lecture5 Stat104 Fall2017 V1 6up
Lecture5 Stat104 Fall2017 V1 6up
Measure of Dispersion
The mean and median give us information about the
central tendency of a set of observations, but these
numbers shed no light on the dispersion, or spread
of the data.
Example: Which data set is more variable ??
Stat 104: Quantitative Methods 5,5,5,5,5 Mean = 5
Class 5: Descriptive Statistics, Part II 1,3,5,8,8 Mean = 5
Measures of variation give information on the spread or
variability of the data values.
3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Range = 14 - 1 = 13
Doesnt take into account all your data-not used that much
n Can eliminate some outlier problems by using
the interquartile range
9 10
11 12
( xi - x )
MAD = | xi - x | What are the units of MAD?
Do people use it?
3 i =1 n i =1
13 14
15 16
(x i - x )2
i =1
s=
n -1
What can we do to get back to our original units??
:
17 18
The standard deviation for the haircut data is Standard Deviation-a Measure of Risk?
$38.36 which still seems large, reflecting the wide
spread in the data. n Standard deviation measures spread of a data set, so it
seems natural for financial instruments to say the higher
> var(mydata$haircut)
[1] 1471.866 the standard deviation the riskier the asset.
> sd(mydata$haircut)
[1] 38.3649
> describe(mydata$haircut)
n This can work, in that generally the higher the standard
X1
vars n mean sd median trimmed
1 74 32.21 38.36 21.5
mad min max range skew kurtosis
25.85 17.05 0 250 250 3.63
se
16.2 4.46
deviation the riskier the investment, but it does have
some problems and you should keep these issues in
Actually, how we determine if a std dev is large or small is
mind.
something we will discuss in the next class.
21 22
23 24
27 28
Example Example
n Winter temperature recorded in Fahrenheit
q mean = 20
q stdev = 10
q median = 22
Descriptive Statistics: X, 2X, -4X, X+2, X-1, -2X+1, 0.5X-2
q IQR = 11
n Convert into Celsius C=(5/9)F-17.78 Variable N Mean Median TrMean StDev SE Mean
q mean = -17.78 + 5/9 * 20 = -6.67 C X 50 0.4695 0.4428 0.4685 0.2880 0.0407
29 30
The Most Common Linear Transformation Using Zs to compare values
n The Z-score is a common linear transformation
n Since z-scores reflect how far a score is from the mean they are a
X - X good way to standardize scores.
z =
S n We can take any distribution and express all the values as z-scores
n By z scoring a data set, the new data set will have (distances from the mean). So, no matter the scale we originally
mean 0 and variance 1. used to measure the variable, it will be expressed in a standard
form.
n The number of standard deviations a raw score n This standard form can be used to convert different scales to the
(individual score) deviates from the mean. same scale so that direct comparison of values from the two
different distributions can be directly compared.
31 32
33 34
For mound shaped data: What good is the empirical rule again ?
Approximately 68% of the data is in the interval
( x - s x, x + s x ) = x s x
Approximately 95% of the data is in the interval
( x - 2s x , x + 2s x ) = x 2s x
37 38
39 40
41 42
45 46
47 48
Example Yuck
n The classic method would not declare the value 100,000
n Consider the data an outlier even though certainly it is highly unusual
2,2,3,3,3,4,4,4,100000,100000 relative to the other eight values.
n The problem is that both the sample mean and the
n For this data mean=20002.5 and s=42162.38 sample standard deviation are sensitive to outliers,
n The Z-score for the point 100000 is which can effect our detection ability.
1000000 - 20002.5 n An outlier detection technique is said to suffer from
= 1.897
42162.38 masking if the very presence of outliers causes them to
be missed.
n So 100000 is NOT declared an outlier.
49 50
Example A Histogram
> mydata=read.csv("https://goo.gl/e8nYDF")
n Pedersen et al. (1998) conducted a study, a portion of which dealt with the > hist(mydata$x,main="Number of Partners Desired")
sexual attitudes of undergraduate students.
n Among other things, the students were asked how many sexual partners
they desired over the next 30 years.
n The responses of 105 males
51 52
53 54
57 58
59 60
Finding Outliers in R-Easy Method How to remove outliers from the data
> IQR(price)
n Consider the following for finding outliers [1] 2112
> outliers=boxplot.stats(price)$out
setdiff(a,b) removes b from a
based on the boxplot rule. > cleanprice=setdiff(price,outliers)
> IQR(cleanprice)
[1] 1596.5
> sd(price)
> boxplot.stats(price)$out #### easier way to get the outliers
[1] 2949.496
[1] 10372 11385 14500 15906 11497 13594 13466 10371 9690 9735 12990 11995
> sd(cleanprice)
[1] 1166.073
> mean(price)
[1] 6165.257
> mean(cleanprice)
[1] 5011.742
61 62
63 64
> describe(mydata$haircut)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 74 32.21 38.36 21.5 25.85 17.05 0 250 250 3.63 16.2 4.46
> describe(mydata$haircut[mydata$haircut<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<100])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<50])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 62 20.42 12.78 17 20.21 10.38 0 48 48 0.24 -0.77 1.62
65 66
> aaplret=dailyReturn(Ad(AAPL))
> describe(aaplret)
vars n mean sd median trimmed mad min max range skew kurtosis se
> describe(sexpart) daily.returns 1 2630 0 0.02 0 0 0.01 -0.18 0.14 0.32 -0.19 6.29 0
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11
> describe(sexpart[sexpart<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 102 5.07 7.85 1 3.27 1.48 0 45 45 2.95 9.84 0.78
> describe(sexpart[sexpart<10])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 84 2.2 2.1 1 1.84 0 0 9 9 1.49 1.49 0.23
67 68
69 70
71 72
75