0% found this document useful (0 votes)

46 views13 pages

Lecture5 Stat104 Fall2017 V1 6up

1. The document discusses various measures of dispersion, or spread, of a data set beyond just the mean and median. These measures provide information about how variable or spread out the data values are. 2. The range is the simplest measure but ignores data distribution and can be overly influenced by outliers. The interquartile range eliminates some outliers by only considering values between the first and third quartiles. 3. A better measure is the mean absolute deviation (MAD), which calculates the average distance of each value from the mean using absolute values to avoid negative distances. This provides a measure of average variability around the mean.

Uploaded by

hwangj3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views13 pages

Lecture5 Stat104 Fall2017 V1 6up

Uploaded by

hwangj3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1 2

Measure of Dispersion
The mean and median give us information about the
central tendency of a set of observations, but these
numbers shed no light on the dispersion, or spread
of the data.
Example: Which data set is more variable ??
Stat 104: Quantitative Methods 5,5,5,5,5 Mean = 5
Class 5: Descriptive Statistics, Part II 1,3,5,8,8 Mean = 5
Measures of variation give information on the spread or
variability of the data values.

3 4

Range Disadvantages of the Range

n Ignores the way in which data are distributed
n Simplest measure of variation
n Difference between the largest and the smallest observations: 7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Range = xmaximum xminimum
n Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Example:
Range = 5 - 1 = 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Range = 14 - 1 = 13

Interquartile Range (IQR)

5
Interquartile Range
Developed by John Tukey, the founder of EDA (exploratory data analysis)
6

Doesnt take into account all your data-not used that much
n Can eliminate some outlier problems by using
the interquartile range

n Eliminate some high-and low-valued

observations and calculate the range from the
IQR
remaining values.
n Interquartile range = 3rd quartile 1st quartile
7 8

Example: Haircut Data Again How should we measure variability ?

The basic idea is to view variability in terms of distance
> summary(mydata$haircut) between each measurement and the mean.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 15.00 21.50 32.21 40.00 250.00
A natural measure of dispersion is to calculate the
IQR=40-15 = 25 average distance all the observations are from the center
of the data: n
1
> IQR(mydata$haircut) spread =
n
(x i - x)
[1] 25 i =1

Is this a good measure of dispersion ? No, its horrible.

Any idea why ?

9 10

Distance from a Fixed Point Distance Has to be Positive

n We can think of a measure of spread as average n We know that distance cant be negative-that is, if you
distance-like what is the average distance everyone lives live north of the SC you are positive miles away and
from the Science Center. south of the SC you are negative miles away.
n Say this average value is 1 mile. Then if you live less n But this spread formula doesnt know that-it just takes
than 1 mile from the Science Center you realize you are the difference between each value and the mean, which
closer than a lot of your fellow students, and if you live could result in negative or positive numbers.
20 miles away you know you are an outlier. 1 n
In fact, this formula always
spread =
n
(x
i =1
i - x)
returns a value of 0!

11 12

Does anyone have a calculator ? One Solution: Mean Absolute Deviation

(MAD)
n We need 3 numbers n One way to get rid of negative distances is
n X1 = X2 = X2 = by using absolute values.
n Calculate the mean = n The Mean Absolute Deviation (MAD) of a
n Now calculate data set is defined to be
1 n
spread =
1 3

( xi - x )
MAD = | xi - x | What are the units of MAD?
Do people use it?
3 i =1 n i =1
13 14

MAD for haircut data Another Solution: The Variance

n The function mad in R is not our mad; its a different, The variance of a set of data is defined as
n
complicated robust measure of location. We use n-1 instead of n for
(x i - x )2 technical reasons that will be
n We can compute our MAD in R as follows (dont worry about s X2 = i =1 discussed later-you could
divide by n; n-1 is just better.
the code) n -1
mean(abs(mydata$haircut-mean(mydata$haircut))) What practical significance can be attached to the
[1] 22.38523 variance as a measure of variability ? Large
n The MAD for the haircut data is then $22.39 variances imply a large amount of variation, but
n This is very close to the IQR=$25. Hmmm what constitutes large ?
The answer will appear in a few slides.

15 16

The variance of the haircut data is 1471.86. Yikes!! Standard Deviation

That seems like a pretty big number. >[1]var(mydata$haircut)
1471.866

What are the units of this number anyway ??

n Most commonly used measure of variation
n Shows variation about the mean
A measure of spread should have the same units
as the original data. In the salary example, the n Has the same units as the original data
variance is measured in dollars squared. n

(x i - x )2
i =1
s=
n -1
What can we do to get back to our original units??
:

17 18
The standard deviation for the haircut data is Standard Deviation-a Measure of Risk?
$38.36 which still seems large, reflecting the wide
spread in the data. n Standard deviation measures spread of a data set, so it
seems natural for financial instruments to say the higher
> var(mydata$haircut)
[1] 1471.866 the standard deviation the riskier the asset.
> sd(mydata$haircut)
[1] 38.3649
> describe(mydata$haircut)
n This can work, in that generally the higher the standard
X1
vars n mean sd median trimmed
1 74 32.21 38.36 21.5
mad min max range skew kurtosis
25.85 17.05 0 250 250 3.63
se
16.2 4.46
deviation the riskier the investment, but it does have
some problems and you should keep these issues in
Actually, how we determine if a std dev is large or small is
mind.
something we will discuss in the next class.

Why is the std dev a lot larger than MAD or IQR?

19 20

Comparing can be difficult Finance Example : Comparing Mutual Funds

Lets use means and sds to compare mutual funds.
n Manager 1 makes a 2% return every month. For 10 different assets we compute the mean and sd.
n Manager 2 makes a -2% return every month. Then plot mean vs sd.

n If you compare them using standard The assets are:

deviation, who is better?

21 22

Some Tools so Far

n New toolbox additions
qDotplot and Histograms
qSummary Statistics (mean, median, std dev)

23 24

Shifting and Rescaling Data Examples

Examples: Changing
n Original data x1, x2, . . . xn 1. from feet (x) to inches (y): y=12x
n Linear Transformation 2. from dollars (x) to cents (y): y=100x
Yi = a + bX i 3. from degrees celsius (x) to degrees fahrenheit (y): y = 32 +
(9/5)x
Shifts data Changes 4. from ACT (x) to SAT (y): y=150+40x
by a scale
5. from inches (x) to centimeters (y):
n Linear Transformations do not change the shape of the y = 2.54x
data distribution, but do change the center and spread.
25 26

Linear Transformations (a+bX rule) Effects of Linear Transformations

n The mean and variance of a data set have two n Your Transformation: y = a + b*x
interesting properties.
n These properties occur when one shifts a data set, or
n meannew = a + b*mean
multiplies by a value (expands or contracts a data set). n mediannew = a + b*median
n stdevnew = |b|*stdev
Var (a + bX ) = b 2Var ( X )
n IQRnew = |b|*IQR
Average(a + bX ) = a + b[ Average( X )]

27 28

Example Example
n Winter temperature recorded in Fahrenheit
q mean = 20
q stdev = 10
q median = 22
Descriptive Statistics: X, 2X, -4X, X+2, X-1, -2X+1, 0.5X-2
q IQR = 11
n Convert into Celsius C=(5/9)F-17.78 Variable N Mean Median TrMean StDev SE Mean
q mean = -17.78 + 5/9 * 20 = -6.67 C X 50 0.4695 0.4428 0.4685 0.2880 0.0407

q stdev = 5/9 * 10 = 5.56 2X 50 0.9389 0.8856 0.9370 0.5760 0.0815

-4X 50 -1.878 -1.771 -1.874 1.152 0.163
q median = -17.78 + 5/9 * 22 = -5.56 C
X+2 50 2.4695 2.4428 2.4685 0.2880 0.0407
q IQR = (5/9)(11)=6.11 X-1 50 -0.5305 -0.5572 -0.5315 0.2880 0.0407
-2X+1 50 0.0611 0.1144 0.0630 0.5760 0.0815
0.5X-2 50 -1.7653 -1.7786 -1.7657 0.1440 0.0204

29 30
The Most Common Linear Transformation Using Zs to compare values
n The Z-score is a common linear transformation
n Since z-scores reflect how far a score is from the mean they are a
X - X good way to standardize scores.
z =
S n We can take any distribution and express all the values as z-scores
n By z scoring a data set, the new data set will have (distances from the mean). So, no matter the scale we originally
mean 0 and variance 1. used to measure the variable, it will be expressed in a standard
form.
n The number of standard deviations a raw score n This standard form can be used to convert different scales to the
(individual score) deviates from the mean. same scale so that direct comparison of values from the two
different distributions can be directly compared.
31 32

Used for Comparison Purposes Calculate the Z-scores

Jason: 900-1000 = -1
n Marys ACT score is 26. Jasons SAT score
100
is 900. Who did better? Mary: 26-22
= +2

n The mean SAT score is 1000 with a standard 2

deviation of 100 SAT points. n From these findings, we gather that Jasons score is
n The mean ACT score is 22 with a standard 1 standard deviation below the mean SAT score and
Marys score is 2 standard deviations above the
deviation of 2 ACT points. mean ACT score.
n Therefore, Marys score is relatively better.

33 34

Interpreting standard deviation Rule of thumb

We now have the two summaries
x sx The most basic analysis is to simply compare the value
how spread out,
of the mean to the value of the standard deviation.
where the data is
or variable the data is
Intuitively, what do you think the following data sets
The mean is pretty easy to understand. What are the units? x s
look like ? spread
Data Set 1 50 0 none small medium large
We know that the bigger sx is, the more variable the data is, Data
Data
Set
Set
2
3
50
50
3
14
none small medium large
none small medium large
but how do we interpret the number ? Data Set 4 50 42 none small medium large
Data Set 4 50 1000 none small medium large

What is a big sx , what is a small one ?

What are the units of sx ?

The empirical rule will help us understand sx and 35 36

relate the summaries back to the dotplot (or histogram).

Empirical rule: no

For mound shaped data: What good is the empirical rule again ?
Approximately 68% of the data is in the interval
( x - s x, x + s x ) = x s x
Approximately 95% of the data is in the interval
( x - 2s x , x + 2s x ) = x 2s x
37 38

Empirical Rule Example You find +/- 2s in Many Places

n A survey of 1000 U.S. gas stations was
conducted and you were told the average
price charged for a gallon of regular gas was
$3.90 with a std dev of $0.20.
n You were also told the data is mound shaped.
n What can you deduce?
2 sx (based on a moving window of
Bollinger bands are x 20 time periods)
See http://en.wikipedia.org/wiki/Bollinger_Bands or take Stat 107

39 40

Dont fall in love with +/- 2s Chebyshevs Rule

n Standard deviation is a good measure of n For any set of data and for any number k,
spread if your data is symmetric; if your data greater than one, the proportion of the data
is not symmetric it really isnt interpretable. that lies within k standard deviations of the
n If your data is not symmetric, one needs to mean is at least:
1
use Chebyshevs rule for interpreting the 1 - 2

spread of your data. k

41 42

So for ( x - 2s x , x + 2s x ) = x 2s x Detecting Outliers

n According to Chebyshevs Theorem, at least n The detection of outliers is important for a variety of reasons.
n One rather mundane reason is that they can help identify
what fraction of the data falls within k (k = 2) erroneously recorded results.
standard deviations of the mean? n We have already seen that even a single outlier can grossly affect
n At least 1 - 1 = 3 = 75 % of the data falls the sample mean and variance, and of course we do not want a
within 2 standard typing error to substantially alter or color our perceptions of the data.
22 4 deviations of the n So it can be prudent to check for outliers, and if any are found, make
mean. sure they are valid.
Hey, thats not 95% of the data. Exactly!
Outliers are Naughty
43
Effect of Outliers on Summary Stats 44

q Outliers can lead to too-high, too-low or nearly correct estimates of the

population mean, depending upon the number and location of the outliers
(asymmetrical vs. symmetrical patterns)
Resistant to Outliers: Median, IQR
q Outliers always lead to overestimates of the standard deviation
Not Resistant to Outliers: Mean, Standard Deviation,
Variance, Range

Mean estimate is Mean estimate is Mean estimate is

too high & std is too low & std is right & std is
overestimated overestimated overestimated

45 46

Classic Outlier Detection Example

n A classic outlier technique is to simply Zscore n Consider the values
the data and declare any point an outlier if 2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,1000
X -X
Z = 2 n For this data mean=65.94 and s=249.1
s
n The Z-score for the point 1000 is
n The value 2 is motivated by the normal 1000 - 65.94
= 3.75
distribution that we will see in a few classes. 249.1

n So 1000 is declared an outlier.

47 48

Example Yuck
n The classic method would not declare the value 100,000
n Consider the data an outlier even though certainly it is highly unusual
2,2,3,3,3,4,4,4,100000,100000 relative to the other eight values.
n The problem is that both the sample mean and the
n For this data mean=20002.5 and s=42162.38 sample standard deviation are sensitive to outliers,
n The Z-score for the point 100000 is which can effect our detection ability.
1000000 - 20002.5 n An outlier detection technique is said to suffer from
= 1.897
42162.38 masking if the very presence of outliers causes them to
be missed.
n So 100000 is NOT declared an outlier.
49 50

Example A Histogram
> mydata=read.csv("https://goo.gl/e8nYDF")
n Pedersen et al. (1998) conducted a study, a portion of which dealt with the > hist(mydata$x,main="Number of Partners Desired")
sexual attitudes of undergraduate students.
n Among other things, the students were asked how many sexual partners
they desired over the next 30 years.
n The responses of 105 males

51 52

Summary Statistics Outliers

n One participant surveyed responded he wanted 6000
sexual partners, over the next 30 years, which is clearly
> describe(mydata$x)
vars n mean sd median trimmed mad min max range skew kurtosis se unusual compared to the other stat 100 students. Heck
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11
> its unusual in general.
> summary(mydata$x)
Min. 1st Qu. Median
0.00 1.00 1.00
Mean 3rd Qu.
64.92
Max.
6.00 6000.00
n Also, two gave the response 150, which again is
unusual.
> sum(mydata$x<mean(mydata$x))
[1] 102
n For HW you will see that the 6000 is flagged as an
outlier, but not the 150. Though it probably should be.
n The mean is not very typical since 102 of the 105 people surveyed
gave a response less than the mean.

53 54

The Boxplot Rule The BoxPlot Rule

n One of the earliest improvements on the classic n In particular, the boxplot rule declares the value X an
outlier detection rule is called the boxplot rule. outlier if

n It is based on the fundamental strategy of X < Q1 - 1.5(Q3 - Q1)

or
avoiding masking by replacing the mean and
X > Q3 + 1.5(Q3 - Q1)
standard deviation with measures of location
and dispersion that are relatively insensitive to n So the rule is based on the lower and upper quartiles, as
well as the interquartile range, which provide resistance
outliers. to outliers.
55 56

Example Outlier Detection in R

n Remember the sexual attitude data n Consider the following
> mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat104/cars10.csv")
> describe(mydata$x) > head(mydata)
vars n mean sd median trimmed mad min max range skew kurtosis se make price mpg headroom trunk weight length turn displacement
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11 1 AMC Concord 4099 22 2.5 11 2930 186 40 121
> 2 AMC Pacer 4749 17 3.0 11 3350 173 40 258
> summary(mydata$x) 3 AMC Spirit 3799 22 3.0 12 2640 168 35 121
Min. 1st Qu. Median Mean 3rd Qu. Max. 4 Buick Century 4816 20 4.5 16 3250 196 40 196
0.00 1.00 1.00 64.92 6.00 6000.00 5 Buick Electra 7827 15 4.0 20 4080 222 43 350
6 Buick LeSabre 5788 18 4.0 21 3670 218 43 231

> attach(mydata) ### this makes the variables directly available to us

n Outlier if > 6+1.5(6-1)=13.5 so 12 points are flagged now

instead of 1 as being outliers.

57 58

Finding Outliers in R-Direct Method Finding Outliers in R-Direct Method

n Consider the following n Consider the following
> summary(price) > summary(price)
Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. Max.
3291 4220 5006 6165 6332 15910 3291 4220 5006 6165 6332 15906

> price[price>6332+1.5*IQR(price)] > price[(price-mean(price))/sd(price) < -1.96]

[1] 10372 11385 14500 15906 11497 13594 13466 10371 9690 9735 12990 11995 integer(0)

> price[price<4220-1.5*IQR(price)] > price[(price-mean(price))/sd(price) > 1.96]

integer(0) [1] 14500 15906 13594 13466 12990 11995

59 60

Finding Outliers in R-Easy Method How to remove outliers from the data
> IQR(price)
n Consider the following for finding outliers [1] 2112
> outliers=boxplot.stats(price)$out
setdiff(a,b) removes b from a
based on the boxplot rule. > cleanprice=setdiff(price,outliers)
> IQR(cleanprice)
[1] 1596.5

> sd(price)
> boxplot.stats(price)$out #### easier way to get the outliers
[1] 2949.496
[1] 10372 11385 14500 15906 11497 13594 13466 10371 9690 9735 12990 11995
> sd(cleanprice)
[1] 1166.073

> mean(price)
[1] 6165.257
> mean(cleanprice)
[1] 5011.742
61 62

Skewness Values of Skewness

n A related idea to outliers is skewness(and one which we n A symmetric data set should have a skewness value
always wonder-do we really have outliers or is the data near 0
skewed, or both)? n Negative values for the skewness indicate data that are
n Skewness measures the degree of asymmetry exhibited skewed left and positive values for the skewness
by the data indicate data that are skewed right.
n
Never will n By skewed left, we mean that the left tail is long relative
(x
i =1
i - x )3 calculate this
by hand
to the right tail. Similarly, skewed right means that the
skewness = right tail is long relative to the left tail.
ns 3

63 64

Skewness Example: Haircut Data

> describe(mydata$haircut)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 74 32.21 38.36 21.5 25.85 17.05 0 250 250 3.63 16.2 4.46
> describe(mydata$haircut[mydata$haircut<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<100])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 72 26.86 20.49 20 24.67 14.83 0 85 85 1 0.52 2.41
> describe(mydata$haircut[mydata$haircut<50])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 62 20.42 12.78 17 20.21 10.38 0 48 48 0.24 -0.77 1.62

65 66

Example: Sexual Partners Remember data is time dependent

> library(quantmod)
> getSymbols("AAPL")
[1] "AAPL"

> aaplret=dailyReturn(Ad(AAPL))

> describe(aaplret)
vars n mean sd median trimmed mad min max range skew kurtosis se
> describe(sexpart) daily.returns 1 2630 0 0.02 0 0 0.01 -0.18 0.14 0.32 -0.19 6.29 0
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 105 64.92 585.16 1 3.66 1.48 0 6000 6000 9.94 97.79 57.11
> describe(sexpart[sexpart<150])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 102 5.07 7.85 1 3.27 1.48 0 45 45 2.95 9.84 0.78
> describe(sexpart[sexpart<10])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 84 2.2 2.1 1 1.84 0 0 9 9 1.49 1.49 0.23
67 68

Remember data is time dependent Transforming Skewed Data

> aaplret=monthlyReturn(Ad(AAPL))
n When a distribution is skewed, it can be hard to
> describe(aaplret)
vars n mean sd median trimmed mad min max range skew kurtosis summarize the data simply with a center and
monthly.returns 1 126 0.03 0.09 0.03 0.03 0.07 -0.33 0.24 0.57 -0.69 2.17
se
monthly.returns 0.01
spread, and hard to decide whether the most
extreme values are outliers or just part of the
stretched-out tail.
n How can we say anything useful about such
data? The secret is to apply a simple function to
each data value.

69 70

Nonlinear Transformations Your dream job

n Sometimes there is need to transform our data in a nonlinear way; n Consider the graph below which shows 2005 CEO data
n Y=sqrt(X), Y=log(X), Y=1/x, etc. for the Fortune 500. The data is in thousands of dollars.
n This is usually done to try to symmetrize the data distribution to
improve their fit to assumptions of statistical analysis (will make
more sense in a few weeks).
n Basically to reduce outliers in the data and/or reduce skewness.

71 72

The data is heavily skewed Log the data

n Skewed distributions are difficult to summarize. Its hard n One way to make a skewed distribution more symmetric
to know what we mean by the center of a skewed is to re-express, or transform, the data by applying a
distribution, so its not obvious what value to use to simple function to all the data values.
summarize the distribution.
n What would you say was a typical CEO total
compensation? The mean value is $10,307,000, while
the median is only $4,700,000.
73 74

The Transform Cheat Sheet Todays Tools

n Calculate the skewness statistic for your data set n New toolbox additions
n If |skewness| < 0.8 data set is cool and unlikely to disrupt qTransformations, Skewness, Outliers
our analysis.
qEmpirical Rule
n Otherwise, try a transformation in the ladder of powers

Things you should know

n Emprical Rule, Chebyshevs Rule

n a+bX rule
n Z scoring
n Detecting Outliers
n Skewness and Transformations

Bba Question Paper
40% (5)
Bba Question Paper
5 pages
Measures of Variability
100% (2)
Measures of Variability
71 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
Measures of Variability
No ratings yet
Measures of Variability
20 pages
Chapter 3 - Data Presentation
100% (1)
Chapter 3 - Data Presentation
40 pages
Introduction To Descriptive Statistics 2014
67% (3)
Introduction To Descriptive Statistics 2014
72 pages
Measure of Central Tendency and Variability
No ratings yet
Measure of Central Tendency and Variability
73 pages
Day 2-Statistical Measures of Data Rev
100% (1)
Day 2-Statistical Measures of Data Rev
82 pages
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
No ratings yet
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
41 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Chapter 4-1
No ratings yet
Chapter 4-1
46 pages
Measures of Dispersion (Autosaved)
No ratings yet
Measures of Dispersion (Autosaved)
64 pages
Analysis of Statistcal Data
No ratings yet
Analysis of Statistcal Data
46 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
4 Descriptive Statistics
No ratings yet
4 Descriptive Statistics
13 pages
Descriptive Statistics - Measures of Spread: April 2014
No ratings yet
Descriptive Statistics - Measures of Spread: April 2014
5 pages
Session 1 DEN1015H 2013 Lecture Notes
No ratings yet
Session 1 DEN1015H 2013 Lecture Notes
31 pages
Descriptive Statistics - Measures of Spread: April 2014
No ratings yet
Descriptive Statistics - Measures of Spread: April 2014
5 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Descriptive Statistics II
No ratings yet
Descriptive Statistics II
24 pages
Business Statistics: Session 2
No ratings yet
Business Statistics: Session 2
60 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
8 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
26 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Averages and Variation Eda
No ratings yet
Averages and Variation Eda
29 pages
Unit 3. Measures of Dispersion Revised
No ratings yet
Unit 3. Measures of Dispersion Revised
41 pages
Notes Stats Quiz 2
No ratings yet
Notes Stats Quiz 2
10 pages
Statistical Data
No ratings yet
Statistical Data
41 pages
2 Measures of Location - Dispersion
No ratings yet
2 Measures of Location - Dispersion
61 pages
AP ECON 2500 Session 2
No ratings yet
AP ECON 2500 Session 2
22 pages
Bus. Statt. Chapter-Lecture 2+3
No ratings yet
Bus. Statt. Chapter-Lecture 2+3
43 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
32 pages
Meas T
No ratings yet
Meas T
8 pages
Data Analysis: Measures of Dispersion
No ratings yet
Data Analysis: Measures of Dispersion
6 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
79 pages
4 - Dispersion & Skewness - Part 1
No ratings yet
4 - Dispersion & Skewness - Part 1
35 pages
2a. Describing Variables With Numbers
No ratings yet
2a. Describing Variables With Numbers
30 pages
4 - Dispersion & Skewness - Part 1
No ratings yet
4 - Dispersion & Skewness - Part 1
35 pages
2.descriptive Statistics
No ratings yet
2.descriptive Statistics
49 pages
Lecture III-Measures of Dispersion
No ratings yet
Lecture III-Measures of Dispersion
33 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
1.3 Variation
No ratings yet
1.3 Variation
16 pages
Dispersion
No ratings yet
Dispersion
16 pages
CHAPTER 1 Descriptive Statistics
No ratings yet
CHAPTER 1 Descriptive Statistics
5 pages
Doane Solutions
No ratings yet
Doane Solutions
291 pages
Chapter Four: Measures of Variation
No ratings yet
Chapter Four: Measures of Variation
26 pages
Business Statistics ASSIGNMENT
No ratings yet
Business Statistics ASSIGNMENT
4 pages
Biostat Ch-5
No ratings yet
Biostat Ch-5
58 pages
3-Measures of Dispersion
No ratings yet
3-Measures of Dispersion
33 pages
Measures of Dispersion Updated
No ratings yet
Measures of Dispersion Updated
38 pages
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
No ratings yet
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
10 pages
Standard Deviation
No ratings yet
Standard Deviation
37 pages
Chapt3 Overheads
No ratings yet
Chapt3 Overheads
8 pages
Lesson Recap
No ratings yet
Lesson Recap
106 pages
Dispersion 1
No ratings yet
Dispersion 1
18 pages
EXP-1 - Statistics and Plotting
No ratings yet
EXP-1 - Statistics and Plotting
23 pages
(Business Statistics) Chapter 3 Part 1
No ratings yet
(Business Statistics) Chapter 3 Part 1
30 pages
Measures of Dispersion Tendency
No ratings yet
Measures of Dispersion Tendency
7 pages
LM03 Statistical Measures of Asset Returns
No ratings yet
LM03 Statistical Measures of Asset Returns
83 pages
NEW IB Biology Stats Booklet 2023
No ratings yet
NEW IB Biology Stats Booklet 2023
68 pages
Data Handling Using Pandas-II
No ratings yet
Data Handling Using Pandas-II
55 pages
2 Mean Median Mode Variance
No ratings yet
2 Mean Median Mode Variance
29 pages
2017 AMC 12B Test
No ratings yet
2017 AMC 12B Test
7 pages
Math 10 Q4 Test
No ratings yet
Math 10 Q4 Test
7 pages
Chapter 4 Fin534
No ratings yet
Chapter 4 Fin534
38 pages
Assignment 1 - Data Screening (16 March)
100% (1)
Assignment 1 - Data Screening (16 March)
5 pages
A4 G10 Q4 Module 1 MELC-1
No ratings yet
A4 G10 Q4 Module 1 MELC-1
8 pages
04-Topically S1-A4 Good
No ratings yet
04-Topically S1-A4 Good
24 pages
Mth105 - Assignment 2-Spring 2015-1
No ratings yet
Mth105 - Assignment 2-Spring 2015-1
5 pages
Statistics 110, Lecture Notes - Cedar Crest College
No ratings yet
Statistics 110, Lecture Notes - Cedar Crest College
111 pages
Statistical Analysis in JASP 2024
No ratings yet
Statistical Analysis in JASP 2024
189 pages
FORECAST in Excel - Step by Step Tutorial
No ratings yet
FORECAST in Excel - Step by Step Tutorial
8 pages
Revision Notes - Measures of Central Tendency and Dispersion
No ratings yet
Revision Notes - Measures of Central Tendency and Dispersion
11 pages
Q4 Module 5 III - Finding Answer - Quantitative
No ratings yet
Q4 Module 5 III - Finding Answer - Quantitative
31 pages
Statitics by Mesfin
No ratings yet
Statitics by Mesfin
150 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
8 pages
Chapter 4 pt.4
No ratings yet
Chapter 4 pt.4
19 pages
CE502 Week 3 (Part 1) Descriptive Statistics
No ratings yet
CE502 Week 3 (Part 1) Descriptive Statistics
36 pages
ETW1001 Week 3: Pre-Class: A. Tables and Charts For Numerical Data
No ratings yet
ETW1001 Week 3: Pre-Class: A. Tables and Charts For Numerical Data
12 pages
Promo Double Date Beginner Assignment
No ratings yet
Promo Double Date Beginner Assignment
24 pages
Statistical Measures Every Analyst Must Know - Part1 - by Prof. Frenzel - Feb, 2024 - Medium
No ratings yet
Statistical Measures Every Analyst Must Know - Part1 - by Prof. Frenzel - Feb, 2024 - Medium
21 pages
MATH01 Midterm Quiz No. 1 Statistics
No ratings yet
MATH01 Midterm Quiz No. 1 Statistics
4 pages
Statistics Fall2013 - Midterm Sample Test 05 - Answer Key 05
No ratings yet
Statistics Fall2013 - Midterm Sample Test 05 - Answer Key 05
6 pages
Annual: Set - A
No ratings yet
Annual: Set - A
3 pages
STAT101 Assignment 1
No ratings yet
STAT101 Assignment 1
3 pages
Assignment 4.54: First Quartile Second Quartile Third Quartile 4.56
No ratings yet
Assignment 4.54: First Quartile Second Quartile Third Quartile 4.56
2 pages
Data Analysis Exercises
No ratings yet
Data Analysis Exercises
4 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Lecture5 Stat104 Fall2017 V1 6up

Uploaded by

Lecture5 Stat104 Fall2017 V1 6up

Uploaded by

1 2

Range Disadvantages of the Range

Interquartile Range (IQR)

n Eliminate some high-and low-valued

Example: Haircut Data Again How should we measure variability ?

Is this a good measure of dispersion ? No, its horrible.

Distance from a Fixed Point Distance Has to be Positive

Does anyone have a calculator ? One Solution: Mean Absolute Deviation

MAD for haircut data Another Solution: The Variance

The variance of the haircut data is 1471.86. Yikes!! Standard Deviation

What are the units of this number anyway ??

Why is the std dev a lot larger than MAD or IQR?

Comparing can be difficult Finance Example : Comparing Mutual Funds

n If you compare them using standard The assets are:

Some Tools so Far

Shifting and Rescaling Data Examples

Linear Transformations (a+bX rule) Effects of Linear Transformations

q stdev = 5/9 * 10 = 5.56 2X 50 0.9389 0.8856 0.9370 0.5760 0.0815

Used for Comparison Purposes Calculate the Z-scores

n The mean SAT score is 1000 with a standard 2

Interpreting standard deviation Rule of thumb

What is a big sx , what is a small one ?

The empirical rule will help us understand sx and 35 36

relate the summaries back to the dotplot (or histogram).

Empirical Rule Example You find +/- 2s in Many Places

Dont fall in love with +/- 2s Chebyshevs Rule

spread of your data. k

So for ( x - 2s x , x + 2s x ) = x 2s x Detecting Outliers

q Outliers can lead to too-high, too-low or nearly correct estimates of the

Mean estimate is Mean estimate is Mean estimate is

Classic Outlier Detection Example

n So 1000 is declared an outlier.

Summary Statistics Outliers

The Boxplot Rule The BoxPlot Rule

n It is based on the fundamental strategy of X < Q1 - 1.5(Q3 - Q1)

Example Outlier Detection in R

> attach(mydata) ### this makes the variables directly available to us

n Outlier if > 6+1.5(6-1)=13.5 so 12 points are flagged now

Finding Outliers in R-Direct Method Finding Outliers in R-Direct Method

> price[price>6332+1.5*IQR(price)] > price[(price-mean(price))/sd(price) < -1.96]

> price[price<4220-1.5*IQR(price)] > price[(price-mean(price))/sd(price) > 1.96]

Skewness Values of Skewness

Skewness Example: Haircut Data

Example: Sexual Partners Remember data is time dependent

Remember data is time dependent Transforming Skewed Data

Nonlinear Transformations Your dream job

The data is heavily skewed Log the data

The Transform Cheat Sheet Todays Tools

Things you should know

n Emprical Rule, Chebyshevs Rule

You might also like