0% found this document useful (0 votes)

20 views19 pages

Day 3

Distribution probability

Uploaded by

Raquel Nicolette

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views19 pages

Day 3

Distribution probability

Uploaded by

Raquel Nicolette

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Distributions

Contents
1 The concept of distributions 1

2 How to express 2

3 How to estimate 3
3.1 Nonparametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Parametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 How to compare data with a distribution 7

4.1 Comparing a single number with a distribution . . . . . . . . . . 7
4.2 Comparing a sample with a distribution . . . . . . . . . . . . . . 8
4.2.1 Testing Gaussianity . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 Testing for general distribution . . . . . . . . . . . . . . . 10
4.3 Comparing the distribution of one sample with another . . . . . 10

5 How to tweak data to modify distribution 12

6 Simulation: generating data from a distribution 13

6.1 Simulation to evaluate performance . . . . . . . . . . . . . . . . . 13
6.2 Simulation to incorporate error bars . . . . . . . . . . . . . . . . 14
6.3 Simulation to simplify mathematics . . . . . . . . . . . . . . . . . 14
6.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Hints for selected exercises 18

1 The concept of distributions

Statistics deals with random data. The randomness present in a data is quan-
tified through the concept of (probability) distribution. By the distribution of
a random variable we understand any statement that specifies two things:

1. the set of possible values

2. a rule which, given any subset of values, tells us the the probability of the
random variable taking a value from that subset.

Suppose that we make a list of all the people in a country and make a histogram
of their incomes, i.e., plot the number of people at each income level against the
income levels. I should get a histogram like this

1
Income histogram of a typical country

The shape of this histogram gives us an idea about the income distribution:
a few people are very rich, quite a lot are poor, the bulk being of the middle
income range.
It is a common practice in statistics to regard the distribution as the ultimate
truth behind the data. Most statistical procedures try to infer about the under-
lying distribution based on the data.

2 How to express
There are various ways to express a distribution mathematically. A popular
way is by the probability density function (pdf ) which is a nonnegative
function f (x) with Z ∞
f (x)dx = 1.
−∞

Often we consider families of pdf’s such that all the members in the same family
have some common shape, and differ only in terms of some quantitative details
controlled by a small number of parameters. Familiarity with the shapes of
each family may be gained by using R as follows.

curve(dnorm(x,mean=0,sd=1),-5,5,ylim=c(0,1),ylab=’’)
title(main=expression(paste("N(",mu,", ",sigma^2,") densities")))
curve(dnorm(x,mean=0,sd=0.5),-5,5,add=T,col=’red’)
curve(dnorm(x,mean=0,sd=1.5),-5,5,add=T,col=’blue’)
curve(dnorm(x,mean=1,sd=1),-5,5,add=T,col=’green’)

Let’s add a legend to the plot.

legend("topright",col=c(’black’,’red’,’blue’,’green’),lwd=1,
legend=c(expression(paste(mu,’=0 ’,sigma,’=1’)),
expression(paste(mu,’=0 ’,sigma,’=0.5’)),

2
expression(paste(mu,’=0 ’,sigma,’=1.5’)),
expression(paste(mu,’=1 ’,sigma,’=1’))))

Exercise 1: Make a similar plot for the Exponential(λ) distribution, which has p.d.f.
λe−λx

if x > 0
f (x) =
0 otherwise.

The R function to compute this pdf is dexp(x,rate), where rate is what we have
called λ(> 0).

3 How to estimate
A typical statistical analysis does not start with a distribution, rather with a
data set. We often make the assumption that the data set comes from some
unknown distribution, and try to estimate the distribution from the data. Some
estimation procedures do not assume any knowledge about the family of the dis-
tribution. These are called the nonparametric techniques. The parametric
techniques on the other hand assume a known family and try to estimate the
parameters to choose the most suitable member of the family.

3.1 Nonparametric
A simple graphical method to estimate a distribution is to draw the histogram.
Let’s first load the SDSS quasar data set.

# Construct large and small samples of SDSS quasar redshifts and r-i colors
qso = read.table(
"http://astrostatistics.psu.edu/datasets/SDSS_QSO.dat",
head=T,fill=T)
qso = na.omit(qso)
dim(qso)
names(qso)
summary(qso)

Next we extract the variables of interest.

z.all = qso[,4]
r.i.all = qso[,9]-qso[,11]
SDSS.all = data.frame(z.all,r.i.all)

3
oldpar = par(mfrow=c(1,3))
hist(z.all, main=’’, xlab=’Redshift’) #Default binning
hist(z.all, breaks=50, main=’’, xlab=’Redshift’) #10 bins
hist(z.all, breaks=’scott’, main=’’, xlab=’Redshift’)
par(oldpar)

The breaks=’scott’ option means the classes are determined by some algo-
rithm called Scott’s algorithm (we hardly need to worry about what that is at
this stage).
A histogram is often used to approximate a pdf. Since the area under a pdf is
1, it is sometimes desirable to scale the histogram appropriately to make the
area 1. This may be done using the prob=T option:

hist(z.all, main=’’, xlab=’Redshift’,prob=T)

This option is useful when we want to overlay a pdf curve on the histogram, as
we shall do later.
Another approach resulting in a plot that is slightly difficult to interpret is to
plot the percentiles.

plot(quantile(z.all, (1:100)/100), pch=20, cex=0.5,

xlab=’Percentile’, ylab=’Redshift’)

A more frequently used method is to draw the empirical cumulative distribution

function (ecdf). For a sample xi , ..., xn the ecdf is defined as a function from
real line to [0,1] as
1
F̂n (x) = (Number of xi ’s ≤ x.
n
To understand the structure of the ecdf let us focus on only a few points, say
the first 10:

plot(ecdf(z.all[1:10]))

Here is a more annotated example.

# Read magnitudes for Milky Way and M 31 globular clusters

GC1 = read.table(
"http://astrostatistics.psu.edu/MSMA//datasets/GlobClus_MWG.dat"

4
,header=T)
GC2 = read.table(
"http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_M31.dat",
header=T)
K1 = GC1[,2]
K2 = GC2[,2]

plot(ecdf(K1), cex.points=0.5, xlab="K (mag)", ylab="E.C.D.F.",

main="Milky Way & M 31 globular cluster KLF")
plot(ecdf(K2-24.90), col.hor=2, col.vert=2, col.points=2,
pch=3, cex.points=0.5, add=T)
text(-7.5, 0.8, lab="MWG")
text(-10.5,0.9, lab="M 31", col=2)

The ecdf function, however, does not provide any confidence band for the es-
timated cdf. For that we could use the ecdf.ksCI function from the sfsmisc
package.

install.packages(’sfsmisc’)
library(’sfsmisc’)
ecdf.ksCI(K1)

A more sophisticated technique is called kernel density estimation. Here

we imagine a small Gaussian1 density centred around each data point, and
average all these densities. While there are R packages to compute kernel density
estimates easily, it is instructive to spell out the steps graphically. We shall work
with the first 5 points from the data set.

z.5 = z.all[1:5]

Next we shall consider a Gaussian density centred at each of these points. We

shall take a pretty small σ, say 0.05.

plot(0,0,xlim=range(z.5),ylim=c(0,10),xlab=’z’,ylab=’density’,ty=’n’)
for(val in z.5) curve(dnorm(x,mean=val,sd=0.05),xlim=range(z.5),add=T)
rug(z.5) #Adds tiny vertical bars on the horizontal axis
#to mark each point

Now we are going to average the densities.

1 It is possible to use other symmetric densities as well.

5
zgrid = seq(min(z.5),max(z.5),len=100)
kd = sapply(zgrid,
function(x) mean(dnorm(x,mean=z.5, sd=0.05)))
lines(zgrid,kd,lwd=3,xlim=range(z.5))

A more sophisticated version of the same thing is done by the density function
of R. However, we prefer to use the external package sm since it also finds
confidence bands. We shall work with the following subset of the data.

z.200 = z.all[1:200]

install.packages(’sm’)
library(sm)
?sm.density #looking up help
dens = sm.density(z.200, h=bw.nrd(z.200)) #Choosing h automatically
lines(dens$eval.points, dens$upper, col=2)
lines(dens$eval.points, dens$lower, col=2)

3.2 Parametric
Here we have to first choose a family of distributions (typically based on the
underlying theory or the histogram). If the chosen family is a standard one then
the fitdistr function from the MASS package can hopefully pick out the best
fitting member of the family.

# Read Milky Way Galaxy and M 31 globular cluster K magnitudes

GC_MWG = read.table(
’http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_MWG.dat’,
header=T)
KGC_MWG = GC_MWG[,2]
kseq = seq(min(KGC_MWG)-1, max(KGC_MWG)+1, 0.25)
# Fit normal distributions
library(MASS)
par(mfrow=c(1,2))
hist(KGC_MWG, breaks=kseq,
main=’’, xlab=’K mag’,ylab=’N’,col=gray(0.5),prob=T)
normfit_MWG = fitdistr(KGC_MWG,’normal’)
normfit_MWG

lines(kseq,

6
dnorm(kseq,
mean=normfit_MWG$estimate[[1]],
sd=normfit_MWG$estimate[[2]]))

Exercise 2: Repeat the same process for the M 31 globular cluster K magnitudes.
The data set is at
http://astrostatistics.psu.edu/MSMA/datasets/GlobClus M31.dat

4 How to compare data with a distribution

4.1 Comparing a single number with a distribution
A dog that is 12 years old, is old enough. But a human being of the same age is
considered young. Our conclusion here depends on the number 12 as compared
against the lifespan distribution of the species in question.
More mathematically, if we know the distribution of a random variable X, and
we are asked to judge if a certain given value c is too large for it, we can check
the probability P (X > c). If it is small, then we may say that c looks like too
high a value for X to take. This is the idea behind the (upper-tailed) p-value
for comparing a number against a given distribution. Of course, we can similarly
consider the (lower-tailed) p-value P (X < c) to judge if c is too small a value
for X.
We have to sometimes judge if a given value c is “too extreme” for X. It is
not easy to come up with a (two-tailed) p-value in general. However, this
question often arises in situations where the distribution of X is symmetric
around 0. Then we use P (X > |c|) as the (two-tailed) p-value.
Before we learn how to compute these using R, let us notice one common feature
of any type of p-value:

a small p-value implies that the given number does not go well with
the given distribution.

Let us see how to compute P (X > c) if X has a N (1, 22 ) distribution and c = 4.

1-pnorm(4,mean=1,sd=2)

The pnorm function computes the Gaussian cumulative distributions. Compare

its name with dnorm, the Gaussian pdf. Every standard distribution in R has 4
functions associated with it:

• a p- function: the distribution function,

7
• a d- function: the pdf/pmf function,
• a q- function: the inverse distribution function, also called the quantile
function,
• an r- function: the random number generator function.

You may like to look up the helps of runif, dlnorm, pexp and qgamma.
If the distribution is not available mathematically, but we have only a sample
from the distribution, then we can proceed as follows.

x = rnorm(100,mean=1,sd=2) #We simulate a data set first

mean(x>4)

This technique sits at the heart of all randomized and bootstrap tests, as we
shall see soon.
It is the standard practice of all statistical softwares to report results of tests in
terms of p-values.

4.2 Comparing a sample with a distribution

Many statistical procedures are geared to worked for sample from a specific
(family of) distribution. Blindly applying these techniques without checking
the distributional assumptions may lead to nonsensical results. Here we discuss
some ways to check if a given sample can be safely assumed to come from a
given distribution. A first rough check may always be done by eye-balling the
shape of the data histogram.
A more easily interpretable method is the quantile quantile plot, where the
quantiles of the distribution are plotted against the quantiles of the sample.

4.2.1 Testing Gaussianity

This is required so often, that the quantile-quantile plot for Gaussian distribu-
tion has a special name normal plot. It is implemented by the R function
qqnorm

# Read magnitudes for Milky Way and M 31 globular clusters

GC1 = read.table(
"http://astrostatistics.psu.edu/MSMA//datasets/GlobClus_MWG.dat",
header=T)
names(GC1)
summary(GC1)
qqnorm(GC1[,2])

8
Graphical methods are easy to interpret, but introduces an element of subjec-
tivity that is often undesirable. So a class of objective tests have been devised
for testing a sample for normality.

asteroids = read.table(
"http://astrostatistics.psu.edu/MSMA/datasets/asteroid_dens.dat",
header=T)
dens = asteroids[,2]
summary(dens)

Now we shall apply a standard test called Shapiro-Wilk’s test for normality. This
test, like any other statistical test, first summarizes the data set into a single
number generically called the test statistic. The test statistic for the Shapiro-
Wilk’s test is denoted by the symbol W. The designers of the test have worked
out the distribution of W under the assumption that the data set actually hails
from a normal distribution. The observed value of W is then compared against
this distribution using the p-value technique given earlier.

shapiro.test(dens)

The Shapiro-Wilk’s test is by no means the only test for normality. The package
nortest contains a slew of other tests.

install.packages(’nortest’)
library(’nortest’)

• Anderson-Darlington test: Here the test statistic is

n
1X
A = −n − [2i − 1][ln(p(i) ) + ln(1 − p(n−i+1) )],
n i=1

where p(i) = Φ([x(i) − x]/s). The R command is

ad.test(dens)

• Cramer-von Mises test: This test uses the test statistic

n
1 X 2i − 1
+ (p(i) − ),
12n i=1 2n

where p(i) = Φ([x(i) − x]/s). The R command is

9
cvm.test(dens)

Here are some others tests from the same package.

lillie.test(dens)
pearson.test(dens)

The package outliers contains some more.

install.packages(’outliers’)
library(’outliers’)
dixon.test(dens)
chisq.out.test(dens)
grubbs.test(dens)
grubbs.test(dens,type=20)

4.2.2 Testing for general distribution

The Kolmogorov-Smirnov test is useful for testing a sample comes from a given
continuous distribution.

ks.test(dens,"pexp",rate=3) #Just a silly example!

This test has problem with ties (i.e, repeated data values).

4.3 Comparing the distribution of one sample with an-

other
Sometimes we have two samples, and want to compare the distributions they
come from.

# Read magnitudes for Milky Way and M 31 globular clusters

GC1 = read.table(
"http://astrostatistics.psu.edu/MSMA//datasets/GlobClus_MWG.dat",
header=T)
GC2 = read.table(
"http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_M31.dat",header=T)
K1 = GC1[,2]
K2 = GC2[,2]

10
Visual comparison of the two histograms is of course one simple (and crude)
way to do this.

par(mfrow=c(1,2))
hist(K1)
hist(K2)
par(mfrow=c(1,1))

Notice that they shapes are quite similar in both the cases. However, the hori-
zontal scales are shifted.
We can also plot a 2-sample quantile-quantile plot, which plots the quantiles of
one sample against those of the other.

qqplot(K1, K2, xlab="MWG", ylab="M31",

main="Milky Way & M31 QQ Plot")

The linear pattern again indicates of the similarity of the two distributions.
Indeed, the linear pattern means that one may be obtained from the other by
some shifting and/or scaling. Let’s try shifting first. We need to have an idea
as to how much we should shift by. Here are three ways:

# Three estimates of the distance modulus to M 31

DMmn = mean(K2) - mean(K1) ; DMmn
sigDMmn = sqrt(var(K1)/length(K1) + var(K2)/length(K2)) ; sigDMmn
DMmed = median(K2) - median(K1) ; DMmed
sigDMmed = sqrt(mad(K1)^2/length(K1) + mad(K2)^2/length(K2)) ; sigDMmed
wilcox.test(K2, K1, conf.int=T)

qqplot(K1, K2-24.90, xlab="MWG", ylab="M31-24.90", main="Milky Way & M31 QQ Plot")

Does it require scaling also? To find out we can compare the linear pattern with
the y = x line:

abline(0,1) #intercept=0, slope=1

The slope is different from 1, so the two distributions still differ in scale.
We can also compare the distributions underlying two independent samples
using the 2-sample Kolmogorov-Smirnov test or the Mood Test.

11
# Two-sample tests for Milky Way and M31 globular clusters
ks.test(K1, K2-24.90) # with distance modulus offset removed
mood.test(K1, K2-24.90)

5 How to tweak data to modify distribution

Many statistical procedures require a sample from a Gaussian distribution. But
suppose that the histogram of the sample turns out to be different from the
Gaussian symmetric bell shaped curve. If the histogram has a single peak but
is skewed to one side, then we can transform the data (in a one-to-one fashion)
such that the transformed data has a bell-shaped histogram. Such a family of
transform is the Box-Cox transforms defined by
λ
x −1
if λ 6= 0
fλ (x) = λ
log(x) otherwise.

Here λ is the “skew-correction” control. Its appropriate value needs to be chosen

based on the amount and direction of skewness present in the sample. Typically,
values between −2 and 2 are used.
The car package of R computes Box-Cox transform. The following R code seeks
to choose a good value for λ based on the histogram. The exercise that follows
urges you to come up with a more objective way.

qso = read.table(
"http://astrostatistics.psu.edu/datasets/SDSS_QSO.dat",
head=T,fill=T)
qso = na.omit(qso)
z.all = qso[,4]
install.packages(’car’)
library(car)
lambda = seq(-1,1,0.5)
tmp=sapply(lambda,function(x) bcPower(z.all,lambda=x))
par(mfrow=c(2,3))
sapply(1:ncol(tmp), function(i) hist(tmp[,i],main=lambda[i]))
par(mfrow=c(1,1))

Exercise 3: Instead of drawing histograms why not apply the Shapiro-Wilk’s test
for each value of λ, and choose the value yielding the highest p-value? Implement this
idea by apply-ing shapiro.test to the columns of tmp.

12
6 Simulation: generating data from a distribu-
tion
A distribution is an idealized mathematical description of the behaviour of a
sample. Thus, in a typical statistical analysis, sample comes first, and a distri-
bution is postulated by the scientist to provide an analytically tractable way to
represent the behaviour of the sample. But there are situations where we start
with a distribution and want to have a sample from it:

• We may want to evaluate some statistical method, which requires data

with some specific distributional assumption. Then we may simulate data
with those assumptions and check if the method really works or not. Also,
we may deliberately generate data violating the assumptions to see how
robust the method is.
• Suppose that we have some data with error bars. We want to use some
statistical method that is meant for precise data. Then we can incorporate
the error bars by running the procedure a large number of times each time
freshly jittering the data within the error bars.

• A distribution may be to complicated to analyze mathematically. Then

we can try to simulate a large sample from it, and base our conclusion on
the sample.

We shall illustrate these scenarios now.

6.1 Simulation to evaluate performance

We shall check if Shapiro-Wilk’s method is a good test for normality or not.
Like any statistical test, it can fail in two ways, either by failing to identify a
truly normal sample, or by wrongly identifying a non-normal sample as normal.
Statistical literature calls these, respectively, the type I and type II errors.
Let’s try to get an idea about the type I error:

tmp = sapply(1:100,function(x) shapiro.test(rnorm(100))$p.val < 0.05)

mean(tmp)

The power of a test is defined as 1− type II error. In our example, it measures

how well the Shapiro-Wilk’s test is able to detect a non-Gaussian distribution.
Let’s generate samples from Exponential(1) distribution, and see if Shapiro-
Wilk’s test detects that the data are from some non-normal distribution:

mean(sapply(1:100,function(x) shapiro.test(rexp(100))$p.val < 0.05))

13
6.2 Simulation to incorporate error bars
Jittering data is a simple way to incorporate the effects of error bars. We shall
illustrate this with a regression example in the next lecture.
For now, here is a skeleton of the approach. Don’t type this verbatim!

simerr = function(cen,lo,hi=lo)
runif(length(cen),min=cen-lo,max=cen+hi)

tmp = sapply(1:100,
function(dummy) {
x=simerr(......)
y=simerr(......)
......
})

6.3 Simulation to simplify mathematics

A huge radio active source emits particles at random time points. A particle de-
tector is detecting them. However, the detector is imperfect and gets “jammed”
for 1 sec every time a particle hits it. No further particles can be detected
within this period. After the 1 sec is over, the detector again starts functioning
normally. We want to know the fraction of particles that the detector is missing.
Since this fraction may depend on the rate at which the source emits particles
we want to get a plot like the following:

Want a plot like this

Random emission of particles from a huge radio active source is well-studied
process. A popular model is to assume that the time gaps between successive
emissions are independent and have Exponential distributions.

14
gaps = rexp(999,rate=1)

The question now is to determine which of the particles will fail to be detected.
A particle is missed if it comes within 1 sec of its predecessor. So the number
of missed particles is precisely the number of gaps that are below 1.

miss = sum(gaps < 1)

miss

Now we want to repeat this experiment a number of times, say 50 times.

missList = c()
for(i in 1:50) {
gaps = rexp(999,rate=1)
miss = sum(gaps < 1)
missList = c(missList,miss)
}
mean(missList)
var(missList)

All these are done for rate = 1. For other rates we need to repeat the entire
process afresh. So we shall use yet another for loop.

rates = seq(0.1,3,0.1)
mnList = c()
vrList = c()
for(lambda in rates) {
missList = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
miss = sum(gaps < 1)
missList = c(missList,miss)
}

mn = mean(missList)
vr = var(missList)
mnList = c(mnList,mn)
vrList = c(vrList, vr)
}

15
Now we can finally make the plot.

plot(rates,mnList/10,ty="l",ylab="% missed")

We shall throw in two error bars for a good measure.

up = mnList + 2*sqrt(vrList)
lo = mnList - 2*sqrt(vrList)

lines(rates,up/10,col="red")
lines(rates,lo/10,col="blue")

Exercise 4: (This exercise is difficult) We can estimate the actual rate from
the hitting times of the particles if the counter were perfect. The formula is to take
the reciprocal of the average of the gaps. But if we apply the same formula for the
imperfect counter, then we may get a wrong estimate of the true rate. We want to
make a correction graph that may be used to correct the under estimate.
Explain why the following R program achieves this. You will need to look up the
online help of cumsum and diff.

rates = seq(0.1,3,0.1)
avgUnderEst = c()
for(lambda in rates) {
underEst = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
hits = cumsum(gaps)
obsHits = c(hits[1], hits[gaps>1])
obsGaps = diff(obsHits)
underEst = c(underEst,1/mean(obsGaps))
}
avgUnderEst = c(avgUnderEst,mean(underEst))
}

plot(avgUnderEst,rates,ty="l")

Can you interpret the plot?

6.4 Bootstrap
Use of repeated sampling is very common in every walk of science to get an
idea about sampling error (confidence intervals, error bars, standard errors etc).

16
Imagine the following situation. You have a data set. You perform some analysis
to get some numbers as result. You want to have an idea about the sampling
error of your result. The ideal way is by repeating the entire process (starting
with fresh data collection each time). It is easy to imagine situations where you
simply cannot afford to do this. What then?
One simple way out is as follows. Take the data that you have, estimate the
distribution from it using any of the way discussed earlier, then simulate fresh
data from this estimated distribution. This technique is called bootstrapping.
Owing to the simplicity of the idea and its wide applicability, it is one of the
most powerful tools in the repertoire of a statistician. Depending on how the
distribution function is estimated, there are two types of bootstrap: parametric
and nonparametric. Nonparametric bootstrap using empirical cdf to estimate
the distribution is the most common form. We shall discuss a simple example
next. More complex applications will follow later.
Astronomical data sets are often riddled with outliers (values that are far from
the rest). To get rid of such outliers one sometimes ignores a few of the extreme
values. One such method is the trimmed mean.

x = c(1,2,3,4,5,6,100,7,8,9)
mean(x)
mean(x,trim=0.1) #ignore the extreme 10% points from BOTH ends
mean(x[x!=1 & x!=100])

We shall compute 10%-trimmed mean of Vmag from the Hipparcos data set.

hip = read.table(
’http://astrostatistics.psu.edu/MSMA/datasets/HIP.dat’,
head=T,fill=T)
hip = na.omit(hip)
attach(hip)
mean(Vmag,trim=0.1)

We want to estimate the standard error of this estimate. For ordinary mean
we have a simple formula, but unfortunately such a simple formula does not
exist for the trimmed mean. So we shall use bootstrap here as follows. We shall
generate 100 resamples each of same size as the original sample.

trmean = sapply(1:100,
function(dummy) {
resamp = sample(Vmag,length(Vmag),replace=T)
mean(resamp,trim=0.1)
})

17
sd(trmean)

A smarter way to achieve the same thing is by using the boot package.

Vmag = rnorm(10)
library(boot)
result = boot(Vmag,function(d,w) mean(d[w],trim=0.1), 100)
result
plot(result)

7 Hints for selected exercises

curve(dexp(x,rate=1),0,5,ylim=c(0,3),ylab=’’)
title(main=expression(paste("Expo(",lambda,") densities")))
curve(dexp(x,rate=2),0,5,add=T,col=’red’)
curve(dexp(x,rate=3),0,5,add=T,col=’blue’)

legend("topright",col=c(’black’,’red’,’blue’),lwd=1,
legend=c(expression(paste(lambda,’=1 ’)),
expression(paste(lambda,’=2’)),
expression(paste(lambda,’=3’))))

GC_M31 = read.table(
’http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_M31.dat’,
header=T)

KGC_M31 = GC_M31[,2]

kseq = seq(min(KGC_M31)-1, max(KGC_M31)+1, 0.25)

hist(KGC_M31, breaks=kseq, prob=T,

main=’’,xlab=’K mag’,ylab=’N’,col=gray(0.5))
normfit_M31 = fitdistr(KGC_M31,’normal’) ; normfit_M31

18
lines(kseq,dnorm(kseq,
mean=normfit_M31$estimate[[1]],
sd=normfit_M31$estimate[[2]]))

pval = apply(tmp,2,function(x) shapiro.test(x)$p)

lambda[which.max(pval)]

3. The plot is not the graph of a function as it doubles back over itself. This is
because if we find too low a count, then this could mean two things: either the
number of hits is really too low, or the number is too high causing the detector to
remain jammed most of the time. Remember that even an undetected particle
can prolong the period of jamming.

Full Notes Introduction To Statistics
100% (1)
Full Notes Introduction To Statistics
121 pages
Statistics For Data Science by Mihir Patnaik
No ratings yet
Statistics For Data Science by Mihir Patnaik
103 pages
00 Probability 2
No ratings yet
00 Probability 2
19 pages
Word File For Prob and Stats
No ratings yet
Word File For Prob and Stats
18 pages
Data Science Using R
No ratings yet
Data Science Using R
34 pages
Data Science Using R
No ratings yet
Data Science Using R
34 pages
Test Bank For Essentials of Statistics For The Behavioral Sciences 4th by Nolan - Quickly Download and Experience The Full Content
100% (8)
Test Bank For Essentials of Statistics For The Behavioral Sciences 4th by Nolan - Quickly Download and Experience The Full Content
33 pages
What Is Statistics?: Definition of Statistics Statistics
No ratings yet
What Is Statistics?: Definition of Statistics Statistics
108 pages
The Development of Confidence Limits For Fatigue Strength Data
No ratings yet
The Development of Confidence Limits For Fatigue Strength Data
11 pages
Stat Distributions
No ratings yet
Stat Distributions
24 pages
14 Aos1221
No ratings yet
14 Aos1221
37 pages
Econometricks-Short Guide
No ratings yet
Econometricks-Short Guide
110 pages
FSML Descriptives Probability RV RNG Handout
No ratings yet
FSML Descriptives Probability RV RNG Handout
57 pages
Lecture 8
No ratings yet
Lecture 8
76 pages
STA1007 Notes
No ratings yet
STA1007 Notes
251 pages
Notes For Lectures 1 To 10 - 2024
No ratings yet
Notes For Lectures 1 To 10 - 2024
39 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
68 pages
Mathews 2008
No ratings yet
Mathews 2008
9 pages
Presentation 3
No ratings yet
Presentation 3
29 pages
Week2 Modified
No ratings yet
Week2 Modified
43 pages
QM2 Tutorial 3
No ratings yet
QM2 Tutorial 3
26 pages
Exponential Family
No ratings yet
Exponential Family
45 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
Distributions Plotting
No ratings yet
Distributions Plotting
8 pages
Session 17-20
No ratings yet
Session 17-20
16 pages
Lecture Notes
No ratings yet
Lecture Notes
138 pages
4central Limit Theorem
No ratings yet
4central Limit Theorem
6 pages
Genetica Cuantitativa
No ratings yet
Genetica Cuantitativa
120 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
Concrete Mix Design 2021 - 035902
No ratings yet
Concrete Mix Design 2021 - 035902
17 pages
Engineering Data Analysis (Report)
No ratings yet
Engineering Data Analysis (Report)
18 pages
Elementary Statistics For UWG v1.11
No ratings yet
Elementary Statistics For UWG v1.11
239 pages
Statistics With MATLABOctave
No ratings yet
Statistics With MATLABOctave
46 pages
Multivariate Quality Control Thesis
100% (1)
Multivariate Quality Control Thesis
135 pages
Cmda2005 Review
No ratings yet
Cmda2005 Review
65 pages
Chapter 1: Populations, Samples and Processes
No ratings yet
Chapter 1: Populations, Samples and Processes
28 pages
STAT 545A Class Meetings #5 and #6 Monday, September 23, 2013 Wednesday, September 25, 2013
No ratings yet
STAT 545A Class Meetings #5 and #6 Monday, September 23, 2013 Wednesday, September 25, 2013
74 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Full Notes Introduction To Statistics
No ratings yet
Full Notes Introduction To Statistics
121 pages
Introduction To Statistics and Data Analysis
No ratings yet
Introduction To Statistics and Data Analysis
567 pages
Probdist Ref
No ratings yet
Probdist Ref
256 pages
CH 8 - Special Continuous Probability Distribution
No ratings yet
CH 8 - Special Continuous Probability Distribution
12 pages
Lecture 5: Let's Look at Some Data: Exploratory Data Analysis
No ratings yet
Lecture 5: Let's Look at Some Data: Exploratory Data Analysis
29 pages
B SC - II-10
No ratings yet
B SC - II-10
84 pages
6 Ways To Test For A Normal Distribution - Which One To Use - by Joos Korstanje - Towards Data Science
No ratings yet
6 Ways To Test For A Normal Distribution - Which One To Use - by Joos Korstanje - Towards Data Science
9 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
14 pages
Khan 2019
No ratings yet
Khan 2019
18 pages
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
No ratings yet
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
206 pages
CBCS Statistics
No ratings yet
CBCS Statistics
79 pages
Elementary Statistical Methods 7th Edition - 678
No ratings yet
Elementary Statistical Methods 7th Edition - 678
383 pages
Chapter Review: X X X X
No ratings yet
Chapter Review: X X X X
12 pages
Week5 PDF
No ratings yet
Week5 PDF
33 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Book IntroStatistics PDF
No ratings yet
Book IntroStatistics PDF
263 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
35 pages
Chapter 2 - Representing Sample Data: Graphical Displays
No ratings yet
Chapter 2 - Representing Sample Data: Graphical Displays
16 pages
Statistical Toolbox For Use With MATLAB
No ratings yet
Statistical Toolbox For Use With MATLAB
420 pages
02descriptive Stats 2011
No ratings yet
02descriptive Stats 2011
35 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Probability Distributions in R
No ratings yet
Probability Distributions in R
42 pages
Tolerancing
No ratings yet
Tolerancing
48 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
Quality Control: Fundamentals of Statistics
No ratings yet
Quality Control: Fundamentals of Statistics
62 pages
MIT18 05S14 Cl5contslides PDF
No ratings yet
MIT18 05S14 Cl5contslides PDF
11 pages
BN2102 1-6 Notes
No ratings yet
BN2102 1-6 Notes
38 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
23 pages
Essentials of Statistics
No ratings yet
Essentials of Statistics
272 pages
hw4s Key
No ratings yet
hw4s Key
3 pages
AAA Photon Dose Calculation Model in Eclipse™
No ratings yet
AAA Photon Dose Calculation Model in Eclipse™
23 pages
Lecture Notes Ma12003 PDF
100% (1)
Lecture Notes Ma12003 PDF
105 pages
Fitting Distributions With R
No ratings yet
Fitting Distributions With R
24 pages
Statistics With MATLAB/Octave: Andreas Stahel Bern University of Applied Sciences Version of 30th June 2017
No ratings yet
Statistics With MATLAB/Octave: Andreas Stahel Bern University of Applied Sciences Version of 30th June 2017
46 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Section32 Measures of Central Tendency and Dispersion
No ratings yet
Section32 Measures of Central Tendency and Dispersion
19 pages
Problem Solutions - Chapter 6: Yates and Goodman: Probability and Stochastic Processes Solutions Manual
No ratings yet
Problem Solutions - Chapter 6: Yates and Goodman: Probability and Stochastic Processes Solutions Manual
22 pages
BRM Unit 2 Notes
No ratings yet
BRM Unit 2 Notes
11 pages
Hazard Identification & Risk Assessment With Human Error Analysis Method in Automotive Industry
No ratings yet
Hazard Identification & Risk Assessment With Human Error Analysis Method in Automotive Industry
15 pages
STA80006 Weeks7-12 PDF
No ratings yet
STA80006 Weeks7-12 PDF
29 pages
5 Describing Populations: in This Chapter We Describe Populations and Samples Using The Language of Probability
No ratings yet
5 Describing Populations: in This Chapter We Describe Populations and Samples Using The Language of Probability
9 pages
The Effect of Performance Appraisal Result On The Personal Motivation and Job Promotion
No ratings yet
The Effect of Performance Appraisal Result On The Personal Motivation and Job Promotion
7 pages
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
No ratings yet
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
8 pages
Biological Cybernetics: Gabor Filters As Texture Discriminator
No ratings yet
Biological Cybernetics: Gabor Filters As Texture Discriminator
11 pages
Introduction To Rstudio: Creating Vectors
No ratings yet
Introduction To Rstudio: Creating Vectors
11 pages
STIMULI To The REVISION PROCESS An Evaluation of The Indifference Zone of The USP 905 Content Uniformity Test1
No ratings yet
STIMULI To The REVISION PROCESS An Evaluation of The Indifference Zone of The USP 905 Content Uniformity Test1
21 pages
Political Impact of Media Exposure
No ratings yet
Political Impact of Media Exposure
21 pages
Unit 3 Experimental Designs: Objectives
No ratings yet
Unit 3 Experimental Designs: Objectives
14 pages
Probability
No ratings yet
Probability
2 pages
8 Probability Distributions: 8.1 R As A Set of Statistical Tables
No ratings yet
8 Probability Distributions: 8.1 R As A Set of Statistical Tables
6 pages
Sim R
No ratings yet
Sim R
6 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)

Day 3

Uploaded by

Day 3

Uploaded by

Distributions

4 How to compare data with a distribution 7

5 How to tweak data to modify distribution 12

6 Simulation: generating data from a distribution 13

7 Hints for selected exercises 18

1 The concept of distributions

1. the set of possible values

Let’s add a legend to the plot.

Next we extract the variables of interest.

hist(z.all, main=’’, xlab=’Redshift’,prob=T)

plot(quantile(z.all, (1:100)/100), pch=20, cex=0.5,

A more frequently used method is to draw the empirical cumulative distribution

Here is a more annotated example.

# Read magnitudes for Milky Way and M 31 globular clusters

plot(ecdf(K1), cex.points=0.5, xlab="K (mag)", ylab="E.C.D.F.",

A more sophisticated technique is called kernel density estimation. Here

Next we shall consider a Gaussian density centred at each of these points. We

Now we are going to average the densities.

# Read Milky Way Galaxy and M 31 globular cluster K magnitudes

4 How to compare data with a distribution

Let us see how to compute P (X > c) if X has a N (1, 22 ) distribution and c = 4.

The pnorm function computes the Gaussian cumulative distributions. Compare

• a p- function: the distribution function,

x = rnorm(100,mean=1,sd=2) #We simulate a data set first

4.2 Comparing a sample with a distribution

4.2.1 Testing Gaussianity

# Read magnitudes for Milky Way and M 31 globular clusters

• Anderson-Darlington test: Here the test statistic is

where p(i) = Φ([x(i) − x]/s). The R command is

• Cramer-von Mises test: This test uses the test statistic

where p(i) = Φ([x(i) − x]/s). The R command is

Here are some others tests from the same package.

The package outliers contains some more.

4.2.2 Testing for general distribution

ks.test(dens,"pexp",rate=3) #Just a silly example!

4.3 Comparing the distribution of one sample with an-

# Read magnitudes for Milky Way and M 31 globular clusters

qqplot(K1, K2, xlab="MWG", ylab="M31",

# Three estimates of the distance modulus to M 31

qqplot(K1, K2-24.90, xlab="MWG", ylab="M31-24.90", main="Milky Way & M31 QQ Plot")

abline(0,1) #intercept=0, slope=1

5 How to tweak data to modify distribution

Here λ is the “skew-correction” control. Its appropriate value needs to be chosen

• We may want to evaluate some statistical method, which requires data

• A distribution may be to complicated to analyze mathematically. Then

We shall illustrate these scenarios now.

6.1 Simulation to evaluate performance

tmp = sapply(1:100,function(x) shapiro.test(rnorm(100))$p.val < 0.05)

The power of a test is defined as 1− type II error. In our example, it measures

mean(sapply(1:100,function(x) shapiro.test(rexp(100))$p.val < 0.05))

6.3 Simulation to simplify mathematics

Want a plot like this

miss = sum(gaps < 1)

Now we want to repeat this experiment a number of times, say 50 times.

We shall throw in two error bars for a good measure.

Can you interpret the plot?

7 Hints for selected exercises

kseq = seq(min(KGC_M31)-1, max(KGC_M31)+1, 0.25)

hist(KGC_M31, breaks=kseq, prob=T,

pval = apply(tmp,2,function(x) shapiro.test(x)$p)

You might also like