Day 3
Day 3
Contents
1 The concept of distributions 1
2 How to express 2
3 How to estimate 3
3.1 Nonparametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Parametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. a rule which, given any subset of values, tells us the the probability of the
random variable taking a value from that subset.
Suppose that we make a list of all the people in a country and make a histogram
of their incomes, i.e., plot the number of people at each income level against the
income levels. I should get a histogram like this
1
Income histogram of a typical country
The shape of this histogram gives us an idea about the income distribution:
a few people are very rich, quite a lot are poor, the bulk being of the middle
income range.
It is a common practice in statistics to regard the distribution as the ultimate
truth behind the data. Most statistical procedures try to infer about the under-
lying distribution based on the data.
2 How to express
There are various ways to express a distribution mathematically. A popular
way is by the probability density function (pdf ) which is a nonnegative
function f (x) with Z ∞
f (x)dx = 1.
−∞
Often we consider families of pdf’s such that all the members in the same family
have some common shape, and differ only in terms of some quantitative details
controlled by a small number of parameters. Familiarity with the shapes of
each family may be gained by using R as follows.
curve(dnorm(x,mean=0,sd=1),-5,5,ylim=c(0,1),ylab=’’)
title(main=expression(paste("N(",mu,", ",sigma^2,") densities")))
curve(dnorm(x,mean=0,sd=0.5),-5,5,add=T,col=’red’)
curve(dnorm(x,mean=0,sd=1.5),-5,5,add=T,col=’blue’)
curve(dnorm(x,mean=1,sd=1),-5,5,add=T,col=’green’)
legend("topright",col=c(’black’,’red’,’blue’,’green’),lwd=1,
legend=c(expression(paste(mu,’=0 ’,sigma,’=1’)),
expression(paste(mu,’=0 ’,sigma,’=0.5’)),
2
expression(paste(mu,’=0 ’,sigma,’=1.5’)),
expression(paste(mu,’=1 ’,sigma,’=1’))))
Exercise 1: Make a similar plot for the Exponential(λ) distribution, which has p.d.f.
λe−λx
if x > 0
f (x) =
0 otherwise.
The R function to compute this pdf is dexp(x,rate), where rate is what we have
called λ(> 0).
3 How to estimate
A typical statistical analysis does not start with a distribution, rather with a
data set. We often make the assumption that the data set comes from some
unknown distribution, and try to estimate the distribution from the data. Some
estimation procedures do not assume any knowledge about the family of the dis-
tribution. These are called the nonparametric techniques. The parametric
techniques on the other hand assume a known family and try to estimate the
parameters to choose the most suitable member of the family.
3.1 Nonparametric
A simple graphical method to estimate a distribution is to draw the histogram.
Let’s first load the SDSS quasar data set.
# Construct large and small samples of SDSS quasar redshifts and r-i colors
qso = read.table(
"http://astrostatistics.psu.edu/datasets/SDSS_QSO.dat",
head=T,fill=T)
qso = na.omit(qso)
dim(qso)
names(qso)
summary(qso)
z.all = qso[,4]
r.i.all = qso[,9]-qso[,11]
SDSS.all = data.frame(z.all,r.i.all)
3
oldpar = par(mfrow=c(1,3))
hist(z.all, main=’’, xlab=’Redshift’) #Default binning
hist(z.all, breaks=50, main=’’, xlab=’Redshift’) #10 bins
hist(z.all, breaks=’scott’, main=’’, xlab=’Redshift’)
par(oldpar)
The breaks=’scott’ option means the classes are determined by some algo-
rithm called Scott’s algorithm (we hardly need to worry about what that is at
this stage).
A histogram is often used to approximate a pdf. Since the area under a pdf is
1, it is sometimes desirable to scale the histogram appropriately to make the
area 1. This may be done using the prob=T option:
This option is useful when we want to overlay a pdf curve on the histogram, as
we shall do later.
Another approach resulting in a plot that is slightly difficult to interpret is to
plot the percentiles.
plot(ecdf(z.all[1:10]))
4
,header=T)
GC2 = read.table(
"http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_M31.dat",
header=T)
K1 = GC1[,2]
K2 = GC2[,2]
The ecdf function, however, does not provide any confidence band for the es-
timated cdf. For that we could use the ecdf.ksCI function from the sfsmisc
package.
install.packages(’sfsmisc’)
library(’sfsmisc’)
ecdf.ksCI(K1)
z.5 = z.all[1:5]
plot(0,0,xlim=range(z.5),ylim=c(0,10),xlab=’z’,ylab=’density’,ty=’n’)
for(val in z.5) curve(dnorm(x,mean=val,sd=0.05),xlim=range(z.5),add=T)
rug(z.5) #Adds tiny vertical bars on the horizontal axis
#to mark each point
5
zgrid = seq(min(z.5),max(z.5),len=100)
kd = sapply(zgrid,
function(x) mean(dnorm(x,mean=z.5, sd=0.05)))
lines(zgrid,kd,lwd=3,xlim=range(z.5))
A more sophisticated version of the same thing is done by the density function
of R. However, we prefer to use the external package sm since it also finds
confidence bands. We shall work with the following subset of the data.
z.200 = z.all[1:200]
install.packages(’sm’)
library(sm)
?sm.density #looking up help
dens = sm.density(z.200, h=bw.nrd(z.200)) #Choosing h automatically
lines(dens$eval.points, dens$upper, col=2)
lines(dens$eval.points, dens$lower, col=2)
3.2 Parametric
Here we have to first choose a family of distributions (typically based on the
underlying theory or the histogram). If the chosen family is a standard one then
the fitdistr function from the MASS package can hopefully pick out the best
fitting member of the family.
lines(kseq,
6
dnorm(kseq,
mean=normfit_MWG$estimate[[1]],
sd=normfit_MWG$estimate[[2]]))
Exercise 2: Repeat the same process for the M 31 globular cluster K magnitudes.
The data set is at
http://astrostatistics.psu.edu/MSMA/datasets/GlobClus M31.dat
a small p-value implies that the given number does not go well with
the given distribution.
1-pnorm(4,mean=1,sd=2)
7
• a d- function: the pdf/pmf function,
• a q- function: the inverse distribution function, also called the quantile
function,
• an r- function: the random number generator function.
You may like to look up the helps of runif, dlnorm, pexp and qgamma.
If the distribution is not available mathematically, but we have only a sample
from the distribution, then we can proceed as follows.
This technique sits at the heart of all randomized and bootstrap tests, as we
shall see soon.
It is the standard practice of all statistical softwares to report results of tests in
terms of p-values.
8
Graphical methods are easy to interpret, but introduces an element of subjec-
tivity that is often undesirable. So a class of objective tests have been devised
for testing a sample for normality.
asteroids = read.table(
"http://astrostatistics.psu.edu/MSMA/datasets/asteroid_dens.dat",
header=T)
dens = asteroids[,2]
summary(dens)
Now we shall apply a standard test called Shapiro-Wilk’s test for normality. This
test, like any other statistical test, first summarizes the data set into a single
number generically called the test statistic. The test statistic for the Shapiro-
Wilk’s test is denoted by the symbol W. The designers of the test have worked
out the distribution of W under the assumption that the data set actually hails
from a normal distribution. The observed value of W is then compared against
this distribution using the p-value technique given earlier.
shapiro.test(dens)
The Shapiro-Wilk’s test is by no means the only test for normality. The package
nortest contains a slew of other tests.
install.packages(’nortest’)
library(’nortest’)
ad.test(dens)
9
cvm.test(dens)
lillie.test(dens)
pearson.test(dens)
install.packages(’outliers’)
library(’outliers’)
dixon.test(dens)
chisq.out.test(dens)
grubbs.test(dens)
grubbs.test(dens,type=20)
This test has problem with ties (i.e, repeated data values).
10
Visual comparison of the two histograms is of course one simple (and crude)
way to do this.
par(mfrow=c(1,2))
hist(K1)
hist(K2)
par(mfrow=c(1,1))
Notice that they shapes are quite similar in both the cases. However, the hori-
zontal scales are shifted.
We can also plot a 2-sample quantile-quantile plot, which plots the quantiles of
one sample against those of the other.
The linear pattern again indicates of the similarity of the two distributions.
Indeed, the linear pattern means that one may be obtained from the other by
some shifting and/or scaling. Let’s try shifting first. We need to have an idea
as to how much we should shift by. Here are three ways:
Does it require scaling also? To find out we can compare the linear pattern with
the y = x line:
The slope is different from 1, so the two distributions still differ in scale.
We can also compare the distributions underlying two independent samples
using the 2-sample Kolmogorov-Smirnov test or the Mood Test.
11
# Two-sample tests for Milky Way and M31 globular clusters
ks.test(K1, K2-24.90) # with distance modulus offset removed
mood.test(K1, K2-24.90)
qso = read.table(
"http://astrostatistics.psu.edu/datasets/SDSS_QSO.dat",
head=T,fill=T)
qso = na.omit(qso)
z.all = qso[,4]
install.packages(’car’)
library(car)
lambda = seq(-1,1,0.5)
tmp=sapply(lambda,function(x) bcPower(z.all,lambda=x))
par(mfrow=c(2,3))
sapply(1:ncol(tmp), function(i) hist(tmp[,i],main=lambda[i]))
par(mfrow=c(1,1))
Exercise 3: Instead of drawing histograms why not apply the Shapiro-Wilk’s test
for each value of λ, and choose the value yielding the highest p-value? Implement this
idea by apply-ing shapiro.test to the columns of tmp.
12
6 Simulation: generating data from a distribu-
tion
A distribution is an idealized mathematical description of the behaviour of a
sample. Thus, in a typical statistical analysis, sample comes first, and a distri-
bution is postulated by the scientist to provide an analytically tractable way to
represent the behaviour of the sample. But there are situations where we start
with a distribution and want to have a sample from it:
13
6.2 Simulation to incorporate error bars
Jittering data is a simple way to incorporate the effects of error bars. We shall
illustrate this with a regression example in the next lecture.
For now, here is a skeleton of the approach. Don’t type this verbatim!
simerr = function(cen,lo,hi=lo)
runif(length(cen),min=cen-lo,max=cen+hi)
tmp = sapply(1:100,
function(dummy) {
x=simerr(......)
y=simerr(......)
......
})
14
gaps = rexp(999,rate=1)
The question now is to determine which of the particles will fail to be detected.
A particle is missed if it comes within 1 sec of its predecessor. So the number
of missed particles is precisely the number of gaps that are below 1.
missList = c()
for(i in 1:50) {
gaps = rexp(999,rate=1)
miss = sum(gaps < 1)
missList = c(missList,miss)
}
mean(missList)
var(missList)
All these are done for rate = 1. For other rates we need to repeat the entire
process afresh. So we shall use yet another for loop.
rates = seq(0.1,3,0.1)
mnList = c()
vrList = c()
for(lambda in rates) {
missList = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
miss = sum(gaps < 1)
missList = c(missList,miss)
}
mn = mean(missList)
vr = var(missList)
mnList = c(mnList,mn)
vrList = c(vrList, vr)
}
15
Now we can finally make the plot.
plot(rates,mnList/10,ty="l",ylab="% missed")
up = mnList + 2*sqrt(vrList)
lo = mnList - 2*sqrt(vrList)
lines(rates,up/10,col="red")
lines(rates,lo/10,col="blue")
Exercise 4: (This exercise is difficult) We can estimate the actual rate from
the hitting times of the particles if the counter were perfect. The formula is to take
the reciprocal of the average of the gaps. But if we apply the same formula for the
imperfect counter, then we may get a wrong estimate of the true rate. We want to
make a correction graph that may be used to correct the under estimate.
Explain why the following R program achieves this. You will need to look up the
online help of cumsum and diff.
rates = seq(0.1,3,0.1)
avgUnderEst = c()
for(lambda in rates) {
underEst = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
hits = cumsum(gaps)
obsHits = c(hits[1], hits[gaps>1])
obsGaps = diff(obsHits)
underEst = c(underEst,1/mean(obsGaps))
}
avgUnderEst = c(avgUnderEst,mean(underEst))
}
plot(avgUnderEst,rates,ty="l")
6.4 Bootstrap
Use of repeated sampling is very common in every walk of science to get an
idea about sampling error (confidence intervals, error bars, standard errors etc).
16
Imagine the following situation. You have a data set. You perform some analysis
to get some numbers as result. You want to have an idea about the sampling
error of your result. The ideal way is by repeating the entire process (starting
with fresh data collection each time). It is easy to imagine situations where you
simply cannot afford to do this. What then?
One simple way out is as follows. Take the data that you have, estimate the
distribution from it using any of the way discussed earlier, then simulate fresh
data from this estimated distribution. This technique is called bootstrapping.
Owing to the simplicity of the idea and its wide applicability, it is one of the
most powerful tools in the repertoire of a statistician. Depending on how the
distribution function is estimated, there are two types of bootstrap: parametric
and nonparametric. Nonparametric bootstrap using empirical cdf to estimate
the distribution is the most common form. We shall discuss a simple example
next. More complex applications will follow later.
Astronomical data sets are often riddled with outliers (values that are far from
the rest). To get rid of such outliers one sometimes ignores a few of the extreme
values. One such method is the trimmed mean.
x = c(1,2,3,4,5,6,100,7,8,9)
mean(x)
mean(x,trim=0.1) #ignore the extreme 10% points from BOTH ends
mean(x[x!=1 & x!=100])
We shall compute 10%-trimmed mean of Vmag from the Hipparcos data set.
hip = read.table(
’http://astrostatistics.psu.edu/MSMA/datasets/HIP.dat’,
head=T,fill=T)
hip = na.omit(hip)
attach(hip)
mean(Vmag,trim=0.1)
We want to estimate the standard error of this estimate. For ordinary mean
we have a simple formula, but unfortunately such a simple formula does not
exist for the trimmed mean. So we shall use bootstrap here as follows. We shall
generate 100 resamples each of same size as the original sample.
trmean = sapply(1:100,
function(dummy) {
resamp = sample(Vmag,length(Vmag),replace=T)
mean(resamp,trim=0.1)
})
17
sd(trmean)
A smarter way to achieve the same thing is by using the boot package.
Vmag = rnorm(10)
library(boot)
result = boot(Vmag,function(d,w) mean(d[w],trim=0.1), 100)
result
plot(result)
1.
curve(dexp(x,rate=1),0,5,ylim=c(0,3),ylab=’’)
title(main=expression(paste("Expo(",lambda,") densities")))
curve(dexp(x,rate=2),0,5,add=T,col=’red’)
curve(dexp(x,rate=3),0,5,add=T,col=’blue’)
legend("topright",col=c(’black’,’red’,’blue’),lwd=1,
legend=c(expression(paste(lambda,’=1 ’)),
expression(paste(lambda,’=2’)),
expression(paste(lambda,’=3’))))
2.
GC_M31 = read.table(
’http://astrostatistics.psu.edu/MSMA/datasets/GlobClus_M31.dat’,
header=T)
KGC_M31 = GC_M31[,2]
18
lines(kseq,dnorm(kseq,
mean=normfit_M31$estimate[[1]],
sd=normfit_M31$estimate[[2]]))
3.
3. The plot is not the graph of a function as it doubles back over itself. This is
because if we find too low a count, then this could mean two things: either the
number of hits is really too low, or the number is too high causing the detector to
remain jammed most of the time. Remember that even an undetected particle
can prolong the period of jamming.
19