0% found this document useful (0 votes)

22 views

05 Inference Lab

The goal of statistical inference is to study data to infer knowledge beyond the immediate scope of the data, such as estimation and testing. The document discusses statistical models and simulations to understand inference under different assumptions. It uses sleep data from drug trials to illustrate concepts like unbiased estimators and optimal properties. More specific assumptions, like a normal distribution, allow more inferences but the conclusions depend on the model.

Uploaded by

EESHA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

05 Inference Lab

Uploaded by

EESHA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

AN INTRODUCTION TO R

DEEPAYAN SARKAR

Introduction to statistical inference

The goal of statistical inference is to study data to infer knowledge that goes beyond the immediate scope
of the data. One usually focuses on two kinds of inference: estimation and testing. We study various
methods to do inference. These can be “intuitive” or “common-sense” methods, or they can be rigorously
derived based on some criterion. Statisticians always like to study various optimality properties of the
methods they study. To make concrete statements about such optimality properties, we usually need to
make model assumptions about the data.
Let us start with an example we have already seen: three sets of observations on ten patients at an asylum.
The observations recorded are the average increase in the hours of sleep given three sleep-inducing drugs,
compared to a control average where no medication was given.

> extra.hyoscyamine <- c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0)
> extra.laevorotatory <- c(1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4)
> extra.racemic <- c(1.5, 1.4, 0.0, -0.7, 0.5, 5.1, 5.7, 1.5, 4.7, 3.5)

First question. Consider the extra.hyoscyamine observations, which represents the extra hours of sleep
on average when given the hyoscyamine drug. The mean increase is

> mean(extra.hyoscyamine)
[1] 0.75

However, we are not really interested in these 10 particular patients, but in the effect of the drug in the
“general population”. To make any statements about that, we need to first link this particular sample to
the population.
This is done by hypothesizing a statistical model. In this simple case, our model needs just one univariate
distribution; we will pretend that we have planned the experiment but not yet observed the data, and
denote the n = 10 observations we will see by the symbols X1 , X2 , . . . , Xn . We will assume that these
quantities are independent random variables coming from a common but unknown distribution F . The
only unknown component of our model is F , which we refer to as the parameter of our model. The link
between the model and the actual observed data is completed by assuming that the observed data is one
realization of these random variables. If we repeated the experiment with n other patients, the observed
numbers would be different, but they would be realizations from the same distribution F .
It is difficult to infer much about the unknown distribution F in this model. However, some simple things
are possible; for example, we may be interested in the expected value of the Xi -s, given by
Z ∞
µ = E(Xi ) = xf (x)dx
−∞

Date: October 2011.

1
2 DEEPAYAN SARKAR

where f is the density function of the distribution F . We call µ the population mean. A common-sense
estimator of µ is the sample mean 0.75 seen above. It also happens to be the least squares estimator, in the
sense that it is the value of a that minimizes the squared-error loss
n
X
(Xi − a)2 .
i=1

This estimator also has nice optimality properties: it is unbiased and has minimum variance among all
linear unbiased estimators. What does that mean? Notice that the sample mean
n
1X
X̄ = Xi
n i=1
is itself a random variable: it would have taken a different value if a different set of 10 patients had been
selected for the experiment. The probablity distribution of X̄ (which depends on F ) has its own mean and
variance. X̄ is unbiased because E(X̄) = µ, and it also has lower variance than any other (linear) unbiased
estimator of µ.
There may be other features of F we may be interested in such as the median and variance; these also have
estimators with various optimality properties.
Second question. In our observed sample, X̄ has the value 0.75. Does this mean µ is exactly equal to
0.75? Of course not. Can we at least say that µ is positive (that is, giving hyoscyamine is better than
giving no drug at all)? Even that is not necessarily true.
To see why, let us do some simulation. It is expensive to perform experiments in real life, but it is cheap to
do them in a computer. Consider our model. We do not know the parameter F , but let us suppose for a
moment that F was the standard Normal distribution N (0, 1). Here is the mean of one sample of size 10
from the N (0, 1) distribution.

> mean(rnorm(10))
[1] -0.4310653

We can repeat this experiment several times to get

> replicate(20, mean(rnorm(10)))

[1] 0.353412449 0.066056723 -0.469304540 -0.477413625 0.070744760
[6] -0.194377583 -0.040043189 -0.529158686 0.199731778 0.385597958
[11] 0.696666885 -0.537843399 -0.367312810 0.001679262 -0.206758857
[16] -0.063791765 0.136722404 -0.422392426 -0.155119988 0.245495107

As we can see, X̄ is sometimes positive even when the true F has µ = 0. Thus, the fact that X̄ is positive
in our experiment does not imply that µ > 0. There is not much more we can say about this problem
unless we make further model assumptions.
A more specific model. Our previous model made very few assumptions: only that all observations are
independent and come from the same distribution. Not much inference can be done in such a general
setup. We will now make a much more specific model: we will assume that F is a Normal distribution,
with mean µ and variance σ 2 ; that is,
Xi ∼ N (µ, σ 2 ), i = 1, 2, . . . , n
The question we are interested in is still whether µ is positive. The intuitive idea is that we would be more
inclined to believe this conjecture if in addition to X̄ being positive, the variance is also small. The familiar
AN INTRODUCTION TO R 3

unbiased estimator of variance is

n
1 X
s2 = (Xi − X̄)2 .
n − 1 i=1

Let us try our simulation approach again, assuming that µ = 0. But looking at a very specific F as we did
before (the standard Normal) will not allow us to make any general conclusions, so we would like to allow
σ 2 to be any positive quantity. However, this seems pointless, as we cannot simulate from N (0, σ 2 ) if we do
not know σ 2 . What can we do then?
The trick is to notice that if we scale X̄ to obtain a new quantity

X̄
U=√
s2

then the distribution of U does not depend on σ 2 .

Xi
Exercise 1. Let X1 , . . . , Xn ∼ N (0, σ 2 ) independently. Let Zi = σ ,i = 1, . . . , n. Then by definition,
Zi ∼ N (0, 1). Let
X̄
UX = q Pn , and
1 2
n−1 i=1 (Xi − X̄)

Z̄
UZ = q Pn
1
n−1 i=1 (Zi − Z̄)2

Prove that UX = UZ . As the distribution of UZ clearly does not depend on σ 2 , the distribution of UX must
also not depend on σ 2 .

This means that whatever value of σ 2 we use to simulate, the computed value of U = X̄s will have the same
distribution; so we may as well use σ 2 = 1. Let us compute U from 5000 simulations of our model.

> u <- function() {

x <- rnorm(10, mean = 0, sd = 1)
mean(x) / sd(x)
}
> u5000 <- replicate(5000, u())
> summary(u5000)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.3200000 -0.2205000 0.0007736 0.0023330 0.2306000 1.8770000

What is the distribution of this U ? Is it Normal? We can check using a Normal Q-Q plot.

> library(lattice)
> qqmath(~u5000, distribution = qnorm, type = c("p", "g"),
aspect = "xy", pch = ".", cex = 2)
4 DEEPAYAN SARKAR

u5000
0

−1

−2

−4 −2 0 2 4

qnorm

Exercise 2. Does this look like a straight line? Compare with the Normal Q-Q plot of values directly
simulated from the Normal distribution.

The systematic curvature is indicative of tails wider than the Normal distribution.
Simulation p-value. We now have 5000 simulated values of U under the assumption that µ = 0 (in the
model that F is a Normal distribution). What is the value of U in our actual experiment?

> print(U <- mean(extra.hyoscyamine) / sd(extra.hyoscyamine))

[1] 0.4192264

Let us compute the proportion of cases in which our simulated U is larger than our observed U .

> sum(u5000 > U) / length(u5000)

[1] 0.1058

Thus under our model, there is a roughly 10% chance that even when µ = 0, we see a value of U at least as
large as what we have actually seen. Of course, our observed proportion is itself an estimate of this chance,
but we can compute it more accurately by simulating more samples.

> sum(replicate(100000, u()) > U) / 100000

[1] 0.10795

We have just performed a hypothesis test, where we tested the null hypothesis
H0 : µ = 0
against the alternative hypothesis
H1 : µ > 0.
AN INTRODUCTION TO R 5

The proportion computed is the p-value P (U > uobserved ), which represents the “surprise factor”: how
unlikely is our observed sample given our null hypothesis?

The true p-value. Our Q-Q plots suggest that the distibution of U when µ = 0 is not Normal, but does
not tell us what it is. It so happens that√ the distribution of U can be derived analytically, and belongs to
the family of t-distributions. In fact, n U ∼ tn−1 (the t-distribution with√ n − 1 degrees of freedom), as
can be graphically verified by a Q-Q plot against the tn−1 distribution. n U is referred to as the test
statistic, as it is the quantity (statistic) on which the test is based.

> qqmath(~ (sqrt(10) * u5000), distribution = function(p) qt(p, df = 9),

type = c("p", "g"), aspect = "iso", pch = ".", cex = 2)

5
(sqrt(10) * u5000)

−5

−6 −4 −2 0 2 4 6

function(p) qt(p, df = 9)

We do not need to depend on simulation to compute the actual p-value; it can be computed using

> pt(sqrt(10) * U, df = 9, lower.tail = FALSE)

[1] 0.1087989

A p-value of this magnitude is not considered strong enough evidence against our conjecture (null
hypothesis) that µ = 0; in other words, we would not conclude in this case that that hyoscyamine is an
effective sleep-inducing drug. The usual practice is to consider a p-value less that 0.05 to be “significant”, or
strong enough evidence (if we stick by such a rule whenever we test, we would erroneously reject the null
hypothesis 5% of the time).
Of course, it is possible that our lack of evidence is simply due to the fact that we do not have enough data.
Not having enough evidence “against the null” does not mean that we have evidence against the alternative.

Exercise 3. Repeat the procedure above to compute p-values for testing whether the two other drugs
(laevorotatory and racemic) significantly increase the average duration of sleep.
6 DEEPAYAN SARKAR

The R function t.test() will also perform the test described above, which is also known as a one-sample
t-test.

> t.test(extra.hyoscyamine, alternative = "greater")

One Sample t-test

data: extra.hyoscyamine
t = 1.3257, df = 9, p-value = 0.1088
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
-0.2870553 Inf
sample estimates:
mean of x
0.75

Third question. Let us now ask a slightly different question: is the drug laevorotatory any better than
hyoscyamine? Since the observations are paired (measured on the same patients), this is the same as asking
if the mean of their difference is positive. R can perform a paired t-test using

> t.test(extra.laevorotatory, extra.hyoscyamine,

alternative = "greater", paired = TRUE)

Exercise 4. Confirm that the paired t-test above gives the same results as

> t.test(extra.laevorotatory - extra.hyoscyamine, alternative = "greater")

A non-parametric test. Let us now go back to a more general model for our data, with F only assumed
to be symmetric around the mean µ. As before, we would like to test
H0 : µ = 0 vs H1 : µ > 0
The Wilcoxon signed-rank test works as follows: rank the observations according to their numeric value
(ignoring their sign), and then compute the sum of ranks for only the positive observations. The idea is
that with only the symmetry assumption, the test statistic corresponds to selecting each number from 1 to
n with probablity 0.5 and taking their sum.
It is clear that as before, the distribution of the test statistic will not depend on F (as long as µ = 0 and F
is symmetric). We can use simulation from any symmetric distribution with mean 0 to estimate the
p-value, or use a theoretical calculation. The distribution of the test statistic is not a standard distribution,
but is available in R (see ?pwilcox). The wilcox.text() function will perform the directly given the data.

> wilcox.test(extra.hyoscyamine, alternative = "greater")

Wilcoxon signed rank test with continuity correction

data: extra.hyoscyamine
V = 31, p-value = 0.1716
alternative hypothesis: true location is greater than 0
AN INTRODUCTION TO R 7

Tests such as these are called non-parametric tests because they do not need strict parametric model
assumptions such as Normality. The t-test actually works quite well even under mild departures from
Normality, and has better optimality properties (it is more likely to correctly detect situations where H0 is
not true) when the data is actually close to Normal. For this reason, the t-test is preferred when the
assumption of Normality is justifiable, even though the Wilcoxon signed-rank test works under more
general assumptions.

The two-sample t-test. Consider a more complicated situation where we have independent samples from
two different populations: X1 , X2 , . . . , Xn ∼ F1 , and Y1 , Y2 , . . . , Ym ∼ F2 . Let the population means of F1
and F2 be µ1 and µ2 . We want to test H0 : µ1 = µ2 .
As before, the typical parametric approach is to assume that each Xi ∼ N (µ1 , σ12 ), and each
Yj ∼ N (µ2 , σ22 ). There are two variations of the next step. If we further assume that σ12 = σ22 and use a
pooled estimate of the variance, we get the so-called two-sample t-test with equal variance, also called the
classical t-test. If we allow σ12 6= σ22 , we can use an alternative procedure due to Welch that produces a
test-statistic whose null distribution does not exactly follow a t-distribution, but is well-approximated by it.
Both tests can be performed using the t.test() function.
Using the energy expenditure data seen before, we can do

> data(energy, package = "ISwR")

> s <- with(energy, split(expend, stature))
> t.test(s$lean, s$obese, var.equal = TRUE)
Two Sample t-test

data: s$lean and s$obese

t = -3.9456, df = 20, p-value = 0.000799
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.411451 -1.051796
sample estimates:
mean of x mean of y
8.066154 10.297778
> t.test(expend ~ stature, energy, var.equal = FALSE)
Welch Two Sample t-test

data: expend by stature

t = -3.8555, df = 15.919, p-value = 0.001411
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.459167 -1.004081
sample estimates:
mean in group lean mean in group obese
8.066154 10.297778

Both these tests have alternative hypothesis H1 : µ1 6= µ2 . The second example uses a formula interface,
similar to the one we have seen in lattice graphics. The formula interface is used extensively for modeling in
R.
8 DEEPAYAN SARKAR

Two-sample Wilcoxon test. As with the one-sample problem, we can compute a two-sample Wilcoxon
test statistic, which computes the rank of all observations taken together (without regard to grouping) and
then sums the ranks of observations in the first group. The test for H1 : µ1 < µ2 can be performed using
wilcox.test() as follows.

> wilcox.test(expend ~ stature, energy, alternative = "less")

Wilcoxon rank sum test with continuity correction

data: expend by stature

W = 12, p-value = 0.001061
alternative hypothesis: true location shift is less than 0

Exercise 5. Using the ideas described above, compute the p-value for this test using simulation. Use 50000
replications. Is your computed p-value close to the p-value reported by wilcox.test()?

Linear models
A wide range of useful statistical models can be viewed as special cases of a class of models called linear
models. The general model may be written as
yi = xi1 β1 + xi2 β2 + · · · + xip βp + εi , i = 1, 2, . . . , n
where each i represents one observation in the sample, yi -s are some response variable that we want to
model, and xij -s are various known predictor or “design” terms. The statistical component in the model
comes through the random variables εi , which are at a minimum assumed to be independent with mean 0
and constant variance σ 2 . Often the more specific model εi ∼ N (0, σ 2 ) is assumed.

Estimation. The parameters in a linear model are the coefficients βj , j = 1, 2, . . . , p, and the variance
parameter σ 2 . The best estimators of the βj -s turn out to be those that minimise the squared error loss.
These estimators have nice analytical solutions, and they can be computed by the R function lm().

Testing. Hypothesis tests under the linear model are conjectures that put linear constraints on the βj
coefficients; typically these are of the form H0 : β1 = 0, H0 : β1 = β2 = β3 = β4 , etc. Such hypotheses can
be tested using a generalization of the t-test known as the F -test, provided we assume the more specific
Normal error model.

Specific models. We will not discuss the general linear model theory, but instead just mention some
common special cases.
The most familiar example of a linear model is simple linear regression. We will re-use the following
artificial example to illustrate regression.

> x <- runif(100, min = 1, max = 5)

> mydf <- data.frame(x = x, y = x^2 + 2 * runif(100))

A simple linear regression model for this data may be

yi = β1 + xi β2 + εi , i = 1, 2, . . . , n = 100
We fit this model in R as
AN INTRODUCTION TO R 9

> fm1 <- lm(y ~ 1 + x, data = mydf)

> fm1
Call:
lm(formula = y ~ 1 + x, data = mydf)

Coefficients:
(Intercept) x
-5.573 5.705

We of course know this model to be incorrect, as E(yi ) cannot be expressed as a linear function of xi . We
can fit the correct model by adding a quadratic term.
yi = β1 + xi β2 + x2i β3 + εi , i = 1, 2, . . . , n = 100
which is fit by

> fm2 <- lm(y ~ 1 + x + I(x^2), data = mydf)

> fm2
Call:
lm(formula = y ~ 1 + x + I(x^2), data = mydf)

Coefficients:
(Intercept) x I(x^2)
1.22003 -0.06778 0.99664

In practice we would not know the correct model, so we may need to perform a hypothesis test to decide
which model is correct. This is done using

> anova(fm1, fm2)

Analysis of Variance Table

Model 1: y ~ 1 + x
Model 2: y ~ 1 + x + I(x^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 171.978
2 97 34.123 1 137.86 391.88 < 2.2e-16

The very small p-value (the column with heading Pr(>F)) indicates that fm2 is a much better fit for the
data. Compare this with a test for a model with a (redundant) cubic term:

> fm3 <- lm(y ~ 1 + x + I(x^2) + I(x^3), data = mydf)

> anova(fm2, fm3)
Analysis of Variance Table

Model 1: y ~ 1 + x + I(x^2)
Model 2: y ~ 1 + x + I(x^2) + I(x^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
10 DEEPAYAN SARKAR

1 97 34.123
2 96 33.795 1 0.32815 0.9322 0.3367

There is not always obvious additional terms to try out to test the goodness of a model fit this way. As an
alternative, it is always a good idea to look at graphical diagnostics for a fit. For example, R shows the
following plots for fm1, the model with only the linear term. The first and last plots are clear indicators of
systematic lack-of-fit.

> par(mfrow = c(1, 4))

> plot(fm1)

Residuals vs Fitted Normal Q−Q Scale−Location Residuals vs Leverage

1.5
14 14 14
3

36 2

2
Standardized residuals
2
Standardized residuals

Standardized residuals
36 161
2

1.0

1
Residuals

1
1
0

0
0

0.5
−1

−1
−1
−2

94
2

−2
Cook's distance
0.0

2
−3

−2

0 5 10 15 20 −2 0 1 2 0 5 10 15 20 0.00 0.02

Fitted values Theoretical Quantiles Fitted values Leverage

Exercise 6. Look at the corresponding diagnostic plots for fm2. Do they still indicate lack-of-fit?

Exercise 7. Change the call that creates mydf to

> mydf <- data.frame(x = x, y = x^2 + 2 * rnorm(100))

so that the errors have a Normal distribution, rather than the uniform distribution as before. Re-fit fm2 and
look at its diagnostic plots. Do you see any qualitative difference?

Exercise 8. For the airquality dataset, fit a suitable regression model for Ozone as response and Temp as
predictor. Comment on goodness-of-fit.

Another important class of linear models is factorial models. The most simple factorial models are one-way
classification models, where the predictor is a categorical variable, and the response is allowed to have a
different mean (but the same variance) for each group. One such model with a two-group categorical
variable is
AN INTRODUCTION TO R 11

> fm4 <- lm(expend ~ 1 + stature, energy)

> anova(fm4)
Analysis of Variance Table

Response: expend
Df Sum Sq Mean Sq F value Pr(>F)
stature 1 26.485 26.4853 15.568 0.000799
Residuals 20 34.026 1.7013

The anova() call with a single argument tests submodels obtained by sequentially removing terms from
the model, starting with the last one. In this case, the only submodel is equivalent to the hypothesis that
both groups have the same mean.

Exercise 9. In a one-way classification model with two categories, testing equality of means across groups
is equivalent to a two-sample t-test with equal variance. Confirm this for the above example by comparing
p-values for the appropriate tests.

An example of a one-way classification model with more than two groups is

> fm5 <- lm(Ozone ~ 1 + factor(Month), data = airquality)

> fm5
Call:
lm(formula = Ozone ~ 1 + factor(Month), data = airquality)

Coefficients:
(Intercept) factor(Month)6 factor(Month)7 factor(Month)8 factor(Month)9
23.615 5.829 35.500 36.346 7.833

Exercise 10. Look at the diagnostic plots produced by plot(fm5). Is there any reason to suspect
lack-of-fit?

Exercise 11. Create a box-and-whisker plot of Ozone by Month. One-way classification models assume
equality of response variance for each group. Is the true for fm5? Fit and assess a new model with
log(Ozone) as the response.

Any number of continuous (regression-type) and categorical terms can be used in a linear model. Here are
a few more complex models for Ozone:

> airquality <- transform(airquality, fmonth = factor(Month), log.Ozone = log(Ozone))

> fm6 <- lm(log.Ozone ~ 1 + Temp, airquality)
> fm7 <- lm(log.Ozone ~ 1 + fmonth, airquality)
> fm8 <- lm(log.Ozone ~ 1 + Temp + fmonth, airquality)
> fm9 <- lm(log.Ozone ~ 1 + Temp * fmonth, airquality)
12 DEEPAYAN SARKAR

fm8 allows a different regression line for each month, but with the same slope. fm9 allows a different
regression line for each month, with possibly different slopes.

Exercise 12. Use anova() to compare all pairs of models for which the comparison makes sense. Which
model would you suggest as the most appropriate for the data?

Further reading
In this tutorial, we have only outlined some very basic ideas of modeling in R. Fitting linear models in R is
by itself an extensive topic. ?lm is a good place to start learning more, especially about the formula
language, and many resources on the web have more details. There are also many other kinds of models
one can fit in R, which you should explore once you are more comfortable with the basic models.

21 - Role of Business Development Service Providers
No ratings yet
21 - Role of Business Development Service Providers
20 pages
6. Continuous Distribution
No ratings yet
6. Continuous Distribution
17 pages
1B40 DA Lecture 2v2
No ratings yet
1B40 DA Lecture 2v2
9 pages
Introduction To Multiple Regression
No ratings yet
Introduction To Multiple Regression
36 pages
AE - Tema 3 - The Multivariate Gaussian Distribution
No ratings yet
AE - Tema 3 - The Multivariate Gaussian Distribution
6 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Multiple Regression Analysis: y + X + X + - . - X + U
No ratings yet
Multiple Regression Analysis: y + X + X + - . - X + U
22 pages
2 Hypothesis Testing
No ratings yet
2 Hypothesis Testing
22 pages
Basics
No ratings yet
Basics
61 pages
Lecture No.10
No ratings yet
Lecture No.10
8 pages
U1-note
No ratings yet
U1-note
69 pages
Block 4 MCO 3 Unit 2
No ratings yet
Block 4 MCO 3 Unit 2
35 pages
STAT2102_Chapter6
No ratings yet
STAT2102_Chapter6
5 pages
1 Regression Analysis and Least Squares Estimators
No ratings yet
1 Regression Analysis and Least Squares Estimators
8 pages
Continuous Distribution
No ratings yet
Continuous Distribution
16 pages
1 Regression Analysis and Least Squares Estimators
No ratings yet
1 Regression Analysis and Least Squares Estimators
7 pages
Chapter 8
No ratings yet
Chapter 8
7 pages
OLS Estimation of Single Equation Models PDF
No ratings yet
OLS Estimation of Single Equation Models PDF
40 pages
X n θ by less than any arbitrary constant c > 0. Also using Chebyshev's theorem, we see > 0
No ratings yet
X n θ by less than any arbitrary constant c > 0. Also using Chebyshev's theorem, we see > 0
2 pages
The Simple Regression Model
No ratings yet
The Simple Regression Model
41 pages
Statests
No ratings yet
Statests
20 pages
Lesson 4 (Normal Distribution) Oct 2024
No ratings yet
Lesson 4 (Normal Distribution) Oct 2024
35 pages
Seminar Econometrie
No ratings yet
Seminar Econometrie
15 pages
MIT14 30s09 Lec17
No ratings yet
MIT14 30s09 Lec17
9 pages
Statistics 512 Notes I D. Small
No ratings yet
Statistics 512 Notes I D. Small
8 pages
lecture-1
No ratings yet
lecture-1
10 pages
Ex2301eng
No ratings yet
Ex2301eng
6 pages
Basic Probability Reference Sheet: February 27, 2001
No ratings yet
Basic Probability Reference Sheet: February 27, 2001
8 pages
5_6Mat271
No ratings yet
5_6Mat271
6 pages
lab7b
No ratings yet
lab7b
7 pages
1.017/1.010 Class 15 Confidence Intervals: Interval Estimates
No ratings yet
1.017/1.010 Class 15 Confidence Intervals: Interval Estimates
4 pages
5. Central limit theorem
No ratings yet
5. Central limit theorem
7 pages
6. Continuous Distribution New
No ratings yet
6. Continuous Distribution New
12 pages
stat100b_hw6_w25
No ratings yet
stat100b_hw6_w25
1 page
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
No ratings yet
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
279 pages
Lecture 07
No ratings yet
Lecture 07
22 pages
week_4_2
No ratings yet
week_4_2
17 pages
Testing The Assumptions
No ratings yet
Testing The Assumptions
7 pages
Lab 7 B
No ratings yet
Lab 7 B
7 pages
Random Variables
No ratings yet
Random Variables
8 pages
Chapter 5 Introduction To Statistical Inference
No ratings yet
Chapter 5 Introduction To Statistical Inference
9 pages
Isometries Over Volterra-Napier Subalgebras: D. Einstein, I. Conway, L. P. Lebesgue and R. Volterra
No ratings yet
Isometries Over Volterra-Napier Subalgebras: D. Einstein, I. Conway, L. P. Lebesgue and R. Volterra
19 pages
Simple Regression (Continued) : Y Xu Y Xu
No ratings yet
Simple Regression (Continued) : Y Xu Y Xu
9 pages
Week 3 Exercises
No ratings yet
Week 3 Exercises
2 pages
2101 F 17 Assignment 1
No ratings yet
2101 F 17 Assignment 1
8 pages
ch11
No ratings yet
ch11
55 pages
chap2
No ratings yet
chap2
17 pages
Lecture 23
No ratings yet
Lecture 23
5 pages
ECON0019 Week1 SLR OLS
No ratings yet
ECON0019 Week1 SLR OLS
33 pages
Probd
No ratings yet
Probd
49 pages
Notes Chapter 2
No ratings yet
Notes Chapter 2
19 pages
Ch6
No ratings yet
Ch6
33 pages
STAT2000 - Unit 1
No ratings yet
STAT2000 - Unit 1
217 pages
STAT0009 Introductory Notes
No ratings yet
STAT0009 Introductory Notes
4 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
17 pages
ch7 pt1 PDF
No ratings yet
ch7 pt1 PDF
26 pages
Lecture1
No ratings yet
Lecture1
8 pages
Stat 1
No ratings yet
Stat 1
6 pages
The Simple Regression Model (2 Variable Model) : Empirical Economics 26163 Handout 1
No ratings yet
The Simple Regression Model (2 Variable Model) : Empirical Economics 26163 Handout 1
9 pages
MIT Microeconomics 14.32 Final Review
No ratings yet
MIT Microeconomics 14.32 Final Review
5 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
New Text Document
No ratings yet
New Text Document
3 pages
CNC
No ratings yet
CNC
43 pages
Unit 10 Clinical/field Supervision: Ramila Maharjan
No ratings yet
Unit 10 Clinical/field Supervision: Ramila Maharjan
37 pages
Determination of Selected Engineering Properties of Soybean (Glycine Max) Related To Design of Processing Machine
No ratings yet
Determination of Selected Engineering Properties of Soybean (Glycine Max) Related To Design of Processing Machine
5 pages
General Rules: Eastern Railway
100% (1)
General Rules: Eastern Railway
343 pages
Intermediate 3 Workbook Unit 2-1
No ratings yet
Intermediate 3 Workbook Unit 2-1
8 pages
LV Test
No ratings yet
LV Test
4 pages
DESIDERETA
No ratings yet
DESIDERETA
2 pages
Puberty Below: Role of Hormones
No ratings yet
Puberty Below: Role of Hormones
3 pages
EOC Test 2 PDF
No ratings yet
EOC Test 2 PDF
30 pages
Aero
No ratings yet
Aero
27 pages
All India Career Point Test NEET
No ratings yet
All India Career Point Test NEET
5 pages
Detail Review On Chemical Physical and Green Synthesis Classification Characterizations and Applications of Nanoparticles
No ratings yet
Detail Review On Chemical Physical and Green Synthesis Classification Characterizations and Applications of Nanoparticles
24 pages
GCSE PHYS Past Papers Mark Schemes Standard January Series 2019 28442
No ratings yet
GCSE PHYS Past Papers Mark Schemes Standard January Series 2019 28442
28 pages
Bach Floral Esoteric Remedies Self Healing Workbook
100% (15)
Bach Floral Esoteric Remedies Self Healing Workbook
99 pages
Elementary Problems and Solutions Edited
No ratings yet
Elementary Problems and Solutions Edited
8 pages
D-1 02 The STCW Convention As Amended Rev-1
No ratings yet
D-1 02 The STCW Convention As Amended Rev-1
50 pages
Mechanical PE AM - 003 Answer
No ratings yet
Mechanical PE AM - 003 Answer
2 pages
4 Thick Cylinders
100% (4)
4 Thick Cylinders
66 pages
Highway Engineering: Cross Section of Road
No ratings yet
Highway Engineering: Cross Section of Road
12 pages
(FREE PDF Sample) A Handbook For Learning Support Assistants Teachers and Assistants Working Together Glenys Fox Ebooks
100% (7)
(FREE PDF Sample) A Handbook For Learning Support Assistants Teachers and Assistants Working Together Glenys Fox Ebooks
84 pages
Chapter One 1.1 Background To The Study
No ratings yet
Chapter One 1.1 Background To The Study
34 pages
Sample CV
No ratings yet
Sample CV
6 pages
MC44 - Inventory Turnover (1) - SAP Mental Notes
No ratings yet
MC44 - Inventory Turnover (1) - SAP Mental Notes
6 pages
Constantine Porphyrogenitus de Administrando Imper
No ratings yet
Constantine Porphyrogenitus de Administrando Imper
30 pages
47055538
No ratings yet
47055538
96 pages
Answers To Exercise
No ratings yet
Answers To Exercise
31 pages
Documentary Movie
No ratings yet
Documentary Movie
8 pages
e 8 Revision 1st Term Exam Duoc 23 24
No ratings yet
e 8 Revision 1st Term Exam Duoc 23 24
13 pages

05 Inference Lab

Uploaded by

05 Inference Lab

Uploaded by

AN INTRODUCTION TO R

Introduction to statistical inference

Date: October 2011.

We can repeat this experiment several times to get

> replicate(20, mean(rnorm(10)))

unbiased estimator of variance is

then the distribution of U does not depend on σ 2 .

> u <- function() {

> print(U <- mean(extra.hyoscyamine) / sd(extra.hyoscyamine))

> sum(u5000 > U) / length(u5000)

> sum(replicate(100000, u()) > U) / 100000

> qqmath(~ (sqrt(10) * u5000), distribution = function(p) qt(p, df = 9),

> pt(sqrt(10) * U, df = 9, lower.tail = FALSE)

> t.test(extra.hyoscyamine, alternative = "greater")

> t.test(extra.laevorotatory, extra.hyoscyamine,

> t.test(extra.laevorotatory - extra.hyoscyamine, alternative = "greater")

> wilcox.test(extra.hyoscyamine, alternative = "greater")

> data(energy, package = "ISwR")

data: s$lean and s$obese

data: expend by stature

> wilcox.test(expend ~ stature, energy, alternative = "less")

data: expend by stature

> x <- runif(100, min = 1, max = 5)

A simple linear regression model for this data may be

> fm1 <- lm(y ~ 1 + x, data = mydf)

> fm2 <- lm(y ~ 1 + x + I(x^2), data = mydf)

> anova(fm1, fm2)

> fm3 <- lm(y ~ 1 + x + I(x^2) + I(x^3), data = mydf)

> par(mfrow = c(1, 4))

Residuals vs Fitted Normal Q−Q Scale−Location Residuals vs Leverage

Fitted values Theoretical Quantiles Fitted values Leverage

Exercise 7. Change the call that creates mydf to

> mydf <- data.frame(x = x, y = x^2 + 2 * rnorm(100))

> fm4 <- lm(expend ~ 1 + stature, energy)

An example of a one-way classification model with more than two groups is

> fm5 <- lm(Ozone ~ 1 + factor(Month), data = airquality)

> airquality <- transform(airquality, fmonth = factor(Month), log.Ozone = log(Ozone))

You might also like