Lecture Slides 3a - Statistical Testing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

Lecture 3

Statistical testing

- Statistical testing
- One-sample t-test

Free after dr. ir. P. Heijnen (TU Delft)


Statistical testing
Outline

• Normal distribution

• Standard normal distribution

• Confidence intervals

• The one-sample t-test


The normal distribution
Normal distribution
• X is a continuous random variable and has a normal
distribution with average  en standard deviation 


 average
 standard deviation
2 variance
Standard percentages of
the normal distribution

68%
95%
99,7%

-3 -2 -1  1 2 3


Normal distribution
Example shopping center

• On average visitors of a shopping center live within a


distance of 4.6 km of the center ( = 3.06)

 = 3.06

 = 4.6
The probability of an interval of X
Example shopping center

• What is the probability of a visitor of the shopping


center living within a distance of 2.5 km of the center?

 = 3.06

x = 2.5
 = 4.6
Standard normal distribution
Standard normal distribution

Transformation of X => Z

• What is the mean of z?


• What is the standard deviation of z?

All probabilities
1
known in tables

0
Standard normal distribution
Example shopping center

What is the z-value of a distance of 2.5 km ( = 4.6


and  = 3.06)?

z = -0.686 0
So we have
z = -0.686

Look up in table
P(Z ≤ -0.686) ≈ 0.245

P(X ≤ 2.5) ≈ 0.245


Probability P(Z  z)
Example Shopping Center
• Which % of visitors of the shopping center live within
a distance of 5 km from the center?

Look up in table

Approximately 55%
Probability P(Z > z)
Example Shopping Center

• Which % of visitors of the shopping center live more


than 5 km from the center?
Confidence intervals
Confidence intervals

• In a sample, we find an average of X

• Can we say that X represents the average of the


population?

• Yes, as a best guess, but how confident are we?

• The smaller the sample, the less confident we are


Notation

• Average so, the average of the


• Sample sample is an estimate of
the average of the
• Population population
• Estimate of
=

• Standard deviation
• Sample
• Population
Confidence interval - the problem
• Find an interval [x1, x2] such that average will be
with 95% confidence within [x1, x2]
95% is often used

• For the shopping example


• What is the 95% confidence interval for the
estimate average distance to the center? = 4.6
When we would draw many
samples, we would get a
distribution of sample means

95%

=4.6 km

x1 x2
Confidence interval

So, we are looking for the values x1 and x2


For the standard normal distribution
we know the critical values z1 and
z2

95%

=0

z1 =-1.96 z2 = 1.96
Confidence interval
To translate this to x variables
we must know the variance of
the distribution of sample
means
Standard deviation of sample means

Standard deviation of sample averages?

σ = standard deviation in
population
n = sample size

So, the standard deviation of sample averages is much


smaller depending on n
So, we have the formula
?

... but we don’t know population


95% variance σ, we only know sample
variance s
=0
Let’s for the moment assume that the
z1 =-1.96 z2 = 1.96
population variance equals the sample
Confidence interval
variance
we assume = s
Then we get
Confidence interval – population variance known
Example shopping center

Using the formula for z, we have

95%

=0

z1 =-1.96 z2 = 1.96
Confidence interval Solving x1 and x2
Confidence interval – population variance known
Example shopping center

95%

=4.6 km

x1 =3.92 x2 = 5.28
Confidence interval

Conclusion: the 95% confidence interval for average


distance to the center is: [3.92, 5.28]
BUT ... the population variance is unknown

• We do not know the population variance, but we do


have an estimate, namely the variance we find in the
sample

• Because we estimate the population variance, there is


more uncertainty

• As a consequence, we cannot use the standard normal


distribution

• Instead, we must use a slightly different distribution,


known as the student t-distribution
Student t-distribution

• Bell-shaped curve, around 0

• Larger variance than


t=0
standard normal distribution

• Takes into account the larger


uncertainty since also  is
estimated by sample stand. dev. s # degrees of freedom

• Probability density function has parameter (N - 1)


Confidence interval – population variance unknown
Example shopping center
df = 80 - 1

The formula for t is the


95% same as for z

-1.99 t=0 1.99

Correspond to 95% for t-


distribution, df = 80 - 1

Larger interval Solving x1 and x2


• Conclusion: the 95% confidence interval for average
distance to the center is: [3.91, 5.29]

95%

3.9 5.3
 = 4.6

Sample size is large enough, so


approximately the same interval is
found:[3.91,5.29]
Summary of steps
Calculate a 95% confidence interval

• Use the Student t-distribution since the population


variance is unknown
1. Calculate the degrees of freedom as df = N – 1
2. Given df, determine the critical values of t for a 95%
confidence interval – this is t0.975
3. Calculate the standard deviation of sample averages
using

4. Given the sample average , calculate the interval [x1


, x2] as:
Calculating a confidence interval in
SPSS

Descriptives

Statistic Std. Error


Verplaatsingsafstand Mean 123.90 .538
inTravel distance in
Nederland 95% Confidence Lower Bound 122.84
Netherlands Interval for Mean Upper Bound
124.95

5% Trimmed Mean 81.49


Median 35.00
Variance 64130.129
Std. Deviation 253.239
Minimum 1
Maximum 6950
Range 6949
Interquartile Range 110
Skewness 5.033 .005
Kurtosis 40.666 .010

Lower and upper bound of 95% confidence interval


for variable Travel distance in Netherlands
One-sample t-test:

Student t-test for averages


The concept of hypothesis testing
Example: body length

• Someone says the average length of an adult person in


the Netherlands is 1.70 m

• In a sample (n = 100) we find an average length of


1.75 m and a standard deviation of 0.15 m

• Do we belief the person?

• We use a statistical test to make a decision


• Assume the person is right; then what is the probability
that we find an average in our sample that differs as
strongly as 1.75 m does from this claimed average?

Quite extreme so it
seems unlikely

0 = 1.70
1.75
Assuming the claim is right
• We want to know the probability that we find an
average in our sample that differs as strongly as 1.75 m
does from this claimed average

• So, on both sides

This percentage is
often chosen
= 1.65 0 = 1.70 = 1.75

• If the probability is smaller than 5% we decide not to


belief the person
Translation to formal terms
• The average length is 1.70 m
• Null hypothesis (H0)

• The average length is not equal to 1.70 m


• Alternative hypothesis (H1)

• The maximum probability of making a wrong decision


that we still accept is 5%
• Alpha ( = 5%)

• The probability that we find an average length of 1.75


or larger in the sample while the null hypothesis is
true
• p-value
The way the test can be performed -
confidence intervals

• Calculate a 95% confidence interval [μ1, μ2] around


the test value
• If the sample average falls outside the interval then
reject the null hypothesis

2.5% 2.5% Because we assume the


sample standard
deviation, we should use a
1 2 Student t-distribution
0 = 1.70
2.5% 2.5%

1 2 df = N – 1 = 99
0 = 1.70

2.5% 2.5%

-1.98 0 1.98

Correspond to 95% for t-


distribution, df = 99

• We have found a mean of 1.75 m


What is the t-value of this mean?
• the t -value is the standardized value just as the z -
value, but then for the t-distribution

• calculated as:

Sample average Test value

• For the sample we find

very small –
typical for large N
df = N – 1 = 99

2.5% 2.5%

-1.98 0 1.98

Correspond to 95% for t-


distribution, df = 99

• Because 3.33 > 1.98, we reject the null hypothesis and


accept the alternative hypothesis that the average is
different than 1.70 m
Choosing the alternative hypothesis
• In the example we tested whether the average is
different from 1.70 m

• We could also test whether the average length is


larger than 1.70 m (in stead of just different)

• This makes sense if we are interested in that particular


question

• Then the alternative hypothesis is:


• The average length is larger than 1.70 m

• Does this make a difference for the test?


• Because we now test whether it is larger we look at
one side

df = N – 1 = 99

5%

0 1.66

Correspond to 95% for t-


distribution, df = 99

• Because 3.33 > 1.66, we reject the null hypothesis


and accept the alternative hypothesis that the actual
average is larger than 1.70 m
Summary ( = 5%) – we have three
possibilities
H1 < H 0
Called one-tailed test
5%
H1 > H 0

5%

H1 ≠ H 0

2.5% 2.5%

Called two-tailed test


Aonther way the test can be
performed - p-value
• p-value is the probability that we find the t-value
while the null hypothesis is true

• What is the probability that we find 1.75 m or


anything as strongly deviating from 1.70 m?

Probability?

= 1.65 0 = 1.70 = 1.75


The t-value is t = 3.33
The degrees of freedom is
df = 99

The corresponding p-value


is p = 0.0012
= 1.65 0 = 1.70 = 1.75

On the internet p-value calculators are avaialable, for


example https://www.graphpad.com/quickcalcs/pvalue1.cfm

In this case, we did a two-tailed test. Does it make a


difference when instead we would do a one-tailed test
(when the alternative H says the average is larger instead
of just difference)?
Yes that makes a difference

Two-tailed

p1 p1 The p-value is
p = p1 + p1 = 0.0012
= 1.65 0 = 1.70 = 1.75

One-tailed

p1 The p-value is
p = p1 = 0.0006
0 = 1.70 = 1.75

So, in a one-tailed test the p-value is twice as small!


Student t-test
Another example: satisfaction measurement

• In a survey, we ask a sample of N =50 visitors to indicate their


satisfaction on a 5 point scale (1 = very dissatisfied, 5 = very
satisfied)

• Null hypothesis: visitors are neutral (i.e., not satisfied or


dissatisfied): 0 = 3.0

• Alternative hypothesis: visitors are not neutral: 0  3.0

• We find a mean score of 𝑥= 3.45 with standard deviation s =


1.51, so = 3.45

• Can we conclude that on average visitors are not neutral?


Student t-test SPSS
Example satisfaction measurement H0
One-Sample Test

Test Value = 3.0


95% Confidence
Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
Average
Satisfaction score 2,107 49 .0403 0.45 0.025 0.875

degrees of freedom N -1

If 0 in interval then
H0 accepted

p-value of this t-value


(or smaller) if H0 is true
Critical t – values

Alpha = 5%

large samples
Summary of important concepts

• Student t-distribution: population variance is estimated


 more uncertainty compared to z-distribution

• Confidence interval for an estimate of population


average based on a sample

• One-sample t-test:
• Is the average different from a known average?

• One-tailed or two-tailed testing depends on the


formulation of the alternative hypothesis

You might also like