0% found this document useful (0 votes)
17 views84 pages

ECMT1010 Notes

notes on economics and statisitcs

Uploaded by

joeymedxna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views84 pages

ECMT1010 Notes

notes on economics and statisitcs

Uploaded by

joeymedxna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

lOMoARcPSD|46660593

ECMT1010 - Comprehensive well set out notes from


Semester 1 2021.
Introduction to Economic Statistics (University of Sydney)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Josef Medina ([email protected])
lOMoARcPSD|46660593

ECMT1010 - ECONOMIC STATISTICS


Economic Statistics is about understanding the use of computing technology
for data description and statistical inference.

WEEK 1 INTRO TUTORIAL

What we will l be covering:

 Sampling Bias (whether online reviews about products are reliable or not.)
 Describing data (stories behind the data - will be using excel and another
software)
 Correlation vs Causation (the true interpretation of one thing led to another)
 Sampling distribution (why the probability of getting one side of a coin is always
50%)
 Bootstrap distribution (the truth about a subject matter via a small lens)
 Randomization distribution (whether an assumption you made 4 an event is true
or not)
 Normal and t distribution (making the above three points easier)
 Simple regression (making predictions)
 Probability rules (how to deal with uncertainties)
 Bayes rule (why mass testing of Covid 19 is not recommended in many
countries)

Excel data and a look at the Australian economy

Australian Bureau of Statistics


https://www.abs.gov.au/

Consumer price index (CPI)


Averages the price changes of goods over a set period of time = Inflation
Inflation is the decline of purchasing power of a given currency over time
Target percentage - 2 to 3%

GDP (Gross domestic Product to be inbetween 3 to 4%


Unemployment rate 4.5 to 4.8%
Average weekly earnings are not an accurate representation of the average weekly
incomes of all of Australia due to the vast range of peoples incomes. It is averaging
all the lows and the highs depicting an average annual income of around $90,000.
This is a pretty unbelievable number which causes us to question the credibility of it.
Full employment - everyone that wants to have a job, has a job

Reserve Bank of Australia

AGS - Australia Government Securities

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK1 KEY CONCEPTS AND BIAS


Key things to understand by the end of the semester

 Key concepts of Inference


 Estimation with intervals
 Testing for significance
 Analyze data using modern resampling or traditional methods
 Use computer software in statistical procedures
 Understand the importance of data collection, the limitations in collection
methods, and be aware of how it affects inference
 Know which statistical methods to use in which situations
 Interpret statistical results effectively and in context
 Be aware of the power of data analysis

Data;
Data are measurements based on individual units or cases/observations

Dataset A collection of variables measured on individual cases or


observations
Variable contains specific information on each case
Spreadsheet Variables are in columns and observations are in rows

Variables
Categorical
- Defines groups, e.g., Gender, Year etc.
- Used to calculate proportions
Quantitative
- Numerical measure e.g., SAT, Height, Weight etc.
- Used to calculate averages

Relationships between variables


Response
- The variable of focus, i.e., the thing you are interested in ~ the price in the
market, the exchange rate, etc.
Explanatory
- What is used to understand, predict or explain the response variable.
- CAN ALSO BE REFERRED TO CAUSE AND EFFECT VARIABLES

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Key Concepts
Population
- includes all the individuals or objects of interest
Sample
- A subset of the population
- The cases that have been selected from the population into our dataset
Inference
- A conclusion reached on the basis of evidence and reasoning
- Statistical inference is using the information from a sample to inform you about
the properties of the population
The Relationship
- Samples are selected to give some information about the population of
interest.

The Population and Sample is what we will continue to refer back to in statistical
inference (what ECMT1010 is all about.)

Sampling bias
Occurs when the method of selecting a sample causes the sample to differ from the
population in some relevant way. This may cause an over and under representation
of your sample.

If sampling bias exists, we cannot trust generalizations from the sample to the
population. In order to avoid Sampling Bias, you need to tale a Random Sample.

Random sample
- A random sample is ideal but very often they are not feasible as we don’t
always have the population we want. You may have to alter the target
population to get a feasible population to sample. This is just the reality, not
everyone may have access to the facilities that you are wanting to sample

Sources of Bias

Bad Methods of Sampling


- Sampling based on something obviously related to the variable(s) of interest
o E.g. sampling students at a library or pub about study habits

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

- The sample being made up of whoever chooses to participate (volunteer bias)


o E.g. emailling all students and then making inferences based on the
responders
o Only people that care strongly about the issue will respond
Other Sources of Bias
- framing(context or wording) of survey questions
- inaccurate or lazy responses
- sources of research funding

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK2 DESCRIBING DATA & ASSOCIATION VS CAUSATION

Association vs Causation
o Two variables are said to be associated if their values are related to one
another.
o Two variables are causally associated if the value of one variable influences
the value of the other

Whenever we have a graph, the variable on the horizontal axis is the explanatory
variable and the variable on the vertical axis is the response variable.

Confounding Variables (CF)


(A third variable associated with both the explanatory and the response variable)
These are problematic from the point of view of undercovering causation as you are
trying to look at whether x causes y. If you have a confounding variable, it could be
the reason you uncover a relationship between x and y. This means you can no
longer be sure that x is actually causing a change in y if you don’t do something
about the confounding variable that’s driving the relationship between those two
variables.

Example:
 You collect data on sunburns (RV) and ice cream consumption (EV). You find that
higher ice cream consumption is associated with a higher probability of sunburn.
Does that mean ice cream consumption causes sunburn?
o No because there is the possibility of a third variable influencing the
association. i.e., temperature (CF). Hot temperatures cause people to
both eat more ice cream and spend more time outdoors under the sun
resulting in more sunburn
 You find that babies born to mothers who smoked during their pregnancies weigh
significantly less than those born to non-smoking mothers

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o you need to account for the fact that smokers are more likely to engage
in other unhealthy behaviours, such as drinking or eating less healthy
foods, then you might overestimate the relationship between smoking
and low birth weight.

Randomization
(How we eliminate Confounding Variables)
 Randomly assigning the values of the explanatory variable.

Experimental study
o We use experiments to eliminate confounding variables using randomization.
o Causality can be established through an experimental study as confounding
variables are eliminated.
This is done through.
o Randomization and the Treatment
o To begin the experiment, using random assignment, they must create two
groups from the study: a control group and treatment group,
o One group (the treatment) is assigned to the explanatory variable while the
other group is not. Both these groups are randomized so should (roughly) look
similar in every aspect except the treatment. This will allow the people taking
the study to determine whether x is a causal factor of y because they have
eliminated the confounding variable that was present in said study. Therefore,
any observed group differences may be attributed to the treatment.

Other experimental studies


o Comparative experiment,
- We are comparing two variables in two groups.
o Matched pair experiment (revisit in week 10)
- Two measurements on one individual

Observational study
o Observational studies use information gathered from observed behaviour as it
naturally exists (just observe the results confounding variables and all).
o You cannot claim causality in an observational study due to the confounding
variables

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Placebo effect
The placebo effect is a phenomenon most commonly found in medical experiments
that impacts the ability to claim causality due to participants feeling the desired effect
regardless of whether the treatment was any good or not.

Double blind study


o A double blind study eliminates the impact of the placebo effect by giving the
control group a fake treatment, subjecting both groups the placebo effect,
therefore eliminating it.

Randomness in Data Collection

Are observational studies useless?


o Randomization using a random sample is the gold standard, but rarely
achievable (especially in economics)
o If the focus is to estimate a statistic about a population, you need a random
sample but not a randomized experiment
- E.g. election polling, GDP, unemployment, etc.
o If the focus is establishing causality, you need a randomized experiment
and can live with a non-random sample
- Drug testing

Describing Data
How we summarize and visualize variables, and any relationships between them

Proportions
o The proportion in some category is found by.
- Number in that category/total number
Categorical Variables
o Usually visualized using a bar chart or a pie chart
Quantitative Variables
o Plots show the distribution of a variable and any outliers

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o A histogram groups data into bins (quantitative measure on the horizontal


axis)
o Bins = how your data is grouped ~ small ranges within our sample, then we
count the frequency that falls in those small ranges. E.g. 0-100 = 1 histogram
column etc.
o Number of bins is chosen by researcher. Popular option is set # of bins to
Square root of n. E.g. if n=225 the bins would be 15 - standard deviation?
Shapes/Skews

o Bell shaped curves are the special, allowing us to follow a number of special
rules.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK3 MEASURES

Sample Statistic
o Using the information in a sample to calculate a statistic, an aspect of the
population we are interested in
o How well our statistic, calculated from our sample, represents the underlying
truth in the population. This is referred to as inference.

Statistical Measures
o Centre
o Spread
o Location
o Association
Notations
N  The number of cases in the total population (mu)
n  The number of cases in the sample or sample size
x  Represents a variable
i  Index (observation 1,2,3 etc.)
xi  The values of each observation

Measures of Centre (average)

Sample Mean;

X Bar is the notation for sample mean

Population Mean;

Mu is the notation for population mean

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

These are the short hand


Where the x’s are the variables in the population and sample means

Median (m)
o Median is the middle value when the data is ordered (lowest to highest)
o The median splits the data in half.
o To find it order x from high to low and pick middle value. If n is even, the
median is the mean of the two middle values

If the data is skewed to the right, the mean is higher than the median due to outliers
described below. Opposite if the data is skewed left.
Outliers
o Outliers skew the data by “pulling” the sample mean in the direction of
skewness.
o Outlier is a value that is notably distinct from the other values in a data set

A statistic is resistant or robust if it is relatively unaffected by outliers. The median is


resistant whereas the mean is not. This is shown in the change of m and x bar if
an outlier is removed, m will relatively unaffected whereas the x bar will be.

Measures of spread

Standard Deviation
o Standard deviation measures the spread of a variable
o It measures the distance of a “typical case” from the mean (what’s the typical
variation from the mean)
o The larger the Standard deviation the more spread and the more variability in
the data

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Sample Standard Deviation

Lower case s is the notation for sample standard deviation


o Calculate each value minus the mean squared. Once you have the sum of
these values you have just calculated, divide by the sample-1 and square
root that answer.

Population Standard Deviation

Sigma is the notation for the population standard deviation

Rule of thumb
o if a sample histogram is roughly bell-shaped, around 95% of the values will be
within two standard deviations of the mean
o 95% of the sample values will be between � ̅ –2 times � and ̅�+2 times �
o In a population, 95% of values will be between �−2� and �+2�

Measure of location

Z scores
o The purpose of a z-score is to take the units of measurements out of
consideration so we can compare results across samples that have different
scales.
o The z-score for a specific value xi is.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o This measures the distance between a value and its mean in units of standard
deviations
o mean of � is always 0 and its standard deviation is always 1
o in a bell-shaped distribution, 95% of z-scores lie between ±2

The z-score gives you a value of an observation above the it’s mean. The higher the
number the further away for the sample mean it is and visa versa.

Other measures of Location


o Max, min and range. (max-min)
o Lower quartile, upper quartile and interquartile range. (Q3-Q1)
Five number summary
o Min,Q1,Median,Q3,Max
o Visualized through box plots

Z value for the


margin of error
calculation.

Measures of association (variables x and y)


Most common way to show this is a scatter plot.
o A scatterplot graphs the association between two quantitative variables

Correlation
o Correlation is a measure of the strength and direction of linear association
between two quantitative variables
o Allows us to measure the intensity of the relationship, whether they are strongly
or weakly associated and whether it is positively or negatively associated.
o Not so good at picking up non-linear association

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Sample Correlation Equation

r is the notation for the sample correlation


Population Correlation Equation

Rho is the notation for the population correlation

Sample correlation
o Correlation values lie between +1 and -1
o r > 0 positive association
o r < 0 negative association
o r = 0 no correlation
o the closer r is to −1 or +1, the stronger the (linear) association
o � is unit-free (does not depend on the units of measurement)
o correlation between � and � = correlation between �and � - doesn’t matter
which axis it is plotted on, you will get the same value for the correlation
o Correlation is not very resistant
- Check for outliers
o � = 0 only means that there is no linear association
- Variables could have a non-linear relationship
- Use a scatterplot to check for non-linear relationships
o Even if there is a strong association, that does not imply causation. You can
only determine causation using an experiment.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK4 INFERENCE

Inference

Population parameter
o In Inference we are interested in an aspect or measure of the population e.g.
the population parameter of the mean (a population statistic/measure)
Sample statistic (p hat p^)
o Is a mirror image of the population except for the size. It reflects all the
characteristics of the population, but it is much smaller.
o The sample statistic is the proportion of the object of interest in a sample
o Using the sample, we are going to calculate the counterpart of the population
parameter (mu) which is the sample statistic mean (x bar)
o Types of sample statistics
- Proportions (p)
- Mean (mu)
- Difference in proportions (p1-p2)
- Difference in means (mu1-mu2)
Statistical inference
o Using the sample statistic as the best guess of what the population measure is.
o i.e. inference is the process of drawing a conclusion about a population using
information from a sample.
Inference notations
o Sample statistic noted as p^
o Population parameter noted as p
o In inference we want to figure out how reliable an indicator the sample
statistic is of the population parameter.

Calculating sample statistic example

o The point estimates (p hats) of p vary from sample to sample


o Never match the parameter exactly

Q: How reliable is the sample statistic?


a) It depends on how much the statistic varies from sample to sample

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Sampling Distribution

o Take the sample statistics from the multitude of random samples taken of the
same size from the same population.
o The plot shows how the sample statistic varies from sample to sample

The more spread out the sampling distribution is, the more the sample statistic varies
from one sample to the next.

Q: What will a sampling distribution look like?

Centre
o if the samples are randomly selected, the sampling distribution will be centered
around the population parameter
Shape
o for most of the statistics we consider, if the sample size is large enough the
sampling distribution will be symmetric and bell-shaped

Key info about sampling distribution


o if sampling bias exists (e.g., due to a non-random sample), the sampling
distribution may not be centered on the true parameter
o while centre and shape are important, our main interest is in the variability of the
sampling distribution
o the reliability of a statistic depends on how much it varies from sample to
sample

Q: How do we measure the variability of the sample statistic?

Standard Error (SE)


o SE = the standard deviation of the sample statistics/SE is the standard deviation
of the sampling distribution
- the higher the �E, the more spread-out the sampling distribution
- the more spread-out the sampling distribution, the less reliable or precise
the sample statistic is.

Q = (1-p)

Statistical inference

Interval Estimate
o Use the SE to calculate an interval estimate.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o An interval estimate gives a range of plausible values for a population


parameter
o A common form for an interval estimate is “sample statistic ± margin of error”
o The margin of error reflects the accuracy of the sample statistic as a point
estimate for the population parameter.
- A point estimate is a value i.e., the mean that estimates the pop

Q: How can we be sure that our interval estimate contains the population
parameter?

Confidence intervals
o A confidence interval for a parameter is an interval computed from a sample that
contains the parameter for a specified proportion of all samples
o the proportion of sample intervals that contain the parameter is called the
confidence level.
- a 95% confidence interval will contain the true parameter in 95% of all
samples
-
Below is what the 95% confidence interval looks like. The green lines are the sample
statistic means +- the margin of errors that include the population parameter
whereas the red are the x bars that do not include mu

Q: How can you get a 95% confidence interval?

95% confidence interval (CI)


If your sampling distribution.
o is symmetric and bell-shaped
o is centred on the population statistic
o has a standard deviation = SE
Then we can recall the 95% rule which is to take

Or in other words:

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Margin of error
= Z* x standard error

How to find n if you are given margin of error and confidence level.
 = z*^2 x p^(1-p^)/MOE^2

o CI is a range of plausible values for a population parameter


o 95% CI means that 19 in 20 samples will yield an interval that contains the
population parameter
o we say we are “95% sure” or “95% confident” that the interval contains the truth
o “We are 95% confident that the true proportion of adults that consider the
economy a ‘top priority’ is between 0.84 and 0.88.”

The issue with our confidence interval/sampling distribution is we are only using
one sample. This sample is just being reshuffled to create different combinations.
To find the SE from our sampling distribution we use Bootstrapping

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK5 BOOTSTRAPPING
Bootstrapping
Key
o We use our sample to create an artificial or simulated population
o One way to do this is to make copies of the original sample
o e.g., if we made 1,000 copies we could get a simulated population of 100,000
Smarties
o as long as the original sample is randomly chosen, the simulated population
should be very similar to the actual population.
o We can then sample repeatedly from the same sample.
Bootstrap sample

In practice, we don’t get a simulated population this way, but we do essentially


the same thing by sampling with replacement from our original sample without
changing the sample size or changing the observations of the original sample:

Steps for bootstrapping a sample


1. Randomly pick a case (observation) from the sample
2. record the value of interest for that case
3. replace (return) the case in the sample
4. repeat steps 1-3 exactly n times to get a bootstrap sample
5. calculate the bootstrap statistic for the bootstrap sample
6. repeat steps 1-5 to get many “sample statistics”
7. compute the SE and find the CI by sample statistic + or - 2 x SE

This process is tedious, so we use STATKEY

Example of bootstrap sample

Bootstrap Standard Error (SE)


 In principle, to calculate the SE for our sample statistic for the population mean,
we would take many samples, calculate many sample means and put them in a
sampling distribution so we could generate a confidence interval.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

The CI, which we know from the 95% rule, will include the population parameter 95%
of the time using our sample statistics + or - 2* the SE.

 The problem with this is we are only using one sample, we don’t always have the
luxury of taking repeated samples from the population to generate a sampling
distribution.
 As we don’t have a sampling distribution, an alternative way of finding the SE is
bootstrapping. Bootstrapping is our clever way of using a single sample taken
from the population to mimic what the resampling process would do if we were
able to take a large number of samples from a population. Our way of
constructing a distribution of estimates of the sample mean from this distribution
we calculate the standard error to formulate our confidence interval.

Describing CI from Bootstraps

Quantitative example Pg 7 Lec 5


o Interpretation: based on our sample, we are 95% confident that the mean
population price of a used Ford Mustang car is between $11,624 and $20,336
Categorical example Pg 11 Lec 5

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o Interpretation: we are 95% certain that the true percentage who believe there is
‘solid evidence’ of global warming lies between 57% and 61%

Example continued:
Does belief in global warming differ by political party?
The sample proportion answering ‘yes’ was 79% among Democrats and 38% among
Republicans.
Calculate a 95% CI for the difference in proportions
o sample sizes are not given; assume �= 1, 000 for each party
o We want to find P� –P� (democrats - republicans)
o The sample statistic is PD−P� = 0. 79−0. 38 = 0.41
o SE From Statkey = 0.02 (0.41 +- 2 x 0.02)  (0.37, 0.45)
o Interpretation: we are 95% sure that the difference in the proportion of
Democrats and Republicans who believe in global warming is between 37% and
45%
- We are a long way from saying there is no difference between these
groups
- The CI is saying that the difference could be at lease 37% and could be as
high as 45%
o the confidence interval does not include zero. We are 95% sure that the
difference is not zero, it is positive.

Sample size v Bootstrap samples


o the larger �, the smaller the �E and the narrower the CI
- larger � implies a more precise sample statistic
 more data means you have better information
o increasing bootstrap sample size beyond 2-3,000 (or so) will have little effect on
the �E (and hence on the CI)
- larger number of bootstraps has no impact on precision
 no more “information” about variability once the unevenness in a
small number of bootstraps has been removed

Changing the CI
> or < 95% confident
o As long as the bootstrap distribution is roughly symmetric, we can construct any
confidence interval we want by finding the appropriate percentiles in the
bootstrap distribution.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o For a P% confidence interval, we keep the middle P% of bootstrap statistics.


Meaning the middle P% defines the upper and lower bounds of the P% CI.
o e.g., for a 99% confidence interval, we keep the middle 99%, leaving 0.5% in
each tail
o hence the 99% CI is (0.5thpercentile, 99.5thpercentile) from the bootstrap
distribution

If you raise the confidence level (95%99%) you get a wider confidence interval to
be more certain that you include the true value. This means that to be ‘more
confident’ that the population parameter falls within the CI we need to have a larger
interval.
higher confidence level→ wider confidence interval

Methods to calculate the Bootstrap confidence interval


Method 1
o Estimate the �E by finding the standard deviation of the bootstrap distribution,
then generate a 95% confidence interval by:
o (Margin of error)

Method 2
o Generate a �% confidence interval as the range for the middle �% of bootstrap
statistics

NOTE: both should yield almost identical results for a 95% CI

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

o The two methods for creating a confidence interval only work if the bootstrap
distribution is smooth and symmetric.
o If the bootstrap distribution is highly skewed or looks ‘spiky’ with gaps, you need
to go beyond introductory statistics methods to create a confidence interval

To reduce the SE you need to increase the number of cases in the original sample

WK6 HYPOTHESIS TESTING

Hypothesis Tests
A hypothesis test is designed to measure evidence against the null hypothesis, and
determine whether this evidence is sufficient to conclude in favour of the alternative
hypothesis.
 Can we refute the null hypothesis and support the alternative hypothesis?
 NOTE:
 Hypothesis tests are written in terms of population parameters not sample
A test is framed in terms of two competing hypotheses:
Null Hypothesis
o The null hypothesis (� 0) is a claim that there is no effect or difference between
variables and the result or cause and effect is just due to random chance.
- Usually includes an = sign
- E.g., There is no such thing as mind reading, the person saying they can
read your mind when picking 5 cards is just guessing which cards the
person is picking. If she is guessing you expect her to get 20% right as she
is choosing from 5 cards
Alternative Hypothesis
o The alternative hypothesis (� � ) is a claim for which we seek evidence
- �� usually includes one of >, <, or ≠ (depends on context)
- E.g., If mind reading does exist we would expect the proportion (p) of
cards guessed to be greater than 20%

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Note: it is important to frame these competing hypothesis in regards to the


population parameter (0.2/20%)

�� is established by observing data that contradicts �0’s claim to no effect/difference

Writing out the hypotheses


Ho: Population parameter for Variable 1 = Population parameter of Variable 2
Ha: Population parameter for Variable 1 ≠ Population parameter of Variable 2 (V2)
or
Ho: Population parameter of (V1) - Population parameter of (V2) = 0
Ha: Population parameter of (V1) - Population parameter of (V2) ≠ 0

First thing to do when setting up the hypothesis is let the audience know what the
notation you are using is, i.e., for the Question: Does support for gun control differ by
gender?
Categorical Data
Pf: proportion of females supporting gun control
Pm: proportion of males supporting gun control
Ho: Pf=Pm
Ha: Pf≠Pm
Question: Do Uni students sleep less than 7 hours per night?
Quantitative Data
μ: average hours of sleep per night for students
Ho: μ=7
Ha: μ< 7

Question: How do we decide between �0 and ��?


Statistically significant
o When the observed sample statistic is unlikely to occur by random chance
(assuming/when the �0 is true), we say the result is statistically significant
o If a result is statistically significant, we have convincing evidence against �0 and
in favour of ��. If not (we can’t say the null hypothesis for the sample is
incorrect for the population), the test is inconclusive.
- I.e., Is a p^ of 1 statistically significant, is it giving us evidence of which we
reject the null hypothesis? (assuming the null is true, how unlikely is it to
get this sample statistic?)

 If it is highly unlikely, we have statistically significant evidence against the null


hypothesis
 If it is not unlikely, we don’t really have much evidence against the null hypothesis

QUESTION how do we measure how unlikely a sample statistic is, if H0 is true?

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

To measure how unlikely a sample statistic is we need to construct a distribution of


statistics we would expect to find if the null hypothesis is true.

Use simulation
we simulate how unlikely it would be to observe 8 correct guesses (sample statistic,
�^= 1) in a population where the probability of getting a correct guess is 0.5 (�0 is
true)

Randomization distribution (RD)


An artificial calculation used to mimic what the situation would be like if the null
hypothesis was true. Unlike the Bootstrap distribution, the RD doesn’t use a sample.
- Must include the null hypothesis

Once you have the randomized distribution we see where our sample statistic lies in
the distribution. See how unlikely it is to get that statistic in a distribution where we
have imposed a null hypothesis. Claim we have statistically significant evidence that
the statistic isn’t unlikely.

A randomization distribution shows a collection of statistics that would be


observed, by random chance, if the null hypothesis were true.
 The randomization distribution is a sampling distribution from the
underlying population.

P Value
 The chance of obtaining a sample statistic as (or more) extreme than the
observed sample statistic, if �0 is true
 The p-value is the proportion of statistics in a randomization distribution that are
at least as extreme as the observed sample statistic

You are calculating the p value to determine the likely hood of getting the sample
statistic if the null hypothesis is true (assuming in the population it is). If it is a low
percentage, it is very unlikely that the sample statistic is the value it is by
chance/from guessing  Evidence against the null hypothesis
 The smaller the p value the stronger the evidence against Ho
 p-value is small → reject the null in favour of the alternative
 p-value isn’t small → do not reject the null as the test is inconclusive

Example:
 The p-value is the chance of getting p^= 1 (the observed sample statistic) when
�= 0.5 (�0 is true).
 To get this, we find the proportion of sample statistics in the randomization
distribution that are as extreme as p^ = 1

Significance level
The significance level (�) is the threshold below which the p-value is deemed small
enough to reject the null.
o if p-value < �  reject �0 in favour of ��

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

- the result is statistically significant


o if p-value > �  do not reject �0
- the result is “not statistically significant”; the test is inconclusive
o Default (a) = 0.05 (5%). Other levels of a that are sometimes used, are 1% or
10%
Note:
You cannot say that you have proved the null hypothesis is true. You cannot prove
anything with imperial data in our statistics, All we have is a degree of likelihood.

Examples:
A sample statistic gives a p-value of 0. 035. Using �= 0. 05, is it statistically
significant?
The p-value is lower than α = 0.05, so the result is statistically significant. Therefore,
we reject H0.

What if the p-value had been 0.208?


The p-value is higher than α = 0.05, so the result is not statistically significant.
Therefore, we do not reject H0. The test is inconclusive.
- This does not mean we have proved that null hypothesis is correct all we have
found is a sample statistic that is consistent with the null hypothesis at a
significance level of 0.05

Two types of alternative hypothesis (Ha)


- One sided Ha contains either > or <
- Two sided Ha contains ≠

Calculating P Value
- the p-value is the proportion in the tail in the direction specified by ��
o If you have > sign you are looking at the right tail, < the left tail and ≠ both
tails
- For a two-sided alternative, the p-value is twice the proportion in the smallest
tail.
o Therefore you need to times p by 2 when calculating

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Summary
1) Define the notation (p, µ or a difference in p or a difference in µ) and formulate the
null (�0) and alternative (��) hypotheses
2) Work out how to construct the randomization distribution showing the statistics
that would be observed if �0 were true
3) Plot the randomization distribution
4) Calculate a p-value as the proportion in the randomization distribution as extreme
as the observed statisticd
5) Reject �0 if p-value < �; otherwise the test is inconclusive.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK8 RANDOMIZATION DIST.


Steps for a hypothesis test
1. Set up null and alternative e.g. H0:µ = 0, Ha:µ>0.
2. Find the sample statistic � bar
3. Derive the randomization distribution, which shows the sample statistics we
would observe by random chance if � 0 were true
4. Find the p-value: chance of observing � bar if �0 were true
a. the proportion of statistics in the randomization distribution as extreme
as � bar
5. Compare the p-value to � (significance level) and make the decision
a. The smaller the p-value, the stronger the evidence against �0

Deriving a randomization distribution


We need to simulate samples assuming �0 is true
Find sample statistic (e.g., x bar) for each simulated sample
Plot all the simulated sample statistics to form the randomization distribution

How do we simulate samples assuming �0 is true?


Depends whether original data are from an experiment or an observational study

Experimental study

 Experiment: a treatment and a control group with random allocation to each


group
 If �0 is true (no difference between groups), the response values will be roughly
the same regardless of the allocation.
- This means that because we are assuming there is no difference between the
two groups in an experiment, we can essentially put them together and
randomly reallocate them to a control group and treatment group.
- By doing this many times, you will create the randomization distribution.
o The approach in randomization is to take away the initial difference
between the groups by merging them
 To simulate what would happen by chance if H0 were true:
- We can randomly reallocate cases to groups, keeping the response values the
same.

Example:
In an experiment evaluating a ‘cure’ for cocaine addiction, 48 people were randomly
assigned to take either Desipramine(a new drug), or Lithium (an existing drug). The
subjects were subsequently tested to see who relapsed into cocaine use.

RESEARCH QUESTION:
Is Desipramine more effective than Lithium at treating cocaine addiction? (lower
relapse rate)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 State the null and alternative hypotheses


 State the possible conclusions

Parameters
Pd - proportion who relapse after taking Desipramine
Pl - proportion who relapse after taking Lithium

Null & Alternative hypothesis


Ho: Pd - Pl = 0
Ha: Pd - Pl < 0

Possible conclusions
 Reject Ho: We have statistically significant evidence that Desipramine is more
effective than Lithium.
 Do not reject Ho - We cannot determine, from the data, whether Desipramine is
more effective than Lithium. The test is inconclusive.

Original Sample
1) Randomly assign units into treatment groups - splitting the original experiment
group into two groups, one that will test using Desipramine and the other with
Lithium.
2) Conduct the experiment - Administer the drugs
3) Observe the relapse counts in each group - R= Relapse, N= No relapse.

Generating a randomization distribution


(Statistics we would observe by random chance if �0 were true)

Experiment: 28 relapsed, 20 didn’t → create simulated statistics as follows


1) Take 48 bits of paper, write ‘R’ on 28 and ‘N’ on 20.
2) Pool the slips and randomly divide into two groups of 24 (representing a
‘Desipramine group’ and a ‘Lithium group’)•
a. calculate the difference in proportions Pd hat− Pl hat
3) Repeat 1-2 many, many times – on average difference will be zero

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

You would repeat this


thousands of times (do
this on stat key) to build
up your randomization
distribution

Observational Study
 Observational study: random sample from the population
 To simulate what would happen if �0 were true, we simulate re-sampling from a
population in which �0 is true
 We already know we can use the bootstrap to simulate re-sampling from a
population when we have only one sample

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 To generate randomization samples for observational studies we just add a step:


make �0 true, then bootstrap

In an observational study, because we can’t fabricate two groups and we haven’t


eliminated confounding variables, we must find alternative ways to create a
distribution that imposes the null hypothesis.

1) Make �0 true by sliding the bootstrap distribution to centre on � = 0


2) Make decision using the p-value for � bar in the randomization distribution.

Our bootstrap distribution is centered on x bar, but we need it to be centered on 0.


Therefore, we need to somehow shift/slide the bootstrap distribution to be centered
on 0. If we do that, we will have created a randomization distribution. Sliding the
bootstrap distribution over the null hypothesis value of the mean will give us a set of
sample statistics where Ho is true. We can then take our original sample statistic (x
bar) and compute our p-value.

Example A (two tail)


 Body temperature
Parameter:
 µ = population mean human body temperature (F - Fahrenheit)
Null & Alternative Hypothesis
Ho: µ = 98.6
Ha: µ ≠ 98.6
(two tail - will need to times 2 the lower value to find the p-value)

 To achieve a randomization distribution imposing that the null hypothesis is true,


we need to take the sample mean (x bar = 98.26) and add 0.34 to each sample
value.
- Note: this does not affect the standard deviation of the sample (only the mean)
 Bootstrapping from this revised sample lets us simulate samples assuming Ho is
true

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Example B (right tail)


 Do males exercise more than females
Parameters
µm = the proportion of males who exercise
µf = the proportion of females who exercise
Null & Alternative Hypothesis
Ho: µm = µf
Ha: µm > µf

There are 3 ways to make Ho true (equal means)


1) Add 3 to the female values (or subtract 3 from male values)
a. “Shift Groups” in Stat key
2) Randomly re-label sample ‘male’ and ‘female’
a. “Reallocate Groups” (default) in Stat key
3) Combine samples and select ‘male’ and ‘female’ groups
a. “Combine Groups” in Stat key
Bootstrap on the revised samples

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Example C (left tail)


 Are blood pressure and heart rate negatively related?
Parameters
P is the correlation between blood pressure and heart rate
Null and Alternative hypothesis
Ho: p = 0
Ha: p < 0 (negative correlation)

 To make Ho true, we must remove the association between the two variables
 We can “break” the association by randomly shuffling one of the variables
 Each time we do this, we get a sample we might observe just by random chance,
if there really is no correlation

Summary of generating randomization distributions


A) Randomized experiments
- Re-randomize groups keeping response values fixed
B) Observational studies
- Paul the Octopus (single proportion): flip a coin 8 times
- Exercise and gender, body temperature (single mean): shift to make �0 true,
then bootstrap.
- Blood pressure and heart rate (correlation): randomly shuffle one variable
 Bootstrap distribution ↔ confidence interval
 Randomization distribution ↔ hypothesis test

Statistical errors
There are 4 possible outcomes in a hypothesis test:
Type I error - We reject Ho based off
our sample but in the underlying
population Ho is true.
Type II error - We do not reject Ho
based off our sample but in the
underlying data, Ho is false.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Analogy to law
A person is innocent (Ho) until proven guilty (Ha)
Evidence (p-value) must be beyond a shadow of doubt (significance level)
Possible mistakes in any verdict:
 Convict an innocent person (type 1 error)
 Release a guilty person (type 2 error)

Probability of a Type I error

Recall a randomization distribution is the distribution of sample statistics if �0 is true.

If Ho is true and a = 0.05, then


 5% of statistics will be in the tail (red)
 5% of statistics will lead to rejecting �o

 The probability of making a Type I error (rejecting a true H0) is the


significance level, α

Probability of a Type II error

The probability of making a Type II error (not rejecting a false �0) depends on:
 effect size (magnitude of the effect being measured)
 sample size (more information implies fewer mistakes)
 variability in the data (more variability implies more mistakes)
 significance level � (smaller � implies more non-rejections)

Decreasing the probability of a Type I error (reducing α) increases the probability of a


Type II error - There is a trade-off between Type I and II errors.
Choosing the significance level (a)
As statisticians, we have the ability to adjust the significance level. However, you
need to identify which way you wish to adjust the (a) as this will impact which error
will be present.

As there is a trade-off between the two errors, you need to identify which is worse to
decide which way you want to adjust the significance level.

Which is worse - Type I or Type II error?

 If a Type I error (rejecting a true null) is much worse than a Type II error, we
may choose a smaller α, like α = 0.01
- e.g., death penalty for conviction (�0 ∶ accused is innocent)

 If a Type II error (not rejecting a false null) is much worse than a Type I
error, we may choose a larger α, like α = 0.10
- e.g., adding fluoride to the water supply (�0 ∶ fluoride has no effect)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

The context is going to tell you which level of significance to choose but the (social
sciences) conventional level is α = 0.05

Density Curves

 A density curve is a formula describing a variable’s distribution

Think of a density curve as an idealized histogram where


1) The total area under the curve is 1
2) The proportion of the population in an interval is the area under the curve for
that interval

An example of a density curve is the normal (Gaussian)

Parameters of a normal distribution

if a variable � has a normal distribution, we write:

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 X means the variable


 ~ means has the distribution of/is distributed as/has the density of.
 N means normal
 µ means the population
  means the standard deviation

There is a family of normal distributions


 Each distribution is uniquely identified by
1) Mean µ - centre of symmetry
2) Standard deviation  - spread

How does this relate to what we have been doing for the past few weeks?
Example:
Atlanta commute times
 A bootstrap distribution of x bar for Atlanta commute times has a mean of 29.11
and St. Dev. of 0.93.

The below normal distribution looks to be a very good approximation to the bootstrap
distribution

1) If we can approximate a bootstrap distribution with a normal dist.


a. We can construct a CI using the normal dist.
i. If we can find the SE, we can approximate instead of bootstrap.
2) If we can approximate a randomization distribution with a normal
a. We can compute a p-value using a normal
i. If we can find the p-value, we can approximate instead of
randomizing

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK9 NORMAL DIST & CLT


 Normal distributions are used as approximations to our randomization and
bootstrap distributions.
 Use it for CI and Hypothesis testing.

Week 8 example continued.


Instead of looking at the prop. of data in a right tail in a bootstrap dist. or
randomization dist.
 Approximate using N(øµ) and find the area under the curve.

Formula for a Norm Dist.


Area

=
 We use stat key for this instead
 Issue = each norm dist. is different
 A short cut for this is the 95% rule.
The 95 percent rule is true for all normal distributions
 95% of values lie within 2 SD of the mean
 68% of values lie within 1 SD of the mean
 In fact, the proportion of values that lie within any given number of SD from
the mean is the same for all normal distributions
 What matters is the number of SD’s from the mean

Z score (revisited):

By converting all the ��values from �~� (�,�) to �� values, we get;

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

The standard normal


(Mean of 0 and SD of 1)
 Z-score converts any �~� (�,�) to � ~� (0, 1)
- we say the data for � have been standardized
- saves looking up a different normal distribution in every case

Normal Approximations
1) If we can approx. a bootstrap dist. with a norm dist.
 You can construct a CI using a norm dist.
2) If we can approx. a randomized dist. with a norm dist.
 We can compute a p-value using a norm dist.

Therefore, If you can find a way to compute SE, you may use normal approximations
without generating the distributions.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Central Limit theorem (CLT)


In order to well-approximate a bootstrap or randomization dist. (distributions of a
mean/proportion) using a normal distribution, The CLT claims that there must
samples must be;
 Random and,
 Sufficiently large in size,

What constitutes a “sufficiently large size” varies:


 For quantitative variables that are not very skewed, � ≥ 30 is usually sufficient
 The more skewed the variable, the larger n has to be for CLT to work.
 For categorical variables, a count of at least 10 in each category is usually
sufficient

Norm. Approx. for a CI

We compute a 95% CI in 2 ways


1) Using the bootstrap
2) Applying the CLT and using the standard normal dist.

 We can use the norm dist. if the CLT applies


 1 and 2 should give the same answer if the CLT applies

Example
In a random sample of n = 1771 12 -19 year olds, 345 had some hearing loss.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Using the N(0,1) approximation


 The CLT applies because 1426 > 345 > 10
 1426 do not have hearing loss, 345 do have hearing loss. Both are greater
than 10 which is the cut off for being able to use the CLT.
 For the 95% CI, the normal cutoffs are -1.96 and +1.96
 (For the 95% rule we were rounding 1.96 to 2)
 We can find the 95% CI using
 Statistic ± 1.96 x SE

CI for other CI using N (0.1)


 z* is the cutoff for the desired confidence level
 z* = 1.96 for a 95% confidence,
 z* = 1.645 for a 90% confidence
 z* = 2.575 for a 99% confidence
 Find CI using: Statistic ± z* x SE
The area between -z* and +z* in N (0,1) reflects the desired level of confidence

This cut off value is called the Critical Value

Norm Approx. for Hypothesis Testing

 As always, we reject Ho if p-value < a (significance level)

Example
Following hypotheses using data on a SAT score.
 Ho: µ first born - µ not first born = 0
 Ha: µ first born - µ not first born > 0
From the sample
 x bar first born - x bar not first born = 30.26
 SE = 37
 One way to do the test: (from rand dist.) area beyond 30.26 in N (0,37)
 P-value = 0.207
 Do not reject the null hypothesis (p-value is greater than any value we would
get for the significance level)
 Another way to do the test: Standardized test statistic (z)

This allows us to access extremity on a common scale


If a statistic is normally dist. under Ho, the p=value is the N (0,1) probability beyond z
 x bar first born - x bar not first born = 30.26

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 SE = 37
 30.26 - 0 ÷ 37 = 0.818
 Area beyond z in N (0,1)
 P-value = 0.207 (same as before)

Standard error for p^


To calculate SE:

Example
Census shows that 65.1% of homes are owner-occupied. Given random samples of
n = 100 homes, what is the standard error of p^?

 The formula and the sampling distribution give the same answer
 Note: the sampling distribution looks normal

CLT for p^
 The distribution of p^ can be approximated by a norm dist. if:
 N*p ≥ 10 and n(1-p) ≥ 10
 N*p means the sample proportion and is referring to the counts in each of
the two categories. (Both categories have to be at least 10)
 If both conditions are satisfied, then

 ~ with a dot means that p^ is approximately normally distributed.


 The equation means; p^ will be normally distributed with a mean of the
population value and a SE given by the square root formula.

CI for p
 From the generic CI formula
 Statistic ± z* x SE
 If n*p^ ≥ 10 and n(1-p^) ≥ 10, then a CI can be computed by;

 Note: p (which we don’t know) is replaced with p^ in SE formula

Hypothesis Test for p


Ho: P = Po

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Ha: P > Po

Calculate the standardized test statistic;

SE is calculated (as usual) assuming Ho is true

Steps do this
1) Check n*po ≥ 10 and n(1-po) ≥ 10
2) If both values are satisfied, then the CLT applies and we may use the normal
approximation
3) Calculate z= p^-po ÷ SE where SE = sqrt po(1-po) ÷ n
4) Find the p-value as the area in the tail(s) beyond z in the standard normal
distribution
5) Use the p-value to make the decision about Ho

Using Critical Value for a conclusion


 Compare the z statistic you computed with the critical value (z*) for the given
significance level (a), i.e., 1% = 2.326. If the z statistic is greater than z*, we
reject the null in favor of the alternative.

Examples for all types of tests for p, in week 10 workshop on excel

WK10 CLT, CI & HT FOR MEANS ETC

Hypothesis testing review


 Given sample statistic , significance level a, sample size n
 Randomization distribution
 Reject Ho if a > p-value for ,
 If CLT conditions are satisfied à p ~ N
 Reject Ho if a > p-value for
 Compute Zi = and reject Ho if a > p-value for Zi
 Find z* given a (e.g., z*=1.96 for a = 0.025 (95%) and reject if zi > z*

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

CLT for
 If n ≥ 30
This means we can bypass the randomization process by using this normal dist. as
an approx. of the randomization dist. and we can use it to carry out the hypothesis.
 Since we don't know (the population standard deviation), we approximate it
with the sample standard deviation, s

 Problem: replacing with s means no longer has a normal distribution
 Replacing with s in the SE formula changes the distribution of from a
normal to a t-distribution.

T-distribution
 The t-dist. is similar to the N (0,1), but with 'fatter' tails
 A t-dist. is characterized by degrees of freedom (df)
 Degrees of freedom (df) are based on the sample size (n)
 As df increase, the t-dist becomes more similar to N (0,1)

The smaller the observations,


(which is related to df) the
more the t dist. will deviate
from the standard normal dist.

We use the t dist. when we


carry out hypothesis tests
using quantitative data.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

CLT for µ
 If n ≥ 30 the standardized test statistic for a mean (using s) follows a t-distribution
with n-1 degrees of freedom
 If n < 30, the t-dist. only applied if x is approximately normal
 Generic formula for CI
 Since we replace with s when calculating SE, we must use the t dist. instead of
the standard normal how do
 Therefore, the CI formula becomes

Hypothesis test for µ


 Ho: µ = µo
 Ha: µ ≠ µo
 Standardized test statistic

 Test statistic has a t-dist. with df = n-1 (if n ≥ 30 or the underlying population is
normal), hence
 (squiggle means this has a t dist. corresponding to n-1 df
Steps to complete a hypothesis test for µ
 If n ≥ 30
 P-value is the area in the tail(s) beyond t. in a t dist. with n-1 df
 Use the p-value to make the decision about Ho
 If n ≤ 30
 Use p-value approach (above) if x is approximately normal
 Otherwise use randomization
 Either way this is referred to as a t-test

CLT for If the CLT applies this means that the


difference in sample proportions is
 If approximately normally distributed
with a mean, given by the underlying
population parameters and the SE
 (calculated as shown)
CI for
 Generic formula for CI:

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593


 Note: We don't use a t distribution here because we are looking at proportions,
proportions are always represented approximately by a normal dist., the t dist.
only comes into play when looking at quantitative data and sample means

Hypothesis tests for


 As usual we calculate


 From the CLT and imposing Ho in the SE formula:


 Problem: what do we use for p1 and p2 if Ho is true (they will be the same)
 We can't use because they will be different (won't impose Ho)

Pooled proportion
 If Ho is true, then p1= p2 (which we call p)
 Our best guess of p is the pooled proportion


 Note: the pooled proportion will always be

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

CLT for
 If n1 ≥ 30 and n2 ≥ 30


 Because we don't know we use the sample equivalents


 Use t dist. with df equal to the smaller of n1 -1 and n2 - 1

CI for µ1 - µ2
 Generic formula for CI: statistic ± t* x SE
 ± t* x SE
 t* is from a t dist. with the desired confidence level
 for a 95% level and n-1 df
 require 2.5% in each tail of the t distribution:
 for a 90% level and n-1 df
 require 5% in each tail of the distribution:

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Hypothesis test for µ1 - µ2


 As usual calculate:


 Which follows a t dist. (assuming n1 ≥ 30 and n2 ≥ 30) and

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Matched Pairs data


 Sometimes data consists of paired values
 Examples:
 Two measurements on an individual (e.g., before & after treatment) or
 Studies on two individuals that are similar i.e., twins
 Each case is matched with a similar case, and one case in each pair is given
each treatment
 Not only does paired data reduce the influence of confounding variables, they
reduce the probability of type II errors.
 You are looking at the same individual, no confounding variables as all
characteristics are the same. Easier for us to reject false null hypothesis.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Matched pairs data vs separate samples


Paired data
 Each swimmer gets both types of swimsuit
 Two measurements for each swimmer
 Wetsuit speed data in one column, regular swimsuit speed data in another
column
Contrast with separate samples
 Some swimmers would get wetsuit others would get regular suit
 Speed data in one column along with a categorical variable in another column
(wetsuit or regular)

Inference with paired data


 For matched pairs we examine differences for each pair, as opposed to the
average difference between groups
 We construct a new variable of the differences, and carry out inference as for a
single mean.
 A test with paired data is just like a single mean test
 We can use the formulas from the single mean case

Finding sample size when given MOE and Confidence level


Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WK11 LINEAR REGRESSION


Linear regression
Is the most widely used statistical tool in economics and is used for prediction

 Correlation can vary only between + 1 or - 1, so this correlation indicates a very


strong positive linear association.

Response
variable (y)

Explanatory Variable (x)


Linear regression finds the straight line that gives the best “fit”
The best ‘fitting linear function: y = mx + b

The estimated regression line

 Bo = The intercept and


 B1 = The slope
 The hat/circumflex reminds us that the estimated regression line applies to the
straight line itself instead of the data points (x & y apply the data points)
 In the restaurant bills example
 Tip hat = -0.292 + 0.182 Bill
 The equation can be used to Predict y for any x
 E.g., the predicted tip on a bill of $59.33 is simply determined by…

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 Tip = -0.292 + 0.182 * (59.33) = 10.506


How do we find out the values of the intercept and the slope?
The best ‘fit’ makes the predicted values ‘closest’ to the observed values
- We want our line to be representative of the data so we want our predictions to
be as close as possible to the observed values for the info we do have.
Define
 yi = observed response value for observation i
 y^I = predicted response value for observation i
Residual (i)
 Residuals are the observed value minus the predicted value (how much the
data points vary from the predicted line of best fit)
 = yi - y^ i
 Vertical distance between observed and predicted values
 E.g., the 2nd observation is a tip of $7 on a bill of $36.11
 Predicted value y^2 = -0.292 + 0.182 * (36.11) = 6.28
 Residual for 2nd observation is y2-y^2 = 7.00 - 6.28 = 0.72

Calculating the residuals for all n observations

Interpreting the slope and intercept (b1 and bo


Tip = -0.292 + 0.182 * Bill
 Slope = 0.182

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 Tip is predicted to increase by $0.182 for every $1 in the bill


 Intercept = -0.292
 The predicted tip when the bill is zero
 The intercept may not have a sensible interpretation when it lies outside
the range of the explanatory variable

Cautions
1) Only use a regression to predict over the range of x
a. If none of the x values are close to 0, then the intercept has no meaningful
interpretation
b. Once you start to predict outside of the range of x, you run the risk of making
a non-sensical prediction
2) Only use a regression when the association between x and y is approximately
linear
3) Beware of the influence of outlier observations

ALWAYS PLOT OUT THE DATA

Example

Which interpretation is correct?


1) A decrease of 0.89 in the birth rate corresponds to 1 year increase in predicted
life expectancy
2) Increasing life expectancy by 1 year will cause birth rate to decrease by 0.89
3) Neither

 The model is predicting birth rate based on life expectancy, not the other
way around which rules out #1
 You can only make conclusions about causality from a randomized
experiment. This rules out #2

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Caution
 X and y regression may indicate a linear association, but this does NOT mean
that changes in x cause changes in y
 A regression may look more sophisticated than correlation, but that does not
give it special power to determine causality
 Causality can only be determines if the values of the explanatory variable are
determined randomly (randomized experiment)

Population parameters in a regression


 The population (true) regression is:

 We use the sample to estimate the population regression


 Hence
 are the population parameters
 are sample statistics (or estimates)

Inference for the slope and intercept


Confidence intervals

 Sample stat ± critical value times a SE


 t distribution instead of a z distribution
 SE in t dist. is an estimate of the SD of the population slope estimate
Hypothesis tests (with df = n - 2)

 The degrees of freedom in a regression are n - 2 because you have estimated 2


things, the intercept and the slope.

How do we estimate the SE’s?


 Bootstrap (CI) or Randomization (HT)
 Formulas (embedded in the regression software)

Examples
Slope 95% CI for inkjet printers
 Price and printing speed (PPM) for n = 20 inkjet printers

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 Conclude: 95% certain population slope is between $49.95 and $131.81 per PPM
 Same procedure for the intercept
 However we would expect this not to make any sense as that would
correspond to a printer that printed 0 pages per minute.
 The null hypothesis for a Slope and Intercept calc is 0 for the coefficient in the
regression software.
Slope Hypothesis test for Inkjet printers


 Confirm the calculation for the slope t-statistic

 Ho is decisively rejected: very strong evidence price is linearly related to printing


speed
 Same procedure for the intercept

Correlation
If we want to test linear association between two variables:
 We can use correlation
 Ho: p= 0
 Ha: p ≠ 0
Test statistic

 Find the critical value using the t-distribution with df = n-2

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 Correlation test returns the same t stat as the slope test


 No accident: the two tests can be use interchangeably
 Correlation is testing whether or not you have a linear association between
the PPM and price and
 The slope is testing whether or not PPM has any influence on the price.
 Testing the same thing:
 The linear association between x and y, indicating that correlation and
linear regression are two ways of looking at the strength of the linear
association between x and y
 Regression gives you a coefficient which tells you that for a 1 increase of
PPM printing speed, you get an increase in price of the printer by $90.89.
Check the residuals
 We have assumed the errors € are randomly distributed around the regression
line
 Check this by examining a scatter-plot of the data and the regression line
 Look out for:
 Curved (non-linear) patterns in the data
 Consistently changing variation around the line
 Outliers

Scatterplots of data and regression lines

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Influence of outliers

 The outliers are pulling the line upwards, need to be aware of outliers.

Coefficient of determination, R2
 Recall for correlation -1 ≤ r ≤ +1
 Squaring correlation, we get r ² (by convention we use R²)
 Since 0 ≤ R² ≤ 1, it can be interpreted as a proportion

 Variation in the response variable that is ‘explained’ by x

Example
R² for the inkjet printers
 Interpret the value of R² for the inkjet printers regression

R² = r² = (0.74) ² = 0.5476
 Int - 54.76% of the variation in price can be explained by printing speed (PPM)

Calculating R²

Response = Regression line + Error


 Suggests we may split the total variation in y into two parts:

 Referred to as analysis of variance (ANOVA)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

1) Variation due to the model,


2) The variation due to the residual
3) Total variation in the response variable
Standard error of the regression
 Se measures how much individual points tend to deviate above or below the
regression line.

 The errors (e) have a standard deviation Oe

 To estimate
Regression output from excel

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 = Sample correlation (squared represented as) R² is the squared value of the


correlation - the proportion of the variation in the price that is accounted for by the
regression model.
 SS = sum of squares (of regression is SSModel, Residual is SSE
 MS = mean sum of squares (SS/df)

Degrees of Freedom
 Distinguishing between n-2 and n-1
 It depends on the parameter you are estimating, as different parameters have
different rules for df. Week 11 lecture is about linear regression, in which the df
for a simple linear model is n-2. n-1 is for means.

WK12 PROBABILITY
Why probability theory is important
In these next two lectures we are finding the link between probability and statistics
 A process is random if its outcome is uncertain

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 Coin could be heads or tails.


 Distributions of random processes may be predictable
 We expect to observe heads approx. 50% of the time
 We often calculate probabilities
 By simulating random processes (bootstrapping or randomization)
 By approximating random processes (normal or t dist.)
 We can also determine probabilities theoretically
 It is important as it underpins all the distributions and the processes we are
interested in statistical inference

Probability is defined in terms of events


An event is something that happens or doesn't - i.e.,
 a randomly selected card is a ♥
 a response variable � > 90
 a randomly selected person is female
 the Wallabies win Bledisloe
 it is going to rain tomorrow

Frequentist definition
Depends on the frequency
The probability of an event A or P(A) is the long run frequency or proportion of times
the event occurs.

 P(A) = 0 means A will definitely NOT happen
 P(A) = 1 means A will definitely happen
P(Y=1):
Take the number of individuals
with exactly 1 sibling ÷ by total
population.

P(X ≥ 85)
Take the number of times a
student has recorded a mark
of 85% or above (from the total
number of students that have
studied ECMT in the last 10
years) ÷ by the total number of
students
 Variables of interest = X,Y & (0.09 is the frequency of times
 Final grade in ECMT that event has occurred.)
 We know we can study from long run
P(Gender = Female)
frequency events that the proportion of the Take the number of females in
final grade being equal or greater than 85 is U.S college students in 2010 ÷
9% (0.09) total number of students in U.S
 Probability is the long run frequency of an event college students in 2010
occurring

If all possible outcomes in A are equally likely

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

 E.g., if we randomly select a card from a 52-card deck, what is the probability it is
a King?

Venn diagrams are a way of visualizing probability

We represent these proportionally i.e., 0.25 instead of 7/28 as we are measuring


between 0 & 1 for the probability of it happening.

What is the probability that an Australian adult is unemployed?

Combinations of events
P (A and B) = probability that both A and B will happen

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

P (A or B) = probability that either A or B will happen (includes both)

What is the probability that an adult is male and unemployed?

Addictive rule
The probability of A or B occurring is not just the sum of the probabilities of the two
events. It is the sum of the two events minus the probability that A and B occurs

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

What is the probability that an adult is female or employed?

Compliment rule

If the probability of getting A is 20%, the probability of not getting a is 80%

What is the probability that an adult is not employed?

Example - Caffeine
A survey finds that 52% of students drink coffee in the morning, 48% drink coffee in
the afternoon, while 37% drink coffee in the morning and afternoon.
What percent of students do not drink coffee in the morning or afternoon?

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Need to apply compliment rule to see the probability of NOT doing an option.

Conditional Probability

 It can be said many ways:


 Probability of A if B
 Probability of A given B
 Probability of A conditional on B
Also written as P(A if B) or P(A given B)

What is the probability that an adult male is employed?

What is the probability that an employed adult is male?

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Conditional Probability Rule

A survey finds that 52% of students drink coffee in the morning, 48% drink coffee in
the afternoon, while 37% drink coffee in the morning and afternoon.
What percent of students who drink coffee in the morning also drink it in the
afternoon?

Tip
Sometimes it helps to put the information into a table:

Multiplicative rule
 Using the definition of conditional probability

 Multiply both sides by P(B) to get

 It also follows that

 (Changing the denominator ^^)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

60% of students at a university rank their experience as “excellent”. Of those


students, 59% also put the university as their first choice.
What percentage of students put the university as a their first choice and rank their
experience as excellent?

Summary of probability rules

- Addictive rule

- Compliment rule

- Conditional probability rule

- Multiplicative rule

Mutually exclusive events


Events A and B are disjoint or mutually exclusive if
- Only one of the two events can happen

Since:

If A and B are mutually exclusive, then


= the sum of the individual probabilities

E.g., if you toss a coin, what is the probability of a head or tail?

Cannot have a H and T happening at the same time, events are disjoint (can't land
on H and T at the same time.)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

P(A or B)

Independence
Events A and B are independent if
- Probability of A happening is not affected by whether B happens

Since:

If A and B are independent, then

E.g., if you toss 2 coins, what is the probability both land tails?

One coin landing on heads or tails has no impact on whether the other coin lands on
heads or tails.

Disjoint events are not the same as independent events


 If P(A) > 0 and P(B) > 0, then disjoint events A and B are
 Always independent
 Never independent
 Sometimes independent - we need more information to judge

 If A and B are disjoint then , which implies

Because P(A) > 0, it follows that so A and B cannot be independent

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Summary of special cases

Law of total probability


3 areas of (A) that are mutually disjoint (B1, B2 and B3)

If are disjoint events that together make up all the possibilities, then.

What is the probability that an adult is employed?


P(employed) = P(employed and male) + P(employed and female)

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

A regional airport is served by two airlines. Airline C operates 70% of the flights and
is late about 20% of the time; Airline D operates the rest and is late on 10 % of its
flights.
What proportion of flights to the airport are late?

Tree Diagram

Example - Breast screening


Consider the following:
 1% of women aged 40 who participate in routine screening have breast cancer
 80% of women with breast cancer get a positive mammogram
 9.6% of women without breast cancer get a positive mammogram

A 40-year-old woman participates in routine screening and gets a positive


mammogram. What's the probability she has cancer

What probability is this asking for?


 P(positive mammogram if cancer)
 P(positive mammogram if OK)
 P(positive mammogram)
 P(cancer)
 P(cancer if positive mammogram)

We need a way to convert current probability to prior probability


 We know

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Bayes' Rule
For any two events A and D

 The denominator can be generalized for any number of events


 A way of figuring out a prior probability given posterior information in a
particular setting. I.e.,
Example - Breast screening (Bayes' Rule)

 Less than 7.8% probability of breast cancer given a positive test


 91.2% of positive test results are "false positives"
 You can use Bayes' rule or probability tree

Example - Tough Quiz


A quiz has one easy question (E, 90% chance of getting it correct), three harder
questions (H, 75% chance correct for each), and one tough question (T, only 30%
correct).

The tutor picks a question at random and tells you that you got it correct. What is the
probability she picked the easy question?

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Random variable

Examples:
X = number of home-team wins in the World Series
Y = Sum of two dice rolls
G = Grade on next statistics exam
T = Time to run 1500m
W = Weight of a rat

A random variable is discrete if it has a finite set of possible values


X = home wins in World Series = {0,1, 2....,7}
Y = sum of two dice rolls {2, 3...,12}b

A random variable is continuous if it has an infinite set of possible values


T = time to run 1500m
W = weight of a rat

Discrete random variables


A probability (mass) function assigns a probability to each possible value.

 X = number of Heads in two coin flips


 The 4 (equally likely) outcomes are {HH, TH, HT, TT)

What is the probability that there is exactly one H from two flips?

Example - probability function for the sum of two die

Find the probability of rolling less than a 5

Find the probability of not rolling a 7

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Mean of a discrete random variable

Example:
Quiz marks probability function

Example - a $1 lottery ticket:


Suppose a lottery prize probability function is given by

Find the mean prize for a ticket

Is it worth buying a $1 lottery ticket?


On average, you would lose roughly 41c on every ticket (1-0.593)
Standard Deviation of a discrete random variable

Example cont.
Quiz marks

 Find the standard deviation


 This is the variance of quiz marks (1.8144)
 Take the square root of the variance to get SD (σ)=1.347

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Finding probabilities using distributions


 It is not always necessary to start from basic principles to compute probability
functions for random variables
 Sometimes the probability function is already well-known
 E.g., we already know the normal distribution can be used to find probabilities
in applications involving continuous random variables
 What about for discrete random variables?
 Can use the binomial distribution for some discrete random variables

A binomial random variable counts the number of 'successes' (for any outcome of
interest) in a sequence of trials where:
 Number of trials (n) is fixed in advance
 Probability of success (p) is the same in each trial
 Successive trials are independent of one another

Examples
X = number of heads in two-coin flips –>
Y = number of sixes in five dice rolls –>

Which of the following is a binomial random variable?

Example - number of sixes in five dice rolls


S = 6 (success) with P(S) = 1/6 (or p = 1/6)
F = 1 to 5 (failure) with P(F) = 5/6
X = number of S in n = 5 trials

The probability of getting five 6s in 5 rolls is

The probability of getting zero 6s in 5 rolls

What is the probability of getting one 6 in 5 rolls?


 n = 5, p = 1/6

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

What is the probability of getting two 6 in 5 rolls?


 n = 5, p = 1/6

 There are 10 ways in total

 working out the number of arrangements is tedious – we need a quicker way

check the number of arrangements for two 6s in 5 rolls

Which is the same as before

Binomial probability function

Therefore, the probability of three 6s in 5 rolls of the dice is:

Example: 6s in 5 dice rolls


Using the binomial probability function, we can complete the table

Find the expected number (mean) of 6s in five dice rolls

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

There is also an easier way to find the mean of a binomial random variable

Mean and std dev. For binomial random variables

For the number of 6s in five dice rolls

Example - basketball free throws


A basketball player makes about 90% of his free throws. If we assume successive
attempts are independent, what is the probability the player makes at least 8 free
throws in 10 attempts?

The same player attempts 290 free throws in a season. Find the mean and standard
deviation for the number of free throws in the season.

Would you be surprised to learn that he made only 250 free throws in the 290
attempts?

250 is more than 2 st.dev. below the mean, so it would be surprising

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 1

Collecting Data
https://canvas.sydney.edu.au/courses/29433/files/15295167?wrap=1

Q1
a)- States
b) - Percent of residents with a college degree, Quantitative Variable.
c) - The residents from Connecticut
d) - Whether or not they had a college degree, Categorical Variable

Q2
a) - media usage - Categorial Variable, categorized by the use of television or
internet
- tiredness during the day - Quantitative Variable, can.
b) - media usage explanatory variable
- tiredness response variable

Q3
a) - the 30,000 people that participated
b/c) - No as sampling bias exists. This study is subject to a bad method of sampling,
as a result of volunteer bias (sample is made up of people that choose to
participate.) Their opinions do not represent the broader population as only people
that feel strongly towards it or have experience (positive or negative) will participate
online creating sampling bais and not giving an accurate representation of the wider
population
d) If you do want to generalize the sample result use randomization while sampling
(take a random sample)
A random sample is ideal but very often they are not feasible as we don’t always
have the population we want. You may have to alter the target population to get a
feasible population to sample. This is just the reality, not everyone may have access
to the facilities that you are wanting to sample
e) Categorial Variable (whether or not they have driven with a dog on their lap)

Q4
a) Yes it is biased due to a bad method of sampling because the sampling as they
are only surveying students at the gym, when they are trying to get an idea of
university students exercising. Exercise isn’t limited to just the gym, you can swim,
run etc.
b) No, unbiased. It is a simple categorical question that not many people would feel
uncomfortable answering as the legal age of drinking in Aus. is 18. If we were in USA
it may be subject to bias due to inaccurate responses. People lying as first year
students being aged 17/18 when the legal age is 21 in USA.
c) No, unbiased. They are choosing to participate themselves but that does not
mean there is a volunteer bias because whether they volunteer or not, it does not
influence the outcome of giving away 5 textbooks, it doesn’t matter who you give the
textbooks to.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

d) No, unbiased. However, a small sample size

may affect the reliability of our estimate


e) Yes, biased as the article is about the opinion of the . It is about the residential
colleges, it can be unbaised if the article was about the experience of people who
live in a college on campus

Q5
a) Do you support the increase of additional overseas skilled workers in order to
stimulate economic growth
b) considering the current covid situation and the wide spreading tendancies of the
virus, do you support the government increasing the immigration levels in order to
stimulate economic growth.
c) the objective of the survey is to receive a truthful answer for whether they support
increasing immigration levels in order to stimulate economic growth or not. This
means the wording of the question must be neutral in order to get an accurate
reliable answer. It is important to avoid framing bias in these circumstances.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 2

Correlation vs Causation
https://canvas.sydney.edu.au/courses/29433/files/13450417?wrap=1

Q1
a) Although the sampling method was random, the choice of dates they selected for
the sample group was not ideal as people tend to celebrate more leading into their
birthdays, resulting in bias data as these individuals would be more fatigued and
tired than usual.
This is an observational study as the group of researchers didn’t control the
explanatory variable, they simply gathered the information as it naturally exists,
resulting in a number of confounding variables still being present in the study.

b) No, it would not be appropriate as this is an observational study. We cannot


establish causality because confounding variables exist.
- Potential confounding variables - exercise, nutrition, more sunlight, work, etc
None of these above third variables are correct as they don’t directionally associate
with both the variables in the study (more exercise may cause an increase in
tiredness, but exercise does not cause an increase in media usage as well.
Confounding variables relationship needs to have a positive or negative
association with both the explanatory and response variable. (can’t be positive
for one variable and negative for the other)
A true confounding variable in this circumstance could be mental stress
(although may not be the best example). People who are mentally stressed
tend to watch more TV, browse the Internet, and people who are stressed tend
to be more tired during the day.

Q2
a) randomly assign the 42 volunteers into two evenly sized groups, a control group
and a treatment group (it doesn’t, matter which group is which, flip a coin) The
control group to text with their dominant hand and the treatment group with their non-
dominant hand
b) Ask all volunteers to test using one hand (randomized) - flip a coin to decide which
hand is used first (right or left), then followed by the other hand. Then compare the
two hands results.

Describing data

Q3
a/b/c) Frequency Table
Response Frequency Relative Frequency
Great deal 81 0.08
Fair amount 325 0.32
Not very much 397 0.39
Not at all 214 0.21

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Chart Title
450
400
350
300
250
200
150
100
50
0
Great Deal Fair amount Not very much Not at all
Frequency Relative Frequency

d) 32%
e)
f) Democrats tend to be more positive about the media as they have the highest
frequency
republicans tend to be more negative about the media, relative to the positive
opinions within the positive opinions
g) GF

You can only talk about skewness in a quantative variable, not in categorical
variables

Bins is the intervals of values for a histogram. We cannot plot every single value
We need to know the range of the data when creating our bins,

Finding an interval;
If a distribution of data is approximately symmetric and bell-shaped, about 95% of
the data should fall within two standard deviations of the mean. This means that
about 95% of the data in a sample from a bell-shaped distribution should fall in the
interval from ¯-2s to x¯+2s.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 3

https://canvas.sydney.edu.au/courses/29433/files/13450418?wrap=1

Q1
a) 2, 26, 24
b) Q1= 8.5
c)

Q3
It is important that the question says that is bell shaped as it means we can use the
95% rule
a)
b)

Zscore and proportion are unit free that are used to compare

Q7

2 standard deviation is important as it gives you an anchor


Xi is any value you are interested in
When you divide from standard deviation everything becomes standardized

You can only use the 95% rule if it’s 95%

Ask about manual calculations with question 7

Q4

Q6
Explanatory variable must be placed on the horizonatal access and the
response variable on the vertical access

Covariance of x and y

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 4

On Excel

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 5
Q1
a) Randomly select 6 cases from the original sample WITH REPLACEMENT,
and calculate X1 bar.
b) X1 bar 21.167, X2 bar 16, X3 bar 22.167 (first bootstrap statistic/first dot is an
estimate of the sample statistic)
i. X bar (21.167, 16, 22.167)/3 = 19.778
ii. Repeat the above steps
c) The shape is not bell shaped as there are too many spikes, it is skewed to the
right, therefore we cannot use the 95% rule as the data is not symmetrical or
bell shaped.
d) The sample mean is now 20\
e) The data is now more symmetrical and bell shaped with a SE of 2.6
f) SE = 20+- 2 x SE
i. LB = 14.8 UB = 25.2
ii. We are 95% confident that the average number of laughs by a
person in a day is between 14.8 and 25.2
Q2
a) Mu (m) (the mean number of hours spent watching television for males at this
university) Mu (f) (the mean number of hours spent watching television for
females at this university.
b) Xm bar - Xf bar = 6-3,91 = 2.09
i. We want to use this statistic to estimate the population mean for
the two statistics
b. How to generate the sample statistic:
i. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar 1
ii. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar 1
iii. Calculate Xm bar 1 - Xf bar 1 = -0.45
iv. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar2
v. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar2
vi. Calculate Xm bar 2 - Xf bar 2 = 1.43
vii. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar 3
viii. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar 3
ix. Calculate Xm bar 3 - Xf bar 3 = 3.2
x. X bar = (-0.45+1.43+3.2)/3= 1.394
xi. Repeat 1000s of times until you are happy with the distribution
c. Estimate the variability of the bootstrap statistic, SE = 1.5
d. 2.09+- 2 x SE
i. LB = -0.91, UB = 5.09
ii. We are 95% confident that the difference in the mean number of
hours spent watching television for males and females at this
university is between -0.91 and 5.09.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

Q3

Q4
a) (Mu) = the mean amount of time (in minutes) spent watching election
coverage for all U.S. adults.
b) 80.44
c) Randomly select 25 cases from the original sample WITH REPLACEMENT,
and calculate Xi Bar 1
d) The bootstrap distribution should be centered around the original sample
statistic of 80.44
e) Once we have constructed the bootstrap distribution we can use the
distribution to estimate the SE
f) S
g) 92% confidence interval using 4% = 65.16 and 96% = 95.78 as 92% will have
4% on either side.
Q5
a) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P1 hat = 0.193 (first dot on the bootstrap
distribution)
b) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P2 hat = 0.206
c) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P3 hat = 0.249
d) P hat (average of bootstrap samples) = (0.193+0.206+0.224)/3 = 0.207
e) Repeat the above many times and construct a bootstrap distribution
f) Check centre and shape
g) Estimate the variability of the bootstrap statistic, SE=0.013
h) Use the SE to construct confidence intervals for the statistical inference. SE =
0.21+-2*0.013
a. 0.184 to 0.236
b. We are 95% confident that the proportion of U.S. adults who have no
confidence in the media is between 0.18 and 0.24
i) We want the estimate to be as narrow as possible
Q6
a) Mu = 0.492, LB = 46.1, UB = 52.8.
i. There are 1000 dots so we count 5 dots from either end of the
tails
Q7
a) It is not appropriate to use the 95% rule on either distributions as neither of
them are very symmetrical or bell shaped with skews to the left in both
examples.

Downloaded by Josef Medina ([email protected])


lOMoARcPSD|46660593

WORKSHOP 6

(NO WORKSHOP DUE TO MID SEM


EXAMS)

Downloaded by Josef Medina ([email protected])

You might also like