ECMT1010 Notes
ECMT1010 Notes
Sampling Bias (whether online reviews about products are reliable or not.)
Describing data (stories behind the data - will be using excel and another
software)
Correlation vs Causation (the true interpretation of one thing led to another)
Sampling distribution (why the probability of getting one side of a coin is always
50%)
Bootstrap distribution (the truth about a subject matter via a small lens)
Randomization distribution (whether an assumption you made 4 an event is true
or not)
Normal and t distribution (making the above three points easier)
Simple regression (making predictions)
Probability rules (how to deal with uncertainties)
Bayes rule (why mass testing of Covid 19 is not recommended in many
countries)
Data;
Data are measurements based on individual units or cases/observations
Variables
Categorical
- Defines groups, e.g., Gender, Year etc.
- Used to calculate proportions
Quantitative
- Numerical measure e.g., SAT, Height, Weight etc.
- Used to calculate averages
Key Concepts
Population
- includes all the individuals or objects of interest
Sample
- A subset of the population
- The cases that have been selected from the population into our dataset
Inference
- A conclusion reached on the basis of evidence and reasoning
- Statistical inference is using the information from a sample to inform you about
the properties of the population
The Relationship
- Samples are selected to give some information about the population of
interest.
The Population and Sample is what we will continue to refer back to in statistical
inference (what ECMT1010 is all about.)
Sampling bias
Occurs when the method of selecting a sample causes the sample to differ from the
population in some relevant way. This may cause an over and under representation
of your sample.
If sampling bias exists, we cannot trust generalizations from the sample to the
population. In order to avoid Sampling Bias, you need to tale a Random Sample.
Random sample
- A random sample is ideal but very often they are not feasible as we don’t
always have the population we want. You may have to alter the target
population to get a feasible population to sample. This is just the reality, not
everyone may have access to the facilities that you are wanting to sample
Sources of Bias
Association vs Causation
o Two variables are said to be associated if their values are related to one
another.
o Two variables are causally associated if the value of one variable influences
the value of the other
Whenever we have a graph, the variable on the horizontal axis is the explanatory
variable and the variable on the vertical axis is the response variable.
Example:
You collect data on sunburns (RV) and ice cream consumption (EV). You find that
higher ice cream consumption is associated with a higher probability of sunburn.
Does that mean ice cream consumption causes sunburn?
o No because there is the possibility of a third variable influencing the
association. i.e., temperature (CF). Hot temperatures cause people to
both eat more ice cream and spend more time outdoors under the sun
resulting in more sunburn
You find that babies born to mothers who smoked during their pregnancies weigh
significantly less than those born to non-smoking mothers
o you need to account for the fact that smokers are more likely to engage
in other unhealthy behaviours, such as drinking or eating less healthy
foods, then you might overestimate the relationship between smoking
and low birth weight.
Randomization
(How we eliminate Confounding Variables)
Randomly assigning the values of the explanatory variable.
Experimental study
o We use experiments to eliminate confounding variables using randomization.
o Causality can be established through an experimental study as confounding
variables are eliminated.
This is done through.
o Randomization and the Treatment
o To begin the experiment, using random assignment, they must create two
groups from the study: a control group and treatment group,
o One group (the treatment) is assigned to the explanatory variable while the
other group is not. Both these groups are randomized so should (roughly) look
similar in every aspect except the treatment. This will allow the people taking
the study to determine whether x is a causal factor of y because they have
eliminated the confounding variable that was present in said study. Therefore,
any observed group differences may be attributed to the treatment.
Observational study
o Observational studies use information gathered from observed behaviour as it
naturally exists (just observe the results confounding variables and all).
o You cannot claim causality in an observational study due to the confounding
variables
Placebo effect
The placebo effect is a phenomenon most commonly found in medical experiments
that impacts the ability to claim causality due to participants feeling the desired effect
regardless of whether the treatment was any good or not.
Describing Data
How we summarize and visualize variables, and any relationships between them
Proportions
o The proportion in some category is found by.
- Number in that category/total number
Categorical Variables
o Usually visualized using a bar chart or a pie chart
Quantitative Variables
o Plots show the distribution of a variable and any outliers
o Bell shaped curves are the special, allowing us to follow a number of special
rules.
WK3 MEASURES
Sample Statistic
o Using the information in a sample to calculate a statistic, an aspect of the
population we are interested in
o How well our statistic, calculated from our sample, represents the underlying
truth in the population. This is referred to as inference.
Statistical Measures
o Centre
o Spread
o Location
o Association
Notations
N The number of cases in the total population (mu)
n The number of cases in the sample or sample size
x Represents a variable
i Index (observation 1,2,3 etc.)
xi The values of each observation
Sample Mean;
Population Mean;
Median (m)
o Median is the middle value when the data is ordered (lowest to highest)
o The median splits the data in half.
o To find it order x from high to low and pick middle value. If n is even, the
median is the mean of the two middle values
If the data is skewed to the right, the mean is higher than the median due to outliers
described below. Opposite if the data is skewed left.
Outliers
o Outliers skew the data by “pulling” the sample mean in the direction of
skewness.
o Outlier is a value that is notably distinct from the other values in a data set
Measures of spread
Standard Deviation
o Standard deviation measures the spread of a variable
o It measures the distance of a “typical case” from the mean (what’s the typical
variation from the mean)
o The larger the Standard deviation the more spread and the more variability in
the data
Rule of thumb
o if a sample histogram is roughly bell-shaped, around 95% of the values will be
within two standard deviations of the mean
o 95% of the sample values will be between � ̅ –2 times � and ̅�+2 times �
o In a population, 95% of values will be between �−2� and �+2�
Measure of location
Z scores
o The purpose of a z-score is to take the units of measurements out of
consideration so we can compare results across samples that have different
scales.
o The z-score for a specific value xi is.
o This measures the distance between a value and its mean in units of standard
deviations
o mean of � is always 0 and its standard deviation is always 1
o in a bell-shaped distribution, 95% of z-scores lie between ±2
The z-score gives you a value of an observation above the it’s mean. The higher the
number the further away for the sample mean it is and visa versa.
Correlation
o Correlation is a measure of the strength and direction of linear association
between two quantitative variables
o Allows us to measure the intensity of the relationship, whether they are strongly
or weakly associated and whether it is positively or negatively associated.
o Not so good at picking up non-linear association
Sample correlation
o Correlation values lie between +1 and -1
o r > 0 positive association
o r < 0 negative association
o r = 0 no correlation
o the closer r is to −1 or +1, the stronger the (linear) association
o � is unit-free (does not depend on the units of measurement)
o correlation between � and � = correlation between �and � - doesn’t matter
which axis it is plotted on, you will get the same value for the correlation
o Correlation is not very resistant
- Check for outliers
o � = 0 only means that there is no linear association
- Variables could have a non-linear relationship
- Use a scatterplot to check for non-linear relationships
o Even if there is a strong association, that does not imply causation. You can
only determine causation using an experiment.
WK4 INFERENCE
Inference
Population parameter
o In Inference we are interested in an aspect or measure of the population e.g.
the population parameter of the mean (a population statistic/measure)
Sample statistic (p hat p^)
o Is a mirror image of the population except for the size. It reflects all the
characteristics of the population, but it is much smaller.
o The sample statistic is the proportion of the object of interest in a sample
o Using the sample, we are going to calculate the counterpart of the population
parameter (mu) which is the sample statistic mean (x bar)
o Types of sample statistics
- Proportions (p)
- Mean (mu)
- Difference in proportions (p1-p2)
- Difference in means (mu1-mu2)
Statistical inference
o Using the sample statistic as the best guess of what the population measure is.
o i.e. inference is the process of drawing a conclusion about a population using
information from a sample.
Inference notations
o Sample statistic noted as p^
o Population parameter noted as p
o In inference we want to figure out how reliable an indicator the sample
statistic is of the population parameter.
Sampling Distribution
o Take the sample statistics from the multitude of random samples taken of the
same size from the same population.
o The plot shows how the sample statistic varies from sample to sample
The more spread out the sampling distribution is, the more the sample statistic varies
from one sample to the next.
Centre
o if the samples are randomly selected, the sampling distribution will be centered
around the population parameter
Shape
o for most of the statistics we consider, if the sample size is large enough the
sampling distribution will be symmetric and bell-shaped
Q = (1-p)
Statistical inference
Interval Estimate
o Use the SE to calculate an interval estimate.
Q: How can we be sure that our interval estimate contains the population
parameter?
Confidence intervals
o A confidence interval for a parameter is an interval computed from a sample that
contains the parameter for a specified proportion of all samples
o the proportion of sample intervals that contain the parameter is called the
confidence level.
- a 95% confidence interval will contain the true parameter in 95% of all
samples
-
Below is what the 95% confidence interval looks like. The green lines are the sample
statistic means +- the margin of errors that include the population parameter
whereas the red are the x bars that do not include mu
Or in other words:
Margin of error
= Z* x standard error
How to find n if you are given margin of error and confidence level.
= z*^2 x p^(1-p^)/MOE^2
The issue with our confidence interval/sampling distribution is we are only using
one sample. This sample is just being reshuffled to create different combinations.
To find the SE from our sampling distribution we use Bootstrapping
WK5 BOOTSTRAPPING
Bootstrapping
Key
o We use our sample to create an artificial or simulated population
o One way to do this is to make copies of the original sample
o e.g., if we made 1,000 copies we could get a simulated population of 100,000
Smarties
o as long as the original sample is randomly chosen, the simulated population
should be very similar to the actual population.
o We can then sample repeatedly from the same sample.
Bootstrap sample
The CI, which we know from the 95% rule, will include the population parameter 95%
of the time using our sample statistics + or - 2* the SE.
The problem with this is we are only using one sample, we don’t always have the
luxury of taking repeated samples from the population to generate a sampling
distribution.
As we don’t have a sampling distribution, an alternative way of finding the SE is
bootstrapping. Bootstrapping is our clever way of using a single sample taken
from the population to mimic what the resampling process would do if we were
able to take a large number of samples from a population. Our way of
constructing a distribution of estimates of the sample mean from this distribution
we calculate the standard error to formulate our confidence interval.
o Interpretation: we are 95% certain that the true percentage who believe there is
‘solid evidence’ of global warming lies between 57% and 61%
Example continued:
Does belief in global warming differ by political party?
The sample proportion answering ‘yes’ was 79% among Democrats and 38% among
Republicans.
Calculate a 95% CI for the difference in proportions
o sample sizes are not given; assume �= 1, 000 for each party
o We want to find P� –P� (democrats - republicans)
o The sample statistic is PD−P� = 0. 79−0. 38 = 0.41
o SE From Statkey = 0.02 (0.41 +- 2 x 0.02) (0.37, 0.45)
o Interpretation: we are 95% sure that the difference in the proportion of
Democrats and Republicans who believe in global warming is between 37% and
45%
- We are a long way from saying there is no difference between these
groups
- The CI is saying that the difference could be at lease 37% and could be as
high as 45%
o the confidence interval does not include zero. We are 95% sure that the
difference is not zero, it is positive.
Changing the CI
> or < 95% confident
o As long as the bootstrap distribution is roughly symmetric, we can construct any
confidence interval we want by finding the appropriate percentiles in the
bootstrap distribution.
If you raise the confidence level (95%99%) you get a wider confidence interval to
be more certain that you include the true value. This means that to be ‘more
confident’ that the population parameter falls within the CI we need to have a larger
interval.
higher confidence level→ wider confidence interval
Method 2
o Generate a �% confidence interval as the range for the middle �% of bootstrap
statistics
o The two methods for creating a confidence interval only work if the bootstrap
distribution is smooth and symmetric.
o If the bootstrap distribution is highly skewed or looks ‘spiky’ with gaps, you need
to go beyond introductory statistics methods to create a confidence interval
To reduce the SE you need to increase the number of cases in the original sample
Hypothesis Tests
A hypothesis test is designed to measure evidence against the null hypothesis, and
determine whether this evidence is sufficient to conclude in favour of the alternative
hypothesis.
Can we refute the null hypothesis and support the alternative hypothesis?
NOTE:
Hypothesis tests are written in terms of population parameters not sample
A test is framed in terms of two competing hypotheses:
Null Hypothesis
o The null hypothesis (� 0) is a claim that there is no effect or difference between
variables and the result or cause and effect is just due to random chance.
- Usually includes an = sign
- E.g., There is no such thing as mind reading, the person saying they can
read your mind when picking 5 cards is just guessing which cards the
person is picking. If she is guessing you expect her to get 20% right as she
is choosing from 5 cards
Alternative Hypothesis
o The alternative hypothesis (� � ) is a claim for which we seek evidence
- �� usually includes one of >, <, or ≠ (depends on context)
- E.g., If mind reading does exist we would expect the proportion (p) of
cards guessed to be greater than 20%
First thing to do when setting up the hypothesis is let the audience know what the
notation you are using is, i.e., for the Question: Does support for gun control differ by
gender?
Categorical Data
Pf: proportion of females supporting gun control
Pm: proportion of males supporting gun control
Ho: Pf=Pm
Ha: Pf≠Pm
Question: Do Uni students sleep less than 7 hours per night?
Quantitative Data
μ: average hours of sleep per night for students
Ho: μ=7
Ha: μ< 7
Use simulation
we simulate how unlikely it would be to observe 8 correct guesses (sample statistic,
�^= 1) in a population where the probability of getting a correct guess is 0.5 (�0 is
true)
Once you have the randomized distribution we see where our sample statistic lies in
the distribution. See how unlikely it is to get that statistic in a distribution where we
have imposed a null hypothesis. Claim we have statistically significant evidence that
the statistic isn’t unlikely.
P Value
The chance of obtaining a sample statistic as (or more) extreme than the
observed sample statistic, if �0 is true
The p-value is the proportion of statistics in a randomization distribution that are
at least as extreme as the observed sample statistic
You are calculating the p value to determine the likely hood of getting the sample
statistic if the null hypothesis is true (assuming in the population it is). If it is a low
percentage, it is very unlikely that the sample statistic is the value it is by
chance/from guessing Evidence against the null hypothesis
The smaller the p value the stronger the evidence against Ho
p-value is small → reject the null in favour of the alternative
p-value isn’t small → do not reject the null as the test is inconclusive
Example:
The p-value is the chance of getting p^= 1 (the observed sample statistic) when
�= 0.5 (�0 is true).
To get this, we find the proportion of sample statistics in the randomization
distribution that are as extreme as p^ = 1
Significance level
The significance level (�) is the threshold below which the p-value is deemed small
enough to reject the null.
o if p-value < � reject �0 in favour of ��
Examples:
A sample statistic gives a p-value of 0. 035. Using �= 0. 05, is it statistically
significant?
The p-value is lower than α = 0.05, so the result is statistically significant. Therefore,
we reject H0.
Calculating P Value
- the p-value is the proportion in the tail in the direction specified by ��
o If you have > sign you are looking at the right tail, < the left tail and ≠ both
tails
- For a two-sided alternative, the p-value is twice the proportion in the smallest
tail.
o Therefore you need to times p by 2 when calculating
Summary
1) Define the notation (p, µ or a difference in p or a difference in µ) and formulate the
null (�0) and alternative (��) hypotheses
2) Work out how to construct the randomization distribution showing the statistics
that would be observed if �0 were true
3) Plot the randomization distribution
4) Calculate a p-value as the proportion in the randomization distribution as extreme
as the observed statisticd
5) Reject �0 if p-value < �; otherwise the test is inconclusive.
Experimental study
Example:
In an experiment evaluating a ‘cure’ for cocaine addiction, 48 people were randomly
assigned to take either Desipramine(a new drug), or Lithium (an existing drug). The
subjects were subsequently tested to see who relapsed into cocaine use.
RESEARCH QUESTION:
Is Desipramine more effective than Lithium at treating cocaine addiction? (lower
relapse rate)
Parameters
Pd - proportion who relapse after taking Desipramine
Pl - proportion who relapse after taking Lithium
Possible conclusions
Reject Ho: We have statistically significant evidence that Desipramine is more
effective than Lithium.
Do not reject Ho - We cannot determine, from the data, whether Desipramine is
more effective than Lithium. The test is inconclusive.
Original Sample
1) Randomly assign units into treatment groups - splitting the original experiment
group into two groups, one that will test using Desipramine and the other with
Lithium.
2) Conduct the experiment - Administer the drugs
3) Observe the relapse counts in each group - R= Relapse, N= No relapse.
Observational Study
Observational study: random sample from the population
To simulate what would happen if �0 were true, we simulate re-sampling from a
population in which �0 is true
We already know we can use the bootstrap to simulate re-sampling from a
population when we have only one sample
To make Ho true, we must remove the association between the two variables
We can “break” the association by randomly shuffling one of the variables
Each time we do this, we get a sample we might observe just by random chance,
if there really is no correlation
Statistical errors
There are 4 possible outcomes in a hypothesis test:
Type I error - We reject Ho based off
our sample but in the underlying
population Ho is true.
Type II error - We do not reject Ho
based off our sample but in the
underlying data, Ho is false.
Analogy to law
A person is innocent (Ho) until proven guilty (Ha)
Evidence (p-value) must be beyond a shadow of doubt (significance level)
Possible mistakes in any verdict:
Convict an innocent person (type 1 error)
Release a guilty person (type 2 error)
The probability of making a Type II error (not rejecting a false �0) depends on:
effect size (magnitude of the effect being measured)
sample size (more information implies fewer mistakes)
variability in the data (more variability implies more mistakes)
significance level � (smaller � implies more non-rejections)
As there is a trade-off between the two errors, you need to identify which is worse to
decide which way you want to adjust the significance level.
If a Type I error (rejecting a true null) is much worse than a Type II error, we
may choose a smaller α, like α = 0.01
- e.g., death penalty for conviction (�0 ∶ accused is innocent)
If a Type II error (not rejecting a false null) is much worse than a Type I
error, we may choose a larger α, like α = 0.10
- e.g., adding fluoride to the water supply (�0 ∶ fluoride has no effect)
The context is going to tell you which level of significance to choose but the (social
sciences) conventional level is α = 0.05
Density Curves
How does this relate to what we have been doing for the past few weeks?
Example:
Atlanta commute times
A bootstrap distribution of x bar for Atlanta commute times has a mean of 29.11
and St. Dev. of 0.93.
The below normal distribution looks to be a very good approximation to the bootstrap
distribution
=
We use stat key for this instead
Issue = each norm dist. is different
A short cut for this is the 95% rule.
The 95 percent rule is true for all normal distributions
95% of values lie within 2 SD of the mean
68% of values lie within 1 SD of the mean
In fact, the proportion of values that lie within any given number of SD from
the mean is the same for all normal distributions
What matters is the number of SD’s from the mean
Z score (revisited):
Normal Approximations
1) If we can approx. a bootstrap dist. with a norm dist.
You can construct a CI using a norm dist.
2) If we can approx. a randomized dist. with a norm dist.
We can compute a p-value using a norm dist.
Therefore, If you can find a way to compute SE, you may use normal approximations
without generating the distributions.
Example
In a random sample of n = 1771 12 -19 year olds, 345 had some hearing loss.
Example
Following hypotheses using data on a SAT score.
Ho: µ first born - µ not first born = 0
Ha: µ first born - µ not first born > 0
From the sample
x bar first born - x bar not first born = 30.26
SE = 37
One way to do the test: (from rand dist.) area beyond 30.26 in N (0,37)
P-value = 0.207
Do not reject the null hypothesis (p-value is greater than any value we would
get for the significance level)
Another way to do the test: Standardized test statistic (z)
SE = 37
30.26 - 0 ÷ 37 = 0.818
Area beyond z in N (0,1)
P-value = 0.207 (same as before)
Example
Census shows that 65.1% of homes are owner-occupied. Given random samples of
n = 100 homes, what is the standard error of p^?
The formula and the sampling distribution give the same answer
Note: the sampling distribution looks normal
CLT for p^
The distribution of p^ can be approximated by a norm dist. if:
N*p ≥ 10 and n(1-p) ≥ 10
N*p means the sample proportion and is referring to the counts in each of
the two categories. (Both categories have to be at least 10)
If both conditions are satisfied, then
CI for p
From the generic CI formula
Statistic ± z* x SE
If n*p^ ≥ 10 and n(1-p^) ≥ 10, then a CI can be computed by;
Ha: P > Po
Steps do this
1) Check n*po ≥ 10 and n(1-po) ≥ 10
2) If both values are satisfied, then the CLT applies and we may use the normal
approximation
3) Calculate z= p^-po ÷ SE where SE = sqrt po(1-po) ÷ n
4) Find the p-value as the area in the tail(s) beyond z in the standard normal
distribution
5) Use the p-value to make the decision about Ho
CLT for
If n ≥ 30
This means we can bypass the randomization process by using this normal dist. as
an approx. of the randomization dist. and we can use it to carry out the hypothesis.
Since we don't know (the population standard deviation), we approximate it
with the sample standard deviation, s
Problem: replacing with s means no longer has a normal distribution
Replacing with s in the SE formula changes the distribution of from a
normal to a t-distribution.
T-distribution
The t-dist. is similar to the N (0,1), but with 'fatter' tails
A t-dist. is characterized by degrees of freedom (df)
Degrees of freedom (df) are based on the sample size (n)
As df increase, the t-dist becomes more similar to N (0,1)
CLT for µ
If n ≥ 30 the standardized test statistic for a mean (using s) follows a t-distribution
with n-1 degrees of freedom
If n < 30, the t-dist. only applied if x is approximately normal
Generic formula for CI
Since we replace with s when calculating SE, we must use the t dist. instead of
the standard normal how do
Therefore, the CI formula becomes
Note: We don't use a t distribution here because we are looking at proportions,
proportions are always represented approximately by a normal dist., the t dist.
only comes into play when looking at quantitative data and sample means
As usual we calculate
From the CLT and imposing Ho in the SE formula:
Problem: what do we use for p1 and p2 if Ho is true (they will be the same)
We can't use because they will be different (won't impose Ho)
Pooled proportion
If Ho is true, then p1= p2 (which we call p)
Our best guess of p is the pooled proportion
Note: the pooled proportion will always be
CLT for
If n1 ≥ 30 and n2 ≥ 30
Because we don't know we use the sample equivalents
Use t dist. with df equal to the smaller of n1 -1 and n2 - 1
CI for µ1 - µ2
Generic formula for CI: statistic ± t* x SE
± t* x SE
t* is from a t dist. with the desired confidence level
for a 95% level and n-1 df
require 2.5% in each tail of the t distribution:
for a 90% level and n-1 df
require 5% in each tail of the distribution:
As usual calculate:
Which follows a t dist. (assuming n1 ≥ 30 and n2 ≥ 30) and
Response
variable (y)
Cautions
1) Only use a regression to predict over the range of x
a. If none of the x values are close to 0, then the intercept has no meaningful
interpretation
b. Once you start to predict outside of the range of x, you run the risk of making
a non-sensical prediction
2) Only use a regression when the association between x and y is approximately
linear
3) Beware of the influence of outlier observations
Example
The model is predicting birth rate based on life expectancy, not the other
way around which rules out #1
You can only make conclusions about causality from a randomized
experiment. This rules out #2
Caution
X and y regression may indicate a linear association, but this does NOT mean
that changes in x cause changes in y
A regression may look more sophisticated than correlation, but that does not
give it special power to determine causality
Causality can only be determines if the values of the explanatory variable are
determined randomly (randomized experiment)
Hence
are the population parameters
are sample statistics (or estimates)
Examples
Slope 95% CI for inkjet printers
Price and printing speed (PPM) for n = 20 inkjet printers
Conclude: 95% certain population slope is between $49.95 and $131.81 per PPM
Same procedure for the intercept
However we would expect this not to make any sense as that would
correspond to a printer that printed 0 pages per minute.
The null hypothesis for a Slope and Intercept calc is 0 for the coefficient in the
regression software.
Slope Hypothesis test for Inkjet printers
Confirm the calculation for the slope t-statistic
Correlation
If we want to test linear association between two variables:
We can use correlation
Ho: p= 0
Ha: p ≠ 0
Test statistic
Influence of outliers
The outliers are pulling the line upwards, need to be aware of outliers.
Coefficient of determination, R2
Recall for correlation -1 ≤ r ≤ +1
Squaring correlation, we get r ² (by convention we use R²)
Since 0 ≤ R² ≤ 1, it can be interpreted as a proportion
Example
R² for the inkjet printers
Interpret the value of R² for the inkjet printers regression
R² = r² = (0.74) ² = 0.5476
Int - 54.76% of the variation in price can be explained by printing speed (PPM)
Calculating R²
To estimate
Regression output from excel
Degrees of Freedom
Distinguishing between n-2 and n-1
It depends on the parameter you are estimating, as different parameters have
different rules for df. Week 11 lecture is about linear regression, in which the df
for a simple linear model is n-2. n-1 is for means.
WK12 PROBABILITY
Why probability theory is important
In these next two lectures we are finding the link between probability and statistics
A process is random if its outcome is uncertain
Frequentist definition
Depends on the frequency
The probability of an event A or P(A) is the long run frequency or proportion of times
the event occurs.
P(A) = 0 means A will definitely NOT happen
P(A) = 1 means A will definitely happen
P(Y=1):
Take the number of individuals
with exactly 1 sibling ÷ by total
population.
P(X ≥ 85)
Take the number of times a
student has recorded a mark
of 85% or above (from the total
number of students that have
studied ECMT in the last 10
years) ÷ by the total number of
students
Variables of interest = X,Y & (0.09 is the frequency of times
Final grade in ECMT that event has occurred.)
We know we can study from long run
P(Gender = Female)
frequency events that the proportion of the Take the number of females in
final grade being equal or greater than 85 is U.S college students in 2010 ÷
9% (0.09) total number of students in U.S
Probability is the long run frequency of an event college students in 2010
occurring
E.g., if we randomly select a card from a 52-card deck, what is the probability it is
a King?
Combinations of events
P (A and B) = probability that both A and B will happen
Addictive rule
The probability of A or B occurring is not just the sum of the probabilities of the two
events. It is the sum of the two events minus the probability that A and B occurs
Compliment rule
Example - Caffeine
A survey finds that 52% of students drink coffee in the morning, 48% drink coffee in
the afternoon, while 37% drink coffee in the morning and afternoon.
What percent of students do not drink coffee in the morning or afternoon?
Need to apply compliment rule to see the probability of NOT doing an option.
Conditional Probability
A survey finds that 52% of students drink coffee in the morning, 48% drink coffee in
the afternoon, while 37% drink coffee in the morning and afternoon.
What percent of students who drink coffee in the morning also drink it in the
afternoon?
Tip
Sometimes it helps to put the information into a table:
Multiplicative rule
Using the definition of conditional probability
Multiply both sides by P(B) to get
It also follows that
(Changing the denominator ^^)
- Addictive rule
- Compliment rule
- Multiplicative rule
Since:
Cannot have a H and T happening at the same time, events are disjoint (can't land
on H and T at the same time.)
P(A or B)
Independence
Events A and B are independent if
- Probability of A happening is not affected by whether B happens
Since:
E.g., if you toss 2 coins, what is the probability both land tails?
One coin landing on heads or tails has no impact on whether the other coin lands on
heads or tails.
If are disjoint events that together make up all the possibilities, then.
A regional airport is served by two airlines. Airline C operates 70% of the flights and
is late about 20% of the time; Airline D operates the rest and is late on 10 % of its
flights.
What proportion of flights to the airport are late?
Tree Diagram
Bayes' Rule
For any two events A and D
The tutor picks a question at random and tells you that you got it correct. What is the
probability she picked the easy question?
Random variable
Examples:
X = number of home-team wins in the World Series
Y = Sum of two dice rolls
G = Grade on next statistics exam
T = Time to run 1500m
W = Weight of a rat
What is the probability that there is exactly one H from two flips?
Example:
Quiz marks probability function
Example cont.
Quiz marks
This is the variance of quiz marks (1.8144)
Take the square root of the variance to get SD (σ)=1.347
A binomial random variable counts the number of 'successes' (for any outcome of
interest) in a sequence of trials where:
Number of trials (n) is fixed in advance
Probability of success (p) is the same in each trial
Successive trials are independent of one another
Examples
X = number of heads in two-coin flips –>
Y = number of sixes in five dice rolls –>
There is also an easier way to find the mean of a binomial random variable
The same player attempts 290 free throws in a season. Find the mean and standard
deviation for the number of free throws in the season.
Would you be surprised to learn that he made only 250 free throws in the 290
attempts?
WORKSHOP 1
Collecting Data
https://canvas.sydney.edu.au/courses/29433/files/15295167?wrap=1
Q1
a)- States
b) - Percent of residents with a college degree, Quantitative Variable.
c) - The residents from Connecticut
d) - Whether or not they had a college degree, Categorical Variable
Q2
a) - media usage - Categorial Variable, categorized by the use of television or
internet
- tiredness during the day - Quantitative Variable, can.
b) - media usage explanatory variable
- tiredness response variable
Q3
a) - the 30,000 people that participated
b/c) - No as sampling bias exists. This study is subject to a bad method of sampling,
as a result of volunteer bias (sample is made up of people that choose to
participate.) Their opinions do not represent the broader population as only people
that feel strongly towards it or have experience (positive or negative) will participate
online creating sampling bais and not giving an accurate representation of the wider
population
d) If you do want to generalize the sample result use randomization while sampling
(take a random sample)
A random sample is ideal but very often they are not feasible as we don’t always
have the population we want. You may have to alter the target population to get a
feasible population to sample. This is just the reality, not everyone may have access
to the facilities that you are wanting to sample
e) Categorial Variable (whether or not they have driven with a dog on their lap)
Q4
a) Yes it is biased due to a bad method of sampling because the sampling as they
are only surveying students at the gym, when they are trying to get an idea of
university students exercising. Exercise isn’t limited to just the gym, you can swim,
run etc.
b) No, unbiased. It is a simple categorical question that not many people would feel
uncomfortable answering as the legal age of drinking in Aus. is 18. If we were in USA
it may be subject to bias due to inaccurate responses. People lying as first year
students being aged 17/18 when the legal age is 21 in USA.
c) No, unbiased. They are choosing to participate themselves but that does not
mean there is a volunteer bias because whether they volunteer or not, it does not
influence the outcome of giving away 5 textbooks, it doesn’t matter who you give the
textbooks to.
Q5
a) Do you support the increase of additional overseas skilled workers in order to
stimulate economic growth
b) considering the current covid situation and the wide spreading tendancies of the
virus, do you support the government increasing the immigration levels in order to
stimulate economic growth.
c) the objective of the survey is to receive a truthful answer for whether they support
increasing immigration levels in order to stimulate economic growth or not. This
means the wording of the question must be neutral in order to get an accurate
reliable answer. It is important to avoid framing bias in these circumstances.
WORKSHOP 2
Correlation vs Causation
https://canvas.sydney.edu.au/courses/29433/files/13450417?wrap=1
Q1
a) Although the sampling method was random, the choice of dates they selected for
the sample group was not ideal as people tend to celebrate more leading into their
birthdays, resulting in bias data as these individuals would be more fatigued and
tired than usual.
This is an observational study as the group of researchers didn’t control the
explanatory variable, they simply gathered the information as it naturally exists,
resulting in a number of confounding variables still being present in the study.
Q2
a) randomly assign the 42 volunteers into two evenly sized groups, a control group
and a treatment group (it doesn’t, matter which group is which, flip a coin) The
control group to text with their dominant hand and the treatment group with their non-
dominant hand
b) Ask all volunteers to test using one hand (randomized) - flip a coin to decide which
hand is used first (right or left), then followed by the other hand. Then compare the
two hands results.
Describing data
Q3
a/b/c) Frequency Table
Response Frequency Relative Frequency
Great deal 81 0.08
Fair amount 325 0.32
Not very much 397 0.39
Not at all 214 0.21
Chart Title
450
400
350
300
250
200
150
100
50
0
Great Deal Fair amount Not very much Not at all
Frequency Relative Frequency
d) 32%
e)
f) Democrats tend to be more positive about the media as they have the highest
frequency
republicans tend to be more negative about the media, relative to the positive
opinions within the positive opinions
g) GF
You can only talk about skewness in a quantative variable, not in categorical
variables
Bins is the intervals of values for a histogram. We cannot plot every single value
We need to know the range of the data when creating our bins,
Finding an interval;
If a distribution of data is approximately symmetric and bell-shaped, about 95% of
the data should fall within two standard deviations of the mean. This means that
about 95% of the data in a sample from a bell-shaped distribution should fall in the
interval from ¯-2s to x¯+2s.
WORKSHOP 3
https://canvas.sydney.edu.au/courses/29433/files/13450418?wrap=1
Q1
a) 2, 26, 24
b) Q1= 8.5
c)
Q3
It is important that the question says that is bell shaped as it means we can use the
95% rule
a)
b)
Zscore and proportion are unit free that are used to compare
Q7
Q4
Q6
Explanatory variable must be placed on the horizonatal access and the
response variable on the vertical access
Covariance of x and y
WORKSHOP 4
On Excel
WORKSHOP 5
Q1
a) Randomly select 6 cases from the original sample WITH REPLACEMENT,
and calculate X1 bar.
b) X1 bar 21.167, X2 bar 16, X3 bar 22.167 (first bootstrap statistic/first dot is an
estimate of the sample statistic)
i. X bar (21.167, 16, 22.167)/3 = 19.778
ii. Repeat the above steps
c) The shape is not bell shaped as there are too many spikes, it is skewed to the
right, therefore we cannot use the 95% rule as the data is not symmetrical or
bell shaped.
d) The sample mean is now 20\
e) The data is now more symmetrical and bell shaped with a SE of 2.6
f) SE = 20+- 2 x SE
i. LB = 14.8 UB = 25.2
ii. We are 95% confident that the average number of laughs by a
person in a day is between 14.8 and 25.2
Q2
a) Mu (m) (the mean number of hours spent watching television for males at this
university) Mu (f) (the mean number of hours spent watching television for
females at this university.
b) Xm bar - Xf bar = 6-3,91 = 2.09
i. We want to use this statistic to estimate the population mean for
the two statistics
b. How to generate the sample statistic:
i. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar 1
ii. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar 1
iii. Calculate Xm bar 1 - Xf bar 1 = -0.45
iv. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar2
v. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar2
vi. Calculate Xm bar 2 - Xf bar 2 = 1.43
vii. Randomly select 13 males from the original sample WITH
REPLACEMENT, and calculate Xm Bar 3
viii. Randomly select 13 females from the original sample WITH
REPLACEMENT, and calculate Xf Bar 3
ix. Calculate Xm bar 3 - Xf bar 3 = 3.2
x. X bar = (-0.45+1.43+3.2)/3= 1.394
xi. Repeat 1000s of times until you are happy with the distribution
c. Estimate the variability of the bootstrap statistic, SE = 1.5
d. 2.09+- 2 x SE
i. LB = -0.91, UB = 5.09
ii. We are 95% confident that the difference in the mean number of
hours spent watching television for males and females at this
university is between -0.91 and 5.09.
Q3
Q4
a) (Mu) = the mean amount of time (in minutes) spent watching election
coverage for all U.S. adults.
b) 80.44
c) Randomly select 25 cases from the original sample WITH REPLACEMENT,
and calculate Xi Bar 1
d) The bootstrap distribution should be centered around the original sample
statistic of 80.44
e) Once we have constructed the bootstrap distribution we can use the
distribution to estimate the SE
f) S
g) 92% confidence interval using 4% = 65.16 and 96% = 95.78 as 92% will have
4% on either side.
Q5
a) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P1 hat = 0.193 (first dot on the bootstrap
distribution)
b) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P2 hat = 0.206
c) Randomly select 1017 people from the original sample WITH
REPLACEMENT (n must be 1017) count the number of ‘no confidence at all’
responses, and calculate P3 hat = 0.249
d) P hat (average of bootstrap samples) = (0.193+0.206+0.224)/3 = 0.207
e) Repeat the above many times and construct a bootstrap distribution
f) Check centre and shape
g) Estimate the variability of the bootstrap statistic, SE=0.013
h) Use the SE to construct confidence intervals for the statistical inference. SE =
0.21+-2*0.013
a. 0.184 to 0.236
b. We are 95% confident that the proportion of U.S. adults who have no
confidence in the media is between 0.18 and 0.24
i) We want the estimate to be as narrow as possible
Q6
a) Mu = 0.492, LB = 46.1, UB = 52.8.
i. There are 1000 dots so we count 5 dots from either end of the
tails
Q7
a) It is not appropriate to use the 95% rule on either distributions as neither of
them are very symmetrical or bell shaped with skews to the left in both
examples.
WORKSHOP 6