0% found this document useful (0 votes)
5 views43 pages

Lec 15

The lecture introduces statistical inference, focusing on making inferences about populations through random samples and the importance of probability. Key components include estimation, confidence intervals, and hypothesis testing, with practical examples such as the Monty Hall problem and Mendel's genetics experiments. The session outlines the process of assessing models and the significance of test statistics in determining the validity of hypotheses.

Uploaded by

salaarmasood321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

Lec 15

The lecture introduces statistical inference, focusing on making inferences about populations through random samples and the importance of probability. Key components include estimation, confidence intervals, and hypothesis testing, with practical examples such as the Monty Hall problem and Mendel's genetics experiments. The session outlines the process of assessing models and the significance of test statistics in determining the validity of hypotheses.

Uploaded by

salaarmasood321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CS334: Principles and

Techniques of Data Science


Lecture 15
Mobin Javed
Today
● Begin new module on Statistical Inference
Statistics of Data & Inferences
● Inferential statistics: making inferences about a
population by examining one or more random samples
drawn from that population

● Statistical view of data: data comes from a random


process. The goal is to learn how this process works to
make predictions or to understand what plays a role in it
○ To understand randomness, we need probability!
Key Components of Statistical Inference
● Three ingredients of statistical inference
○ Estimation
■ “𝑝̂ is an estimator of 𝑝, the bias of a coin (i.e., the
probability of a head)”

○ Confidence Intervals
■ “[0.61,0.66] is a 95% confidence interval for 𝑝”

○ Hypothesis Testing
■ “We find statistical evidence that the coin is biased”
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Review:
1. Random Variables
2. Expectation
3. Distributions
Random Variables (RVs)
● A random variable is a numerical function of a random
sample (also known as a “statistic”)
○ It maps samples (or outcomes of an experiment) to a real number

Population

X(s) = 1
Let s be a sample of size 3
X(s) = 2
Let X be the number of blue people in our sample
X, then, is a random variable! X(s) = 0
Functions of Random Variables
● A function of a random variable is also a random variable!
○ i.e., it is a “function of a function of the sample”
○ If you create multiple random variables based on your sample,
then functions of them are also random variables
Probability Distribution
● Probabilities associated with each value that a random
variable can assume:
○ For instance, P(X = 10) is the chance that X has the value 10

● The distribution of X is a description of how the total


probability of 100% is split over all possible values of X
○ The probabilities of each possible value must each be non-
negative, and must sum to 1:
Example Distribution
● Consider a random variable X with the following
distribution table:

probabilities
of those
values
values

To compute related probabilities, we add up


the probabilities belonging to that event. (For
instance, X < 6 happens if X = 3 or X = 4).
Expectation of a Random Variable
● The expectation of a random variable X is the weighted
average of the values of X, where the weights are the
probabilities of the values
○ The most common formulation applies the weights one
possible value at a time:

○ However, an equivalent formulation applies the weights one


sample at a time:
Empirical Distribution
● “Empirical”: based on observations
● Observations can be from repetitions of an experiment
● “Empirical Distribution”
○ All observed values
○ The proportion of times each value appears
Large Random Samples
Law of Averages
If a chance experiment is repeated many times,
independently and under the same conditions, then the
proportion of times that an event occurs gets closer to the
theoretical probability of the event

As you increase the number of rolls of a die, the proportion


of times you see the face with five spots gets closer to 1/6
Empirical Distribution of a Sample
● If the sample size is large, then the empirical
distribution of a uniform random sample resembles the
distribution of the population, with high probability
Monty Hall Problem
● Game (“Let’s Make a Deal”):
○ There are three doors and behind two doors there is
a goat and behind one door there is a car
○ The game asks you to pick one door
○ Monty will then open any one of the other doors
containing a goat
○ Monty gives you an option to switch
● Question: Will switching improve your chances of
winning a car?
The correct answer is that switching will increase the chances of winning

Exercise: Can you prove this empirically by simulating several trials of the
game?

For solution see:


https://www.inferentialthinking.com/chapters/09/4/Monty_Hall_Problem.html
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Models
● A model is a set of assumptions about the data
● In data science, many models involve assumptions
about processes that involve randomness
○ “Chance models”
Approach to Assessment
● If we can simulate data according to the assumptions of
the model, we can learn what the model predicts
● We can then compare the predictions to the data that
were observed
● If the data and the model’s predictions are not
consistent, that is evidence against the model
Gregor Mendel, 1822-1884

• An Austrian monk who is widely recognized as the


founder of the modern field of genetics
• He performed careful and large-scale experiments
to come up with the fundamental laws of genetics
Mendel
● He formulated sets of assumptions about each variety
of pea plants, which were his models
● He then tested the validity of his models by growing
the plants and gathering data
● Let’s analyze the data from one such experiment to see
if Mendel’s model was good
A Model
● In a particular variety of pea plants, each plant has either
purple flowers or white flowers
● Mendel’s model:
○ Each plant is purple-flowering with chance 75%, regardless of the
colors of the other plants (Mendel’s hypothesis)
● We want to know whether the model is good, or not
● Discuss: How would you assess this model?
Assessing Models – General Idea
● We can simulate plants under the assumptions of the
model and see what it predicts
● Then we can compare the predictions with the data
that Mendel recorded
Steps in Assessing a Model
● Come up with a statistic that will help you decide whether the data
support the model or an alternative view of the world
● Simulate the statistic under the assumptions of the model
● Draw a histogram of the simulated values. This is the model’s
prediction for how the statistic should come out
● Compute the observed statistic from the sample in the study
● Compare this value with the histogram
● If the two are not consistent, that’s evidence against the model
Steps in Assessing a Model
● Start with the percentage of purple-flowering plants in
sample
o If that percentage is much larger or much smaller than 75, there
is evidence against the model
o Distance from 75 is the key
● Statistic:
| sample percent of purple-flowering plants – 75 |
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Testing Hypotheses
● Choosing one of two viewpoints based on data
● “Each plant is purple-flowering with chance 75%” vs
“No, it isn’t”
Incomplete Information
● We are trying to choose between two views of the
world, based on data in a sample
● It is not always clear whether the data are consistent
with one view or the other
● Random samples can turn out quite extreme. It is
unlikely, but possible
Testing Hypotheses
● A test chooses between two views of how data were
generated
● The views are called hypotheses
● The test picks the hypothesis that is better supported
by the observed data
Null and Alternative
The method only works if we can simulate data under one
of the hypotheses
● Null hypothesis
○ A well defined chance model about how the data were
generated
○ We can simulate data under the assumptions of this model –
“under the null hypothesis”
● Alternative hypothesis
○ A different view about the origin of the data
Test Statistic
● The statistic that we choose to simulate, to decide
between the two hypotheses is known as the test statistic

Questions before choosing the statistic:


● What values of the statistic will make us lean towards the
null hypothesis?
● What values will make us lean towards the alternative?
○ Preferably, the answer should be just “high”
Conclusion of the Test
Resolve choice between null and alternative hypotheses
● Compare the observed test statistic and its empirical
distribution under the null hypothesis
● If the observed value is not consistent with the
distribution, then the test favors the alternative –
“rejects the null hypothesis”
Statistical Significance
Conventions About Inconsistency
● “Inconsistent”: The observed test statistic is in the tail of
the empirical distribution under the null hypothesis
● “In the tail,” first convention:
○ The area in the tail is less than 5%
○ The result is “statistically significant”
● “In the tail,” second convention:
○ The area in the tail is less than 1%
○ The result is “highly statistically significant”
The cut-off value (i.e., 1% , 5%) is known as the significance level 𝛼
Critical Value
● The critical value is the value of the test statistic at
which the probability in the tail of the distribution
equals 𝛼
P-value
Formal name: observed significance level

The P-value is the chance,


● under the null hypothesis,
● that the test statistic
● is equal to the value that was observed in the data
● or is even further in the direction of the alternative
Recall Mendel’s Model and
Experiments
Gregor Mendel, 1822-1884

• An Austrian monk who is widely recognized as the


founder of the modern field of genetics
• He performed careful and large-scale experiments
to come up with the fundamental laws of genetics

| sample percent of purple-flowering plants - 75 |


Quantifying Conclusions
P(the test statistic would be equal to or more extreme
than the observed test statistic under the null hypothesis)

Observed test statistic: 3.2


P-value: 0.0243

Using a significance level of 5% (0.05),


we reject the null hypothesis (equivalent
to formally saying that the test result is
statistically significant)
An Error Probability
Can the Conclusion be Wrong?
Yes.

Null is true Alternative is


true
Test rejects the
null ❌ ✅
Test doesn’t
reject the null ✅ ❌
An Error Probability
● The cutoff for the P-value (significance level) is an error
probability

● If your cutoff is 5%, and the null hypothesis happens to be


true
● then there is about a 5% chance that your test will reject the null
hypothesis
● This is called a Type I error: rejecting a true null hypothesis
● A Type II error occurs when you fail to reject the null
hypothesis when it is actually false

You might also like