0% found this document useful (0 votes)

5 views43 pages

Lec 15

The lecture introduces statistical inference, focusing on making inferences about populations through random samples and the importance of probability. Key components include estimation, confidence intervals, and hypothesis testing, with practical examples such as the Monty Hall problem and Mendel's genetics experiments. The session outlines the process of assessing models and the significance of test statistics in determining the validity of hypotheses.

Uploaded by

salaarmasood321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views43 pages

Lec 15

Uploaded by

salaarmasood321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CS334: Principles and

Techniques of Data Science

Lecture 15
Mobin Javed
Today
● Begin new module on Statistical Inference
Statistics of Data & Inferences
● Inferential statistics: making inferences about a
population by examining one or more random samples
drawn from that population

● Statistical view of data: data comes from a random

process. The goal is to learn how this process works to
make predictions or to understand what plays a role in it
○ To understand randomness, we need probability!
Key Components of Statistical Inference
● Three ingredients of statistical inference
○ Estimation
■ “𝑝̂ is an estimator of 𝑝, the bias of a coin (i.e., the
probability of a head)”

○ Confidence Intervals
■ “[0.61,0.66] is a 95% confidence interval for 𝑝”

○ Hypothesis Testing
■ “We find statistical evidence that the coin is biased”
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Review:
1. Random Variables
2. Expectation
3. Distributions
Random Variables (RVs)
● A random variable is a numerical function of a random
sample (also known as a “statistic”)
○ It maps samples (or outcomes of an experiment) to a real number

Population

X(s) = 1
Let s be a sample of size 3
X(s) = 2
Let X be the number of blue people in our sample
X, then, is a random variable! X(s) = 0
Functions of Random Variables
● A function of a random variable is also a random variable!
○ i.e., it is a “function of a function of the sample”
○ If you create multiple random variables based on your sample,
then functions of them are also random variables
Probability Distribution
● Probabilities associated with each value that a random
variable can assume:
○ For instance, P(X = 10) is the chance that X has the value 10

● The distribution of X is a description of how the total

probability of 100% is split over all possible values of X
○ The probabilities of each possible value must each be non-
negative, and must sum to 1:
Example Distribution
● Consider a random variable X with the following
distribution table:

probabilities
of those
values
values

To compute related probabilities, we add up

the probabilities belonging to that event. (For
instance, X < 6 happens if X = 3 or X = 4).
Expectation of a Random Variable
● The expectation of a random variable X is the weighted
average of the values of X, where the weights are the
probabilities of the values
○ The most common formulation applies the weights one
possible value at a time:

○ However, an equivalent formulation applies the weights one

sample at a time:
Empirical Distribution
● “Empirical”: based on observations
● Observations can be from repetitions of an experiment
● “Empirical Distribution”
○ All observed values
○ The proportion of times each value appears
Large Random Samples
Law of Averages
If a chance experiment is repeated many times,
independently and under the same conditions, then the
proportion of times that an event occurs gets closer to the
theoretical probability of the event

As you increase the number of rolls of a die, the proportion

of times you see the face with five spots gets closer to 1/6
Empirical Distribution of a Sample
● If the sample size is large, then the empirical
distribution of a uniform random sample resembles the
distribution of the population, with high probability
Monty Hall Problem
● Game (“Let’s Make a Deal”):
○ There are three doors and behind two doors there is
a goat and behind one door there is a car
○ The game asks you to pick one door
○ Monty will then open any one of the other doors
containing a goat
○ Monty gives you an option to switch
● Question: Will switching improve your chances of
winning a car?
The correct answer is that switching will increase the chances of winning

Exercise: Can you prove this empirically by simulating several trials of the
game?

For solution see:

https://www.inferentialthinking.com/chapters/09/4/Monty_Hall_Problem.html
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Models
● A model is a set of assumptions about the data
● In data science, many models involve assumptions
about processes that involve randomness
○ “Chance models”
Approach to Assessment
● If we can simulate data according to the assumptions of
the model, we can learn what the model predicts
● We can then compare the predictions to the data that
were observed
● If the data and the model’s predictions are not
consistent, that is evidence against the model
Gregor Mendel, 1822-1884

• An Austrian monk who is widely recognized as the

founder of the modern field of genetics
• He performed careful and large-scale experiments
to come up with the fundamental laws of genetics
Mendel
● He formulated sets of assumptions about each variety
of pea plants, which were his models
● He then tested the validity of his models by growing
the plants and gathering data
● Let’s analyze the data from one such experiment to see
if Mendel’s model was good
A Model
● In a particular variety of pea plants, each plant has either
purple flowers or white flowers
● Mendel’s model:
○ Each plant is purple-flowering with chance 75%, regardless of the
colors of the other plants (Mendel’s hypothesis)
● We want to know whether the model is good, or not
● Discuss: How would you assess this model?
Assessing Models – General Idea
● We can simulate plants under the assumptions of the
model and see what it predicts
● Then we can compare the predictions with the data
that Mendel recorded
Steps in Assessing a Model
● Come up with a statistic that will help you decide whether the data
support the model or an alternative view of the world
● Simulate the statistic under the assumptions of the model
● Draw a histogram of the simulated values. This is the model’s
prediction for how the statistic should come out
● Compute the observed statistic from the sample in the study
● Compare this value with the histogram
● If the two are not consistent, that’s evidence against the model
Steps in Assessing a Model
● Start with the percentage of purple-flowering plants in
sample
o If that percentage is much larger or much smaller than 75, there
is evidence against the model
o Distance from 75 is the key
● Statistic:
| sample percent of purple-flowering plants – 75 |
Outline
● Sampling, Random Variables and Distributions
● Assessing Models and Comparing Distributions
● Hypothesis Testing
Testing Hypotheses
● Choosing one of two viewpoints based on data
● “Each plant is purple-flowering with chance 75%” vs
“No, it isn’t”
Incomplete Information
● We are trying to choose between two views of the
world, based on data in a sample
● It is not always clear whether the data are consistent
with one view or the other
● Random samples can turn out quite extreme. It is
unlikely, but possible
Testing Hypotheses
● A test chooses between two views of how data were
generated
● The views are called hypotheses
● The test picks the hypothesis that is better supported
by the observed data
Null and Alternative
The method only works if we can simulate data under one
of the hypotheses
● Null hypothesis
○ A well defined chance model about how the data were
generated
○ We can simulate data under the assumptions of this model –
“under the null hypothesis”
● Alternative hypothesis
○ A different view about the origin of the data
Test Statistic
● The statistic that we choose to simulate, to decide
between the two hypotheses is known as the test statistic

Questions before choosing the statistic:

● What values of the statistic will make us lean towards the
null hypothesis?
● What values will make us lean towards the alternative?
○ Preferably, the answer should be just “high”
Conclusion of the Test
Resolve choice between null and alternative hypotheses
● Compare the observed test statistic and its empirical
distribution under the null hypothesis
● If the observed value is not consistent with the
distribution, then the test favors the alternative –
“rejects the null hypothesis”
Statistical Significance
Conventions About Inconsistency
● “Inconsistent”: The observed test statistic is in the tail of
the empirical distribution under the null hypothesis
● “In the tail,” first convention:
○ The area in the tail is less than 5%
○ The result is “statistically significant”
● “In the tail,” second convention:
○ The area in the tail is less than 1%
○ The result is “highly statistically significant”
The cut-off value (i.e., 1% , 5%) is known as the significance level 𝛼
Critical Value
● The critical value is the value of the test statistic at
which the probability in the tail of the distribution
equals 𝛼
P-value
Formal name: observed significance level

The P-value is the chance,

● under the null hypothesis,
● that the test statistic
● is equal to the value that was observed in the data
● or is even further in the direction of the alternative
Recall Mendel’s Model and
Experiments
Gregor Mendel, 1822-1884

• An Austrian monk who is widely recognized as the

founder of the modern field of genetics
• He performed careful and large-scale experiments
to come up with the fundamental laws of genetics

| sample percent of purple-flowering plants - 75 |

Quantifying Conclusions
P(the test statistic would be equal to or more extreme
than the observed test statistic under the null hypothesis)

Observed test statistic: 3.2

P-value: 0.0243

Using a significance level of 5% (0.05),

we reject the null hypothesis (equivalent
to formally saying that the test result is
statistically significant)
An Error Probability
Can the Conclusion be Wrong?
Yes.

Null is true Alternative is

true
Test rejects the
null ❌ ✅
Test doesn’t
reject the null ✅ ❌
An Error Probability
● The cutoff for the P-value (significance level) is an error
probability

● If your cutoff is 5%, and the null hypothesis happens to be

true
● then there is about a 5% chance that your test will reject the null
hypothesis
● This is called a Type I error: rejecting a true null hypothesis
● A Type II error occurs when you fail to reject the null
hypothesis when it is actually false

FRM Secret Sauce Quintedge
No ratings yet
FRM Secret Sauce Quintedge
35 pages
Introduction To Econometrics (3 Updated Edition, Global Edition)
No ratings yet
Introduction To Econometrics (3 Updated Edition, Global Edition)
8 pages
Chapter 16
No ratings yet
Chapter 16
24 pages
Random Variables and Probability Distribution
80% (5)
Random Variables and Probability Distribution
78 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
8 pages
EDA Reviewer
No ratings yet
EDA Reviewer
8 pages
StockWatson Econ CH 2
No ratings yet
StockWatson Econ CH 2
39 pages
Module 5 - Inferential Statistics and Their Application
No ratings yet
Module 5 - Inferential Statistics and Their Application
43 pages
Revision Module 1,2,3
No ratings yet
Revision Module 1,2,3
129 pages
Research - Stats Notes
No ratings yet
Research - Stats Notes
44 pages
Introductin To Econometrics
No ratings yet
Introductin To Econometrics
34 pages
Business Econometrics Using SAS Tools (BEST) : Class IV - Probability Refresher
No ratings yet
Business Econometrics Using SAS Tools (BEST) : Class IV - Probability Refresher
31 pages
Unit 3 R As A Set of Statistical Tables
No ratings yet
Unit 3 R As A Set of Statistical Tables
31 pages
Inferential Statistics: Sampling, Probability, and Hypothesis Testing
No ratings yet
Inferential Statistics: Sampling, Probability, and Hypothesis Testing
26 pages
Statistics 17 18
No ratings yet
Statistics 17 18
21 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
50 pages
FDA CIA 2 Qs Answers
No ratings yet
FDA CIA 2 Qs Answers
26 pages
Probability and Statistics
No ratings yet
Probability and Statistics
8 pages
Review Exam 1
No ratings yet
Review Exam 1
3 pages
Statistics and Probability Q3
No ratings yet
Statistics and Probability Q3
27 pages
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
No ratings yet
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
279 pages
What Is Statistic
No ratings yet
What Is Statistic
129 pages
cs447 - Tool Using Simulation To Test A Hypothesis
No ratings yet
cs447 - Tool Using Simulation To Test A Hypothesis
4 pages
Level 3 Comp PROBABILITY PDF
100% (1)
Level 3 Comp PROBABILITY PDF
30 pages
1 Intro-Statistics
No ratings yet
1 Intro-Statistics
61 pages
ML Unit-3
No ratings yet
ML Unit-3
18 pages
Stats Review
No ratings yet
Stats Review
65 pages
Applications of Inference Statistics
No ratings yet
Applications of Inference Statistics
28 pages
STK110 - Chapter 4
No ratings yet
STK110 - Chapter 4
47 pages
Statistics and Probability
No ratings yet
Statistics and Probability
43 pages
Biostatistics Notes Part 1
No ratings yet
Biostatistics Notes Part 1
9 pages
Psych 101 Endterm Notes
No ratings yet
Psych 101 Endterm Notes
9 pages
Unit 4
No ratings yet
Unit 4
20 pages
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong
No ratings yet
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong
64 pages
Prop Final 4
No ratings yet
Prop Final 4
119 pages
Mit18 05 s22 Statistics
No ratings yet
Mit18 05 s22 Statistics
173 pages
Statistics
No ratings yet
Statistics
19 pages
00 - Inrroduction To Statistics
No ratings yet
00 - Inrroduction To Statistics
30 pages
Lecture 4 - Data Science Statistics
No ratings yet
Lecture 4 - Data Science Statistics
21 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Lecture 1
No ratings yet
Lecture 1
12 pages
COMP6053 Lecture: Hypothesis Testing, T-Tests, P-Values, Type-I and type-II Errors
No ratings yet
COMP6053 Lecture: Hypothesis Testing, T-Tests, P-Values, Type-I and type-II Errors
48 pages
Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
No ratings yet
Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
36 pages
Probability and Statistics
No ratings yet
Probability and Statistics
5 pages
Inferential Statistics
No ratings yet
Inferential Statistics
23 pages
Hypothesis Testing - The Scientists' Moral Imperative
No ratings yet
Hypothesis Testing - The Scientists' Moral Imperative
34 pages
Correlation, Probability
No ratings yet
Correlation, Probability
36 pages
Lecture 2
No ratings yet
Lecture 2
9 pages
Quantitative Research Techniques and Statistics Notes
No ratings yet
Quantitative Research Techniques and Statistics Notes
12 pages
Module 02 - AIML Statisitcs
No ratings yet
Module 02 - AIML Statisitcs
103 pages
Biostatistics - Probability - 02 October 2024
No ratings yet
Biostatistics - Probability - 02 October 2024
42 pages
Econ1203 Notes
67% (3)
Econ1203 Notes
35 pages
STATISTICS Module 1
No ratings yet
STATISTICS Module 1
31 pages
Introduction To Probability
100% (1)
Introduction To Probability
17 pages
BASIC PROBABILITY - MSC PDF
No ratings yet
BASIC PROBABILITY - MSC PDF
72 pages
Stat - G. Assignment
No ratings yet
Stat - G. Assignment
21 pages
1 Intro Tree Diagram
No ratings yet
1 Intro Tree Diagram
35 pages
Statistical Methods
No ratings yet
Statistical Methods
16 pages
Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
Intro To Probability (Pattern Recognition)
No ratings yet
Intro To Probability (Pattern Recognition)
94 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Lec 11
No ratings yet
Lec 11
46 pages
Lec 10
No ratings yet
Lec 10
55 pages
Lec 12
No ratings yet
Lec 12
43 pages
Lec 13
No ratings yet
Lec 13
35 pages
Lec 14
No ratings yet
Lec 14
40 pages
Tut 3
No ratings yet
Tut 3
2 pages
Statistics Unit Test-Part 1 +probability
No ratings yet
Statistics Unit Test-Part 1 +probability
3 pages
Math T STPM Sem 3 2022
No ratings yet
Math T STPM Sem 3 2022
2 pages
Forecasting - Lecture
No ratings yet
Forecasting - Lecture
72 pages
Stat319 131 06 E3
No ratings yet
Stat319 131 06 E3
4 pages
Statistics Probability
No ratings yet
Statistics Probability
66 pages
AIT HW 1
No ratings yet
AIT HW 1
5 pages
Significance Tests
No ratings yet
Significance Tests
43 pages
Seminar Maschinellem Lernen: An Improved Model Selection Heuristic For AUC
No ratings yet
Seminar Maschinellem Lernen: An Improved Model Selection Heuristic For AUC
19 pages
18.S096 Problem Set Fall 2013: Stochastic Calculus
No ratings yet
18.S096 Problem Set Fall 2013: Stochastic Calculus
3 pages
Chapter Two - Estimation (STA408)
No ratings yet
Chapter Two - Estimation (STA408)
46 pages
QM - Ii Assignment - 3: Submitted By: Group 2 (Sec-B)
No ratings yet
QM - Ii Assignment - 3: Submitted By: Group 2 (Sec-B)
6 pages
(SEM 4) (Statistics Honours Paper)
No ratings yet
(SEM 4) (Statistics Honours Paper)
6 pages
Logistic Regression
No ratings yet
Logistic Regression
33 pages
F Test Table
No ratings yet
F Test Table
7 pages
Machine Learning - 9: BITS Pilani
No ratings yet
Machine Learning - 9: BITS Pilani
13 pages
Data Science 101
No ratings yet
Data Science 101
1 page
Non-Normal Process Capability Indices
No ratings yet
Non-Normal Process Capability Indices
6 pages
AS306 Fme Cou HSK Jan13
No ratings yet
AS306 Fme Cou HSK Jan13
2 pages
DSA UNIT 2 MCQs
No ratings yet
DSA UNIT 2 MCQs
29 pages
BSC Psychology IV Apr2020 Statistical Inference
No ratings yet
BSC Psychology IV Apr2020 Statistical Inference
3 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
3 Gage-RR
No ratings yet
3 Gage-RR
25 pages
Worksheet 1
No ratings yet
Worksheet 1
5 pages
Essentials of Bio Statistics Research Methodology
No ratings yet
Essentials of Bio Statistics Research Methodology
7 pages
On The Econometrics of The Bass Diffusion Model: H. Peter B
No ratings yet
On The Econometrics of The Bass Diffusion Model: H. Peter B
14 pages
A Guide To Control Charts
No ratings yet
A Guide To Control Charts
12 pages
Statistics and Probability 4th Quarter
No ratings yet
Statistics and Probability 4th Quarter
3 pages
(Original PDF) Australasian Business Statistics, 4th Edition Download
100% (1)
(Original PDF) Australasian Business Statistics, 4th Edition Download
45 pages

Lec 15

Uploaded by

Lec 15

Uploaded by

CS334: Principles and

Techniques of Data Science

● Statistical view of data: data comes from a random

● The distribution of X is a description of how the total

To compute related probabilities, we add up

○ However, an equivalent formulation applies the weights one

As you increase the number of rolls of a die, the proportion

For solution see:

• An Austrian monk who is widely recognized as the

Questions before choosing the statistic:

The P-value is the chance,

• An Austrian monk who is widely recognized as the

| sample percent of purple-flowering plants - 75 |

Observed test statistic: 3.2

Using a significance level of 5% (0.05),

Null is true Alternative is

● If your cutoff is 5%, and the null hypothesis happens to be

You might also like