0% found this document useful (0 votes)
22 views

Data Science Unit-2

This document covers the fundamentals of statistical modeling, focusing on random variables, sample statistics, hypothesis testing, and Bayesian inference. It explains deterministic and stochastic phenomena, detailing the sources of errors in observed outcomes and the types of random variables. The document also discusses probability measures, conditional probability, and the importance of proper sampling procedures for making inferences about populations from sample data.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Science Unit-2

This document covers the fundamentals of statistical modeling, focusing on random variables, sample statistics, hypothesis testing, and Bayesian inference. It explains deterministic and stochastic phenomena, detailing the sources of errors in observed outcomes and the types of random variables. The document also discusses probability measures, conditional probability, and the importance of proper sampling procedures for making inferences about populations from sample data.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

PE 515 CS – DATA SCIENCE

UNIT – II
22.11.2022
STATISTICAL MODELING
RANDOM VARIABLE
SAMPLE STATISTICS
HYPOTHESIS TESTING
CONFIDENCE INTERVALS
P HACKING
BAYESIAN INFERENCE

Objectives
Introduction to random variables
How they are characterized using probability measures and probability density
functions
How parameters of these density functions can be estimated
How you can do decision making from data using the method of hypothesis testing
Characterizing random phenomena – what are they – how probability can be used
as a measure to describe
Statistical Modeling
Random phenomena
1) Deterministic phenomenon
2) Stochastic phenomenon

1) Deterministic phenomenon – Phenomenon whose outcome can be predicted


with a very high degree of confidence
Example – age of a person (using date of birth stated in Aadhaar Card),
- up to number of days, if asked to predict the age of the person to a hour or a
minute the date of birth from an Aadhaar Card is insufficient
- may be the information from the birth certificate are needed
- But to predict the age with higher degree of precision, up to the last minute, it
is not possible to do it with the same level of confidence.

2) Stochastic phenomenon – Phenomenon which can have many possible


outcomes for some experimental conditions. Outcome can be predicted with
limited confidence
Example – Outcome of a coin toss
- Might get a head or tail (but can’t say with 90% or 95% confidence)
- Might be able to say it only with a 50% confidence if it is a fair coin

Page 1 of 1
Why are we dealing with stochastic phenomena?
- Data obtained from experiments contains some errors.
- Reasons for these errors may be, all the rules are not known, which governs the data
generating process or in other words, all the laws are not known (knowledge of all
the causes that affects the outcomes and thereof)
- These are called modeling errors
- The other kind of error is due to the sensor itself.
- The sensors used for observing the outcomes may contain errors
- Such errors are called measurement errors.
- These 2 errors are modeled using probability density functions and therefore the
outcomes are also predicted with certain confidence intervals.
- The types of random phenomena can either be discrete where the outcomes are
finite
- Example: Coin Toss experiment – only two outcomes (either head or tail)
- Example: Throw of a dice – 6 outcomes
- Continuous random phenomena – infinite number of outcomes
- Example: Measurement of a body temperature (varies from 96 – 105 degrees)
depending on a person is running a temperature or not.
- So such continuous variable thing which have random outcomes are called
continuous

Characterizing random phenomena


Sources of error in observed outcome
o Lack of knowledge of generating process (model error)
o Errors in sensors used for observing outcomes (measurement error)
Types of random phenomena
o Discrete: Outcomes are finite
Coin Toss: {H,T}
Throw of a dice: {1, 2, 3, 4, 5, 6}
o Continuous: infinite number of outcomes
Body temperature measurement in def F

Sample Space, Events (Discrete Phenomena)


Sample space
o Set of all possible outcomes of a random phenomenon
Coin toss: S = {H, T}
Two Coin Tosses: S = {HH, HT, TH, TT)
Event
o Subset of the sample space
Occurrence of a head in first toss of a two coin toss experiment A =
{HH, HT}
Outcomes of a sample space are elementary events

Page 2 of 2
- Random phenomena - all the notions of probability - using just the coin toss
experiment
- A single coin tosses whose outcomes are described by H and T
- The sample space is the set of all possible outcomes.
- In this case the sample space consists of these two outcomes H and T denoted by
the symbols H and T
- On the other hand if there are two successive coin tosses, then there can be 4
possible outcomes denoted by the symbol HH, HT, TH and TT and that constitutes
the sample space.
- Outcomes of the sample space for example, HH, HT, TH and TT can also be
considered as events. These events are known as elementary events.

Probability measure – is a function that assigns a real value to every outcome of


random phenomena, which satisfies following axioms:
0 ≤ P(A) ≤ 1 (probabilities are non-negative and less than 1 for any event A)
P(S) = 1 (probability of the entire sample space – one of the outcomes should occur)
For two mutually exclusive events A and B,

o P(A B) = P(A) + P(B)

Note: To get ≤, use (2264 + ALT X)



To get use (ALT + 8746)

To get use (ALT + 8745)

Interpretation of a probability as a frequency:


Conduct an experiment (coin toss) N time.
If NA is number of times outcome A occurs then P(A) =NA/N

Exclusive and independent events


Independent events
o Two events are independent if occurrence of one has no influence on
occurrence of other

Formally A and B are independent events if and only if P(A B) = P(A) X
P(B)
In a two coin toss experiment, the occurrence of head in second toss
can be assumed to be independent of occurrence of head or tail in first
toss, then P(HH) = P(H in first toss) X P(H in second toss) = 0.5 X 0.5 =
0.25
o All the four outcomes in the case of two coin toss experiment, will have a
probability of 0.25

Page 3 of 3
o Two events are said to be independent, if the occurrence of one has no
influence on the occurrence of other. That is, even if first event occurs it is not
possible to make any improvement about the predictability of B if event A and
B are independent formally.

o P(A B) which means a joint occurrence of A and B can be obtained by
multiplying their respective probabilities which is P(A) into P(B).

Mutually exclusive events


o Two events are mutually exclusive if occurrence of one implies other event
does not occur
In a two coin toss experiment, events {HH} and {HT} are mutually
exclusive P(HH and HT) = P(HH) P(HT) = 0.25 + 0.25 = 0.5
obtained by a basic laws of probability of mutually exclusive events
o Mutually exclusive events are the events that preclude each other.
o Which means, if event A has occurred then it implies B has not occurred, then
A and B are called mutually exclusive events
o Two coin tosses in succession, two successive heads have occurred, it is
clear that the event of head followed by a tail has not occurred, these are
mutually exclusive events
o The probability of either receiving two successive heads or a head and
followed by a tail can be obtained in this case by simply adding their
respective probabilities because they are mutually exclusive events.

Some rules of Probability


- Venn diagram to derive probability rules for the 2 coin toss experiment.

Page 4 of 4
- In the two coin toss experiment the sample space consists of 4 outcomes denoted
by HH, HT, TH and TT.
- The event A, which is a head in the first toss
- This consists of two outcomes HH and HT
- A compliment is the set of all events that exclude A, which is nothing but the set of
outcomes TH and TT
- Probability of a compliment = probability of the entire sample space – P( A), which is
one
- P( HH) which is 0.25 + P( HT) which is 0.25 = 0.5.
- The probability of a compliment TH and TT = 0.5.
- P(A)c = 1 - P(A).

- A and B are not mutually exclusive, a common event of two successive heads which
belongs to both A and B
- Compute the P(A) or B - a head in the first toss or a head in the second toss, then
this comes to three outcomes and together gives the probability of 0.75 which can
be counted from the respective probabilities of HT, HH and TH
Page 5 of 5
Conditional Probability
If two events A and B are not independent, then information available about the
outcome of event A can influence the predictability of event B
Conditional probability

o P(B | A) = P(A B)/P(A) if P(A)>0
o P(A | B)P(B) = P(B | A)P(A) – Bayes Formula
o P(A) = P(A | B)P(B) + P(A | BC)P(BC)
Example: two (fair) coin toss experiment
o Event A : First toss in head = {HT, HH}
o Event B : Two successive heads = {HH}
o Pr(B)=0.25 (no information)

o Given event A has occurred Pr(B | A) = 0.5 = 0.25 / 0.5 = P(A B)/P(A)

EXAMPLE:
In manufacturing process of 1000 parts are produced of which 50 are defective.
We randomly take a part form the day’s production

o Outcomes: {A = Defective Part B = Non-Defective Part}


o P(A) = 50/1000, P(B) = 950/1000

Suppose we draw a second part without replacing the first part


o Outcomes: {C = Defective Part D = Non-Defective Part}
o Pr(C)= 50/1000 (no information about the outcome of the first draw)
o Pr(C | A)= 49/999 (given information that first draw is defective)
o Pr(C | B)= 50/999 (given information that first draw is non-defective)
o P(C)=49/999 * 50/1000 + 50/999 * 950/1000 = 50/1000

o P(A | C) = P (A C)/P(C) = P(C | A)P(A)/P(C)=49/999
Random Variables and Probability Mass/Density Functions

The notion of random variables and the idea of probability mass and density
functions
How to characterize these functions
How to work with them

Random Variable
A random variable (RV) is a map from sample space to a real line such that there
is a unique real number corresponding to every outcome of sample space
o Example: Coin toss sample space [H T] mapped to [0 1]. If the sample
space outcomes are real valued no need for this mapping (eg. Throw a
dice)
o Allows numerical computations such as finding expected value of a RV
o Discrete RV (throw a dice or a coin)
o Continuous RV (sensor readings, time interval between failures)
o Associated with the RV is also a probability measure

Page 6 of 6
Probability mass / Density Function
For a Discrete RV the probability mass function assigns a probability to every
outcome in sample space
o Sample space of RV (x) for a coin toss experiment: [0 1]
o P(x=0)=0.5; P(x=1)=0.5
For a continuous RV the probability density function f(x) can be used to assign a
probability to every interval on a real line
Continuous RV(x) can take any value [–∞, ∞]
area under curve

Cumulative density function F(X)


P(a < x < b) = ∫baf(x)dx
F(b) = P(–∞ < x < b) = ∫b-∞f(x)dx
It is a fair coin and that is why the outcomes are given equal probability.

In the case of a continuous random variable, we define what is known as a


probability density function, which can be used to compute the probability for
every outcome of the random variable within an interval.
Notice, in the case of a continuous random variable, there are ∞ of outcomes and
therefore, we cannot associate a probability with every outcome.
However, we can associate a probability that the random variable lies within
some finite interval.
Random variable x which can take any value in the real line from - ∞ to ∞, the
density function f(x), such that the probability that the variable lies in an interval a
to b is defined as the integral of this function from a to b.
The integral is an area, the area represents the probability, the probability that the
random variable lies between - 1 to 2 is denoted by the shaded area
The cumulative density function, which is denoted by capital F and this is the
probability that the random variable x lies in the interval - ∞ to b, the integral
between - ∞ and b of this density function f(x) dx.
Other functions – Binomial Mass Function, Guassian or Normal Density Function,
Chi-square density function
Other examples of pdf – uniform density function & exponential density function

Page 7 of 7
Moments of a PDF
Similar to describing a function using derivatives, a pdf can be described by its
moments
o For continuous distributions

E[xk] = ∫-∞ xk f(x) dx
o For discrete distributions
E[xk] = ∑Ni=1 xik p(xi)
o Mean: µ = E [x]
o Variance: σ2 = E [(x - µ)2] = E[x2] -µ2
o Standard deviation = square root of variance = σ

Note: To get ∞, use (221E + ALT X)


To get σ, use (ALT + 229)
To get µ, use (ALT + 230)

01.12.2022
Sample Statistics

Probability provides a theoretical framework for providing, for performing


statistical analysis of data.
Statistics actually deals with the analysis of experimental observations that we
have obtained.

Need for sampling

PDFs of RVs establishes theoretical framework.


o But entire sample space may not be known
o Parameters of distribution may not be known
From a finite sample derive conclusions about the pdf and its parameters
Sample (or observation) set is assumed to be sufficiently representative of the
entire sample space
o Proper sampling procedures and design of experiments to be used for
obtaining the sample
While doing analysis we do not know the entire sample space
We may also not know all the parameters of the distribution from which the

Page 8 of 8
samples are being withdrawn.
Typically we actually obtain only a few samples of the total number of available
population.
So, from this finite sample we have to derive conclusions about the probability
density function of the entire population and also infer, make inferences about
the parameters of these distributions.
So, the sample or observation set is supposed to be sufficiently representative of
the entire sample space.
Example - To find out the average height of people in the world, can’t take heights
of American people alone because they are known to be much taller, when
compared to the Asian people.
While taking samples, take from Europe, Asia and so on. So that, will get a
representative of the entire population of this world
This is called proper sampling procedures and these are dealt with in the design
of experiments

Basic concepts
Population – Set of all possible outcomes of a random experiment characterized
by f(x)
Sample set (realization) – Finite set of observations obtained through an
experiment
Inference – conclusion derived regarding the population (pdf, parameters) from
the sample set
o Inference made from a sample set is also uncertain since it depends on
the sample set which is one of many possible realizations, provide also
the confidence interval associated with these estimates that are derived
Statistical analysis
o Descriptive statistics (Analysis)
Graphical – organizing and presenting the data (ex: Box plots,
probability plots)
Numerical – summarizing the sample set (ex: mean, mode, range,
variance, moments)
o Inferential
Estimation – estimate parameters of the pdf along with its
confidence region
Hypothesis testing – making judgments about f(x) and its
parameters
Measures of central tendency
o Represent sample set by a single value
Mean (or average) =

Best estimate in least squares criterion


Page 9 of 9
̅
Unbiased estimate of population mean E[ x ] = µ
Affected by outliers
Ex: sample height of 20 cherry trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Mean = 1385 / 20 = 69.25 (population mean used to generate
random sample was 70)
Mean = 1435 / 20 = 71.75 (after a bias of 50 was added to first
sample value)
The mean of a sample is defined as the summation of all the data points that
was obtained divided by the number of data points.
That is also denoted by the symbol x̅ and its also called the mean or the
average of the sample.

Unbiased estimate - Expectation of x̅ is μ. This can be analytically proven for any


kind of distribution.
Take a sample of N points and get an estimate x̅ . Repeat this experiment and
draw another random sample from the population of N points and get another
value of x̅ .
Average all these x̅ from different experimental sets, then the average of these
averages will tend to this population mean.
This is an useful and important property of estimates
The one unfortunate aspect of this particular statistic or mean is that it is if there
is one bad data point in sample, then estimate of x̅ can be significantly
affected by this bad value. The bad value is called an outlier and even a single
outlier in data can give rise to a bad estimate of x̅ .
A single bias in the sample will lead to the poor estimate.

MEDIAN
Represent sample set by a single value
o Median – value of xi such that 50% of the values are less than xi and 50%
of observations are greater than xi
Robust with respect to outliers in data
Best estimate in least absolute derivation sense
Ex: Sample Heights of 20 Cherry Trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Median = 69 (population mean used to generate random sample
was 70)
Median = 69 (after a bias of 50 was added to first sample value)
Another measure of central tendency is what is called a median.
The median is a value such that 50 percent of the data points lie below this value
and 50 percent of the experimental observations are greater than this value.
Order all the observations from smallest to highest and then find out the middle
value.
10th point is 67, because there are even number of points, the eleventh point is 71
take the average between this and call that the median. (67 + 71 = 138, 138 / 2 =

Page 10 of 10
69)
If there are odd number of points then take the middle point just as it is
Add a bias in the first data point and make this 105 and then reorder the data and
find out the median again; the median has not changed.
So, the presence of an outlier has not affected the median

MODE
Represent sample set by a single value
o Mode – Value that occurs most often (most probable value)
o Ex: Sample heights of 20 cherry trees
o [55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
o Mode = 67 (3 occurrences)
A mode is another measure of central tendency and this value is the value that
occurs most often or what is called the most probable value.
Sometimes distribution may have two modes. What is called a bimodal
distribution in which case if sampling is done from such a distribution, it will give
two clusters, one cluster around the one of the modes and another cluster
around the second mode

Page 11 of 11
Measures of spread
Represents spread of sample set

https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php

Page 12 of 12
https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php

Another measure which characterizes a sample set is the measures of spread


and tells how widely their data is ranging.
Sample variance – Formula description
Page 13 of 13
The sample variance is an unbiased estimate of the population variance
The square root of the sample variance is also known as the standard deviation.
The sample variance happens to be also a very susceptible to outliers.
So, if there is a single outlier, the sample variance, sample standard deviation can
become very poor estimate of the population parameter.

Another measure of spread which is called the mean absolute deviation,


somewhat similar to the Median
Formula description
A third measure of spread is what is called the range that is basically the
difference between the maximum and minimum value.

Example: Refer Measures of spread image in page 11, highlighted in red color
A single outlier can cause the standard deviation and the variance to become
very poor and therefore cannot be trusted as a good estimate of the population
standard deviation or variance.
Take the mean absolute deviation from the median, it would be even better in
terms of robustness with respect to the outlier.
The range of the data can be obtained as the maximum and minimum value
Even when the 20 data points were not given, only with the mean and standard
deviation, it is possible to tell the properties of the sample (power of sample
statistics)

Page 14 of 14
03.12.2022
Graphical Analysis – Histograms, Box Plot, Probability Plot, Scatter Plot

Histograms
o Divide the range of values in sample set into samll intervals and count
how many observations fall within each interval
o For each interval plot a rectangle with width = interval size and height
equal to number of observations in interval
o Example – Sample of 20 heights of black cherry trees
[73 75 55 60 66 71 81 67 83 75 82 71 63 55 72 78 67 65 67 59]

Page 15 of 15
Given a sample set, first divide this sample set into small ranges; and count how
many observations fall within that range or within each interval

Plot the width of the interval or the interval size of the x axis and the number of
data points available in that interval as the y axis

Page 16 of 16
Box Plot

Other kinds of plots – which is called the box plot, which is used most often in
sometimes in visualizing stock prices
Compute quantities called quartiles Q1, Q2 and Q3 and the minimum and
maximum values in the range
What are quartiles?
Quartiles are basically an extension of the idea of median
Q2 is exactly the median - which means half the number of points fall below the
value of Q2 and half the number of points are exactly about Q2.
Similarly, Q1 represents the 25 percent value which means 25 percent of the
observations fall below Q1.
75 percent above Q1 and Q3 implies that 75 percent of the data points fall below
Q3 and 25 percent above Q3.
And once you have these values, the median, the quartiles and the minimum
maximum, you can plot what is called the box and whisker plot
The lowest observation and the highest observation are called the whiskers.
This gives a little more information about the spread of the data

Page 17 of 17
Probablity Plot

The third kind of plot which is very useful is to know about the distribution of the
data and this is called the probability plot the p-p plot or the q-q plot.
Standardization means remove the mean and divide by the standard deviation
The 20 values as these are called the standardized values, sort them from the
lowest to highest

https://mathcracker.com/normal-probability-plot-maker#results

55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83

The theoretical frequencies fi need to be computed as well as the associated z-

Page 18 of 18
scores zi, for i = 1, 2, ..., 20:
Observe that the theoretical frequencies f_ifi are approximated using the
following formula: fi= i−0.375 / n+0.25
Where i corresponds to the position in the ordered dataset, and zi is
corresponding associated z-score.
This is computed as zi=Φ−1(fi)

Position (i) X (Asc. Order) fi zi


1 55 0.0309 -1.868
2 55 0.0802 -1.403
3 59 0.1296 -1.128
4 60 0.179 -0.919
5 63 0.2284 -0.744
6 65 0.2778 -0.589
7 66 0.3272 -0.448
8 67 0.3765 -0.315
9 67 0.4259 -0.187
10 67 0.4753 -0.062
11 71 0.5247 0.062
12 71 0.5741 0.187
13 72 0.6235 0.315
14 73 0.6728 0.448
15 75 0.7222 0.589
16 75 0.7716 0.744
17 78 0.821 0.919
18 81 0.8704 1.128
19 82 0.9198 1.403
20 83 0.9691 1.868

The normal probability plot is obtained by plotting the X-values (sample data) on
the horizontal axis, and the corresponding zi values on your vertical axis.

Page 19 of 19
The following normality plot is obtained:

Scatter plot

Page 20 of 20
The scatter plot plots one random variable against another.
So, if there are two random variables, let us say y and x and to know whether
there is any relationship between y and x, then one way of visually verifying this
dependency or interdependency is to plot y versus x.
Data corresponding to 100 students – Students have spent time preparing for a
quiz and they have obtained marks in that quiz.
If more time was spent on study then more marks might have scored by students
X – axis (Time spent), Y-axis (Marks obtained)
If the random variables has dependency, then alignment of the data can be seen
If there is no dependency, then the data will spread randomly, and there is no
clear pattern
This plot is helpful in the process of assessing dependency between 2 variables
and then proceed for further analysis

http://www.alcula.com/calculators/statistics/scatter-plot/

Statistics Calculator: Scatter Plot


Make a scatter plot from a set of data.

Data
1. 3, 100
2. 3, 100
3. 2, 75
4. 1, 50
5. 1, 45
6. 3, 100
7. 3, 100
8. 2, 75
9. 1, 50
10. 1, 45
11. 3, 100
12. 3, 100
13. 2, 75
14. 1, 50
15. 1, 45
16. 3, 100
17. 3, 100
18. 2, 75
19. 1, 50
20. 1, 45

Page 21 of 21
08.12.2022
Hypothesis testing
The basics of hypothesis testing which is an important activity while taking
decision from a set of data.

Motivation for Hypothesis Testing


Business – will an investment in a mutual fund yield annual returns greater than
desired value? (based on past performance of the fund)
Medical – is the incidence of the diabetes greater among males than females?
Social – are women more likely to change mobile service provider than men?
Engineering – has the efficiency of the pump decreased from its original value
due to aging?

Hypothesis testing
The hypothesis is generally converted to a test of the mean or variance
parameter of a population (or differences in mean or variances of populations)
A hypothesis is a statement or postulate about the parameters of a distribution
(or model)
o Null hypothesis H0 – The default or status quo postulate that we wish to
reject if the sample set provides sufficient evidence
o Alternative hypothesis H1 – The alternative postulate that is accepted if
the null hypothesis is rejected

Hypothesis tsting procedure


Identify the parameter of interest (mean, variance, proportion) which you wish to
test
Construct the null and alternate hypothesis
Compute a test statistic which is a fucntion of the sample set of observations
Derive the distribution of the test statistic under the null hypothesis assumption
Choose a test criterion (threshold) against which the test statistic is compared to
reject / not reject the null hypothesis

No hypothesis test is perfect. There are inherent errors since it is based on


observations which are random
The performance of a hypothesis test depends on
o Extent of variablity in data
o Number of observations (Sample size)
o Test Statistic (function of observations)
o Test criterion (Threshold)
There are 2 types of hypothesis tests – the two sided and one sided test
If it is a two sided test, it has a lower criterion threshold and the upper threshold
selected from the appropriate distribution
Depending on the type of test, thresholds were chosen and then test statistics
were compared against those thresholds

Page 22 of 22
Errors in hypothesis testing
Two types of Errors (Type 1 & Type II)

Typically the type 1 error probability α (also called as the level of significance of
the test) is controlled by choosing the criterion from the distribution of the test
statistic under the null hypothesis
Type 1 error (false alarm)
Type II error also has a probability which is denoted by β.
Correct decision probability is known as power of the statistical test and is
denoted by 1 - β.

Page 23 of 23
Summary of useful hypothesis tests

https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/#Hypothesis

What is a Hypothesis?
A hypothesis is an educated guess about something in the world around you. It should
be testable, either by experiment or observation. For example:
A new medicine you think might work.
A way of teaching you think might be better.

What is a Hypothesis Statement?


If you are going to propose a hypothesis, it’s customary to write a statement. Your
statement will look like this:
“If I…(do this to an independent variable)….then (this will happen to the dependent
variable).”
For example:
If I (decrease the amount of water given to herbs) then (the herbs will increase in
size).
If I (give patients counseling in addition to medication) then (their overall
depression scale will decrease).
If I (give exams at noon instead of 7) then (student test scores will improve).

A good hypothesis statement should:


Include an “if” and “then” statement (according to the University of California).
Include both the independent and dependent variables.
Be testable by experiment, survey or other scientifically sound technique.
Be based on information in prior research (either yours or someone else’s).

Page 24 of 24
Have design criteria (for engineering or programming projects).

What is Hypothesis Testing?


Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results. You’re basically testing whether your
results are valid by figuring out the odds that your results have happened by chance. If
your results may have happened by chance, the experiment won’t be repeatable and so
has little use.

What is the Null Hypothesis?


If you trace back the history of science, the null hypothesis is always the accepted fact.
Simple examples of null hypotheses that are generally accepted as being true are:
1. DNA is shaped like a double helix.
2. There are 8 planets in the solar system (excluding Pluto).
3. Taking Vioxx can increase your risk of heart problems (a drug now taken off the
market).

Hypothesis Testing
Example #1: Basic Example
A researcher thinks that if knee surgery patients go to physical therapy twice a week
(instead of 3 times), their recovery period will be longer.
Average recovery time for knee surgery patients is 8.2 weeks.
The hypothesis statement in this question is that the researcher believes the
average recovery time is more than 8.2 weeks.
It can be written in mathematical terms as: H1: μ > 8.2
Next, state the null hypothesis.
That’s what will happen if the researcher is wrong.
In the above example, if the researcher is wrong then the recovery time is less than
or equal to 8.2 weeks.
In math, that’s: H0 μ ≤ 8.2

Example #2: Basic Example


A principal at a certain school claims that the students in his school are above
average intelligence.
A random sample of thirty students IQ scores have a mean score of 112.5.
Is there sufficient evidence to support the principal’s claim?
The mean population IQ is 100 with a standard deviation of 15.

Step 1: State the Null hypothesis.


The accepted fact is that the population mean is 100, so: H0: μ = 100.
Step 2: State the Alternate Hypothesis.

Page 25 of 25
The claim is that the students have above average IQ scores, so: H1: μ > 100.
The fact that we are looking for scores “greater than” a certain point means that
this is a one-tailed test.

Step 3: Draw a picture to help you visualize the problem.

Step 4: State the alpha level. If you aren’t given


an alpha level, use 5% (0.05).

Step 5: Find the rejection region area (given by


your alpha level above) from the z-table. An area
of .05 is equal to a z-score of 1.645.

Step 6:Find the test statistic using this formula:

For this set of data: z=


(112.5–100) / (15/√30)=4.56

Step 7:If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5,
you
cannot reject the null hypothesis. In this case, it is more (4.56 > 1.645), so you
can reject
the null.
https://youtu.be/N5Wdfd3exmc

11.12.2022
Page 26 of 26
https://youtu.be/cL5ie-669rc

Confidence Interval

Estimation and confidence interval


Levels of confidence
Interpretation of interval estimation
Margin of error
Interval estimation of a population mean

Estimation and confidence interval


1. A point estimate is a single value used to estimate a population value
2. An interval estimate states the range within which a population parameter
probably lies
3. A confidence interval is a range of values within which the population parameter
is expected to occur
4. The two confidence interval that are used extensively are 95% and 99%

Level of confidence (α)


1. Level of confidence is the probability that the unknown population parameter
falls within the interval
2. Denote (1-α)%=level of confidence
3. Example – 90%, 95%, 99%
4. α – alpha is the probability that the parameter is not within the interval

Interpretation of interval estimation


1. For a 95% confidence interval, about 95% of the similarity constructed intervals
will contain the parameter being estimated
2.

Page 27 of 27
3. 95% of the sample means for a specified sample size will lie within 1.96 standard
deviations of the hypothesized population mean
4. For the 99% confidence interval, 99% of the sample means for a specified sample
size will lie within 2.58 standard deviations of the hypothesized population mean

Margin of error & Interval estimate


1. An interval estimate can be calculated by adding or subtracting the margin of
error to the point estimate
2. The purpose of an interval estimate is to provide information about how close
the point estimate is to the value of parameter
3. The general form of an interval estimate of a population mean is as follows:
X̅ +/- Margin of Error
4. Interval estimation of a population mean:
5. Interval estimate =

X̅ = sample mean
Z = number of standard deviation from the sample mean
s = Standard deviation in the sample
n = size of the sample

Example:

Suppose a student measuring the boiling temperature of a certain liquid


observes the readings (in degree Celsius) 101.7, 102.4, 106.3, 105.9, 101.2 and
102.7 on 6 different samples of the liquid. He calculates the sample mean to be
103.40. if he knows that the standard deviation for this procedure is 1.2 degrees,
what is the confidence interval for the population mean at a 95% confidence level?

Use the above interval estimate formula

= 103.4 ± 1.96 X [1.2 / 6 ]


= 103.4 ± 0.93
= (102.47, 104.33)

Page 28 of 28
https://www.samlau.me/test-textbook/ch/18/hyp_phacking.html

P-hacking
A p-value or probability value is the chance, based on the model in the null
hypothesis that the test statistic is equal to the value that was observed in the
data or is even further in the direction of the alternative.
If a p-value is small, that means the tail beyond the observed statistic is small
and so the observed statistic is far away from what the null predicts.
This implies that the data support the alternative hypothesis better than they
support the null.
By convention, when the p-value is below 0.05, the result is called statistically
significant, and the null hypothesis is rejected.
There are dangers that present itself when the p-value is misused.
P-hacking is the act of misusing data analysis to show that patterns in data are
statistically significant, when in reality they are not.
This is often done by performing multiple tests on data and only focusing on the
tests that return results that are significant.

Multiple Hypothesis Testing


One of the biggest dangers of blindly relying on the p-value to determine
“statistical significance” comes when we are just trying to find the results that
give us “good” p-values.
This is commonly done when doing “food frequency questionnaires,” or FFQs, to
study eating habits’ correlation with other characteristics (diseases, weight,
religion, etc).
FiveThirtyEight, an online blog that focuses on opinion poll analysis, made their
own FFQ – using this data, analysis were carried to find the results that can be
considered “statistically significant.”

data = pd.read_csv('raw_anonymized_data.csv')
# Do some EDA on the data so that categorical values get changed to 1s and 0s
data.replace('Yes', 1, inplace=True)
data.replace('Innie', 1, inplace=True)
data.replace('No', 0, inplace=True)
data.replace('Outie', 0, inplace=True)

# These are some of the columns that give us characteristics of FFQ-takers


characteristics = ['cat', 'dog', 'right_hand', 'left_hand']

# These are some of the columns that give us the quantities/frequencies of


different food the FFQ-takers ate
ffq = ['EGGROLLQUAN', 'SHELLFISHQUAN', 'COFFEEDRINKSFREQ']
Look specifically for whether people own cats, dogs, or what handedness they
Page 29 of 29
are.
data[characteristics].head()
cat dog right_hand left_hand
0 0 0 1 0
1 0 0 1 0
2 0 1 1 0
3 0 0 1 0
4 0 0 1 0
Additionally, look at how much shellfish, eggrolls, and coffee people consumed.
data[ffq].head()
EGGROLLQUAN SHELLFISHQUAN COFFEEDRINKSFREQ
0 1 3 2
1 1 2 3
2 2 3 3
3 3 2 1
4 2 2 2
Calculate the p-value for every pair of characteristic and food frequency/quantity
features.
# Calculate the p value between every characteristic and food frequency/quantity
pair
pvalues = {}
for c in characteristics:
for f in ffq:
pvalues[(c,f)] = findpvalue(data, c, f)
pvalues
{('cat', 'EGGROLLQUAN'): 0.69295273146288583,
('cat', 'SHELLFISHQUAN'): 0.39907214094767007,
('cat', 'COFFEEDRINKSFREQ'): 0.0016303467897390215,
('dog', 'EGGROLLQUAN'): 2.8476184473490123e-05,
('dog', 'SHELLFISHQUAN'): 0.14713568495622972,
('dog', 'COFFEEDRINKSFREQ'): 0.3507350497291003,
('right_hand', 'EGGROLLQUAN'): 0.20123440208411372,
('right_hand', 'SHELLFISHQUAN'): 0.00020312599063263847,
('right_hand', 'COFFEEDRINKSFREQ'): 0.48693234457564749,
('left_hand', 'EGGROLLQUAN'): 0.75803051153936374,
('left_hand', 'SHELLFISHQUAN'): 0.00035282554635466211,
('left_hand', 'COFFEEDRINKSFREQ'): 0.1692235856830212}
study finds that:
Eating/Drinking is linked to: P-value
Egg rolls Dog ownership <0.0001
Shellfish Right-handedness 0.0002
Shellfish Left-handedness 0.0004
Coffee Cat ownership 0.0016

Clearly this is flawed.


Page 30 of 30
Aside from the fact that some of these correlations seem to make no sense, it is
found that shellfish is linked to both right and left handedness.
Because it is blindly tested all columns against each other for statistical
significance.
Chosen only whatever pairs say “statistically significant” results.
This shows the dangers of blindly following the p-value without a care for proper
experimental design.

Example:
A simple example of this would be in the case of rolling a pair of dice and getting
two 6s.
If the null hypothesis that the dice are fair and not weighted, and take the test
statistic to be the sum of the dice, then find the p-value of this outcome, which
will be 1/36 or 0.028, and gives statistically significant results that the dice are
fair.
But obviously, a single roll is not nearly enough rolls to provide with good
evidence to say whether the results are statistically significant or not, and shows
that blindly applying the p-value without properly designing a good experiment
can result in bad results.

Bayesian Inference
https://towardsdatascience.com/what-is-bayesian-inference-4eda9f9e20a6

Frequentism is based on frequencies of events.


Bayesianism is based on our knowledge of events.
The prior represents your knowledge of the parameters before seeing data.
The likelihood is the probability of the data given values of the parameters.
The posterior is the probability of the parameters given the data.
Bayes’ theorem relates the prior, likelihood, and posterior distributions.
MLE is the maximum likelihood estimate, which is what frequentists use.
MAP is the maximum a posteriori estimate, which is what Bayesians use.

Page 31 of 31
Illustration of how our prior knowledge affects our posterior knowledge

Machine learning is mainly concerned with prediction; prediction is very much


concerned with probability
Two main interpretations of probability: 1) frequentism 2) Bayesianism.
The frequentist (or classical) definition of probability is based on frequencies of
events, whereas the Bayesian definition of probability is based on our
knowledge of events. (What the data says versus what we know from the data.)
Analogy: Where did you lose your phone?
Both the frequentist and the Bayesian use their ears when inferring, where to look for the phone,
but the Bayesian also incorporates prior knowledge about the lost phone into their inference.

Bayes’ theorem
We have two sets of outcomes A and B (also called events), denote the probabilities of each
event P(A) and P(B) respectively.
The probability of both events is denoted with the joint probability P(A, B), and expand this with
conditional probabilities
P (A, B) = P (A|B) P (B) (1)
i.e., the conditional probability of A given B and the probability of B give us the joint probability
of A and B. It follows that
P (A, B) = P (B|A) P (A) (2)
Since the left-hand sides of (1) and (2) are the same, we can see that the right-hand sides are
equal
= P (A|B) P (B) = P (B|A) P (A)
P (B|A) P (A)
= P (A|B) =
P (B)
This is Bayes’ theorem.

The evidence (the denominator above) ensures that the posterior distribution on the left-hand side
is a valid probability density and is called the normalizing constant.
The theorem in words is stated as follows: Posterior α Likelihood x Prior, where ∝ means
“proportional to”.

Example: coin flipping


A coin flips heads up with probability θ and tails with probability 1−θ (where θ is unknown).
You flip the coin 11 times, and it ends up heads 8 times. Now, would you bet for or against the
event that the next two tosses turn up heads?
Let X be a random variable representing the coin, where X=1 is heads and X=0 is tails such
that P(X=1)=θ and Pr(X=0)=1−θ. Furthermore, let D denote our data (8 heads, 3 tails).
Estimate the value of the parameter θ, so that we can calculate the probability of seeing 2 heads
in a row.

Page 32 of 32
If the probability is less than 0.5, we will bet against seeing 2 heads in a row, but if it’s above 0.5,
then we bet for.

Frequentist approach
As the frequentist, maximize the likelihood, which is to ask the question: what value of θ will
maximize the probability that we got D given θ, or more formally, we want to find

This is called maximum likelihood estimation (MLE).


The 11 coin flips follow a binomial distribution with n=11 trials, k=8 successes, and θ the
probability of success.
Using the likelihood of a binomial distribution, find the value of θ that maximizes the
probability of the data.

Note that (3) expresses the likelihood of θ given D, which is not the same as saying the probability
of θ given D.

The image underneath shows our likelihood function P(D θ) (as a function of θ) and the
maximum likelihood estimate.

Illustration of the likelihood function P(D|θ)


with a vertical line at the maximum likelihood estimate.

The value of θ that maximizes the likelihood is k/n, i.e., the proportion of successes in the trials.
The maximum likelihood estimate is therefore k/n = 8/11 ≈ 0.73.
Assuming the coin flips are independent, now calculate the probability of seeing 2 heads in a row:

Since the probability of seeing 2 heads in a row is larger than 0.5, we would bet for!

Bayesian approach

As the Bayesian, to maximize the posterior, asks the question: what value of θ will maximize the
probability of θ given D?
Page 33 of 33
which is called maximum a posteriori (MAP) estimation.
To answer the question, use Bayes’ theorem

Since the evidence P(D) is a normalizing constant not dependent on θ, ignore it.
This now gives

During the frequentist approach, it is found the likelihood (3),


Drop the binomial coefficient, since it’s not dependent on θ.
The only thing left is the prior distribution P(θ).
This distribution describes initial (prior) knowledge of θ.
A convenient distribution to choose is the Beta distribution, because it’s defined on the interval [0,
1], and θ is a probability, which has to be between 0 and 1.

This gives
Where Γ is the Gamma function. Since the fraction is not dependent on θ, ignore it, which gives

Set the prior distribution in such a way that we incorporate, what we know about θ prior to seeing
the data.
Now, we know that coins are usually pretty fair, and if we choose α=β=2, we get a beta
distribution that favors θ=0.5 more than θ=0 or θ=1.
The illustration below shows this prior Beta(2,2), the normalized likelihood, and the resulting
posterior distribution.

Page 34 of 34
Illustration of the prior P (θ), likelihood P(D | θ), and posterior distribution P (θ | D)
with a vertical line at the maximum a posteriori estimate.

The posterior distribution ends up being dragged a little more towards the prior distribution, which
makes the MAP estimate a little different the MLE estimate.

which is a little lower than the MLE estimate — and if we now use the MAP estimate to calculate
the probability of seeing 2 heads in a row, we find that we will bet against it

Page 35 of 35

You might also like