Data Science Unit-2
Data Science Unit-2
UNIT – II
22.11.2022
STATISTICAL MODELING
RANDOM VARIABLE
SAMPLE STATISTICS
HYPOTHESIS TESTING
CONFIDENCE INTERVALS
P HACKING
BAYESIAN INFERENCE
Objectives
Introduction to random variables
How they are characterized using probability measures and probability density
functions
How parameters of these density functions can be estimated
How you can do decision making from data using the method of hypothesis testing
Characterizing random phenomena – what are they – how probability can be used
as a measure to describe
Statistical Modeling
Random phenomena
1) Deterministic phenomenon
2) Stochastic phenomenon
Page 1 of 1
Why are we dealing with stochastic phenomena?
- Data obtained from experiments contains some errors.
- Reasons for these errors may be, all the rules are not known, which governs the data
generating process or in other words, all the laws are not known (knowledge of all
the causes that affects the outcomes and thereof)
- These are called modeling errors
- The other kind of error is due to the sensor itself.
- The sensors used for observing the outcomes may contain errors
- Such errors are called measurement errors.
- These 2 errors are modeled using probability density functions and therefore the
outcomes are also predicted with certain confidence intervals.
- The types of random phenomena can either be discrete where the outcomes are
finite
- Example: Coin Toss experiment – only two outcomes (either head or tail)
- Example: Throw of a dice – 6 outcomes
- Continuous random phenomena – infinite number of outcomes
- Example: Measurement of a body temperature (varies from 96 – 105 degrees)
depending on a person is running a temperature or not.
- So such continuous variable thing which have random outcomes are called
continuous
Page 2 of 2
- Random phenomena - all the notions of probability - using just the coin toss
experiment
- A single coin tosses whose outcomes are described by H and T
- The sample space is the set of all possible outcomes.
- In this case the sample space consists of these two outcomes H and T denoted by
the symbols H and T
- On the other hand if there are two successive coin tosses, then there can be 4
possible outcomes denoted by the symbol HH, HT, TH and TT and that constitutes
the sample space.
- Outcomes of the sample space for example, HH, HT, TH and TT can also be
considered as events. These events are known as elementary events.
Page 3 of 3
o Two events are said to be independent, if the occurrence of one has no
influence on the occurrence of other. That is, even if first event occurs it is not
possible to make any improvement about the predictability of B if event A and
B are independent formally.
∩
o P(A B) which means a joint occurrence of A and B can be obtained by
multiplying their respective probabilities which is P(A) into P(B).
Page 4 of 4
- In the two coin toss experiment the sample space consists of 4 outcomes denoted
by HH, HT, TH and TT.
- The event A, which is a head in the first toss
- This consists of two outcomes HH and HT
- A compliment is the set of all events that exclude A, which is nothing but the set of
outcomes TH and TT
- Probability of a compliment = probability of the entire sample space – P( A), which is
one
- P( HH) which is 0.25 + P( HT) which is 0.25 = 0.5.
- The probability of a compliment TH and TT = 0.5.
- P(A)c = 1 - P(A).
- A and B are not mutually exclusive, a common event of two successive heads which
belongs to both A and B
- Compute the P(A) or B - a head in the first toss or a head in the second toss, then
this comes to three outcomes and together gives the probability of 0.75 which can
be counted from the respective probabilities of HT, HH and TH
Page 5 of 5
Conditional Probability
If two events A and B are not independent, then information available about the
outcome of event A can influence the predictability of event B
Conditional probability
∩
o P(B | A) = P(A B)/P(A) if P(A)>0
o P(A | B)P(B) = P(B | A)P(A) – Bayes Formula
o P(A) = P(A | B)P(B) + P(A | BC)P(BC)
Example: two (fair) coin toss experiment
o Event A : First toss in head = {HT, HH}
o Event B : Two successive heads = {HH}
o Pr(B)=0.25 (no information)
∩
o Given event A has occurred Pr(B | A) = 0.5 = 0.25 / 0.5 = P(A B)/P(A)
EXAMPLE:
In manufacturing process of 1000 parts are produced of which 50 are defective.
We randomly take a part form the day’s production
The notion of random variables and the idea of probability mass and density
functions
How to characterize these functions
How to work with them
Random Variable
A random variable (RV) is a map from sample space to a real line such that there
is a unique real number corresponding to every outcome of sample space
o Example: Coin toss sample space [H T] mapped to [0 1]. If the sample
space outcomes are real valued no need for this mapping (eg. Throw a
dice)
o Allows numerical computations such as finding expected value of a RV
o Discrete RV (throw a dice or a coin)
o Continuous RV (sensor readings, time interval between failures)
o Associated with the RV is also a probability measure
Page 6 of 6
Probability mass / Density Function
For a Discrete RV the probability mass function assigns a probability to every
outcome in sample space
o Sample space of RV (x) for a coin toss experiment: [0 1]
o P(x=0)=0.5; P(x=1)=0.5
For a continuous RV the probability density function f(x) can be used to assign a
probability to every interval on a real line
Continuous RV(x) can take any value [–∞, ∞]
area under curve
Page 7 of 7
Moments of a PDF
Similar to describing a function using derivatives, a pdf can be described by its
moments
o For continuous distributions
∞
E[xk] = ∫-∞ xk f(x) dx
o For discrete distributions
E[xk] = ∑Ni=1 xik p(xi)
o Mean: µ = E [x]
o Variance: σ2 = E [(x - µ)2] = E[x2] -µ2
o Standard deviation = square root of variance = σ
01.12.2022
Sample Statistics
Page 8 of 8
samples are being withdrawn.
Typically we actually obtain only a few samples of the total number of available
population.
So, from this finite sample we have to derive conclusions about the probability
density function of the entire population and also infer, make inferences about
the parameters of these distributions.
So, the sample or observation set is supposed to be sufficiently representative of
the entire sample space.
Example - To find out the average height of people in the world, can’t take heights
of American people alone because they are known to be much taller, when
compared to the Asian people.
While taking samples, take from Europe, Asia and so on. So that, will get a
representative of the entire population of this world
This is called proper sampling procedures and these are dealt with in the design
of experiments
Basic concepts
Population – Set of all possible outcomes of a random experiment characterized
by f(x)
Sample set (realization) – Finite set of observations obtained through an
experiment
Inference – conclusion derived regarding the population (pdf, parameters) from
the sample set
o Inference made from a sample set is also uncertain since it depends on
the sample set which is one of many possible realizations, provide also
the confidence interval associated with these estimates that are derived
Statistical analysis
o Descriptive statistics (Analysis)
Graphical – organizing and presenting the data (ex: Box plots,
probability plots)
Numerical – summarizing the sample set (ex: mean, mode, range,
variance, moments)
o Inferential
Estimation – estimate parameters of the pdf along with its
confidence region
Hypothesis testing – making judgments about f(x) and its
parameters
Measures of central tendency
o Represent sample set by a single value
Mean (or average) =
MEDIAN
Represent sample set by a single value
o Median – value of xi such that 50% of the values are less than xi and 50%
of observations are greater than xi
Robust with respect to outliers in data
Best estimate in least absolute derivation sense
Ex: Sample Heights of 20 Cherry Trees
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Median = 69 (population mean used to generate random sample
was 70)
Median = 69 (after a bias of 50 was added to first sample value)
Another measure of central tendency is what is called a median.
The median is a value such that 50 percent of the data points lie below this value
and 50 percent of the experimental observations are greater than this value.
Order all the observations from smallest to highest and then find out the middle
value.
10th point is 67, because there are even number of points, the eleventh point is 71
take the average between this and call that the median. (67 + 71 = 138, 138 / 2 =
Page 10 of 10
69)
If there are odd number of points then take the middle point just as it is
Add a bias in the first data point and make this 105 and then reorder the data and
find out the median again; the median has not changed.
So, the presence of an outlier has not affected the median
MODE
Represent sample set by a single value
o Mode – Value that occurs most often (most probable value)
o Ex: Sample heights of 20 cherry trees
o [55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
o Mode = 67 (3 occurrences)
A mode is another measure of central tendency and this value is the value that
occurs most often or what is called the most probable value.
Sometimes distribution may have two modes. What is called a bimodal
distribution in which case if sampling is done from such a distribution, it will give
two clusters, one cluster around the one of the modes and another cluster
around the second mode
Page 11 of 11
Measures of spread
Represents spread of sample set
https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php
Page 12 of 12
https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php
Example: Refer Measures of spread image in page 11, highlighted in red color
A single outlier can cause the standard deviation and the variance to become
very poor and therefore cannot be trusted as a good estimate of the population
standard deviation or variance.
Take the mean absolute deviation from the median, it would be even better in
terms of robustness with respect to the outlier.
The range of the data can be obtained as the maximum and minimum value
Even when the 20 data points were not given, only with the mean and standard
deviation, it is possible to tell the properties of the sample (power of sample
statistics)
Page 14 of 14
03.12.2022
Graphical Analysis – Histograms, Box Plot, Probability Plot, Scatter Plot
Histograms
o Divide the range of values in sample set into samll intervals and count
how many observations fall within each interval
o For each interval plot a rectangle with width = interval size and height
equal to number of observations in interval
o Example – Sample of 20 heights of black cherry trees
[73 75 55 60 66 71 81 67 83 75 82 71 63 55 72 78 67 65 67 59]
Page 15 of 15
Given a sample set, first divide this sample set into small ranges; and count how
many observations fall within that range or within each interval
Plot the width of the interval or the interval size of the x axis and the number of
data points available in that interval as the y axis
Page 16 of 16
Box Plot
Other kinds of plots – which is called the box plot, which is used most often in
sometimes in visualizing stock prices
Compute quantities called quartiles Q1, Q2 and Q3 and the minimum and
maximum values in the range
What are quartiles?
Quartiles are basically an extension of the idea of median
Q2 is exactly the median - which means half the number of points fall below the
value of Q2 and half the number of points are exactly about Q2.
Similarly, Q1 represents the 25 percent value which means 25 percent of the
observations fall below Q1.
75 percent above Q1 and Q3 implies that 75 percent of the data points fall below
Q3 and 25 percent above Q3.
And once you have these values, the median, the quartiles and the minimum
maximum, you can plot what is called the box and whisker plot
The lowest observation and the highest observation are called the whiskers.
This gives a little more information about the spread of the data
Page 17 of 17
Probablity Plot
The third kind of plot which is very useful is to know about the distribution of the
data and this is called the probability plot the p-p plot or the q-q plot.
Standardization means remove the mean and divide by the standard deviation
The 20 values as these are called the standardized values, sort them from the
lowest to highest
https://mathcracker.com/normal-probability-plot-maker#results
55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83
Page 18 of 18
scores zi, for i = 1, 2, ..., 20:
Observe that the theoretical frequencies f_ifi are approximated using the
following formula: fi= i−0.375 / n+0.25
Where i corresponds to the position in the ordered dataset, and zi is
corresponding associated z-score.
This is computed as zi=Φ−1(fi)
The normal probability plot is obtained by plotting the X-values (sample data) on
the horizontal axis, and the corresponding zi values on your vertical axis.
Page 19 of 19
The following normality plot is obtained:
Scatter plot
Page 20 of 20
The scatter plot plots one random variable against another.
So, if there are two random variables, let us say y and x and to know whether
there is any relationship between y and x, then one way of visually verifying this
dependency or interdependency is to plot y versus x.
Data corresponding to 100 students – Students have spent time preparing for a
quiz and they have obtained marks in that quiz.
If more time was spent on study then more marks might have scored by students
X – axis (Time spent), Y-axis (Marks obtained)
If the random variables has dependency, then alignment of the data can be seen
If there is no dependency, then the data will spread randomly, and there is no
clear pattern
This plot is helpful in the process of assessing dependency between 2 variables
and then proceed for further analysis
http://www.alcula.com/calculators/statistics/scatter-plot/
Data
1. 3, 100
2. 3, 100
3. 2, 75
4. 1, 50
5. 1, 45
6. 3, 100
7. 3, 100
8. 2, 75
9. 1, 50
10. 1, 45
11. 3, 100
12. 3, 100
13. 2, 75
14. 1, 50
15. 1, 45
16. 3, 100
17. 3, 100
18. 2, 75
19. 1, 50
20. 1, 45
Page 21 of 21
08.12.2022
Hypothesis testing
The basics of hypothesis testing which is an important activity while taking
decision from a set of data.
Hypothesis testing
The hypothesis is generally converted to a test of the mean or variance
parameter of a population (or differences in mean or variances of populations)
A hypothesis is a statement or postulate about the parameters of a distribution
(or model)
o Null hypothesis H0 – The default or status quo postulate that we wish to
reject if the sample set provides sufficient evidence
o Alternative hypothesis H1 – The alternative postulate that is accepted if
the null hypothesis is rejected
Page 22 of 22
Errors in hypothesis testing
Two types of Errors (Type 1 & Type II)
Typically the type 1 error probability α (also called as the level of significance of
the test) is controlled by choosing the criterion from the distribution of the test
statistic under the null hypothesis
Type 1 error (false alarm)
Type II error also has a probability which is denoted by β.
Correct decision probability is known as power of the statistical test and is
denoted by 1 - β.
Page 23 of 23
Summary of useful hypothesis tests
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/#Hypothesis
What is a Hypothesis?
A hypothesis is an educated guess about something in the world around you. It should
be testable, either by experiment or observation. For example:
A new medicine you think might work.
A way of teaching you think might be better.
Page 24 of 24
Have design criteria (for engineering or programming projects).
Hypothesis Testing
Example #1: Basic Example
A researcher thinks that if knee surgery patients go to physical therapy twice a week
(instead of 3 times), their recovery period will be longer.
Average recovery time for knee surgery patients is 8.2 weeks.
The hypothesis statement in this question is that the researcher believes the
average recovery time is more than 8.2 weeks.
It can be written in mathematical terms as: H1: μ > 8.2
Next, state the null hypothesis.
That’s what will happen if the researcher is wrong.
In the above example, if the researcher is wrong then the recovery time is less than
or equal to 8.2 weeks.
In math, that’s: H0 μ ≤ 8.2
Page 25 of 25
The claim is that the students have above average IQ scores, so: H1: μ > 100.
The fact that we are looking for scores “greater than” a certain point means that
this is a one-tailed test.
Step 7:If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5,
you
cannot reject the null hypothesis. In this case, it is more (4.56 > 1.645), so you
can reject
the null.
https://youtu.be/N5Wdfd3exmc
11.12.2022
Page 26 of 26
https://youtu.be/cL5ie-669rc
Confidence Interval
Page 27 of 27
3. 95% of the sample means for a specified sample size will lie within 1.96 standard
deviations of the hypothesized population mean
4. For the 99% confidence interval, 99% of the sample means for a specified sample
size will lie within 2.58 standard deviations of the hypothesized population mean
X̅ = sample mean
Z = number of standard deviation from the sample mean
s = Standard deviation in the sample
n = size of the sample
Example:
Page 28 of 28
https://www.samlau.me/test-textbook/ch/18/hyp_phacking.html
P-hacking
A p-value or probability value is the chance, based on the model in the null
hypothesis that the test statistic is equal to the value that was observed in the
data or is even further in the direction of the alternative.
If a p-value is small, that means the tail beyond the observed statistic is small
and so the observed statistic is far away from what the null predicts.
This implies that the data support the alternative hypothesis better than they
support the null.
By convention, when the p-value is below 0.05, the result is called statistically
significant, and the null hypothesis is rejected.
There are dangers that present itself when the p-value is misused.
P-hacking is the act of misusing data analysis to show that patterns in data are
statistically significant, when in reality they are not.
This is often done by performing multiple tests on data and only focusing on the
tests that return results that are significant.
data = pd.read_csv('raw_anonymized_data.csv')
# Do some EDA on the data so that categorical values get changed to 1s and 0s
data.replace('Yes', 1, inplace=True)
data.replace('Innie', 1, inplace=True)
data.replace('No', 0, inplace=True)
data.replace('Outie', 0, inplace=True)
Example:
A simple example of this would be in the case of rolling a pair of dice and getting
two 6s.
If the null hypothesis that the dice are fair and not weighted, and take the test
statistic to be the sum of the dice, then find the p-value of this outcome, which
will be 1/36 or 0.028, and gives statistically significant results that the dice are
fair.
But obviously, a single roll is not nearly enough rolls to provide with good
evidence to say whether the results are statistically significant or not, and shows
that blindly applying the p-value without properly designing a good experiment
can result in bad results.
Bayesian Inference
https://towardsdatascience.com/what-is-bayesian-inference-4eda9f9e20a6
Page 31 of 31
Illustration of how our prior knowledge affects our posterior knowledge
Bayes’ theorem
We have two sets of outcomes A and B (also called events), denote the probabilities of each
event P(A) and P(B) respectively.
The probability of both events is denoted with the joint probability P(A, B), and expand this with
conditional probabilities
P (A, B) = P (A|B) P (B) (1)
i.e., the conditional probability of A given B and the probability of B give us the joint probability
of A and B. It follows that
P (A, B) = P (B|A) P (A) (2)
Since the left-hand sides of (1) and (2) are the same, we can see that the right-hand sides are
equal
= P (A|B) P (B) = P (B|A) P (A)
P (B|A) P (A)
= P (A|B) =
P (B)
This is Bayes’ theorem.
The evidence (the denominator above) ensures that the posterior distribution on the left-hand side
is a valid probability density and is called the normalizing constant.
The theorem in words is stated as follows: Posterior α Likelihood x Prior, where ∝ means
“proportional to”.
Page 32 of 32
If the probability is less than 0.5, we will bet against seeing 2 heads in a row, but if it’s above 0.5,
then we bet for.
Frequentist approach
As the frequentist, maximize the likelihood, which is to ask the question: what value of θ will
maximize the probability that we got D given θ, or more formally, we want to find
Note that (3) expresses the likelihood of θ given D, which is not the same as saying the probability
of θ given D.
∣
The image underneath shows our likelihood function P(D θ) (as a function of θ) and the
maximum likelihood estimate.
The value of θ that maximizes the likelihood is k/n, i.e., the proportion of successes in the trials.
The maximum likelihood estimate is therefore k/n = 8/11 ≈ 0.73.
Assuming the coin flips are independent, now calculate the probability of seeing 2 heads in a row:
Since the probability of seeing 2 heads in a row is larger than 0.5, we would bet for!
Bayesian approach
As the Bayesian, to maximize the posterior, asks the question: what value of θ will maximize the
probability of θ given D?
Page 33 of 33
which is called maximum a posteriori (MAP) estimation.
To answer the question, use Bayes’ theorem
Since the evidence P(D) is a normalizing constant not dependent on θ, ignore it.
This now gives
This gives
Where Γ is the Gamma function. Since the fraction is not dependent on θ, ignore it, which gives
Set the prior distribution in such a way that we incorporate, what we know about θ prior to seeing
the data.
Now, we know that coins are usually pretty fair, and if we choose α=β=2, we get a beta
distribution that favors θ=0.5 more than θ=0 or θ=1.
The illustration below shows this prior Beta(2,2), the normalized likelihood, and the resulting
posterior distribution.
Page 34 of 34
Illustration of the prior P (θ), likelihood P(D | θ), and posterior distribution P (θ | D)
with a vertical line at the maximum a posteriori estimate.
The posterior distribution ends up being dragged a little more towards the prior distribution, which
makes the MAP estimate a little different the MLE estimate.
which is a little lower than the MLE estimate — and if we now use the MAP estimate to calculate
the probability of seeing 2 heads in a row, we find that we will bet against it
Page 35 of 35