MLS 2 - Statistics For Data Science
MLS 2 - Statistics For Data Science
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Topics covered so far
1. Statistical Inference
a. Distributions - Binomial, Uniform, Normal
b. Sampling
c. Central Limit Theorem
d. Confidence Intervals
2. Hypothesis Testing
a. Hypothesis Formulation
b. One-Tailed Test vs Two-Tailed Test
c. Type I and Type II Errors
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Gauge your Understanding
1. What is a random variable and how is it related to probability distribution?
2. What are some of the most commonly used distributions?
3. What is Central Limit Theorem (CLT) and when is it used?
4. What do you mean by estimations?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
What is a Random Variable?
A random variable is a function that assigns a numerical value to each outcome of an experiment. It
assumes different values with different probability. It is usually denoted by capital letter X and the
probability associated with any particular value of X is denoted by P(X=x).
Example: Suppose that a fair coin is tossed twice and the possible outcome are {HH, HT, TH, TT}. Let X be
the random variable representing the number of heads that can come up. So, X can take values from the
set {2, 1, 0}.
The probability of two heads coming up is P(X=2) = ¼.
Random Variable
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
What is a Probability Distribution?
The probability distribution of a random variable describes the values that the random variable can
take along with the probabilities of those values.
Discrete
Discrete Probability Probability mass
Distribution function
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Distributions around us (commonly occuring)
Normal IQ distribution of all the seven years old children in New York
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Binomial Distribution
The binomial distribution is the probability distribution of the number of successes of an experiment that is
conducted multiple times and has only two possible outcomes.
Example: Suppose you have purchased 10 lottery tickets and the possible outcomes are winning the lottery or not
winning the lottery, then you can answer a question like what is the probability of winning 6 lottery tickets using
binomial distribution.
1. There are only two possible outcomes (success or failure) for each trial.
2. The number of trials is fixed.
3. The outcome of each trial is independent. In other words, none of the trials have an effect on the probability
of the next trial.
4. The probability of success is exactly the same for each trial.
Note: In binomial distribution, if the number of trials for a given experiment is equal to 1, then it is called Bernoulli
distribution.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Uniform Distribution
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
Normal Distribution
The normal distribution is a continuous probability distribution that is symmetric about the mean. It is also known
as bell curve because the graph of its probability density function looks like a bell.
Properties:
Empirical Rule
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
Sampling Distributions
What is the need for sampling?
Given the limited resources and time, it is not always possible to study the population. That’s why we choose a sample out of the
population to make inference about the population.
Example: Suppose a new drug is manufactured and it needs is to be tested for the adverse side effects on a country’s population. It is
almost impossible to conduct a research study that involves everyone.
It is a distribution of a particular sample statistic obtained from all possible samples drawn from a specific population.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
Central Limit Theorem
Assumptions
Data must be randomly sampled Sample values must be independent of each other
Samples should come from the same distribution Sample size must be sufficiently large (≥30)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
Case Study
Inferential Statistics
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
Gauge Your Understanding
1. What is hypothesis testing and what are different types of hypotheses?
2. What are some of the key terms involved in hypothesis testing?
3. What is the difference between one-tailed and two-tailed tests?
4. What are the steps to perform a hypothesis test?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
Introduction to Hypothesis Testing
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
Key terms in Hypothesis Testing
● The significance level (denoted by α), is the probability of rejecting the null
hypothesis when it is true.
Level of Significance ● It is a measure of the strength of the evidence that must be present in the
sample data to reject the null hypothesis.
● The total area under the distribution curve of the test statistic is partitioned
Acceptance or Rejection into acceptance and rejection region
Region ● Reject the null hypothesis when the test statistic lies in the rejection region,
else we fail to reject it
Types of Error ● There are two types of errors - Type I and Type II
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
Type I and Type II errors
Level of significance =
α
Confidence Level = H0 True H0 False
(1 - α )
Correct
Reject H0 Type I Error (α)
decision
Power of the
test = (1 - β)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
Let’s go through an example
Problem Statement: The store manager believes that the average waiting time for the customers
at checkouts has become worse than 15 minutes. Formulate the Null and the Alternate
hypotheses.
Null Hypothesis (H0): The average waiting time at checkouts is less than equal to 15 minutes.
Alternate Hypothesis (Ha): The average waiting time at checkouts is more than 15 minutes.
Type I error (False Positive): Reject Null hypothesis when it is indeed true. “The fact is that the
average waiting time at checkout is less than equal to 15 minutes but the store manager has
identified that it is more than 15 minutes”.
Type II error (False Negative): Fail to reject Null hypothesis when it is indeed false. “The fact is
that the average waiting time at checkout is more than 15 minutes but the store manager has
identified that it is less than equal to 15 minutes”.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
One-tailed vs Two-tailed Test
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
Hypothesis Testing Steps
Formulate H0 and Ha
Draw Conclusion
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
Case Study
Hypothesis Testing
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 21
Happy Learning !
22
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.