0% found this document useful (0 votes)
20 views35 pages

Chapter 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Chapter 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter 5

Brief Overview of Probability


SAMPLING DISTRIBUTIONS
• An important application of statistics in machine learning is
how to draw a conclusion about a set or population based on
the probability model of random samples of the set.
• For example, based on the malignancy sample test results of
some random tumour cases we want to estimate the
proportion of all tumours which are malignant and thus advise
the doctors on the requirement or non-requirement of biopsy
on each tumour case.
• Different random samples may give different estimates.
• If we can get some knowledge about the variability of all
possible estimates derived from the random samples, then we
should be able to arrive at reasonable conclusions.
Terminologies used in Sampling
• Population: a finite set of objects being investigated.
• Sampling: Different machine learning models do not perform
well if the size of data is very large because of the limitation
of the computer memory. To solve this problem we have to
pick the part of the data set which represents the whole data
set. This process of picking the part of the data set is known
as sampling.
• Random sample: a sample of objects drawn from a population
in a way that every member of the population has the same
chance of being chosen.
• Sampling distribution refers to the probability distribution of a
random variable defined in a space of random samples.
Sampling with replacement
• While choosing the samples from the population if
each object chosen is returned to the population
before the next object is chosen.
• In this case, repetitions are allowed.
• That means, if the sample size n is chosen from the
population size of N, then the number of such samples
is:
• N × N × cc. × N = Nn , because each object can be
repeated.
• The probability of each sample being chosen is: 1/ Nn
Sampling without replacement
• We don’t return the object being chosen to
the population before choosing the next
object:
• Choose n elements out of N elements:
Mean and variance of sample with
replacement
Sampling Without Replacement
HYPOTHESIS TESTING
• A fundamental statistical method to make informed
decisions based on empirical evidence.
• It involves formulating assumptions about population
parameters using sample statistics and rigorously
evaluating these assumptions against collected data.
(For example, a judge assumes a person is innocent and
verifies this by reviewing evidence and hearing
testimony before reaching a verdict.)
• A systematic approach that allows researchers to assess
the validity of a statistical claim about an unknown
population parameter.
• Hypothesis testing is basically an assumption that
we make about a population parameter. It
evaluates two mutually exclusive statements
about a population to determine which statement
is best supported by the sample data.
• To test the validity of the claim or assumption
about the population parameter:
– A sample is drawn from the population and analyzed.
– The results of the analysis are used to decide whether
the claim is true or not.
Defining Hypotheses

• Null hypothesis (H0): In statistics, the null hypothesis is a general


statement or default position that there is no relationship between
two measured cases or no relationship among groups. In other words,
it is a basic assumption or made based on the problem knowledge.
Example: A company’s mean production is 50 units/per day H0: μ = 50.
• Alternative hypothesis (H1): The alternative hypothesis is the
hypothesis used in hypothesis testing that is contrary to the null
hypothesis.
Example: A company’s production is not equal to 50 units/per day i.e.
H1: μ ≠ 50.
• The null hypothesis and alternative hypothesis are complementary
statistical hypotheses that are used to test a claim or statement about
a population:
Null vs. Alternative Hypothesis
Key Terms of Hypothesis Testing

• Level of significance: It refers to the degree of


significance in which we accept or reject the
null hypothesis. 100% accuracy is not possible
for accepting a hypothesis, so we, therefore,
select a level of significance that is usually 5%.
• This is normally denoted with α and generally,
it is 0.05 or 5%, which means your output
should be 95% confident to give a similar kind
of result in each sample.
Key Terms of Hypothesis Testing
• P-value: calculated probability, is the
probability of finding the observed/extreme
results when the null hypothesis(H0) of a
study-given problem is true. If your P-value is
less than the chosen significance level then
you reject the null hypothesis i.e. accept that
your sample claims to support the alternative
hypothesis.
Key Terms of Hypothesis Testing
• Test Statistic: A numerical value calculated from sample data
during a hypothesis test, used to determine whether to reject
the null hypothesis. It is compared to a critical value or p-value
to make decisions about the statistical significance of the
observed results.
– Z-test: If population means and standard deviations are known. Z-
statistic is commonly used.
– t-test: If population standard deviations are unknown. and sample
size is small than t-test statistic is more appropriate.
– Chi-square test: Chi-square test is used for categorical data or for
testing independence in contingency tables
– F-test: F-test is often used in analysis of variance (ANOVA) to compare
variances or test the equality of means across multiple groups.
Key Terms of Hypothesis Testing
Calculating test statistic

• Z-statistics: When population means and standard


deviations are known.
z=(​x–μ​)/(σ/n0.5)
• T-Test is used when n<30,
t=(​xˉ−μ​)/ (s/n0.5)

x̄ = sample mean, μ = population mean, s = standard


deviation of the sample, n = sample size
Key Terms of Hypothesis Testing
• Critical value: The critical value in statistics is a
threshold or cutoff point used to determine
whether to reject the null hypothesis in a
hypothesis test.
Key Terms of Hypothesis Testing
• Degrees of freedom: The variability or freedom one
has in estimating a parameter. The degrees of freedom
are related to the sample size and determine the shape.
• Degrees of freedom are the maximum number of
logically independent values, which may vary in a data
sample. Degrees of freedom are calculated by
subtracting one from the number of items within the

• 𝑑𝑓=𝑁−1
data sample.

, where N is the number of items in the data sample


Comparing Test Statistic

• There are two ways to decide where we should


accept or reject the null hypothesis.
• Method A: (using Critical Value)
– If Test Statistic>Critical Value: Reject the null hypothesis.
– If Test Statistic≤Critical Value: Fail to reject the null
hypothesis.
(Critical values are predetermined threshold values that
are used to make a decision in hypothesis testing. To
determine critical value for hypothesis testing, we
typically refer to a statistical distribution table)
Comparing Test Statistic
• Method B: Using P-values
We can also come to an conclusion using the p-
value,
• If the p-value is less than or equal to the
significance level i.e. (p≤α), you reject the null
hypothesis.
• If the p-value is greater than the significance
level i.e. (p≥α), you fail to reject the null
hypothesis.
One-Tailed and Two-Tailed Test
One-Tailed

• One tailed test focuses on one direction, either greater than or less than a
specified value. We use a one-tailed test when there is a clear directional
expectation based on prior knowledge or theory. The critical region is located
on only one side of the distribution curve. If the sample falls into this critical
region, the null hypothesis is rejected in favor of the alternative hypothesis.
• There are two types of one-tailed test:
• Left-Tailed (Left-Sided) Test: The alternative hypothesis asserts that the true
parameter value is less than the null hypothesis. Example: H0​:μ≥50 and H1: μ<50
• Right-Tailed (Right-Sided) Test: The alternative hypothesis asserts that the true
parameter value is greater than the null hypothesis. Example: H0 : μ≤50 and
H1:μ>50
One-Tailed and Two-Tailed Test
Two-Tailed Test

• Considers both directions, greater than and


less than a specified value.
• We use a two-tailed test when there is no
specific directional expectation, and want to
detect any significant difference.
• Example: H0: μ= 50 and H1: μ1≠=50
Error in Hypothesis testing
• Type I error: When we reject the null
hypothesis, although that hypothesis was true.
Type I error is denoted by alpha(α).
• Type II error : When we accept the null
hypothesis, but it is false. Type II errors are
denoted by beta(β).
Real life Examples of Hypothesis Testing

• Example-1: Does a New Drug Affect Blood Pressure?


Imagine a pharmaceutical company has developed a new
drug that they believe can effectively lower blood pressure
in patients with hypertension. Before bringing the drug to
market, they need to conduct a study to assess its impact
on blood pressure.
Data: Before Treatment:
120, 122, 118, 130, 125, 128, 115, 121, 123, 119
After Treatment:
115, 120, 112, 128, 122, 125, 110, 117, 119, 114
Step 1: Define the Hypothesis
Null Hypothesis: (H0)The new drug has no effect
on blood pressure.
Alternate Hypothesis: (H1)The new drug has an
effect on blood pressure.
• Step 2: Define the Significance level
Let’s consider the Significance level at 0.05,
indicating rejection of the null hypothesis.
If the evidence suggests less than a 5% chance
of observing the results due to random
variation.
• Step 3: Compute the test statistic

• Using paired t test, for the above problem


m= -3.9, s= 1.8 and n= 10
T-statistic = -9 based on the formula for paired t
test
Paired T-Test – A Detailed Overview

• t-test is the statistical method used to determine if


there is a difference between the means of two
samples.
• This t-test is further divided into 3 types based on
your data and result need.
– One sample t-test: the mean of a single population is
compared against the known mean.
– Independent sample t-test: the mean of two different
populations is compared.
– Paired sample t-test: the mean of the same group or
population is at separate times.
Paired t-test
• Object is measured twice consequential providing the pairs of
observation for paired t-Test.
• Used to find if the mean of the dependent variable is the same in two
same or related groups.
• For example: measuring the weight of a person before and after
breakfast.
• The hypothesis can be represented as:
– Null Hypothesis, H0: u1 = u2 or H0: u1 –u2 = 0
– Alternative hypothesis, H1: u1 is not equal to u2 or H1: u1 – u2 is not equal to
zero.
(U1 is the mean of variable 1 U2 is the mean of variable 2)
• t = m/(s/√n),
m = mean of the difference i.e Xafter, Xbefore
s = standard deviation of the difference (d) i.e di​=Xafter, Xbefore,
n = sample size,
• Step 4: Find the p-value
The calculated t-statistic is -9 and degrees of
freedom df = 9,
you can find the p-value using statistical
software or a t-distribution table.
thus, p-value = 8.538051223166285e-06
• Step 5: Result
• If the p-value is less than or equal to 0.05, the
researchers reject the null hypothesis.
• If the p-value is greater than 0.05, they fail to reject the
null hypothesis.
• Conclusion: Since the p-value (8.538051223166285e-06)
is less than the significance level (0.05), the researchers
reject the null hypothesis. There is statistically significant
evidence that the average blood pressure before and
after treatment with the new drug is different.
Example-2: Cholesterol level in a population

• Data: A sample of 25 individuals is taken, and


their cholesterol levels are measured.
• Cholesterol Levels (mg/dL):
• 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202,
208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192,
205.
• Populations Mean = 200
• Population Standard Deviation (σ): 5
mg/dL(given for this problem)
• Step 1: Define the Hypothesis
• Null Hypothesis (H0): The average cholesterol
level in a population is 200 mg/dL.
• Alternate Hypothesis (H1): The average
cholesterol level in a population is different
from 200 mg/dL.
• Step 2: Define the Significance level
• As the direction of deviation is not given , we
assume a two-tailed test, and based on a
normal distribution table, the critical values
for a significance level of 0.05 (two-tailed) can
be calculated through the z-table and are
approximately -1.96 and 1.96.
• Compute the test statistic
Z-statistics: When population means and
standard deviations are known.
z=(​x–μ​)/(σ/n0.5)

• The test statistic is calculated by using the z


formula Z=(203.8–200)/(5÷250.5​)​
• We get Z=2.039999999999992.
• Step 4: Result
• Since the absolute value of the test statistic
(2.04) is greater than the critical value (1.96),
we reject the null hypothesis. And conclude
that, there is statistically significant evidence
that the average cholesterol level in the
population is different from 200 mg/dL

You might also like