0% found this document useful (0 votes)
5 views47 pages

Unit III - Update

Unit III of the Fundamentals of Data Science and Analytics course covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and the central limit theorem. Key topics include the definitions of probability, mutually exclusive events, dependent and independent events, and the procedures for conducting z-tests. The unit also discusses confidence intervals, point estimates, and the implications of Type I and Type II errors in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views47 pages

Unit III - Update

Unit III of the Fundamentals of Data Science and Analytics course covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and the central limit theorem. Key topics include the definitions of probability, mutually exclusive events, dependent and independent events, and the procedures for conducting z-tests. The unit also discusses confidence intervals, point estimates, and the implications of Type I and Type II errors in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

UNIT III – INFERENTIAL STATISTICS


Populations – samples – random sampling – Sampling distribution- standard error of the mean - Hypothesis
testing – z-test – z-test procedure –decision rule – calculations – decisions – interpretations - one-tailed and
two-tailed tests – Estimation – point estimate – confidence interval – level of confidence – effect of sample
size.
PART A
1. What is Statistics?
Define Population and it types.
 Population
 Any complete set of observations (or potential observations).
Types of Population
 Real Populations
o A real population is one in which all potential observations
are accessible at the time of sampling.
 Hypothetical Populations
o A hypothetical population is one in which all
potential observations are not accessible at the time
of sampling.

2. Define Sample and Random Sampling.


 Sample
 Any subset of observations from a population.
 The sample size is small relative to the population size.
 Random Sampling
 A selection process that guarantees all potential
observations in the population have an equal chance of being
selected.
 Inferential statistics requires that samples be random.

2. Define the term probability.


 Probability
 The proportion or fraction of times that a particular event is likely to occur.

3. What is meant by Mutually Exclusive Events? State the Addition Rule


for Mutually Exclusive Events
Mutually Exclusive Events
Events that cannot occur together.
Addition Rule
 Add together the separate probabilities of several mutually

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

exclusive events to find the probability that any one of these


events will occur.

where Pr( ) refers to the probability of the event in parentheses


and A and B are mutually exclusive events.

4. What is meant by Dependent and Independent Events? State the


Multiplication Rule for Independent Events.
Dependent Events
 When the occurrence of one event affects the probability of
the other event, these events are dependent.
 Although the heights of randomly selected pairs of men are
independent, the heights of brothers are dependent.

Independent Events
 The occurrence of one event has no effect on the probability that
the other event will occur.

Multiplication Rule
 Multiply together the separate probabilities of several independent
events to find the probability that these events will occur together.

where A and B are independent events.

5. Define Conditional Probability and Alternative Approach to


Conditional Probabilities
Conditional Probability
 The probability of one event, given the occurrence of another event.

Alternative Approach to Conditional Probabilities


 Conditional probabilities can be easily misinterpreted.
 Convert probabilities to frequencies (which, for example, total 100);
solve the problem with frequencies; and then convert the answer back to a
probability
6. Define sampling distribution of the mean.
 The sampling distribution of the mean refers to the probability
distribution of means for all possible random samples of a given size
2

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

from some population.

7. Narrate the symbols used for the mean and standard deviation of
three types of Distributions.

8. Define mean of all sample means.

 MEAN OF ALL SAMPLE MEANS


 The mean of all sample means always equals the population mean.

where represents the mean of the sampling distribution and μ


represents the mean of the population.

10. Define Standard error of the mean.

 STANDARD ERROR OF THE MEAN


 The distribution of sample means also has a standard
deviation, referred to as the standard error of the mean.
 The standard error of the mean equals the standard deviation
of the population divided by the square root of the sample
size.

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

11. Define Shape of the sampling distribution or state the central


limit theorem.
 SHAPE OF THE SAMPLING DISTRIBUTION
Central Limit Theorem
 The central limit theorem states that, regardless of the shape of
the population, the shape of the sampling distribution of the mean
approximates a normal curve if the sample size is sufficiently
large.

12. Define Hypothesis Testing and its


types. Hypothesis Testing
 Hypothesis testing is a statistical method used to determine if
there is enough evidence in a sample data to draw conclusions
about a population.
 It is used to estimate the relationship between 2 statistical variables.
 It involves formulating two competing hypotheses, the null
hypothesis (H0) and the alternative hypothesis (H1), and then
collecting data to assess the evidence.
 Hypothesis testing evaluates two mutually exclusive population
statements to determine which statement is most supported by
sample data.

13. Defining Null Hypothesis and Alternate Hypothesis


 Null hypothesis (H0):
In statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured cases
or no relationship among groups. In other words, it is a basic
assumption or made based on the problem knowledge.
Example:
A company’s mean production is 50 units/per
day H0: = 50.
 Alternative hypothesis (H1):
The alternative hypothesis is the hypothesis used in
hypothesis testing that is contrary to the
null hypothesis. Example:
 A company’s production is not equal to 50 units/per day i.e. H1: 50.

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

14. Explain testing of Null Hypothesis. Define Common Outcome and


Rare Outcome.
Testing Null Hypothesis
 The null hypothesis is tested by determining whether the one
observed sample mean qualifies as a common outcome or a rare
outcome in the hypothesized sampling distribution
Common Outcomes
o An observed sample mean qualifies as a common outcome if the
difference between its value and that of the hypothesized
population mean is small enough to be viewed as a probable
outcome under the null hypothesis.
o There is no compelling reason for rejecting the null hypothesis, it is
retained.
Rare Outcomes
o An observed sample mean qualifies as a rare outcome if the difference
between its value and the hypothesized population mean is too
large to be reasonably viewed as a probable outcome under the null
hypothesis.

15. Discuss z test for a population mean.

Z TEST FOR A POPULATION MEAN

 A hypothesis test that evaluates how far the observed sample


mean deviates, in standard error units, from the hypothesized
population mean.
 This z test is accurate only when
(1) the population is normally distributed or the sample size is
large enough to satisfy the requirements of the central limit
theorem
(2) the population standard deviation is known.

16. List the z - test step by step procedure


Step 1 - State the research problem.
Step 2 - Identify the statistical hypotheses.
Step 3 - Specify a decision rule.
Step 4 - Calculate the value of the
observed z. Step 5 - Make a decision.
Step 6 - Interpret the decision.
5

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

17. Define Critical z Score


A z score that separates common from rare outcomes and hence dictates
whether H0 should be retained or rejected.

18. Define Level of Significance (α)


 The degree of rarity required of an observed outcome in order to reject the null
hypothesis (H0).

19. What is the use of one-tailed and two – tailed tests in hypothesis testing? When to use it?
 One and Two-Tailed Tests are ways to identify the relationship between the statistical
variables.
 For checking the relationship between variables in a single direction (Left or Right
direction), use a one-tailed test.
 A two-tailed test is used to check whether the relations between variables are in any
direction or not.

20. Define One-Tailed or Directional Test

 A one-tailed test is based on a uni-directional hypothesis where the area of rejection is


on only one side of the sampling distribution.
 It determines whether a particular population parameter is larger or smaller than the
predefined parameter. It uses one single critical value to test the data.

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

21. Define Two-Tailed or Non-directional Test


 Rejection regions are located in both tails of the sampling distribution.
 For checking whether the sample is greater or less than a
range of values, use the two-tailed testing.
 It is used for null hypothesis testing.

Figure 3.9 – Two tailed test

22. Define Point Estimate. POINT ESTIMATE

A single value that represents some unknown population characteristic, such as


the population mean.
The best single point estimate for the unknown population mean is simply the
observed value of the sample mean.

23. Define Confidence interval

CONFIDENCE INTERVAL (CI) FOR μ


A confidence interval for μ uses a range of values that, with a known d egree of
certainty, includes an unknown population characteristic, such as a population
mean.
Confidence Interval for μ Based on z

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

where

represents the sample mean;


z conf represents a number from the standard normal table that
satisfies the confidence specifications for the confidence interval;
and
represents the standard error of the mean.

24. Explain the term ˜Normal Distribution” (Nov/Dem 2023)


 Normal distribution the Normal Distribution, also called the
Gaussian Distribution,, also known as the Gaussian distribution,
is a probability distribution that is symmetric about the mean,
showing that data near the mean are more frequent in occurrence
than data far from the mean.
 In graphical form, the normal distribution appears as a "bell curve".

Empirical Rule states that,


 68% of the data approximately fall within one standard deviation of
the mean, i.e. it falls between {Mean – One Standard Deviation,
and Mean + One Standard Deviation}
 95% of the data approximately fall within two standard deviations of
the mean, i.e. it falls between {Mean – Two Standard Deviation,
and Mean + Two Standard Deviation}
99.7% of the data approximately fall within a third standard deviation of the
mean, i.e. it falls between {Mean – Third Standard Deviation, and Mean +
Third Standard Deviation}

25. Brief about the Type I and Type II errors in Statistics.


Identify the relationship between standard error and margin of
error. (Nov/Dem 202)
Type I and Type II errors
 A type I error (false-positive) occurs if an investigator rejects a null
hypothesis that is actually true in the population; a type II error
(false- negative) occurs if the investigator fails to reject a null
hypothesis that is actually false in the population.
α= P(TypeI Error) and β = P(TypeII Error)

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example
 Type I error (false positive): the test result says you have
coronavirus, but you actually don’t.
 Type II error (false negative): the test result says you don’t
have coronavirus, but you actually do.

Relationship between standard error and margin of error


• A margin of error is a statistical measure that accounts for the degree
of error received from the outcome of the research sample.
• Standard error measures the accuracy of the representation of the
population sample to the mean using the standard deviation of the
data set.
• Standard error and standard deviation are both measures of variability:
• The standard deviation describes variability within a single sample.
The standard error estimates the variability across multiple samples of a
population.
26. Compare between one-tailed and two-tailed test (Apr/May2024)

 A one-tailed test may be either left-tailed or right-tailed.

 For one-tailed, we use either > or < sign for the alternative hypothesis. For two-tailed, we use ≠
sign for the alternative hypothesis.

 When the alternative hypothesis specifies a direction then we use a one-tailed test. If no
direction is given then we will use a two-tailed test
 If we require a 100(1−α)% confidence interval we have to make some adjustments when using
a two-tailed test.
 For a one-tailed test, the critical value is 1.645 . So the critical region is Z<−1.645 for a left-
tailed test and Z>1.645 for a right-tailed test. For a two-tailed test, the critical value is 1.96 . So
the confidence interval is |Z|<1.96 and the critical regions are where |Z|>1.96 .

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

27. State the central limit theorem (Apr/May2024)

 The central limit theorem says that the sampling distribution of the mean will always be
normally distributed, as long as the sample size is large enough.

 The central limit theorem states that irrespective of a random variable's distribution if large
enough samples are drawn from the population then the sampling distribution of the mean for
that random variable will approximate a normal distribution.

 This fact holds true for samples that are greater than or equal to 30

10

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

PART B

1. Give a detailed introduction about Population Sample and Probability.


 Population
 Any complete set of observations (or potential observations).
Types of Population
 Real Populations
o A real population is one in which all potential observations
are accessible at the time of sampling.
 Hypothetical Populations
o A hypothetical population is one in which all
potential observations are not accessible at the time
of sampling.

 Sample
 Any subset of observations from a population.
 The sample size is small relative to the population size.

Example 3.1
For each of the following pairs, indicate with a Yes or No
whether the relationship between the first and second
expressions could describe that between a sample and its
population, respectively.
(a) students in the last row; students in class
(b) citizens of Wyoming; citizens of New York
(c) 20 lab rats in an experiment; all lab rats, similar to
those used, that could undergo the same experiment
(d) all U.S. presidents; all registered Republicans
(e) two tosses of a coin; all possible tosses of a coin
Solution
(a) Yes
(b) No. Citizens of Wyoming aren’t a subset of citizens of New York.
(c) Yes
(d) No. All U.S. presidents aren’t a subset of all registered Republicans.
(e) Yes

11

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.2
Identify all of the expressions from Example 3.1 that involve
a hypothetical population.
Solution
Expressions in 8.1(c) and 8.1(e) involve hypothetical populations.
 Random Sampling
 A selection process that guarantees all potential observations in
the population have an equal chance of being selected.
 Inferential statistics requires that samples be random.

Example 3.3
Indicate whether each of the following statements is True or False.
A random selection of 10 playing cards from a deck of 52 cards implies that
(a) the random sample of 10 cards accurately represents the
important features of the whole deck.
(b) each card in the deck has an equal chance of being selected.
(c) it is impossible to get 10 cards from the same suit (for
example, 10 hearts).
(d) any outcome, however unlikely, is possible.
Solution
a. False. Sometimes, just by chance, a random sample of 10 cards
fails to represent the important features of the whole deck.
b. True
c. False. Although unlikely, 10 hearts could appear in a random
sample of 10 cards.
d. True

 Tables Of Random Numbers


 Tables of random numbers can be used to obtain a random sample.
 These tables are generated by a computer designed to equalize
the occurrence of any one of the 10 digits: 0, 1, 2, . . . , 8,
9

12

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.4

Describe how you would use the table of random numbers to take
a. a random sample of five statistics students in a classroom
where each of nine rows consists of nine seats.
b. a random sample of size 40 from a large directory consisting
of 3041 pages, with 480 lines per page.

Solution

a. There are many ways. For instance, consult the tables of random numbers,
using the first digit of each 5-digit random number to identify the row
(previously labelled 1, 2, 3, and so on), and the second digit of th e same
random number to locate a particular student’s seat within that row.
Repeat this process until five students have been identified. (If the
classroom is larger, use additional digits so that every student can be
sampled.)

b. Once again, there are many ways. For instance, use the initial 4 digits of each random
number (between 0001 and 3041) to identify the page number of the telephone
directory and the next 3 digits (between 001 and 480) to identify the particular line on
that page. Repeat this process, using 7-digit numbers, until 40 telephone numbers have
been identified

 Probability
 The proportion or fraction of times that a particular event is
likely to occur.

Mutually Exclusive Events


 Events that cannot occur together.
Addition Rule
 Add together the separate probabilities of several mutually
exclusive events to find the probability that any one of these
events will occur.

where Pr( ) refers to the probability of the event in parentheses and


A and B are mutually exclusive events.

13

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Independent Events
 The occurrence of one event has no effect on the probability that
the other event will occur.

Multiplication Rule
Multiply together the separate probabilities of several independent events to find the
probability that these events will occur together.

where A and B are independent events.

Dependent Events
 When the occurrence of one event affects the probability of the
other event, these events are dependent.
 Although the heights of randomly selected pairs of men are
independent, the heights of brothers are dependent.

Conditional Probability
 The probability of one event, given the occurrence of another event.

Alternative Approach to Conditional Probabilities


 Conditional probabilities can be easily misinterpreted.
 Convert probabilities to frequencies (which, for example, total 100);
solve the problem with frequencies; and then convert the answer back to a
probability

14

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

2. Discuss in detail about sampling distribution and creating


sampling distribution in inferential statistics.

 Sampling distribution of the mean


 Creating a sampling distribution

 Mean of all sample means

 Standard error of the mean

 Shape of the sampling distribution

 SAMPLING DISTRIBUTION OF THE MEAN


 The sampling distribution of the mean refers to the probability
distribution of means for all possible random samples of a given size
from some population.

 CREATING A SAMPLING DISTRIBUTION


 Imagine small population of four observations with values of 2,
3, 4, and 5, as shown in Figure 3.2.

Figure 3.2 - Graph of a miniature population.

 Itemize all possible random samples, each of size two, that could
be taken from this population.

15

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

 There are four possibilities on the first draw from the population and also four
possibilities on the second draw from the population, as indicated in Table
3.1.*

 The two sets of possibilities combine to yield a total of 16


possible samples.
 Table 3.1 also lists a sample mean (found by adding the
two observations and dividing by 2) and its probability of occurrence
(expressed as 1⁄ 1 6, since each of the 16 possible samples is
equally likely).

Table 3.1 - All possible samples of size two from a miniature population

 When cast into a relative frequency or probability distribution, as in


Table 3.2, the 16 sample means constitute the sampling distribution
of the mean, previously defined as the probability distribution of
means for all possible random samples of a given size from some
population.
 Not all values of the sample mean occur with equal probabilities in Table
3.2 since some values occur more than once among the 16 possible
samples.

16

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

 For instance, a sample mean value of 3.5 appears among 4 of 16


possibilities and has a probability of 4⁄ 16.
Table 3.2 – Sampling Distribution of the Mean (samples of
size Of two from a miniature population)

 Probability of a Particular Sample Mean


 The distribution in Table 3.2 can be consulted to determine the
probability of obtaining a particular sample mean or set of sample
means.
 The probability of a randomly selected sample mean of either 5.0 or
2.0 equals 1⁄16 + 1⁄16 = 2⁄16 = .1250.
 This type of probability statement, based on a sampling
distribution, assumes an essential role in inferential statistics
 Refer Figure 3.3

17

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

FIGURE 3.3
Emergence of the sampling distribution of the mean from all possible
samples

18

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.8
Without peeking, list the special symbols for the mean of
the population
(a) mean of the sampling distribution of the mean
(b) mean of the sample
(c) standard error of the mean
(d) standard deviation of the sample
(e) standard deviation of the population (f) .

Example 3.9
Imagine a very simple population consisting of only five
observations: 2, 4, 6, 8, 10.
(a) List all possible samples of size two.

Construct a relative frequency table showing the sampling distribution


of the mean

19

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

 MEAN OF ALL SAMPLE MEANS


 The mean of all sample means always equals the population mean.

where represents the mean of the sampling distribution and μ


represents the mean of the population.

Example 3.10
Indicate whether the following statements are True or False.

The mean of all sample means, , . . .


(a) always equals the value of a particular sample mean.
(b) equals 100 if, in fact, the population mean equals 100.
(c) usually equals the value of a particular sample mean.
(d) is interchangeable with the population mean.
a. False. It always equals the value of the population mean.
b. True
c. False. Because of chance, most sample means tend to be
either larger or smaller than the mean of all sample
means.
d. True

STANDARD ERROR OF THE MEAN


The distribution of sample means also has a standard deviation, referred to
as the standard error of the mean.
 The standard error of the mean serves as a special type of
standard deviation that measures variability in the sampling
distribution.
 A rough measure of the average amount by which sample
means deviate from the mean of the sampling distribution or from the
population mean.
 The standard error of the mean equals the standard
deviation of the population divided by the square root of the
sample size.
20

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.10
Indicate whether the following statements are True or False. The

standard error of the mean, , . . .


(a) roughly measures the average amount by which
sample means deviate from the population mean.
(b) measures variability in a particular sample.
(c) increases in value with larger sample sizes.
(d) equals 5, given that σ = 40 and n = 64.
(a) True
(b) False. It measures variability among sample means.
(c) False. It decreases in value with larger sample sizes.
(d) True

 SHAPE OF THE SAMPLING DISTRIBUTION


Central Limit Theorem
 the central limit theorem states that, regardless of the shape of the
population, the shape of the sampling distribution of the mean
approximates a normal curve if the sample size is sufficiently large.
Example - For the two non-normal populations in the top panel of Figure 3.4, the
shapes of the sampling distributions in the middle panel show essentially the same
preliminary drift toward normality when the sample size equals only 2, while the
shapes of the sampling distributions in the bottom panel closely approximate
normality when the sample size equals 25.

21

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Figure 3.4 – Effect of Central limit theorem

Example 3.11
Indicate whether the following statements are True or False.
The central limit theorem
a. states that, with sufficiently large sample sizes, the shape of
the population is normal.
b. states that, regardless of sample size, the shape of the
sampling distribution of the mean is normal.
c. ensures that the shape of the sampling distribution of the
mean equals the shape of the population.
applies to the shape of the sampling distribution—not to the shape of the
population and not to the shape of the sample.

22

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

a. False. The shape of the population remains the same


regardless of sample size.
b. False. It requires that the sample size be sufficiently large —
usually between 25 and 100.
c. False. It ensures that the shape of the sampling distribution
approximates a normal curve, regardless of the shape of the
population (which remains intact).
d. True

3. Explain in detail about Hypothesis Testing and


its types. Hypothesis Testing
 Hypothesis testing is a statistical method used to determine if
there is enough evidence in a sample data to draw conclusions
about a population.
 It is used to estimate the relationship between 2 statistical variables.
 It involves formulating two competing hypotheses, the null
hypothesis (H0) and the alternative hypothesis (Ha), and then
collecting data to assess the evidence.
 Hypothesis testing evaluates two mutually exclusive population
statements to determine which statement is most supported by
sample data.

Defining Hypotheses
 Null hypothesis (H0):
In statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured cases
or no relationship among groups. In other words, it is a basic
assumption or made based on the problem knowledge.
Example:
A company’s mean production is 50 units/per
day H0: = 50.
 Alternative hypothesis (H1):
The alternative hypothesis is the hypothesis used in
hypothesis testing that is contrary to the
null hypothesis. Example: company’s production is

23

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

not equal to 50 units/per day i.e. H1: 50.

Key Terms of Hypothesis Testing


 Level of significance:
o It refers to the degree of significance to accept or reject the null
hypothesis. 100% accuracy is not possible for accepting a
hypothesis, so, therefore, select a level of significance that is
usually 5%.
o This is normally denoted with and generally, it is 0.05 or
5%, which means the output should be 95% confident to give a
similar kind of result in each sample.
 P-value:
o The P value, or calculated probability, is the probability of finding
the observed/extreme results when the null hypothesis(H0) of a
study- given problem is true.
o If P-value is less than the chosen significance level then reject
the null hypothesis i.e. accept that the sample claims to support
the alternative hypothesis.
 Test Statistic:
o The test statistic is a numerical value calculated from sample data
during a hypothesis test, used to determine whether to reject
the null hypothesis.
o It is compared to a critical value or p-value to make decisions
about the statistical significance of the observed results.
 Critical value:
o The critical value in statistics is a threshold or cutoff point used to
determine whether to reject the null hypothesis in a hypothesis
test.
 Degrees of freedom:
o Degrees of freedom are associated with the variability or free dom
one has in estimating a parameter.
o The degrees of freedom are related to the sample size and
determine the shape.

Testing Null Hypothesis


The null hypothesis is tested by determining whether the one
observed sample mean qualifies as a common outcome or a rare
outcome in the hypothesized sampling distribution of Figure 3.5.

24

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Figure 3.5. - Hypothesized sampling distribution of the mean


centred about a hypothesized population mean of
500.
 Common Outcomes
o An observed sample mean qualifies as a common outcome if the
difference between its value and that of the hypothesized population
mean is small enough to be viewed as a probable outcome under the
null hypothesis.
o There is no compelling reason for rejecting the null hypothesis, it is
retained.
 Rare Outcomes
o An observed sample mean qualifies as a rare outcome if the difference
between its value and the hypothesized population mean is too large to
be reasonably viewed as a probable outcome under the null
hypothesis.
Boundaries for Common and Rare Outcomes

Figure 3.6 - One possible set of common and rare outcomes (values of X).
 Figure 3.6 shows one possible set of boundaries for common and rare
outcomes, expressed in values of X.
 If the one observed sample mean is located between 478 and 522, it
will qualify as a common outcome, and the null hypothesis will be
retained.
25

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

 If, however, the one observed sample mean is greater than522 or less
than 478, it will qualify as a rare outcome, and the null hypothesis
will be rejected.

4. Discuss in detail about z test for a population mean and z


test procedure.
Converting a Raw Score to z
 To convert a raw score into a standard score, express the raw score
as a distance from its mean (by subtracting the mean from the raw
score), and then split this distance into standard deviation units (by
dividing with the standard deviation).

Converting a Sample Mean to z

where

- observed sample mean;

- the hypothesized population mean

- the standard error of the mean

Z TEST FOR A POPULATION MEAN


 A hypothesis test that evaluates how far the observed sample
mean deviates, in standard error units, from the hypothesized
population mean.
 This z test is accurate only when
(1) the population is normally distributed or the sample size is large
enough to satisfy the requirements of the central limit
theorem
(2) the population standard deviation is known.

26

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Z - TEST STEP BY STEP PROCEDURE

Step 1 - State the research problem.


 State the problem to be resolved by the investigation.

Step 2 - Identify the statistical hypotheses.


 The statistical hypotheses consist of a null hypothesis
(H0) and an alternative (or research) hypothesis ( H1).
Null Hypothesis (H0)
 A statistical hypothesis that usually asserts that nothing
special is happening with respect to some characteristic
of the underlying population.

Where μ is the population mean


Alternative Hypothesis (H1)
 The opposite of the null hypothesis.

 Depending on the outcome of the hypothesis test, H0


will either be retained or rejected.

Step 3 - Specify a decision rule.


 This rule indicates precisely when H0 should be rejected.

Step 4 - Calculate the value of the observed z.


Express the one observed sample mean as an observed z,

Critical z Score
 A z score that separates common from rare outcomes and
hence dictates whether H0 should be retained or rejected.

Level of Significance (α)


 The degree of rarity required of an observed outcome in order
to reject the null hypothesis (H0).

Step 5 - Make a decision.


 Either retain or reject H0 at the specified level of
27

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

significance, justifying this decision by noting the


relationship between observed and critical z scores.
Retaining H0 is a Weak Decision
 H 0 is retained whenever the observed z qualifies as a
common outcome on the assumption that H 0 is true.
Rejecting H0 is a Strong Decision
 H 0 is rejected whenever the observed z qualifies a s a
rare outcome on the assumption that H 0 is true.

Step 6 - Interpret the decision.


 Using words, interpret the decision in terms of the original
research problem.
Rejection of the null hypothesis supports the research hypothesis, while
retention of the null hypothesis fails to support the research ypothesis

Example 3.14
First using words, then symbols, identify the null hypothesis
for each of the following situations.
a. A school administrator wishes to determine whether sixth-grade
boys in her school district differ, on average, from the national
norms of 10.2 pushups for sixth-grade boys.
b. A consumer group investigates whether, on average, the true
weights of packages of ground beef sold by a large supermarket
chain differ from the specified 16 ounces.
c. A marriage counselor wishes to determine whether, during
a standard conflict-resolution session, his clients differ, on
average, from the 11 verbal interruptions reported for
“welladjusted couples.”

(a) Sixth-grade boys in her school district average 10.2 pushups.


H0: μ = 10.2
(b) On average, weights of packages of ground beef sold by a
large supermarket chain equal 16 ounces.
H0: μ = 16
(c) The marriage counselor’s clients average 11 interruptions
per session.
H0: μ = 11

28

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.15
For each of the following situations, indicate whether H 0 should
be retained or rejected and justify your answer by specifying the
precise relationship between observed and critical z scores. Should
H 0 be retained or rejected, given a hypothesis test with critical z
scores of ±
1.96 and

a. Retain H 0 at the .05 level of significance because z = 1.74 is less


positive than 1.96.
b. Retain H 0 at the .05 level of significance because z = 0.13 is less
positive than 1.96.
c. Reject H 0 at the .05 level of significance because z = −2.51 is
more negative than –1.96.

5. Discuss and differentiate between one-tailed and two – tailed tests


in hypothesis testing.

 One and Two-Tailed Tests are ways to identify the relationship between
the statistical variables.
 For checking the relationship between variables in a single direction
(Left or Right direction), use a one-tailed test.
 A two-tailed test is used to check whether the relations between variables
are in any direction or not.

One-Tailed or Directional Test


 A one-tailed test is based on a uni-directional hypothesis where the area
of rejection is on only one side of the sampling distribution.
 It determines whether a particular population parameter is larger or
smaller than the predefined parameter. Refer Figure 3.7 and Figure 3.8
 It uses one single critical value to test the data.

Figure 3.7 – One tailed test

29

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Figure 3.8 a. One-Tailed or Directional Test (Lower Tail Critical) Figure 3.8 b. One-
Tailed or Directional Test (Upper Tail Critical)

 Figure 3.8 a, illustrates a rejection region that is associated with only


the lower tail of the hypothesized sampling distribution.
 The corresponding decision rule, with its critical z of –1.65, is
referred to as a one-tailed or directional test with the lower tail
critical.
 Figure 3.8 b, illustrates one-tailed or directional test with the upper
tail critical. This one-tailed test is the mirror image of the previous
test.
 The corresponding decision rule, with its critical z of 1.65, is
referred to as a one-tailed or directional test with the upper tail
critical.

Two-Tailed or Non-directional Test


 Rejection regions are located in both tails of the sampling distribution.
 For checking whether the sample is greater or less than a range of
values, use the two-tailed testing.
 It is used for null hypothesis testing.

Figure 3.9 – Two tailed test

30

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Figure 3.10 – Two-Tailed or Nondirectional Test

 Figure 3.10 shows rejection regions that are associated with both
tails of the hypothesized sampling distribution.
 The corresponding decision rule, with its pair of critical z scores of
±1.96, is referred to as a two-tailed or nondirectional test.

Difference Between One and Two-Tailed Test:

One-Tailed Test Two-Tailed Test


A test of any statistical hypothesis,
A test of a statistical hypothesis,
where the alternative hypothesis
where the alternative hypothesis is
is one-tailed either right-tailed or
two- tailed.
left-
tailed.
For one-tailed, use either > or < sign For two-tailed, use ≠ sign for the
for the alternative hypothesis. alternative hypothesis.

When the alternative hypothesis


If no direction is given then use a
specifies a direction then use a one-
two- tailed test.
tailed test.

Critical region lies entirely on either Critical region is given by the portion
the right side or left side of the of the area lying in both the tails of the
sampling distribution. probability curve of the test statistic.

Here, the Entire level of significance


It splits the level of significance
(α) i.e. 5% has either in the left tail or
(α) into half.
right tail.

31

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Rejection region is either from the left Rejection region is from both sides i.e.
side or right side of the sampling left and right of the sampling
distribution. distribution.

It checks the relation between the It checks the relation between the
variable in a single direction. variables in any direction.

It is used to check whether the one It is used to check whether the two
mean is different from another mean or mean different from one another or
not. not.

Example 3.17
For each of the following situations, indicate whether H0 should
be retained or rejected.
Given a one-tailed test, lower tail critical with α = .01, and
(a) z = – 2.34 (b) z = – 5.13 (c) z = 4.04
Given a one-tailed test, upper tail critical with α = .05, and
(d) z = 2.00 (e) z = – 1.80 (f) z = 1.61
a. Reject H0 at the .01 level of significance because z = –2.34 is more
negative than –2.33.
b. Reject H0 at the .01 level of significance because z = –5.13 is more
negative than –2.33.
c. Retain H0 at the .01 level of significance because z = 4.04 is less negative
than –2.33. (The value of the observed z is in the direction of no concern.)
d. Reject H0 at the .05 level of significance because z = 2.00 is more
positive than 1.65.
e. Retain H0 at the .05 level of significance because z = –1.80 is less positive
than 1.65. (The value of the observed z is in the direction of no concern.)
f. Retain H0 at the .05 level of significance because z = 1.61 is less
positive than 1.65.

Example 3.18
Specify the decision rule for each of the following situations
(referring to Table to find critical z values):
(a) a two-tailed test with α = .05
(b) a one-tailed test, upper tail critical, with α = .01
(c) a one-tailed test, lower tail critical, with α = .05
(d) a two-tailed test with α = .01
a. Reject H0 at the .05 level of significance if z equals or is more positive than
1.96 of if z equals or is more negative than –1.96.
b. Reject H0 at the .01 level of significance if z equals or is more
positive than 2.33.
c. Reject H0 at the .05 level of significance if z equals or is more negative than
32

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

–1.65.
d. Reject H0 at the .01 level of significance if z equals or is more positive
than 2.58 or if z equals or is more negative than –2.58.

6. Discuss in detail about Estimation.


 POINT ESTIMATE
 A single value that represents some unknown
population characteristic, such as the population
mean.

 The best single point estimate for the unknown population mean
is simply the observed value of the sample mean.

Example 3.19
A random sample of 200 graduates of U.S. colleges reveals a
mean annual income of $62,600. What is the best estimate of the
unknown mean annual income for all graduates of U.S.
colleges?
$62,600

 CONFIDENCE INTERVAL (CI) FOR μ


 A confidence interval for μ uses a range of values that, with a
known degree of certainty, includes an unknown population
characteristic, such as a population mean.

Confidence Interval for μ Based on z

where

represents the sample mean;


z conf represents a number from the standard normal table that
satisfies the confidence specifications for the confidence interval;
and
represents the standard error of the mean.

33

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.20
Reading achievement scores are obtained for a group of fourth graders.A score of 4.0
indicates a level of achievement appropriate for fourth grade, a score
below 4.0 indicates underachievement, and a score above
4.0 indicates overachievement. Assume that the population
standard deviation equals 0.4. A random sample of 64 fourth graders reveals a mean
achievement score of 3.82.
a. . Construct a 95 percent confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error.)
b. b. Interpret this confidence interval; that is, do you find any consistent evidence either of
overachievement or of underachievement?
Solution

c. Can claim, with 95 percent confidence, that the interval between 3.72 and 3.92 includes the true
population mean reading score for the fourth graders. All of these values suggest that, on average, the fourth
graders are underachieving

Example 3.21
Before taking the GRE, a random sample of college seniors received special training on how to take
the test. After analysing their scores on the GRE, the investigator reported a dramatic gain, relative to
the national average of 500, as indicated by a 95 percent confidence interval of 507 to 527. Are the
following interpretations true or false?
a.About 95 percent of all subjects scored between 507 and 527.
b.The interval from 507 to 527 refers to possible values of the population mean for all students who
undergo special training.
c.The true population mean definitely is between 507 and 527.
d.This particular interval describes the population mean about 95 percent of the time.
f. .In practice, we never really know whether the interval from 507 to 527 is true or false.
We can be reasonably confident that the population mean is between 507 and 527.

(a) Solution :

(b) False. We can be 95 percent confident that the mean for all subjects will be between 507 and 527.

(c) True

34

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

(d) False. We can be reasonably confident—but not absolutely confident— that the true population mean
lies between 507 and 527.
(e) False. This particular interval either describes the one true population mean or fails to describe the
one true population mean.
(f) True

(g) True

 LEVEL OF CONFIDENCE
 The level of confidence indicates the percent of time that a series
of confidence intervals includes the unknown population
characteristic, such as the population mean.
 Any level of confidence may be assigned to a confidence interval merely
by substituting an appropriate value for zconf in Formula

Choosing a Level of Confidence


 Although many different levels of confidence have been used,
95 percent and 99 percent are the most prevalent.

 EFFECT OF SAMPLE SIZE


 The larger the sample size, the smaller the standard error and,
hence, the more precise (narrower) the confidence interval will be.
 Indeed, as the sample size grows larger, the standard error will
approach zero and the confidence interval will shrink to a point
estimate.
 Given this perspective, the sample size for a confidence interval,
unlike that for a hypothesis test, never can be too large.
 Factors to select the sample size
i. Experience – Small samples can result in wide confidence
interval and risk of errors.
ii. Confidence Level – Larger the confidence level, larger the
sample size.

Example 3.22
On the basis of a random sample of 120 adults, a pollster
reports, with 95 percent confidence, that between 58 and 72
percent of all Americans believe in life after death.
a. If this interval is too wide, what, if anything, can be
done with the existing data to obtain a narrower
confidence interval?
b. What can be done to obtain a narrower 95 percent
confidence interval if another similar investigation is
35

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

being planned?
a. Switch to an interval having a lesser degree of confidence,
such as 90 percent or 75 percent.
b. Increase the sample size.

HYPOTHESIS TESTS OR CONFIDENCE INTERVALS?


 Hypothesis tests merely indicate whether or not an effect is present,
whereas Confidence intervals indicate the possible size of the
effect.
 Confidence intervals tend to be more informative than hypothesis tests.

Example 3.23
In a recent scientific sample of about 900 adult Americans, 70 percent favour stricter gun control of
assault weapons, with a margin of error of ±4 percent for a 95 percent confidence interval. Therefore, the
95 percent confidence interval equals 66 to 74 percent. Indicate whether the following interpretations are
true or false:
a. The interval from 66 to 74 percent refers to possible values of the sample percent.

b. The true population percent is between 66 and 74 percent.

c. In the long run, a series of intervals similar to this one would fail to include the
population percent about 5 percent of the time.
d. We can be reasonably confident that the population percent is between 66 and 74
percent.

Solution
(a) False. The interval from 66 to 74 percent refers to possible values of the population
proportion.
(b) False. Can be reasonably confident—but not absolutely confident— that the true
population proportion is between 66 and 74 percent.
(c) True

(d) True

36

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.23
For the population at large, the Wechsler Adult Intelligence Scale
is designed to yield a normal distribution of test scores with a mean
of 100 and a standard deviation of 15. School district officials
wonder whether, on the average, an IQ score different from 100
describes the intellectual aptitudes of all students in their district.
Wechsler IQ scores are obtained for a random sa mple of 25 of their
students, and the mean IQ is found to equal 105. Using the step-by-
step procedure, test the null hypothesis at the .05 level of
significance.

Example 3.24
Consult the power curves in Figure 11.7 to estimate the approximate
detection rates, rounded to the nearest tenth, for the
following situations:
(a) a three-point effect, with a sample size of 29
(b) a six-point effect, with a sample size of 13
(c) a twelve-point effect, with a sample size of 13

(a) .3
(b) .4
(c) .9

37

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.25
An investigator consults a chart to determine the sample size
required to detect an eight-point effect with a probability of .80.
What happens to this detection rate of .80—will it actually be smaller,
the same, or larger—if, unknown to the investigator, the true effect
actually equals
(a) twelve points?
(b) five points?
a. The power for the 12-point effect is larger than .80 because the true
sampling distribution is shifted further into the rejection region for
the false H0.
b. The power for the 5-point effect is smaller than .80 because the true
sampling distribution is shifted further into the retention region for
the false H0.

Example 3.26

In Question 10.5 on page 191, it was concluded that, the mean


salary among the population of female members of the American
Psychological Association is less than that ($82,500) for all
comparable members who have a doctorate and teach full time.
(a) Given a population standard deviation of $6,000 and a sample
mean salary of $80,100 for a random sample of 100 female
members, construct a 99 percent confidence interval for the mean
salary for all female members.
(b) Given this confidence interval, is there any consistent evidence
that the mean salary for all female members falls below $82,500,
the mean salary for all members?

(b) can claim, with 99 percent confidence, that the interval between
$78,552 and $81,648 includes the true population mean salary for all
female members of the American Psychological Association. All of
these values suggest that, on average, females’ salaries are less than
males’ salaries.

38

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Example 3.27

Imagine that one of the following 95 percent confidence intervals estimates the
effect of vitamin C on IQ scores:

Solution:
(a) Which one most strongly supports the conclusion that
vitamin C increases IQ scores?
(b) Which one implies the largest sample size?
(c) Which one most strongly supports the conclusion that
vitamin C decreases IQ scores?
(c) Which one would most likely stimulate the investigator to
conduct an additional experiment using larger sample
sizes?
(a) 3 (b) 1 (c) 5 (d) 4

39

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

7. Among 100 couples who had undergone marital counseling, 60 couples


described their relationships as improved, and among this latter
group, 45 couples had children. The remaining couples
described their relationships as unimproved, and among this group,
5 couples had children.
 What is the probability of randomly selecting a couple who described
their relationship as improved? (2)
 What is the probability of randomly selecting a couple with children?
(2)
 What is the conditional probability of randomly selecting a couple with
children, given that their relationship was described as improved?(2)
(d) What is the conditional probability of an improved relationship,
given that a couple has children? (Nov/Dem 2023)
Solution
(a) The probability of a couple describing their relationship as
improved is 0.6 or 60%.
(b) The probability of a couple having children is 0.5 or 50%.
(c) The conditional probability of a couple having children given that their
relationship was described as improved is 0.75 or 75%.
(d) The conditional probability of an improved relationship, given a
couple has children, is 0.9 or 90%.
Step 1: Calculate the total number of Couples
The total number of couples is given as 100. So, this will act as the
denominator for all the probability calculations.
Step 2: Probability of a Couple describing their relationship as Improved
Out of the total 100 couples, 60 couples described their relationships as improved. So,
the probability of randomly selecting a couple who described their relationship as
improved is calculated as: 60/100=0.6 or 60%.

Step 3: Probability of a Couple having Children


Out of the total 100 couples, 45+5=50 couples had children. So, the probability of randomly
selecting a couple with children is calculated as: 50/100=0.5 or 50%.
Step 4:
Conditional Probability of a Couple having Children given their relationship is improved
Out of the 60 couples who described their relationships as improved, 45 couples had children. So, the
conditional probability of randomly selecting a couple with children, given their relationship was
described as improved, is calculated as: 45/60=0.75 or 75%.

40

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

Step 5:
Conditional Probability of an improved relationship, given a couple has children
Out of the 50 couples who had children, 45 couples described their relationships as improved. So, the
conditional probability of an improved relationship, given a couple has children, is calculated as:
45/50=0.9 or 90%.
8. The probability of a boy being born equals 0.50, or 1/2, as does the
probability of a girl being born. For a randomly selected family with two
children, what's the probability of
Two boys, that is, a boy and a boy? (3)
Two girls? (2)
Either two boys or two girls? (2) (Nov/Dem 2023))
Step 1: Analyze the probability of each event
According to the problem, the probability for a child to be a boy or a girl is
0.50 or 1/2. Since having a boy or a girl are independent events, the probability
of having two boys or two girls is the product of their individual probabilities
(multiplication rule).
Step 2: Compute the probability of two boys
The probability of having a boy is 1/2 and since the births are independent, the
probability of having two boys is (1/2)∗(1/2)=1/4.
Step 3: Compute the probability of two girls
Similar to Step 2, the probability of having two girls is
(1/2)∗(1/2)=1/4.
Step 4: Compute the probability of either two boys or two girls
Two boys and two girls are mutually exclusive events, meaning they can't occur
simultaneously. Hence, the probability of having either two boys or two girls is the sum
of their individual probabilities (addition rule). So, the probability is 1/4+1/4=1/2

9. The normal range for a widely accepted meapre of body size, the Body Mass
Index (BMI), ranges from 18.5 to Using the midrange BMI score of 21.75
as the null hypothesized value for the population mean, test this
hypothesis at the .01 level of significance given a random sample of,30
weight-watcher participants who show a mean BMI = 22. /and a
standard deviation of 3.1. (6) (Nov/Dem 2023)
Solution
 Not reject the null hypothesis
 Significance level of .01, the p-value will be 0.433, which is
greater than the level of significance of .01.
 No difference between the BMI and body size of the 30 weight -

41

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

watchers participants and the general population from which they


were drawn.

10. State any two reasons why the research hypothesis is not tested directly.
Explain them in brief. (7) (Nov/Dem 2023)
 The research hypothesis is not tested directly because it is
difficult to prove a specific effect or relationship exists.
 Instead, K researchers test the null hypothesis, which states that
there is no effect or relationship.

11. Imagine that one of the following 95 percent confidence intervals estimates the effect of
vitamin C on IQ score.
(Apr/May 2024)
95% Confidence Interval Lower Limit Upper Limit
1 100 102
2 95 99
3 102 106
4 90 111
5 91 98

i) Which one most strongly support the conclusion that vitamin Cincreases IQ score? (4)
ii) Which one implies the largest sample size? (3)
iii) Which one most strongly supports the conclusion that vitamin C decreases IQ scores?
(3)
iv) Which onewould most likely stimulate the investigator to conduct an additionl
experiment using larger sample size? (3)
Solution :

95% Confidence Intervals:


 The abbreviation for SE in the equation stands for the standard error. We use the standard
error in our calculations for a confidence interval because they are based on a sampling
distribution
 The formula for calculating a 95% confidence interval is as follows:
o Lower Bound = 𝑥¯ - (1.96) x SE
o Upper Bound = 𝑥¯ + (1.96) x SE

42

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

12. Examplify in detail about the significance of z-test, its procedure and decision rule with
example. (Apr/May 2024) (6)
Z-Test
 A z-test is a statistical test used to determine whether two population means are different
when the variances are known and the sample size is large.
 It can also be used to compare one mean to a hypothesized value.
 A z-test is a hypothesis test for data that follows a normal distribution.
 A z-statistic, or z-score, is a number representing the result from the z-test.
 Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has
a small sample size.
 Z-tests assume the standard deviation is known, while t-tests assume it is unknown.

The Z-score is calculated with the formula:


 z=(x -μ)/σ
 Where:
 z = Z-score
 x = the value being evaluated
 μ = the mean
 σ = the standard deviation

Z-test is a statistically significant test for Hypothesis Testing.

There are 3 steps in Hypothesis Testing:


 State Null and Alternate Hypothesis
 Perform Statistical Test
 Accept and reject the Null Hypothesis

z-test Example

43

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

 A gym trainer claimed that all the new boys in the gym are above average weight. A random
sample of thirty boys weight have a mean score of 112.5 kg and the population mean weight is
100 kg and the standard deviation is 15.

 Is there a sufficient evidence to support the claim of gym trainer.

From step 3, we have


2.37>1.645
So, we have to reject the null hypothesis.
Ie, population means of concentration of elements are not the same for men and women.
Conclusion:
 Z-test is a statically significant test for the hypothesis testing (null and alternative
hypothesis) when the sample size is large, and the population parameter (mean and
variance) is known.

44

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

13. A study finds that racism in cricket more oftentakes place when the game is played in
England orAustralia or New Zaland (say EAN countries). Given that
 Racism takes place or Game is played in EAN is 9/13.
 Racism takes place or Game is played in EAN is 5/7.
 Game is played in EAN given that racism takes place is 4/5

Find the probability of


o No Racism takes place
o Game is played in EAN
o Racism takes place given that Game is played in EAN (7) ( Apr/May 2024)

Step 1: Define Variables


Let R be the event that racism takes place.
Let EAN be the event that the game is played in EAN countries.
Step 2: List Given Probabilities

45

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

14. Indicate whether each of the following distributions is positively or


negatively skewed. The distribution of Incomes of taxpayers have a mean of
$48,000 and a median of $43,000. (Apr/May2024)

Solution:

Mean (48,000) > Median (43,000) Positively Skewed

GPAs for all students at some college have a mean of 3,01 and a
median of 3.20.

Mean (3.01) < Median (3.20) Negatively Skewed

Daily TV viewing times for preschool children has 'a mean of 55


minutes and a median of 73 minutes.

 Mean (55 minutes) < Median (73 minutes) Negatively Skewed

15. Assume that we have a stream of items of large and unknown length
that we can only iterate over once. Devise an effective sampling
algorithm that randomly chooses an item from this stream such that
each item is equally likely to be selected. (Apr/May2024)

Reservoir Sampling
Let us assume we have to sample 5 objects out of an infinite stream
such that each element has an equal probability of getting selected.
import randomdef generator(max):
number = 1

46

PREPARED BY: Ms.G.RAMYA,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1 Mailam Engineering College

while number < max:

number += 1
yield number# Create as stream generator
stream = generator(10000)# Doing Reservoir Sampling
from the stream
k=5
reservoir =
for i, element in enumerate(stream):
if i+1<= k:
reservoir.append(element)
else:
probability = k/(i+1)
if random.random() < probability:
# Select item in stream and remove one of the k
items already selected
reservoir[random.choice(range(0,k))] = elementprint(reservoir)

[1369, 4108, 9986, 828, 5589]

So, let us think of a stream of only 3 items, and we have to keep 2 of


them.
We see the first item, and we hold it in the list as our reservoir has
space. We see the second item, and we hold it in the list as our
reservoir has space.
We see the third item. We choose the third item to be in the list with
probability 2/3.
Let us now see the probability of first item getting selected:
That probability is:
2/3*1/2 = 1/3
Thus the probability of 1 getting selected is:
1–1/3 = 2/3
We can have the exact same argument for the Second Element
and we can extend it for many elements.
Thus each item has the same probability of getting selected: 2/3 or in general k/n.

47

PREPARED BY: Ms.G.RAMYA,AP / CSBS

You might also like