0% found this document useful (0 votes)
7 views49 pages

Unit IV Update

The document outlines the fundamentals of data science and analytics, focusing on statistical methods such as t-tests, ANOVA, and chi-square tests. It provides definitions, procedures, and calculations related to these statistical tests, including sampling distributions, degrees of freedom, and confidence intervals. Additionally, it discusses the differences between t-tests and ANOVA, as well as the significance of p-values in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views49 pages

Unit IV Update

The document outlines the fundamentals of data science and analytics, focusing on statistical methods such as t-tests, ANOVA, and chi-square tests. It provides definitions, procedures, and calculations related to these statistical tests, including sampling distributions, degrees of freedom, and confidence intervals. Additionally, it discusses the differences between t-tests and ANOVA, as well as the significance of p-values in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

UNIT IV – ANALYSIS OF VARIANCE


t-test for one sample – sampling distribution of t – t-test procedure – t-test for two independent samples – p-
value – statistical significance – t-test for two related samples. F-test – ANOVA – Two-factor experiments –
three f-tests – two-factor ANOVA –Introduction to chi-square tests.
PART A
1. Define Sampling Distribution of t.
 The distribution that would be obtained if a value of t were calculated for
each sample mean for all possible random samples of a given size from some
population.

2. Define Degree of Freedom.


 Degrees of freedom (df) refers to the number of values free to vary when, for
example, sample variability is used to estimate the unknown population
variability.

where df represents degrees of freedom and n equals the sample size.

3. What is t-test or t-ratio?


 A replacement for the z ratio whenever the unknown population standard
deviation must be estimated.

 with its t sampling distribution and n − 1 degrees of freedom.

4. Formulate the estimation of standard error and estimated standard


error of the mean.

 Where represents the estimated standard error of the mean; n equals the sample size;
and s has been defined as

 where s is the sample standard deviation; df refers to the degrees of freedom; and

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 SS has been defined as

 This new version of the standard error, the estimated standard


error of the mean, is used whenever the unknown population standard
deviation must be estimated.

5. Discuss the steps in Calculation for the t Test.


Panel I
 This panel generates values for the sample mean, X, and the sample
standard deviation, s.
Panel II
 Dividing the sample standard deviation, s, by the square root of the
sample size, n, gives the value for the estimated standard error, .
Panel III
 Finally, dividing the difference between the sample mean, X,

 the null hypothesized value,, by the estimated standard error, ,


yields the value of the t ratio.

6. Define confidence intervals for 𝝁 based on t.

 When the population standard deviation is unknown and,


therefore, must be estimated, as in the present case, t replaces z
in the new formula for a confidence interval:


 where X represents the sample mean; tconf represents a number (distributed with
n – 1 degrees of freedom) from the t tables, which satisfies the confidence
specifications for the confidence interval; represents the estimated standard
error of the mean.

7. Define two independent samples.


 Observations in each sample are based on different (and unmatched)
subjects.
 When samples are independent, observations in one sample are not paired, on a
one-to-one basis, with observations in the other sample.

8. Define Sampling Distribution of


 Differences between sample means based on all possible pairs of random
samples from two underlying populations.
 It represents the entire spectrum of differences between sample means based on
all possible pairs of random samples from the two underlying populations.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

9. Define Mean of the Sampling Distribution,


 The mean of the new sampling distribution of equals
the difference between population means, that is,

 - it is the mean of the new sampling


distribution and is the difference between population means.

10. Define Standard Error of the Sampling Distribution

 - A rough measure of the average amount by which any


sample mean difference deviates from the difference between population
means.

 where
it is the new standard error, are the two population
variances, and n1 and n2 are the two sample sizes.

11. Define t – ratio for two population means or two independent


samples.

12. List the steps for calculating t – ratio for two population means or two
independent samples.
Panel I
 Requiring the most computational effort, this panel produces values for the two
sample means, X1 and X2, and for the two sample sums of squares, SS1 and
SS2,

Panel II - Pooled Variance Estimate,


 The most accurate estimate of the population variance (assumed to
be the same for both populations) based on a combination of two sample
sums of squares and their degrees of freedom.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Panel III - Estimated Standard Error,


 The standard deviation of the sampling distribution for the difference between
means whenever the unknown variance common to both populations must be
estimated.

Panel IV
 Finally, dividing the difference between the two sample means, ,

 the null hypothesized population mean difference, (of zero) by the

estimated standard error, , generates a value for the t ratio.

13. Define about p-values.


p- value
 The p-value for a test result represents the degree of rarity of that result,
given that the null hypothesis is true.
 Smaller p-values tend to discredit the null hypothesis and to support the
research hypothesis.
 The p-value represents the proportion of area, beyond the observed result,
in the tail of the sampling distribution.

14. Which should be used Level of Significance or p-Value?


 Specified before the test result has been observed, the level of significance
describes a degree of rarity that, if attained subsequently by the test result,
triggers the decision to reject H0.
 Specified after the test result has been observed, a p-value describes the
most impressive degree of rarity actually attained by the test result.

15. What is Statistical Significance?


 Statistical significance between pairs of sample means implies only that the null
hypothesis is probably false, and not whether it’s false because of a large or small
difference between population means.
16. Define t-test for two related samples.
 The null hypothesis for two related samples can be tested with a t ratio.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 which has a t sampling distribution with n – 1 degrees of freedom, D


represents the sample mean of the difference scores; Dhyp represents the
hypothesized population mean (of zero) for the difference scores; and
represents the estimated standard error of ,

17. List the steps in calculations for the t test


Panel I
 Panel I involves most of the computational labour, and it generates
values for the sample mean difference, D, and the sample standard
deviation for the difference scores, sD.

Panel II
 Dividing the sample standard deviation, sD, by the square root of its

sample size, n, gives the estimated standard error, .

Panel III
 Finally, dividing the difference between the sample mean, D, and the
null hypothesized value, Dhyp(of zero), by the estimated standard
error, , culminates in the value for the t ratio.

18. What is f – test?


 F reflects the ratio of the observed differences between all sample means
(measured as variability between groups) in the numerator and the estimated
error term or pooled variance estimate (measured as variability within
groups) in the denominator term, that is,

19. What is ANOVA? Discuss in detail about one factor


ANOVA. Analysis of Variance (ANOVA)
 When data are quantitative, an overall test of the null hypothesis for
more than two population means is known as analysis of variance.
 An overall test of the null hypothesis for more than two population
means.
One-Factor ANOVA
 The simplest type of ANOVA that tests for differences among
population means categorized by only one independent variable.
 In a one-factor ANOVA, a single null hypothesis is tested with one
5

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

F ratio.

20. Define Two-Factor ANOVA


 A more complex type of analysis that tests whether differences exist
among population means categorized by two factors or independent variables.
 In two-factor ANOVA, three different null hypotheses are tested, one at
a time, with three F ratios: Fcolumn, Frow, and Finteraction

21. What is three f tests?

22. What is Chi-Square Test?


 A Chi-square test is a hypothesis testing method.
 There are two commonly used Chi-square tests:
o Chi-square goodness of fit test
o Chi-square test of independence.

23. With an assumption of a null hypothesis as correct, what does it mean


when the p-values are high and low? (Nov/Dem 2023)
 Low P values: data are unlikely with a true null.
 A low p value means that the sample result would be unlikely if the null
hypothesis were true and leads to the rejection of the null hypothesis.
 High P values: data are likely with a true null.
 A high p value means that the sample result would be likely if the null
hypothesis were true and leads to the retention of the null hypothesis.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

24. Define the term one-factor ANOVA. (Nov/Dem 2023)


 The most basic method is the single-factor analysis of variance, which is
also known as the one-way ANOVA simply because this method contains
just one factor (single factor).
 A single factor with a maximum of two levels can still be analyzed using
the t-test or z-test or other appropriate tests.
 However, the single factor with more than two levels will need ANOVA
with advanced methods depending on the experimental situations.
 The most basic single factor with more than two levels is the completely
randomized design (CRD).

25. Differences Between ANOVA & T-Test (Apr/May2024)

Comparison T-TEST ANOVA


variable

Definition t-test is statistical ANOVA is an


hypothesis test used observable
to compare the technique used to
means of two compare the
population groups. means of more
than two
population groups.

Feature t-test compares two ANOVA equates three


sample sizes (n) or more such
both below 30. groups.

Example Sample from class A and When one crop is being


B students have cultivated from
given a various seed
mathematics course varieties.
may have different
mean and standard
deviation.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Test t-test can be performed in ANOVA is one-sided


a double-sided or test due to no
single-sided test. negative variance.

26. Write short note on F-Test (Apr/May2024 )

 A test statistic has an F-distribution under the null hypothesis is known as an F test.

 It is used to compare the statistical models as per the data set available.

 The F-statistic, or F-value, is calculated as follows: F = σ 1 σ 2 , or Variance 1/Variance 2.


Hypothesis testing of variance relies directly upon the F-distribution data for its comparison.

 Example : If a researcher wants to test whether or not two independent samples have been drawn
from a normal population with the same variability, then he generally employs the F-test.

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

PART B

1. Discuss in detail about t-test for one sample – sampling distribution of t


and t – test procedure with a case study – Gas Mileage Investigation.
 Sampling Distribution of t
 The distribution that would be obtained if a value of t were
calculated for each sample mean for all possible random
samples of a given size from some population.
 Degrees of freedom
 Degrees of freedom (df) refers to the number of values free to
vary when, for example, sample variability is used to estimate
the unknown population variability.

where df represents degrees of freedom and n equals the sample size.

Figure 4.1 - Various t distributions.


 Figure 4.1 shows three t distributions.
 When there is an infinite (∞) number of degrees of freedom, the
distribution of t is the same as the standard normal
distribution of z.
 Notice that even with only four or ten degrees of freedom, a t
distribution shares a number of properties with the normal
distribution.
 All t distributions are symmetrical, unimodal, and bell-shaped, with a
dense concentration that peaks in the middle (when t equals 0) and tapers
off both to the right and left of the middle (as t becomes more positive or
negative, respectively).

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Figure 4.2
Hypothesized sampling distribution of t (gas mileage investigation).

Problem 4.1
Find the critical t values for the following hypothesis tests:
(a) two-tailed test, α = .05, df = 12
(b) one-tailed test, lower tail critical, α = .01, df = 19
(c) one-tailed test, upper tail critical, α = .05, df = 38
(d) two-tailed test, α = .01, df = 48


10

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 t TEST

t Ratio
 A replacement for the z ratio whenever the unknown
population standard deviation must be estimated.

with its t sampling distribution and n − 1 degrees of freedom.

 ESTIMATING THE STANDARD ERROR ( )

where represents the estimated standard error of the mean;


n equals the sample size; and
s has been defined as

 where s is the sample standard deviation;


 df refers to the degrees of freedom; and SS has been defined as

 This new version of the standard error, the estimated standard


error of the mean, is used whenever the unknown population
standard deviation must be estimated.

11

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Problem 4.2
A consumers’ group randomly samples 10 “one-pound”
packages of ground beef sold by a supermarket. Calculate
(a) the mean and (b) the estimated standard error of the mean
for this sample, given the following weights in ounces: 16,
15, 14, 15, 14, 15, 16, 14, 14, 14.

 CALCULATIONS FOR THE t TEST


Panel I
 This panel generates values for the sample mean, X, and the
sample standard deviation, s.
 The sample standard deviation is obtained by first using Formula

and after dividing the sum of squares, SS, by its degrees of


freedom, n − 1, extracting the square root.

Panel II
 Dividing the sample standard deviation, s, by the square root of

the sample size, n, gives the value for the estimated standard

error.

Panel III
 Finally, dividing the difference between the sample mean, X,

and the null hypothesized value,,

 by the estimated standard error, yields the value of the t ratio.

12

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

13

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Problem 4.3
The consumers’ group suspects that a supermarket makes extra
money by supplying less than the specified weight of 16 ounces in
its “one- pound” packages of ground beef. Given that a random
sample of 10 packages yields a mean of 14.7 ounces and an
estimated standard error of the mean of 0.26 ounce, use the
customary step-by-step procedure to test the null hypothesis at the
.05 level of significance with t.

CONFIDENCE INTERVALS FOR 𝝁 BASED ON t

 When the population standard deviation is unknown and,


therefore, must be estimated, as in the present case, t
replaces z in the new formula for a confidence interval:

where X represents the sample mean;tconf represents a number (distributed with n


– 1 degrees of freedom) from the t tables, which satisfies the confidence
specifications for the confidence interval; and represents the estimated standard
error of the mean.

14

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

2. Discuss in detail about t-test for two independent samples using the
case study – EPO Experiment.
TWO INDEPENDENT SAMPLES
 Observations in each sample are based on different (and unmatched)
subjects.
 When samples are independent, observations in one sample are not
paired, on a one-to-one basis, with observations in the other sample.

Sampling Distribution of
 Differences between sample means based on all possible pairs of
random samples from two underlying populations.
 It represents the entire spectrum of differences between sample means
based on all possible pairs of random samples from the two underlying
populations.

Mean of the Sampling Distribution,


 The mean of the new sampling distribution of equals the
difference between population means, that is,

Where is the mean of the new sampling distribution


and is the difference between population means.

Standard Error of the Sampling Distribution


 A rough measure of the average amount by which any sample
mean difference deviates from the difference between population
means.

 where is the new standard error, are the two population


variances, and n1 and n2 are the two sample sizes.
t TEST

15

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Expressed in symbols,

Example 4.5
Find the critical t values for each of the following hypothesis tests:
(a) two-tailed test; α = .05; n1 = 12; n2 = 11
(b) one-tailed test, upper tail critical; α = .05; n1 = 15; n2 = 13
(c) one-tailed test, lower tail critical; α = .01; n1 = n2 = 25
(d) two-tailed test; α = .01; n1 = 8; n2 = 10

CALCULATIONS FOR THE t TEST

Panel I
Requiring the most computational effort, this panel produces values for the
two sample means, X1 and X2, and for the two sample sums of squares,
SS1 and SS2, where

Panel II

Pooled Variance Estimate,


 The most accurate estimate of the population variance (assumed to
be the same for both populations) based on a combination of two
16

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

sample sums of squares and their degrees of freedom.

Panel III

Estimated Standard Error,


 The standard deviation of the sampling distribution for the difference
between means whenever the unknown variance common to both
populations must be estimated.

 The estimated standard error, , is calculated by


substituting the pooled variance, , twice, once as an estimate

for and once as


an estimate for ; then dividing each term by its sample size,
either n1 or n2; and finally, taking the square root of the entire
expression, that is,

Panel IV
 Finally, dividing the difference between the two sample means,
, and the null hypothesized population mean difference,
(of zero) by the estimated standard error, , generates a
value for the t ratio.

17

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

3. Discuss in detail about p-values.

p- value
 The p-value for a test result represents the degree of rarity of that
result, Given that the null hypothesis is true.
 Smaller p-values tend to discredit the null hypothesis and to support
the research hypothesis.
 The p-value represents the proportion of area, beyond the observed
result, in the tail of the sampling distribution.
 In the left panel of Figure 4.3, a relatively deviant (from zero)
observed t is associated with a small p-value that makes the null
hypothesis suspect, while in the right panel, a relatively non-deviant
observed t is associated with a large p-value that does not make the
null hypothesis suspect.

Figure 4.3- Shaded sectors showing small and large p-values.

 Figure 4.3 illustrates one-tailed p-values that are appropriate


whenever the investigator has an interest only in deviations in a
particular direction, as with a one-tailed hypothesis test.
 Otherwise, two-tailed p-values are appropriate.
 Two-tailed p-values would require equivalent shaded areas to be
located in both tails of the sampling distribution, and the resulting two-
tailed p- value would be twice as large as its corresponding one-
tailed p-value.

18

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Level of Significance or p-Value?


 Specified before the test result has been observed, the level of
significance describes a degree of rarity that, if attained subsequently
by the test result, triggers the decision to reject H0.

 Specified after the test result has been observed, a p-value describes the most
impressive degree of rarity actually attained by the test result.

Example 4.7
Find the approximate p-value for each of the following test results:
(a) one-tailed test, upper tail critical; df = 12; t = 4.61
(b) one-tailed test, lower tail critical; df = 19; t = –2.41
(c) two-tailed test; df = 15; t = 3.76
(d) two-tailed test; df = 42; t = 1.305
(e) one-tailed test, upper tail critical; df = 11; t = –4.23 (Be careful!)

Example 4.8
Indicate which member of each of the following pairs of p-values
describes the more rare test result:
(a1) p > .05 (a2) p < .05
(b1) p < .001 (b2) p < .01
(c1) p < .05 (c2) p < .01
(d1) p < .10 (d2) p < .20
(e1) p = .04 (e2) p = .02
a2, b1, c2, d1, e2
Example 4.9
Treating each of the p-values in the previous exercise separately,
indicate those that would cause you to reject the null hypothesis at
the .05 level of significance.
a2, b1, b2, c1, c2, e1, e2

19

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

4. What is Statistical Significance? Discuss in detail.

 Tests of hypotheses often are referred to as tests of significance, and


test results are described as being statistically significant (if the null
hypothesis has been rejected) or as not being statistically significant
(if the null hypothesis has been retained).

 Statistical significance between pairs of sample means implies only


that the null hypothesis is probably false, and not whether it’s false
because of a large or small difference between population means.

 Rejecting the null hypothesis always refers to the population, such as


rejecting the hypothesized zero difference between two population means, while
statistically significant always refers to the sample, such as assigning statistical
significance to the observed difference between two sample means.

 Using excessively large sample sizes can produce statistically


significant results that lack importance.

 Statistical significance merely indicates that an observed effect, such


as an observed difference between the sample means, is sufficiently
large, relative to the standard error, to be viewed as a rare outcome.
 (Statistical significance also implies that the observed outcome is
reliable, that is, it would reappear as a similarly rare outcome in a
repeat experiment.)

 Rejecting H0 at, for instance, the .05 level of significance, signifies


that the probability of the observed, or a more extreme, result is less
than or equal to 0.05 assuming H0 is true. This is a conditional
probability that takes the form:

 Pr (the observed result, given H0 is true) .05.

 The probability of .05 depends entirely on the assumption that H0 is true


since that probability of .05 originates from the hypothesized sampling distribution
centered about H0.

20

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

5. Discuss in detail about t-test for two related samples with case study.
t TEST
 The null hypothesis for two related samples can be tested with a t ratio.

 which has a t sampling distribution with n – 1 degrees of


freedom,
 D represents the sample mean of the difference scores; Dhyp
represents the hypothesized population mean (of zero) for the
difference scores; and represents the estimated standard error ,

CALCULATIONS FOR THE t TEST:

 The three panels show the computational steps that produce a t of


7.35 in the current experiment.
Panel I

 Panel I involves most of the computational labour, and it generates values for the
sample mean difference, D, and the sample standard deviation for the difference scores,
sD.

 To obtain the sample standard deviation, first use a variation on the


computation formula for the sum of squares where X has been replaced with
D, that is,

and then, after dividing the sum of squares, SSD, by its degrees of freedom,
n − 1, extract the square root, that is,

21

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Panel II
 Dividing the sample standard deviation, sD, by the square root of its sample size, n,
gives the estimated standard error, , that is,

Panel III

 Finally,dividing the difference between the sample mean, D, and the


null hypothesized value, Dhyp(of zero), by the estimated standard
error, , culminates in the value for the t ratio.

22

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Example 4.10
An investigator tests a claim that vitamin C reduces the severity of
common colds. To eliminate the variability due to different family
environments, pairs of children from the same family are randomly
assigned to either a treatment group that receives vitamin C or a
control group that receives fake vitamin C. Each child estimates, on
a 10-point scale, the severity of their colds during the school year.
The following scores are obtained for ten pairs of children:

Using t, test the null hypothesis at the .05 level

23

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

24

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

6. What is f – test? Discuss in detail about the purpose of f-test with the
case study?
 In the two-sample case, t reflects the ratio between the observed
difference between the two sample means in the numerator and the
estimated standard error in the denominator.
 For three or more samples, the null hypothesis is tested with a new ratio,
the F ratio.
 Essentially, F reflects the ratio of the observed differences between all
sample means (measured as variability between groups) in the numerator
and the estimated error term or pooled variance estimate (measured as
variability within groups) in the denominator term, that is,

 Like t, F has its own family of sampling distributions that can be


consulted to test the null hypothesis. The resulting test is known as an
F test.
 An F test of the null hypothesis is based on the notion that if the null
Hypothesis is true, both the numerator and the denominator of the F
ratio would tend to be about the same, but if the null hypothesis is
false, the numerator would tend to be larger than the denominator.

If Null Hypothesis Is True


 If the null hypothesis is true (because there is no treatment effect due
to Different sleep deprivation periods), the two estimates of variability
(between and within groups) would reflect only random error. In this
case,

 Except for chance, estimates in both the numerator and the denominator
are similar, and generally, F varies about a value of 1.
If Null Hypothesis Is False
 If the null hypothesis is false (because there is a treatment effect due to
different sleep deprivation periods), both estimates still would reflect

25

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

random error, but the estimate for between groups would also reflect the
treatment effect. In this case,

 When the null hypothesis is false, the presence of a treatment effect


tends to cause a chain reaction:
o The observed differences between group means tend to be large,
as does the variability between groups.
Accordingly, the numerator term tends to exceed the denominator
term, producing an F whose value is larger than 1.
 When the null hypothesis is false because of a large treatment effect,
there is an even more pronounced chain reaction,beginning with very
large observed differences between group means and ending with an F
whose value tends to be considerably larger than 1.

26

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 Full-fledged F tests for Outcomes A and B agree with the earlier


intuitive decisions.
 Given the .05 level of significance, the null hypothesis should be
retained for Outcome A, since the observed F of 0.75 is smaller than
the critical F of 5.14.
 However, the null hypothesis should be rejected for Outcome B,
since the observed F of 7.36 exceeds the critical F.

Example 4.11
If the null hypothesis is true, both the numerator and denominator
of the F ratio would reflect only (a) . If the null hypothesis is false, the
numerator of the F ratio would also reflect the (b). If the null
hypothesis is false because of a large treatment effect, the value of F
would tend to be considerably larger than (c).

Example 4.12
Find the critical values for the following F tests:
(a) α = .05, dfbetween = 1, dfwithin = 18
(b) α = .01, dfbetween = 3, dfwithin = 56
(c) α = .05, dfbetween = 2, dfwithin = 36
(d) α = .05, dfbetween = 4, dfwithin = 95

Example 4.13
Find the approximate p-value for the following observed F ratios,
where the numbers in parentheses refer to the degrees of freedom
in the numerator and denominator, respectively.
(a) F (2, 11) = 4.56
(b) F (1, 13) = 11.25
(c) F (3, 20) = 2.92
(d) F (2, 29) = 3.66

27

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

7. What is ANOVA? Discuss in detail about one factor


ANOVA. Analysis of Variance (ANOVA)

o When data are quantitative, an overall test of the null hypothesis for more than
two population means is known as analysis of variance.
o An overall test of the null hypothesis for more than two population means.

 One-Factor ANOVA
o The simplest type of ANOVA that tests for differences among population means
categorized by only one independent variable.

 Two Possible Outcomes Example

 Table shows two fictitious experimental outcomes that, when analysed with ANOVA,
produce different decisions about the null hypothesis: It is retained for one outcome but
rejected for the other.
TWO SOURCES OF VARIABILITY
Differences between Group Means
 Differences of 5, 6, and 4 appear between group means in Outcome A, and these
relatively small differences might reflect only chance.

28

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 Even though the null hypothesis is true (because sleep deprivation does not affect the
subjects’ aggression scores), group means tend to differ merely because of chance
sampling variability.
 It’s reasonable to expect, therefore, that the null hypothesis for Outcome A should not
be rejected.
 There appears to be a lack of evidence that sleep deprivation affects the subjects’
aggression scores in Outcome A.
 On the other hand, differences of 2, 5, and 8 appear between the group means for
Outcome B, and these relatively large differences might not be attributable to chance.
 Instead, they indicate that the null hypothesis probably is false (because sleep
deprivation affects the subjects’ aggression scores). It’s reasonable to expect,
therefore, that the null hypothesis for Outcome B should be rejected.
 There appears to be evidence of a treatment effect, that is, the existence of at least one
difference between the population means defined by the independent variable (sleep
deprivation).

Two-Factor ANOVA
 A more complex type of analysis that tests whether differences exist among population
means categorized by two factors or independent variables.

Example
 For computational simplicity, assume that the social psychologist randomly
o assigns two subjects to be tested (one at a time) with crowds of either zero, two,
or four people and either the nondangerous or dangerous conditions.
 The resulting six groups, each consisting of two subjects, represent all possible
combinations of the two factors.*

29

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Note: Shaded numbers are means.


The shaded numbers represent four different types of means:
The three column means (9, 12, 15) represent the mean reaction times for Each crowd
size when degree of danger is ignored. Any differences among these column means not
attributable to chance are referred to as the main effect of crowd size on reaction time.
In ANOVA, main effect always refers to the effect of a single factor, such as crowd
size, when any other factor, such as degree of danger, is ignored.
 The two row means (8, 16) represent the mean reaction times for degree of danger when
crowd size is ignored. Any difference between these row means not attributable to
chance is referred to as the main effect of degree of danger on reaction time.
 The mean of the reaction times for each group of two subjects yields the six means (8, 7, 9,
10, 17, 21) for each combination of the two factors. Often referred to as cell means or
treatment-combination means, these means reflect not only the main effects for crowd
size and degree of danger described earlier but, more importantly, any effect due to the
interaction between crowd size and degree of danger, as described below.

o Finally, the one mean for all three column means—or for both row means—
yields the overall or grand mean (12) for all subjects in the study.

Main Effect
 The effect of a single factor when any other factor is ignored.
Graphs for Main Effects
 The slanted line in panel A of Figure 4.4 depicts the large differences between column
means, that is, between mean reaction times for subjects, regardless of degree of danger,
with crowds of zero, two, and four people.
 The relatively steep slant of this line suggests that the null hypothesis for crowd size
might be rejected.
 The steeper the slant is, the larger the observed differences between column means
and the greater the suspected main effect of crowd size.
 On the other hand, a fairly level line in panel A of Figure 4.4 would have reflected the
relative absence of any main effect due to crowd size.
 The slanted line in panel B of Figure 4.4 depicts the large difference between row
means, that is, between mean reaction times for dangerous and non dangerous
conditions, regardless of crowd size.
 The relatively steep slope of this line suggests that the null hypothesis for degree of
danger also might be rejected; that is, there might be a main effect due to degree of
danger.

30

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College


 Figure 4.4 depicts the large differences between column means
Example 4.14
A college dietician wishes to determine whether students prefer a particular pizza
topping (either plain, vegetarian, salami, or everything) and one type of crust
(either thick or thin). A total of 160 volunteers are randomly assigned to one of
the eight cells in this two- factor experiment. After eating their assigned pizza, the
20 subjects in each cell rate their preference on a scale ranging from 0 (inedible)
to10 (the best). The results, in the form of means for cells, rows, and columns, are
as follows:

 Construct graphs for each of the three possible effects, and use this information to
make preliminary interpretations about pizza preferences. Ordinarily, of course, you
would verify these speculations by performing an ANOVA—a task that cannot be
performed for these data, since only means are supplied.

Figure 4.5, F ratios in both a one- and a two-factor ANOVA


31

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 As suggested in Figure 4.5, F ratios in both a one- and a two-factor ANOVA always
consist of a numerator (shaded) that measures some aspect of variability between groups
or cells and a denominator that measures variability within groups or cells.
 In a one-factor ANOVA, a single null hypothesis is tested with one F ratio.
 In two-factor ANOVA, three different null hypotheses are tested, one at a time,
with three F ratios: Fcolumn, Frow, and Finteraction.
 The numerator of each of these three F ratios reflects a different aspect of variability
between cells:
 variability between columns (crowd size),
 variability between rows (degree of danger),
 interaction—any remaining variability between cells not attributable to either variability
between columns (crowd size) or rows (degree of danger ).
 The shaded numerator terms for the three F ratios in the bottom panel of Figure 4.5 estimate
random error and, if present, a treatment effect (for subjects treated differently by the
investigator).
 The denominator term always estimates only random error (for subjects treated similarly in
the same cell).
 In practice, a sufficiently large F value is viewed as rare, given that the null hypothesis
is true, and therefore, it leads to the rejection of the null hypothesis.
 Otherwise, the null hypothesis is retained.

Test Results for Two-Factor Experiment


 As indicated in the boxed summary for the hypothesis test for a smoke alarm
experiment, test results agree with our preliminary interpretations based on graphs. Each
of the three null hypotheses is rejected at the .05 level of significance. The significant
main effects indicate that crowd size and degree of danger, in turn, influence the reaction
times of subjects to smoke. The significant interaction, however, indicates that the effect
of crowd size on reaction times differs for nondangerous and dangerous conditions.

32

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

33

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

8. What is Chi-Square Test?


 A Chi-square test is a hypothesis testing method.
 Two common Chi-square tests involve checking if observed frequencies
in one or more categories match expected frequencies.
 The chi-square test is a statistical test used to determine if there is a
significant association between two categorical variables.
 It is a non-parametric test, meaning it does not make assumptions
about the underlying distribution of the data.
 It compares the observed frequencies of the categories in a contingency
table with the expected frequencies that would occur under the
assumption of independence between the variables.
 The test calculates a chi-square statistic, which measures the
discrepancy between the observed and expected frequencies.

Types of Chi-square tests


 There are two commonly used Chi-square tests:
 the Chi-square goodness of fit test the Chi-square test of independence
Steps to perform a Chi-square test
 For both the Chi-square goodness of fit test and the Chi-square test of independence,
the same analysis steps, listed below.
 Define your null and alternative hypotheses before collecting your data.
 Decide on the alpha value. For example, suppose set α=0.05 when testing for
independence. Here, have decided on a 5% risk of concluding the two variables are
independent when in reality they are not.
 Check the data for errors.
 Check the assumptions for the test.
34

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 Perform the test and draw your conclusion

Properties of Chi-Square Test


 The chi-square test possesses several important properties that make it a valuable
statistical tool:
Aspect Description
Non-parametric Test The chi-square test is non-parametric, making no assumptions
about the data’s underlying distribution. Applicable to
categorical data.
Test for Examines association between categorical variables,
Independence determining significance of relationship or dependency,
not
strength or direction.
Goodness of Assesses how well observed data fit an expected
Fit distribution,
Test
comparing observed frequencies to expected frequencies.
Measures discrepancy between observed and
Chi-Square Statistic expected frequencies in a contingency table,
indicating association or
goodness of fit.
Degrees of Freedom Depend on the number of categories in variables.
Determine
critical values and influence test result interpretation.
Null and Alternative Null hypothesis assumes no association or
Hypotheses difference,
alternative hypothesis suggests presence of association or
difference.
Test Statistic and P- Produces test statistic and corresponding p-value. Compare
value test statistic to critical value, p-value indicates
probability
under null hypothesis.
Null hypothesis rejected if test statistic exceeds critical value
Interpretation or p-value is less than chosen significance
level. Indicates
significant association or deviation.

35

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

9. Twenty-three overweight male volunteers are randomly assigned to three


different treatment programs designed to produce a weight loss by focusing on
either diet, exercise, or the modification of eating behavior. Weight changes
were recorded, to the nearest pound, for all participants who
completed the two-month experiment. Positive scores signify a weight
drop; negative scores, a weight gain.
Weight Change
Diet Exercise Behavior Modification
3 —1 7
4 8 1
0 4 10
—3 2 0
5 2 18
10 —3 12
3 4 0
6 5
T 22 12 63
N 8 6 9
EX = G = 97; N= 23 EX2 = 961

Summarize the results with an ANOVA table. (Nov/Dem 2023)


Solution
Step 1: Create ANOVA table
 First, let's set up an ANOVA table.
 Need five columns labeled
 Source of Variation (which includes Between Groups, Within Groups, and Total),
 Sum of Squares (SS),
 Degrees of Freedom (df),
 Mean Square (MS), and
 F-ratio (F).
 To fill out the table, will need the means and overall count, plus the method of weight loss.
Note that degrees of freedom for Between Groups is the number of groups minus 1 and for
Within Groups is the total size minus the number of groups.

36

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

10. The F test describes the ratio of two sources of variability: that for subjects treated
differently and that for subjects treated similarly. Is there any sense in the which
the t test for two independent groups can be viewed likewise? (Nov/Dem 2023)
Short Answer
 Yes, the t test for two independent groups can be viewed similarly to the F test in the
sense that both are examining variability.
 They differ in their specifics: the t test is comparing the mean difference against the
variability within groups, while the F test is comparing the variability between groups
against the variability within groups.
Step by step solution
Step 1: Understanding The F Test
 The F test is normally used in the context of Analysis of Variance (ANOVA) to compare
the variances between different groups. It is essentially a ratio of two estimates of variance:
the variance between groups (numerator), and the variance within groups (denominator). If
the between-group variance is significantly greater than the within- group variance, it
would suggest that the means of the groups differ.
Step 2: Understanding The t Test
 The t-test for two independent groups is used to compare the means of those groups to
determine if they are significantly different. The t- test is calculated using the mean
difference between the two groups (numerator) and the variability within the groups
(denominator).
Step 3: Identifying the Link between The t Test and The F Test
 In the t-test, we are technically comparing variability too, though we are specifically
interested in whether the variability in group means is greater than what we would expect
by chance. In the F test, we're more broadly comparing variability to examine if the
amount of variability between group means is larger than the variability within groups. So
it can be said that there is a link between them, but they're not quite serving the same
purpose.

37

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

11. Brief about Partial squared curvilinear correlation.


 What is its purpose?(Nov/Dem2023)
 The partial correlation coefficient (or partial correlation) is a statistical measure that
quantifies the linear association between two variables, while controlling for the effects of
one or more other variables.
 It measures the strength of the association between two variables, while accounting for
the effects of other variables that may also be related to both of them.
 The formula for the partial correlation coefficient is similar to the formula for the
Pearson correlation coefficient, but it includes a term that adjusts for the effects of other
variables.
 The formula can be represented as:

 rxy.z = (rxy - rxz*ryz) / sqrt((1-rxz^2)(1-ryz^2))


 where x, y and z are variables, rxy is the correlation coefficient between x and y, rxz is
the correlation coefficient between x and z, and ryz is the correlation coefficient between y
and z.
 The value of the partial correlation coefficient ranges from -1 to 1, where a value of -1
indicates a perfect negative association, a value of 0 indicates no association, and a
value of 1 indicates a perfect positive association.
 The partial correlation coefficient is a measure of association and not a measure of
causality.
 A high partial correlation coefficient does not imply that one variable causes the other, only
that the two variables have a strong association when controlling for the effect of other
variables.

38

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

12. Brief about TUKEY'S HSD Test. Additionally, explain in brief about two-factor
ANOVA.(Nov/Dem 2023)

TUKEY'S HSD Test

 The Tukey HSD ("honestly significant difference" or "honest significant difference")


test is a statistical tool used to determine if the relationship between two sets of data is
statistically significant – that is, whether there's a strong chance that an observed
numerical change in one value is causally related to an observed change in another
value.
 The Tukey test is a way to test an experimental hypothesis.
 The Tukey's honestly significant difference test (Tukey's HSD) is used to test
differences among sample means for significance.
 The Tukey's HSD tests all pairwise differences while controlling the probability of making
one or more Type I errors.
 The Tukey's HSD test is one of several tests designed for this purpose and fully controls
this Type I error rate.
 The value of the Tukey test is given by taking the absolute value of the difference between
pairs of means and dividing it by the standard error of the mean (SE) as determined by
a one-way ANOVA test.
 The SE is in turn the square root of (variance divided by sample size).

Two-factor ANOVA.

 ANOVA (Analysis of Variance) is a statistical test used to analyze the difference


between the means of more than two groups.
 A two-way ANOVA is used to estimate how the mean of a quantitative variable changes
according to the levels of two categorical variables.
 The two-way ANOVA compares the mean differences between groups that have been split
on two independent variables (called factors).
 The primary purpose of a two-way ANOVA is to understand if there is an interaction
between the two independent variables on the dependent variable.
 For example, could use a two-way ANOVA to understand whether there is an
interaction between gender and educational level on test anxiety amongst university

39

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

students, where gender (males/females) and education level (undergraduate/postgraduate)


are independent variables, and test anxiety is dependent variable.
 Alternately, may want to determine whether there is an interaction between physical
activity level and gender on blood cholesterol concentration in children, where physical
activity (low/moderate/high) and gender (male/female) are independent variables, and
cholesterol concentration is dependent variable.
 The interaction term in a two-way ANOVA informs whether the effect of one of
independent variables on the dependent variable is the same for all values of your other
independent variable (and vice versa).

How does the ANOVA test work?

 ANOVA tests for significance using the F test for statistical significance.
 The F test is a groupwise comparison test, which means it compares the variance in each
group mean to the overall variance in the dependent variable.
 If the variance within groups is smaller than the variance between groups, the F test
will find a higher F value, and therefore a higher likelihood that the difference
observed is real and not due to chance.
 A two-way ANOVA with interaction tests three null hypotheses at the same time:
 There is no difference in group means at any level of the first independent
variable.
 There is no difference in group means at any level of the second independent variable.
 The effect of one independent variable does not depend on the effect of the other
independent variable

40

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

13. A manufacture of a gas additive claims that it improves gas mileage. A random sample of 30
drivers tests this claim by determining their gas mileage for a full tank of gas that contain the
additive (X1) and for a full tank of gas that does not contain the additive(X2). The sample mean
difference, Dbar , equals 2.12 miles (in favour of the additive) , and the estimated standard error
equals 1.50 miles. (Apr/May 2024)

a) Using t, test the null hypothesis at the 0.05 level ofsignificance


b) Specify the p-value for this result
c) Are there any special precaution that should be taken with the present experimental design?

Setting up the Hypotheses


 The null hypothesis, 𝐻0, is that the gas additive makes no difference, hence the mean
difference 𝜇𝐷=0.
 The alternative hypothesis, 𝐻1, is that the gas additive improves gas mileage, hence 𝜇𝐷>0.
 Calculating the Test Statistic
 The sample mean difference, 𝐷¯, is given as 2.12 miles and the estimated standard error is
1.50 miles.
 We can use these to calculate the test statistic, 𝑡=𝐷¯−𝜇𝐷𝑆𝐷¯ where 𝑆𝐷¯=𝑠𝐷𝑛 is the
standard error of 𝐷¯, 𝑠𝐷 is the sample standard deviation of 𝐷 and 𝑛 is the sample size.
Plugging in the given values, we get 𝑡=2.12−01.50=1.413.

Getting the Critical Value


 working with a .05 level of significance.
 The degrees of freedom 𝑑𝑓 for a paired samples t-test is equal to the number of pairs minus
one (n − 1), in this case, 𝑑𝑓=30−1=29.
 Looking up in the t-distribution critical values table (one-tailed as we're interested only in
whether additive improves mileage, not decreases),
 find the critical value is approximately 1.699.

Decision
 Since the calculated t (1.413) is less than the critical value (1.699), we cannot reject the null
hypothesis.
 This means that, statistically, the gas additive does not significantly improve mileage.
Calculating P-Value
 The p-value is the probability of observing a t-score as extreme as the calculated t given that
the null hypothesis is true.

41

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 While it's typically found using statistical software or a calculator, let's just note that it will
be greater than the significance level of 0.05 since we failed to reject the null hypothesis.

Precautions about Experimental Design


 This tests assumes that tanks of gas are independent and derive from a normally distributed
population.
 Furthermore, factors such as different types of roads, traffic conditions, and weather can
affect gas mileage.
 The drivers should ideally be driving under similar conditions for fair comparison. It would
also be good to randomize the order in which each driver uses the additive and non-additive
gas.

Conclusion:

 The null hypothesis that the additive does not improve gas mileage cannot be rejected at the
0.05 significance level since the t-statistic (1.413) is less than the critical value (1.699)
 The p-value is greater than 0.05. Special precautions for experimental design includes having a
large and representative sample size, controlling for external factors influencing gas mileage, and
randomizing the order of utilizing the additive and non-additive gas.

14. A library system lends books for periods of 21 days. This policy is being reevaluated in view of
a possible new loan period that could be either longer or shorter than 21 days. To aid in making
this decision, book-lending records were consulted to determine the loan periods actually used
by the patrons. A random sample of eight records revealed the following loan periods in days:
and 16 . Test the null hypothesis with , using the .05 level of significance.A library lends books
period for 21 days. This policy is being revaluated in view of possible, Calculation of Sample
Mean and Sample Standard Deviation
(Apr/May2024)

 To begin the t-test, first calculate the sample mean (𝑥¯) and sample standard deviation (𝑠).
 The sample mean is the average of the data points, which can be found by summing all of the data
and dividing by the number of data points, which is 8.
 The sample standard deviation is a measure of the amount of variance or dispersion in the data set.
Calculation of Sample Standard Error
 Once we have the standard deviation, we can calculate the standard error of the mean. The
standard error is the standard deviation divided by the square root of the number of data
points, which is 8 in this case.

42

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Calculation of t-value
 The calculated t-value, which is the test statistic, can now be calculated using the
formula:𝑡=𝑥¯−𝜇𝑠/𝑛where μ is the assumed population mean (21 days in our case), 𝑥¯ is the
sample mean calculated in step 1, 𝑠 is the standard deviation calculated in step 1, and 𝑛 is the
number of data points.

Determination of Critical t-value


 The next step is to determine the critical t-value. The critical t-value is dependent on the
specified significance level (0.05 in our case) and degrees of freedom.
 For a two-tailed test with 7 degrees of freedom (n-1), and a 0.05 significance, the critical t-
value can be obtained from the t-distribution table.

Comparison of Calculated t-value and Critical t-value


 If the absolute value of the calculated t-value is greater than the critical t-value, we reject the
null hypothesis and conclude that the average loan period is significantly different from 21
days.
 If the absolute value of the calculated t-value is less than or equal to the critical t-value, we
accept the null hypothesis and conclude that the average loan period is not significantly
different from 21 days.

Conclusion:
 The result of the hypothesis test (whether the average loan period is significantly different
from 21 days or not) is determined by comparing the absolute value of the calculated t-
value with the critical t-value from the t-distribution table for a significance level of 0.05
and 7 degrees of freedom.
 If the absolute value of the calculated t-value is greater than the critical t-value, it means
there is sufficient evidence to reject the null hypothesis.
 If it's less than or equal to the critical t-value, it means there is not enough evidence to
reject the null hypothesis.

43

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

15. A random sample of 90 college students indicates whether they most desire love, wealth, power, health,
fame, or family happiness. Using the 0.05 level of significance and the following results, test the null hypothesis
that, in the underlying population, the various desires are equally population using chi-square test. (Apr/May
2024)

Desires ofcollege students

FRQUENCY LOVE WEALTH POWER HEALTH FAME FAMILY TOTAL


HAP
OBSERVED(FO) 25 10 5 25 10 15 90

Solution: Formula For Chi-Square Test

Where

c = Degrees of freedom

O = Observed Value

E = Expected Value

Step 1: Define the Hypothesis

 H0: There is no link between desires and population.

 H1: There is a link between desires and population.

Step 2: Calculate the Expected Values

 To calculate the expected frequency.

44

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 For example, the expected value for Love is:

Expected value = 90 * 25/ 90

Similarly, calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table

 calculate the (O - E)2 / E for each cell in the table.

Where

O = Observed Value

E = Expected Value

Step 4: Calculate the Test Statistic X2

X2 is the sum of all the values in the last table

Null Hyposis:

p1 = p2 = p3 = p4 = p5 = p6 = 1/6

Desire and gender are independent

p 1 = p2 = p3 = p 4 = p5 = p6

c) What is your alternative hypothesis?

p1 ≠ p2 ≠ p3 ≠ p4 ≠ p5 ≠ p6 ≠ 1/6

p 1 ≠ p 2 ≠ p 3 ≠ p 4 ≠ p5 ≠ p 6

H0 is not true

H1 is not true

Conclusion:

45

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

 Based on our data, college students do not prefer love, wealth, power, health, fame, and
family happiness equally
 Based on our data, college students prefer fame more than family happiness

16. An investigator polls common cold sufferers, asking them to estimate the number of hours of
physical discomfort caused by their most recent colds. Assume that their estimates approximate a
normal curve with a mean of 83 hours and a standard deviation of 20 hours. (Apr/May 2024)

i. What is the estimated number of hours for the shortest-suffering 5 percent? (3)

ii. What proportion of sufferers estimate that their colds lasted longer than 48 hours? (2)

iii. What proportion suffered for fewer than 61 hours? (2)

iv. What is the estimated number of hours suffered by the extreme 1 percent either above
or below the mean? (2)

v. What proporation suffered between 1 and 3 days,that is between 24 and 75 hours? (3)

vi. What proporation suffered for between 2 and 4 days?

Solution

i) What is the estimated number of hours for the shortest-suffering 5 percent?


Where m=83,d=20,
below = 5/100 = 50.1,
that is these people suffer 50.1 hours or less.
ii) What proportion of sufferers estimate that their colds lasted longer than 48 hours?
Above = 0.96 = 96%.
iii)What proportion suffered for fewer than 61 hours?
Below = 0.1357 = 13.57%

Iv) What is the estimated number of hours suffered by the extreme 1 percent either above or
below the mean?

Below = 31.48 and above = 13.452 hours, here assume that 1 % is the sum of extreme low and extreme
high.

46

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

Conclusion :
a) The estimated number of hours for the shortest-suffering 5 percent is 50.1 hours.
b)The proportion of sufferers who estimate their colds lasted longer than 48 hours is 0.9599
(95.99%).
(c) The proportion who suffered for fewer than 61 hours is 0.1357 (13.57%).
(d) The estimated number of hours suffered by the extreme 1 percent either above or below the
mean is between 31.48 and 134.52 hours.
(e) The proportion who suffered for between 1 and 3 days (24 and 72 hours) is 0.2896 (28.96%).

17. Admission to a state university depends partially on the applicant's high school GPA.
Assume that the applicants' GPAs approximate a normal curve with a mean of 3.20 and a
standard deviation of 0.30 .
(i) If applicants with GPAs of 3.50 or above are automatically admitted, what proportion
of applicants will be in this category?
(ii) If applicants with GPAs of 2.50 or below are automatically denied admission, what
proportion of applicants will be in this category?
(iii) A special honors program is open to all applicants with GPAs of 3.75 or better. What
proportion of applicants are eligible?
(iv) If the special honors program is limited to students whose GPAs rank in the upper 10
percent, what will Brittany's GPA have to be for admission to this program?(AprMay 2024)

Solution

(i) If applicants with GPAs of 3.50 or above are automatically admitted, what proportion
of applicants will be in this category?

 To find the proportion of applicants with GPAs of 3.50 or above who are automatically
admitted, convert the GPA 3.50 to a z-score first.
 The formula for z-score is
o µ𝑍=(𝑋−µ)𝜎,
o where X is the value for which we want to find the Z-score, µ is the Mean, and σ is the standard
deviation.
o Here X=3.50, µ=3.20, and σ=0.30.
o Using these values in the formula, to get Z score.
o . The result gives the proportion of students who have GPAs less than 3.5.

47

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

o To subtract this from 1 to get the proportion of students who have GPAs above this, which is
desired answer.

(ii) If applicants with GPAs of 2.50 or below are automatically denied admission, what
proportion of applicants will be in this category?

 calculate the z-score for X=2.50. Again, use the standard normal distribution table or calculator
function to find the proportion p for this z score.
 The result gives the proportion of students that have GPAs less than 2.5, and this is the
proportion of students who will be denied admission automatically.

(iii) A special honors program is open to all applicants with GPAs of 3.75 or better. What
proportion of applicants are eligible?

 calculate the z-score for X=3.75, which is the GPA for a special honors program.
 further, use the standard normal distribution table or function to find the proportion of students
who have GPAs less than 3.75.
 Subtract this number from 1 to get the proportion of students who have GPAs greater than 3.75
and this number is the proportion of applicants that are eligible for the honors program.

(iv) If the special honors program is limited to students whose GPAs rank in the upper 10
percent, what will Brittany's GPA have to be for admission to this program?

 In this problem, the proportion of students is known (0.10 or 10%), and the GPA score needs to
be found.
 First, find the corresponding z-score for the proportion 0.10 in the standard normal distribution
table or use a calculator function.
 Next, convert this z-score to the GPA score using the z-score formula in reverse: µ𝑋=𝑍𝜎+µ.
 Here, Z is the calculated Z-score, σ is standard deviation and µ is the mean.
 This gives the GPA a student should have in order to be in the top 10% and hence be eligible
for the special honors program.

Conclusion:

 The proportion of students automatically admitted, automatically denied, eligible for honors
program and GPA needed to be in top 10% can be found using the z-score method and using the
standard normal distribution.
 The exact numbers may vary based on the values retrieved from the standard normal
distribution table or calculator function.

48

PREPARED BY: Ms.G.Ramya,AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-1V Mailam Engineering College

49

PREPARED BY: Ms.G.Ramya,AP / CSBS

You might also like