Advance Statistics Project Report
Advance Statistics Project Report
Advance Statistics Project Report
DSBA
ABHISHEK.V.
CONTENTS:
1.Problem 1.
1.1
1.2
1.3
1.4
1.5
2.Problem 2.
2.1
2.2
2.3
3.Problem 3.
3.1
3.2
3.3
3.4
4.Probelm 4.
4.1
4.2
4.3
5.Problem 5.
5.1
5.2
6.Problem 6.
7.Problem 7.
7.1
7.2
7.3
7.4
7.5
7.6
7.7
Problem 1
A physiotherapist with a male football team is interested in studying the relationship between
foot injuries and the positions at which the players play from the data collected.
1.1 What is the probability that a randomly chosen player would suffer an injury?
The probability that a randomly chosen player would suffer an injury can be calculated by
Dividing the total number of injured players by the total number of players.
Probability of injury = (Total Players Injured) / (Total Players)
= 145/235 ~ 0.617
(61.7%).
There is 61.7 % chance that out of total players, a randomly chosen player would suffer
an injury.
1.3What is the probability that a randomly chosen player plays in a striker position and has a
foot injury? The probability that a randomly chosen player plays in a striker position and has a
foot injury can be calculated by dividing the number of strikers with foot injuries by the total
numbers of players.
Probability of Striker with foot injury = (Strikers Injured) / (Total Players)
= 45/235 ~0.191
(19.1%).
There is 19.14% chance that a randomly chosen player plays in a Striker position and has
a foot injury.
1.4 What is the probability that a randomly chosen injured player is a striker?
The probability that a randomly chosen player plays in a striker and has a foot injury can be
calculated by dividing the number of injured strikers by the total numbers of Injured players .
Probability of Striker among injured players = (Strikers Injured) / (Total Players Injured)
= 45/145 ~0.310
(31.0%).
Therefore, there is 31% chance that a randomly chosen injured player is a Striker.
1.5 What is the probability that a randomly chosen injured player is either a forward or an
attacking midfielder?
The probability that a randomly chosen player is either a forward or an attacking midfielder can
be calculated by adding the number of injured forwards and injured attacking midfielder and
dividing by the total number of injured players.
Probability of forward or Attacking Midfielder among Injured Players = (Forward Injured+
Attacking Midfielders Injured) / (Total Players Injured) = (56+24)/145 ~ 0.551
(55.1%).
Therefore, there is 55% chance that a randomly chosen injured player is either a forward
or an attacking midfielder.
Problem 2.
An independent research organization is trying to estimate the probability that an accident at a
nuclear power plant will result in radiation leakage. The types of accidents possible at the plant
are, fire hazards, mechanical failure, or human error. The research organization also knows that
two or more types of accidents cannot occur simultaneously.
According to the studies carried out by the organization, the probability of a radiation leak in
case of a fire is 20%, the probability of a radiation leak in case of a mechanical 50%, and the
probability of a radiation leak in case of a human error is 10%. The studies also showed the
following;
1.The probability of a radiation leak occurring simultaneously with a fire is 0.1%.
2.The probability of a radiation leak occurring simultaneously with a mechanical failure is
0.15%.
3.The probability of a radiation leak occurring simultaneously with a human error is 0.12%.
On the basis of the information available, answer the questions below:
2.1 What are the probabilities of a fire, a mechanical failure, and a human error respectively?
The Probability of each type of accident are given directly in the information.
1. Probability of a fire: 20% or 0.20
2. Probability of a mechanical failure: 50% or 0.50
3. Probability of a human error: 10% or 0.10
This problem can be solved by application of Bayes Theorem.
Let P(R) be the probability of Radiation leak
P(F) be the probability of Fire.
P(M) be the probability of Mechanical Failure.
P(H) be the probability of Human Error
According to the given data,
P(R | F) = 0.20
P(R | M) = 0.50
P(R | H) = 0.10
P( F∩ R ) = 0.1% = 0.001
P( M∩R ) = 0.15% = 0.0015
P( H∩R ) = 0.12% = 0.0012
As we know, P( A|B ) = P( A∩ B ) / P(B), given A & B are independent events. Since, it is
given that all possible accidents cannot occur simultaneously, we can say that these are
independent events.
P( F ) = P( F∩ R ) / P( R|F) = 0.001/0.20 = 0.005
Thus, probability of a are at the plant is 0.5%
P( M ) = P( M∩ R ) / P( R|M) = 0.0015/0.5 = 0.003
Thus, probability of a mechanical failure at the plant is 0.3%
P( H ) = P(H∩R ) / P( R|H) = 0.0012/0.10 = 0.012
Thus, probability of Human Error failure at the plant is 1.2%
2.3 Suppose there has been a radiation leak in the reactor for which the definite cause is not
known. What is the probability that it has been caused by :
A Fire.
A Mechanical Failure.
A Human Error.
Given that the definite cause of the radiation leak is not known, we can calculate the
probabilities that it has been caused by each type of accident:
P( M | R ) = P ( R∩ M ) / P( R ) = 0.0015/0.0037 = 0.4054
Thus, probability that a radiation leak has happened due to mechanical failure is 40.5%
P( H | R ) = P ( R∩ H ) / P( R ) = 0.0012/0.0037 = 0.3243
Thus, probability that a radiation leak has happened due to human error is 32.4%
Problem 3.
The breaking strength of gunny bags used for packaging cement is normally distributed with a
mean of 5 kg per sq. centimeter and a standard deviation of 1.5 kg per sq. centimeter. The
quality team of the cement company wants to know the following about the packaging material
to better understand wastage or pilferage within the supply chain; Answer the questions below
based on the given information; (Provide an appropriate visual representation of your
answers, without which marks will be deducted)
3.1What proportion of the gunny bags have a breaking strength less than 3.17 kg per sq cm?
We have:
μ (mean) = 5 kg/cm²
σ (standard deviation) = 1.5 kg/cm²
X = 3.17 kg/cm² (value of interest)
We need to find the z-score: Z = (X - μ) / σ = (3.17 - 5) / 1.5 ≈ -1.22
From, Python calculation, using the code scipy.stats.norm.cdf(-1.22), we get the output as
0.111.
Therefore, we can say that 11% of the gunny bags have a breaking strength less than
3.17kg per square cm
3.2 What proportion of the gunny bags have a breaking strength at least 3.6 kg per sq cm.?
We have:
μ (mean) = 5 kg/cm²
σ (standard deviation) = 1.5 kg/cm²
X = 3.6 kg/cm² (value of interest)
We need to find the z-score: Z = (X - μ) / σ = (3.6 - 5) / 1.5 ≈ -0.933
Since we are asked to give a proportion of bags with breaking strength at least 3.6 Kg/ cm2
This implies that we want to and the area under curve to the right of P( X >= 3.6 )
P( X >= 3.6 ) = 1 – stats.norm.cdf(-0.933)
Therefore, P( Z > - 0.933 ) = 0.8247
Therefore, we can say that 82.5% of the gunny bags have a breaking strength of at least
3.6 kg per square cm.
3.3 What proportion of the gunny bags have a breaking strength between 5 and 5.5 kg per sq
cm.?
We have:
μ (mean) = 5 kg/cm²
σ (standard deviation) = 1.5 kg/cm²
X1 = 5 kg/cm² (lower limit)
X2 = 5.5 kg/cm² (upper limit)
We need to find the z-scores for both values:
Z1 = (X1 - μ) / σ = (5 - 5) / 1.5 = 0
Z2 = (X2 - μ) / σ = (5.5 - 5) / 1.5 ≈ 0.333
We will do stats.norm.cdf(Z2) – stats.norm.cdf(Z1) .
This will give us the area between these two points.
P( Z1 < Z < Z2 ) = P ( Z< 0.3333) – P ( Z < 0 )
This comes out to be 0.1306.
Therefore, we can say that 13% of the gunny bags have breaking strength between 5 and
5.5 kg/cm2
3.4 What proportion of the gunny bags have a breaking strength NOT between 3 and 7.5 kg per
sq cm.
To find the proportion of bags with breaking strength not in the specified range, we can find the
cumulative probabilities for z-scores corresponding to 3 and 7.5 and then subtract that
proportion from 1:
For Z1 = (3 - 5) / 1.5 ≈ -1.333: Cumulative Probability for Z1 ≈ 0.0918
For Z2 = (7.5 - 5) / 1.5 ≈ 1.333: Cumulative Probability for Z2 ≈ 0.9082
We will do 1-(stats.norm.cdf(Z2) – stats.norm.cdf(Z1)).
This will give us the area between these two points.
This comes out to be 0.1390. Therefore, we can say that 13.9% of the gunny bags have
breaking strength between 3 and 7.5 kg/cm2.
Problem 4:
Grades of the final examination in a training course are found to be normally distributed, with a
mean of 77 and a standard deviation of 8.5. Based on the given information answer the
questions below.
We can use the properties of the normal distribution to answer these questions. Remember that
the standard normal distribution has a mean (μ) of 0 and a standard deviation (σ) of 1. We can
transform values to the standard normal distribution using the formula:
Z = (X - μ) / σ
Where:
X is the value you're interested in
μ is the mean of the distribution
σ is the standard deviation of the distribution
Z is the standard score (how many standard deviations the value is from the mean).
4.1 What is the probability that a randomly chosen student gets a grade below 85 on this exam?
X = 85 (the grade we're interested in)
μ = 77 (mean)
σ = 8.5 (standard deviation)
Z = (85 - 77) / 8.5 ≈ 0.941
In Python, We will use norm function of scipy.stats to calculate our cumulative density
function, which will give the area to the level of distribution below 85.
Therefore, we can say that there is 82.6% probability that a randomly chosen student gets
a grade below 85 on this exam.
4.2 What is the probability that a randomly selected student scores between 65 and 87?
X1 = 65 (lower limit)
X2 = 87 (upper limit)
Z1 = (65 - 77) / 8.5 ≈ -1.4117
Z2 = (87 - 77) / 8.5 ≈ 1.1764
This comes out to be 0.80128. Therefore, we can say that there is 80% probability that a
randomly chosen student gets a grade between 65 and 87 on this exam.
4.3 What should be the passing cut-off so that 75% of the students clear the exam?
We need to find the grade (X) that corresponds to the 75th percentile in the normal distribution.
Using a standard normal distribution table or a calculator,
To calculate this, we will use ppf function (Percent point function) of scipy.stats library,which
will return a discrete value that is less than or equal to the asked probability.
Here, we want the cut of score above which 75% of students clear the exam. We want are at the
right side of the discrete value above which 75% students pass.
Thus, the passing cut of score so that 75% of students clear the exam is 71.26
Problem 5:
Zingaro stone printing is a company that specializes in printing images or patterns on polished
or unpolished stones. However, for the optimum level of printing of the image the stone surface
has to have a Brinell's hardness index of at least 150. Recently, Zingaro has received a batch of
polished and unpolished stones from its clients. Use the data provided to answer the following
(assuming a 5% significance level);
5.1 Earlier experience of Zingaro with this particular client is favorable as the stone surface was
found to be of adequate hardness. However, Zingaro has reason to believe now that the
unpolished stones may not be suitable for printing. Do you think Zingaro is justified in thinking
so?
Step 1:
Defining Hypothesis: Null Hypothesis Ho: Adequate hardness of stone found >= 150
Alternate Hypothesis Ha: Unpolished stone hardness not suitable for printing < 150
Here, Level of Significance ɑ = 0.05 and Sample size n= 75 (derived from dataset), x¯ =134.11,
σ = 33.04, μ = 150
Step 2:
Define the test statistic based on the information in the question. Here, we are going to use the
Zstat .
From the value of the Zstat, we understand that this is a lower tailed-test.
Step 3:
Let us check the critical value with respect to α for the test statistic.
Using norm.ppf for a alpha value of 0.05, the critical value is -1.6414.
Let’s calculate the p-value as well. The p-value comes out to be 1.5567 x 10-5
5.2 Is the mean hardness of the polished and unpolished stones the same.
H0: mu(polished) = mu(unpolished)
Ha: mu(polished)! = mu(unpolished)
The hypothesis test result is below :-
Mu1=134.1105
Mu2=147.7881
Std1=1091.76
Std2=242.96
t_stats = 1.9148542155126753
P_value = 0.05839774428202243
Problem 7:
Dental implant data: The hardness of metal implant in dental cavities depends on multiple
factors, such as the method of implant, the temperature at which the metal is treated, the alloy
used as well as on the dentists who may favour one method above another and may work better
in his/her favourite method. The response is the variable of interest.
1. Test whether there is any difference among the dentists on the implant hardness. State the null
and alternative hypotheses. Note that both types of alloys cannot be considered together. You
must state the null and alternative hypotheses separately for the two types of alloys.?
2. Before the hypotheses may be tested, state the required assumptions. Are the assumptions
fulfilled? Comment separately on both alloy types.?
3. Irrespective of your conclusion in 2, we will continue with the testing procedure. What do you
conclude regarding whether implant hardness depends on dentists? Clearly state your
conclusion. If the null hypothesis is rejected, is it possible to identify which pairs of dentists
differ?
4. Now test whether there is any difference among the methods on the hardness of dental implant,
separately for the two types of alloys. What are your conclusions? If the null hypothesis is
rejected, is it possible to identify which pairs of methods differ?
5. Now test whether there is any difference among the temperature levels on the hardness of dental
implant, separately for the two types of alloys. What are your conclusions? If the null
hypothesis is rejected, is it possible to identify which levels of temperatures differ?
6. Consider the interaction effect of dentist and method and comment on the interaction plot,
separately for the two types of alloys?
7. Now consider the effect of both factors, dentist, and method, separately on each alloy. What do
you conclude? Is it possible to identify which dentists are different, which methods are
different, and which interaction levels are different?
Solution:1.
Test whether there is any difference among the dentists on the implant hardness. State the null
and alternative hypotheses. Note that both types of alloys cannot be considered together. You
must state the null and alternative hypotheses separately for the two types of alloys?
Step 1:
Defining Hypothesis: Defining Separate Hypothesis for both cases
We will perform One Way ANOVA for response variable.
Now let perform One Way ANOVA test for response variable for Alloy1 and Alloy2
Separately.
Case 1:
Null Hypothesis: Ho: Mean Hardness is same across all dentists for Alloy
Alternate Hypothesis: Ha: Mean Hardness is not same for at least one pair of Dentists for
Alloy 1
Case 2:
Null Hypothesis: Ho: Mean Hardness is same across all dentists for Alloy 2.
Alternate Hypothesis: Ha: Mean Hardness is not same for at least one pair of Dentists for
Alloy2
Here, Level of Significance ɑ = 0.05 and Sample size n= 90 (derived from dataset)
Now we see that corresponding p-value is greater than the alpha (0.05),
Thus we fail to reject the null hypothesis.
Thus the mean hardness is same across all the dentists.
2.Before the hypotheses may be tested, state the required assumptions. Are the assumptions
fulfilled?
Comment separately on both alloy types?
These are the assumptions that are required before the test:
The responses for each type of alloy have a normal distribution.
These distributions have the same variance.
The data is independent.
Let’s see the boxplot of the Response variable to see the distribution.
We can clearly see that the distribution is not normal. The data clearly does not fulfill the
assumption.
3.Irrespective of your conclusion in 2, we will continue with the testing procedure. What do you
conclude regarding whether implant hardness depends on dentists? Clearly state your
conclusion. If the null hypothesis is rejected, is it possible to identify which pairs of dentists
differ?
The conclusion is that implant hardness doesn’t depend on dentists.
Yes we can identify which pair of dentists is differ by tuckeyhsd test
Below is the result.
Here False means, there is no difference.
4.Now test whether there is any difference among the methods on the hardness of dental
implant, separately for the two types of alloys. What are your conclusions? If the null
hypothesis is rejected, is it possible to identify which pairs of methods differ?
Step 1:
Defining the hypothesis:
Defining the Separate hypothesis for both the cases.
Case 1:
Null Hypothesis:
Ho: The mean hardness is same across all methods for Alloy1.
Alternate Hypothesis:
Ha: The mean Hardness is not same for at least one pair of methods for Alloy1.
Case 2:
Null Hypothesis:
Ho: The mean hardness is same across all methods for Alloy2.
Alternate Hypothesis:
Ha: The mean Hardness is not same for at least one pair of methods for Alloy2.
We can say that the mean hardness for both Alloy1 and Alloy2 is different for at least one pair
of method of dental implant.
So, there is a significant difference among the methods on implant hardness for alloy2
The conclusion is that implant hardness doesn't depend on methods for alloy1 but depends for
alloy2.
Yes we can identify which pair of methods is differ by tuckey hsd test
By the tuckeyhsd result we can say that there is no difference between method 1-2, but
there is significant difference between method 1-3, & 2-3
5.Now test whether there is any difference among the temperature levels on the hardness of
dental implant, separately for the two types of alloys. What are your conclusions? If the null
hypothesis is rejected, is it possible to identify which levels of temperatures differ?
Step 1:
Defining the hypothesis:
Defining the Separate hypothesis for both the cases.
Case 1:
Null Hypothesis:
Ho: The mean hardness is same across all temperature levels for Alloy1.
Alternate Hypothesis:
Ha: The mean Hardness is not same for different temperature levels for Alloy1.
Case 2:
Null Hypothesis:
Ho: The mean hardness is same across all temperature levels for Alloy2.
Alternate Hypothesis:
Ha: The mean Hardness is not same for different temperature levels for Alloy2.
Now we see that corresponding p-value is greater than the alpha (0.05) for both Alloy1 and
Alloy2. Thus we accept the null hypothesis.
We can say that the mean hardness is same across all temperature levels for both Alloy1 and
Alloy2.
So, there is no difference among the temp on implant hardness for alloy 2
The conclusion is that implant hardness doesn't depend on temp for alloy1 and alloy 2
Yes we can identify which pair of methods is differ by tuckey hsd test
By the tuckey hsd result we can say that there is no difference between pair of
temperatures.
6.Consider the interaction effect of dentist and method and comment on the interaction plot,
separately for the two types of alloys?
For Alloy1
Ho: there is no interaction in both the categories for alloy 1
Ha: there is interaction between both the categories for alloy 1
As the interaction effect for C(Dentist):C(Method) is less than significance value, that means
we reject the null hypothesis
So, there is significant interaction between the dentist and method for alloy 1
For Alloy2
Ho: there is no interaction in both the categories for alloy 2
Ha: there is interaction between both the categories for alloy 2
As the interaction effect for C(Dentist):C(Method) is more than significance value, that means
we accept the null hypothesis.
So, there is no significant interaction between the dentist and method for alloy2.
7.Now consider the effect of both factors, dentist, and method, separately on each alloy. What
do you conclude? Is it possible to identify which dentists are different, which methods are
different, and which interaction levels are different?
Conclusion for alloy 1 :-As the p value for C(Dentist):C(Method) is lower than 0.05 so, we can
say that there is significant interaction between Dentist and method for alloy 1.
Conclusion for alloy 2 :-As the p value for C(Dentist):C(Method) is lower than 0.05 so, we can
say that there is significant interaction between Dentist and method for alloy 2.
Alloy1
Conclusion :-
1. As in the dentists the P value is higher than 0.05, so there is no significant difference
between them.
2.As in methods the p value is higher than 0.05, so there is no significant difference in the
methods.
3.In interaction point C(Dentist)[T.4]:C(Method)[T.3],
C(Dentist)[T.5]:C(Method)[T.3], the p values are lower than 0.05, so here this point
interaction are different.
Alloy 2
Conclusion:-
1.As in the dentists the P value is higher than 0.05, so there is no significant
difference between them.
2.As in methods the p value is lower than 0.05 for C(Method)[T.3], so there is a
significant difference in the methods.
3.In interaction point C(Dentist)[T.5]:C(Method)[T.3], the p values is lower
than0.05, so here this point interaction is different.