r lang-Unit-04
r lang-Unit-04
Uses mathematics and statistics as a way to make assumptions and reach conclusions
from data. The use of statistical models is ubiquitous in scientific fields, including engineering,
business, and life sciences. It’s a valuable tool for drawing inferences and making quantitative
predictions about data.
This article defines statistical modeling and shows where and why it’s used. It then examines
the finer points of the most common forms of modeling, with actual published examples so
you can see regression in action.
With statistical inference, you can estimate the population parameter by measuring it from
part of the population. Choosing that part to collect information on is known as statistical
sampling. Election polls do precisely that to predict winners (or losers) from a sample of U.S.
voters.
Statistical models are better at explaining the relationship between variables. They
seek some understanding of the structure underlying the observed data.
Maybe the “simplest” statistical model is the arithmetic mean, or average, of a population
parameter. With this measure, you’re attempting to guess what the expected value is if you
take a random sample from the population.
the factors associated with the COVID-19 mortality rate (the dependent variable) in
169 countries were identified using linear regression.
The researchers performed a simple linear regression analysis to test the correlation
between COVID-19 mortality and a test number (the independent variable). They used
multiple regression analysis for predicting mortality rates considering other explanatory
factors (e.g., case numbers, critical cases, hospital bed numbers).
The results suggested that an increase in testing was effective at attenuating mortality when
other means of control were insufficient. Higher mortality was found to be associated with:
• Analyzing data
• Making predictions
Statistical testing can also refer to a testing method in software engineering that aims
to identify unreliable software package products.
• t-Test
• chi-square Test
Demonstrate Sampling and Sampling Distribution using the Iris Data set
str(iris)
iris_df<-data.frame(iris)
View(iris_df)
iter<-100
n<-5
means<-rep(NA,iter)
for(i in 1:iter)
mean_of_each_sample<-sample(iris$Petal.Length,n)
means[i]<-mean(mean_of_each_sample)
hist(means)
7.Hypothesis Testing
What is Hypothesis Testing? (2 marks)
Is a form of statistical inference that uses data from a sample to draw conclusions about a
population parameters or a population parameter or a population probability distribution.
What are Inferential Statistics? (2 marks) Sampling
Distribution
Sample 1 Sample 2
Sample
Conclusion
Sample 3 Population Sample 6
Sample 4
Sample n
Data
Conclusion
The The Conclusion that is drawn from the population data or the hypothesis
made is tested with various types of Hypothesis testing
8.Components of Hypothesis testing
List the Components of Hypothesis Testing. (2 marks)
The components of Hypothesis testing are
1. Null Hypothesis
2. Alternate Hypothesis testing
3. Confidence Interval (CI)
4. Significant Value
5. Decision boundary
Define Null Hypothesis. (2 marks)
Null hypothesis as a general statement or a default position that says there is no
relationship between two measured phenomena or there is no association among groups.
A Null Hypothesis is denoted by the symbol H0 in statistics. It is usually pronounced as
“h-nought” or “H-null”. The Subscript in H is the digit 0.
Define Alternate Hypothesis. (2 marks)
Explain the Null and Alternate Hypothesis with an Example (8 marks) (Very Important)
So the following Graph represents the above chances which is distributed normally
So if we get 50% head and 50% tail then we get a straight line exactly at the centre
If we get 60% head and 40% tail then we get a straight line between 40 and 60
If we get 70% head and 30% tail then we get a straight line between 30 and 70
So now we can explain the components of the Hypothesis for this problem
Figure 2: Resenting the Normal Distribution Curve for Hypothesis testing and its
Components
Means if we toss the coin the best output that we need is between 30 and 70, if it
is so then we accept the Null Hypothesis which is justified from the experiment.
This means if we toss a coin and if the chances of getting head and tail is beyond 30
and 70 then we accept the Alternate Hypothesis.
3. Confidence Interval
The Confidence interval is said to be 95 percent, means that if we repeat the coin flip
experiment 100 times, for 95 times, our probability of getting heads will fall within
that confidence interval.
Means all 95 times also if the chances of getting heads and tails lies between 30 and
70
4. Significant value
=0.05
This means that 5% of the chances in tossing the coin 100 times may fall beyond 30
and 70, therefore the chances of getting the rest of the chances other than the three
chances mentioned above are 0.05% (or 5%).
5. Decision Boundary
The Confidence Interval itself is the Decision Boundary which has been decided
from tossing the coin for 100 times and it is proved that for 95 times the probability
of getting number of heads and tails is between 30 and 70,so the decision boundary
is 95%.
Conclusion
Suppose for the 5th flip of the coin, if we get 65% head and 25% tail, then accept
the Null Hypothesis as it lies in the given decision boundary as shown in the above
figure 2.
When do we use Z-test? (2 marks) (just write the below diagram for this question)
Consider an example
The Average heights of all residents in a city is 168 cm with a population std deviation σ = 3.9.
A doctor believes the mean to be different. He measured the height of 36 individuals and
found the average to be 169.5
Given:
Sample n=36
H0 = μ=168 cm
Significant value α = 1 – CI
= 1 - 0.95=0.05
Population Mean
1 – 0.025 = 0.9750
So, the Decision boundary is +0.9750 and -0.9750 as shown in the graph.
169.5−168
Z-test = = 2.31
3.9/√36
Statistical Inference:
So when 36 samples are drawn from a population of mean 169.5 and with standard
deviation 168, when this is applied to Z-test formula , we get 2.31. So when 2.31 is compared
with the above decision boundary value i.e 0.975 .
2.31 < 0.975 which lies outside the Decision boundary, So in this Case we Reject the Null
hypothesis and Accept the Alternate Hypothesis.