Unit - 4
Unit - 4
Modelling:
statistical modelling involves creating mathematical representations of relationships in data.
Example: For instance, predicting house prices based on factors like square footage, location,
and number of bedrooms using regression analysis is a statistical modeling task The model
helps estimate how these factors influence the price and make predictions for new houses.
Sampling distributions:
Sampling distribution is also known as a finite-sample distribution. Sampling distribution is
the probability distribution of a statistic based on random samples of a given population. It
represents the distribution of frequencies on how spread apart various outcomes will be for a
specific population.
Since population is too large to analyse, you can select a smaller group and repeatedly sample
or analyse them. The gathered data, or statistic, is used to calculate the likely occurrence, or
probability, of an event.
Example: Consider a factory that manufactures light bulbs. The firm strives to keep its bulbs
at a consistent degree of brightness All bulbs manufactured have a normal distribution of
brightness levels, with a mean () of 800 lumens and a standard deviation () of 20 lumens.
1
4) Plot the frequency distribution of each sample statistic that you developed from the step
above. The resulting graph will be the sampling distribution.
3) T-Distribution:
This type of sampling distribution is common in cases of small sample sizes.it may also be
used when there is very little information about the entire population-distributions are used to
make estimates about the mean and other statistical points.
The central “balance” point of a sampling distribution is its mean, but the standard deviation
of a sampling distribution refers as a standard error.
The theoretical formulas for various sampling distributions therefore depend upon
The original probability distributions that are assumed to have generated the raw data
The size of the sample itself.
Note:
probability distribution the central balance point of a sampling distribution is mean
The standard deviation of a sampling distribution is referred to as a standard error.
2
If X itself is normal the sampling distribution of X is a normal distribution, with mean X
and standard error σX√n.
If X is not normal the sampling distribution of X is still approximately normal with mean
µX and standard error σX√n, and this approximation improves arbitrarily as n ∞. This
is known as the central limit theorem (CLT).
3
Example2:
#Population parameters
population_mean < 75
population std < 10
sample_size <25
num_samples <-1000 #Number of samples to take
# Generating samples and calculating sample means
sample_means <numeric(num_samples)
for (i in 1num_samples) {
sample <- norm(sample_size, mean population_mean, Sd = population_std)
sample_means[i] <- mean(sample)
}
#Calculating the mean of sample means
4
sampling_distribution_mean <- mean(sample_means)
print(paste("Mean of the sampling distribution of the mean", sampling
_distribution_mean))
Output:
Mean of the sampling distribution of the mean: 74.9392512095679"
Sampling Distribution of Proportion
Sampling distribution of proportion focuses on proportions in a population. Here, you
select samples and calculate their corresponding proportions. The means of the sample
proportions from each group represent the proportion of the entire population.
5
Central Limit Theorem
According to the central limit theorem, if X1, X2, ..., Xn is a random sample of size n
taken from a population with mean µ and variance σ2 then the sampling distribution of
the sample mean tends to normal distribution with mean µ and variance σ2/n as sample
size tends to large.
This formula indicates that as the sample size increases, the spread of the sample means
around the population mean decreases, with the standard deviation of the sample means
shrinking proportionally to the square root of the sample size, and the variate Z,
Z = (x - μ)/(σ/√n)
where,
z is z-score
x is Value being Standardized (either an individual data point or the sample mean)
μ is Population Mean
σ is Population Standard Deviation
n is Sample Size
This formula quantifies how many standard deviations a data point (or sample mean) is
away from the population mean. Positive z-scores indicate values above the mean, while
negative z-scores indicate values below the mean. Follows the normal distribution with
mean 0 and variance unity, that is, the variate Z follows standard normal distribution.
According to the central limit theorem, the sampling distribution of the sample means
tends to normal distribution as sample size tends to large (n > 30)
Confidence Intervals
A confidence interval (CI) is an interval defined by a lower limit /and an upper limit u ,
used to describe possible values of a corresponding true population parameter in light of
observed sample data
Interpretation of a confidence interval therefore allows you to state a "level of
confidence" that a true parameter of interest falls between this upper and lower limit,
often expressed as percentage
As such, it is a common and useful tool built directly from the sampling distribution of
the statistic of interest.
The following are the important points to note:
The level of confidence is usually expressed as a percentage, such that we’d construct a
100*(1-a) percent confidence interval, where0 <a < 1 is an “amount of tail probability.
The three most common intervals are defined with ether a = 0.1(a 90 percent interval), a =
0.05 (a 95 percent interval),or a =0.01(a 99 percent interval).
Confidence intervals may be constructed in different ways, depending on the type of
statistic statistic critical value standard error
6
A critical value is the value of the test statistic which defines the upper and lower bounds
of a confidence interval or which defines the threshold of statistical significance in a
statistical test Statistic to state the population parameter standard error, in the sampling
distribution standard deviation.
The formula for a confidence interval for the mean in a sampling distribution, assuming
anormal distribution or a sufficiently large sample size (Central Limit Theorem),
Confidence Interval-Sample Mean=(Cortical Value Standard Deviation Sample Size)
Confidence Interval-Sample Mean+= (critical Value Sample Size Standard Deviation)
For example, let's say we want to find a 95% confidence interval for the mean weight of
apples sampled from a farm We collect a sample of 50 apples, measure their weight, and find
the sample mean weight to be 150 grams. Let's assume the population standard deviation is10
grams
Solution:
Given a 95% confidence level and a normal distribution (z-distribution) with acritical value
of 1.96 for a 95% confidence interval Confidence Interval=150+(1.96x10/√50)
Confidence Interval=150 (1.96*10/√50) =150±2.78
Therefore, the 95% confidence interval for the mean weight of apples in the population
would be approximately from 147.22147.22 grams to 152.78152.78 grams.
#Given data
sample mean<- 2.5
sample_standard_deviation <-10.8
sample_size 500
Calculate critical value for 95% confidence level
confidence level <- 0.95
alpha<-1 - confidence level
critical value qnorm (1 alpha 2)
Calculate margin of error
margin_of_error critical value (sample_standard_deviation/sqrt(sample_size))
#Calculate confidence interval
lower_bound - sample mean -margin_of_error
upper bound sample_mean +margin_of_error
#Print the confidence interval
cat( The 95% confidence interval for the average time spent on social media is
[" lower_bound, "". upper_bound, “] \n")
7
Output The 95% confidence interval for the average time spent on social media is
[2.429878, 2.570122]
a) An Interval for a Mean
A confidence interval for a mean gives us a range of possible values for the population mean.
If a confidence interval does not include a particular value, we can say that it is not likely that
the particular value is the true population mean.
b) An Interval for a Proportion
In order to construct a confidence interval for a population proportion, we must be able to
assume the sample proportions follow a normal distribution. Calculate and interpret
confidence intervals for estimating a population proportion. Calculating the Margin of Error.
Calculating the Margin of Error
The margin of error for a confidence interval with confidence level C for an unknown
population proportion pi s
Margin of error =z*√^p*(1-p)/n
Hypothesis Testing
Hypothesis testing is a tool for making statistical inferences about the population data. It is an
analysis tool that tests assumptions and determines how likely something is within a given
standard of accuracy. Hypothesis testing provides a way to verify whether the results of an
experiment are valid.
A null hypothesis and an alternative hypothesis are set up before performing the hypothesis
testing. This helps to arrive at a conclusion regarding the sample obtained from the
population. In this article, we will learn more about hypothesis testing, its types, steps to
perform the testing, and associated examples.
What is Hypothesis Testing in Statistics?
Hypothesis testing uses sample data from the population to draw useful conclusions regarding
the population probability distribution. It tests an assumption made about the data using
different types of hypotheses testing methodologies. The hypothesis testing results in either
rejecting or not rejecting the null hypothesis.
Hypothesis Testing Definition
Hypothesis testing can be defined as a statistical tool that is used to identify if the results of
an experiment are meaningful or not. It involves setting up a null hypothesis and an
alternative hypothesis. These two hypotheses will always be mutually exclusive. This means
that if the null hypothesis is true then the alternative hypothesis is false and vice versa. An
example of hypothesis testing is setting up a test to check if a new medicine works on a
disease in a more efficient manner.
FOUR STEP PROCESS OF HYPOTHESIS TESTING
There are 4 major steps in hypothesis testing:
a) State the hypothesis- This step is started by stating null and alternative hypothesis which
is presumed as true.
b) Formulate an analysis plan and set the criteria for decision- In this step, a significance
level of test is set. The significance level is the probability of a false rejection in a
hypothesis test.
c) Analyze sample data- In this, a test statistic is used to formulate the statistical
comparison between the sample mean and the mean of the population or standard
deviation of the sample and standard deviation of the population.
8
d) Interpret decision- The value of the test statistic is used to make the decision based on
the significance level. For example, if the significance level is set to 0.1 probability, then
the sample mean less than 10% will be rejected. Otherwise, the hypothesis is retained to
be true.
COMPONENTS OF A HYPOTHESIS TEST
1) Hypotheses
As the name would suggest, in hypothesis testing, formally stating a claim and the subsequent
hypothesis test is done with a null and an alternative hypothesis.
The null hypothesis is interpreted as the baseline or no change hypothesis and is the claim
that is assumed to be true.
Null Hypothesis
The null hypothesis is a concise mathematical statement that is used to indicate that there is
no difference between two possibilities. In other words, there is no difference between certain
characteristics of data. This hypothesis assumes that the outcomes of an experiment are based
on chance alone. It is denoted as H0H0. Hypothesis testing is used to conclude if the null
hypothesis can be rejected or not. Suppose an experiment is conducted to check if girls are
shorter than boys at the age of 5. The null hypothesis will say that they are the same height.
Alternative Hypothesis
The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the
observations of an experiment are due to some real effect. It indicates that there is a statistical
significance between two possible outcomes and can be denoted as H or Ha. For the above-
mentioned example, the alternative hypothesis would be that girls are shorter than boys at the
age of 5.
In general, null and alternative hypotheses are denoted HO and HA,
respectively, and they are written as follows:
Ho: ….
HA: …..
When HA is defined in terms of a less-than statement, with <, it is one-sided; this is also
called a lower-tailed test.
When HA is defined in terms of a greater-than statement, with >, it is one-sided; this is also
called an upper-tailed test.
When HA is merely defined in terms of a different-to statement, with ≠, it is two-sided; this is
also called a two-tailed test.
2) Test Statistic
Once the hypotheses are formed, sample data are collected, and statistics are calculated
according to the parameters detailed in the hypotheses.
9
The test statistic is the statistic that's compared to the appropriate standardized sampling
distribution to yield the p-value.
A test statistic is typically a standardized or rescaled version of the sample statistic of interest.
3) Hypothesis Testing P Value
In hypothesis testing, the p value is used to indicate whether the results obtained after
conducting a test are statistically significant or not. It also indicates the probability of making
an error in rejecting or not rejecting the null hypothesis. This value is always a number
between 0 and 1. The p value is compared to an alpha level, αα or significance level. The
alpha level can be defined as the acceptable risk of incorrectly rejecting the null hypothesis.
The alpha level is usually chosen between 1% to 5%.
Hypothesis Testing Critical region
All sets of values that lead to rejecting the null hypothesis lie in the critical region.
Furthermore, the value that separates the critical region from the non-critical region is known
as the critical value.
Hypothesis Testing Formula
Depending upon the type of data available and the size, different types of hypothesis testing
are used to determine whether the null hypothesis can be rejected or not. The hypothesis
testing formula for some important test statistics are given below:
4) Significance Level
For every hypothesis test, a significance level, denoted a, is assumed.
This is used to qualify the result of the test.
The significance level defines a cutoff point, at which you decide whether there is
sufficient evidence to view HO as incorrect and favor HA instead.
If the p-value is greater than or equal to a, then you conclude there is insufficient
evidence against the null hypothesis, and therefore you retain HO when compared to HA.
If the p-value is less than a, then the result of the test is statistically significant. This
implies there is sufficient evidence against the null hypothesis, and therefore you reject
HO in favor of HA.
Common or conventional values of a are a = 0.1, α = 0.05, and a = 0.01.
10