0% found this document useful (0 votes)
27 views10 pages

Unit - 4

Unit 4 covers statistical testing and modeling, including hypothesis testing, sampling distributions, and analysis of variance. It explains the importance of statistical tests in determining significant differences in data, the creation of statistical models for predictions, and the Central Limit Theorem's role in understanding sample means. Additionally, it outlines the process of hypothesis testing, including formulating hypotheses, analyzing data, and interpreting results.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Unit - 4

Unit 4 covers statistical testing and modeling, including hypothesis testing, sampling distributions, and analysis of variance. It explains the importance of statistical tests in determining significant differences in data, the creation of statistical models for predictions, and the Central Limit Theorem's role in understanding sample means. Additionally, it outlines the process of hypothesis testing, including formulating hypotheses, analyzing data, and interpreting results.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT – 4

Statistical testing and modelling


Statistical testing and modelling, sampling distributions, hypothesis testing, components
of hypothesis test, testing means, testing proportions, testing categorical variables, errors
and power, Analysis of variance.

Statistical testing and modelling


Statistical testing:
 Statistical testing involves analyzing data to make decision about a on a sample.
population based on a sample.
 It helps to determine if determine of observed difference or relationships in the data Are
statistically significant or due to random chance.
 Example: They collect and compare data on the longevity of bulbs generated using the
new method to the lifespan of bulbs produced using the old method They can statistically
analyse if the mean lifespan of bulbs generated using the new method is substantially
different from those produced using the old method by running a hypothesis test (such as
a t-test or ANOVA). If the test reveals a considerable difference in favour of the new
procedure, the corporation can confidently implement it, knowing that it results in longer-
lasting bulbs.

Modelling:
statistical modelling involves creating mathematical representations of relationships in data.
Example: For instance, predicting house prices based on factors like square footage, location,
and number of bedrooms using regression analysis is a statistical modeling task The model
helps estimate how these factors influence the price and make predictions for new houses.

Sampling distributions:
Sampling distribution is also known as a finite-sample distribution. Sampling distribution is
the probability distribution of a statistic based on random samples of a given population. It
represents the distribution of frequencies on how spread apart various outcomes will be for a
specific population.
Since population is too large to analyse, you can select a smaller group and repeatedly sample
or analyse them. The gathered data, or statistic, is used to calculate the likely occurrence, or
probability, of an event.
Example: Consider a factory that manufactures light bulbs. The firm strives to keep its bulbs
at a consistent degree of brightness All bulbs manufactured have a normal distribution of
brightness levels, with a mean () of 800 lumens and a standard deviation () of 20 lumens.

There are a few steps involved with sampling distribution


These include:
1) Select a random sample of a specific size from a given population.
2) Calculate a statistic for the sample, such as the mean, median, or standard deviation.
3) Develop a frequency distribution of each sample statistic that you calculated from the step
above.

1
4) Plot the frequency distribution of each sample statistic that you developed from the step
above. The resulting graph will be the sampling distribution.

Types of Sampling Distributions

1) Sampling Distribution of the Mean:


This method shows a normal distribution where the middle is the mean of the sampling
distribution. As such, it represents the mean of the overall population. In order to this point,
the researcher must figure out the mean of each sample group and map out the individual
data.

2) Sampling Distribution of Proportion:


This method involves choosing a sample set from the overall population to get the proportion
of the sample. The mean of the proportions ends up becoming the proportions of the larger
group.

3) T-Distribution:
This type of sampling distribution is common in cases of small sample sizes.it may also be
used when there is very little information about the entire population-distributions are used to
make estimates about the mean and other statistical points.
The central “balance” point of a sampling distribution is its mean, but the standard deviation
of a sampling distribution refers as a standard error.
The theoretical formulas for various sampling distributions therefore depend upon
 The original probability distributions that are assumed to have generated the raw data
 The size of the sample itself.
Note:
 probability distribution the central balance point of a sampling distribution is mean
 The standard deviation of a sampling distribution is referred to as a standard error.

1. DISTRIBUTION FOR A SAMPLE MEAN:


 The sampling distribution of the mean represents the distribution of sample means taken
from a population.
 It helps understand how sample means vary and approach the population mean as sample
size increases, following the Central Limit Theorem.
 Mathematically, the variability inherent in an estimated sample mean is described as
follows: Formally, denote the random variable of interest as X. This represents the mean
of a sample of n observations from the "raw observation random variable X, as in x1, x2,
…, xn.
 Those observations are assumed to have a true finite mean ∞- < μΧ <∞ and a true finite
standard deviation 0<X<∞.
 The conditions for finding the probability distribution of a sample mean vary depending
on whether you know the value of the standard deviation.
Situation 1: Standard Deviation Known
 When the true value of the standard deviation σX is known, then the following are true:

2
 If X itself is normal the sampling distribution of X is a normal distribution, with mean X
and standard error σX√n.
 If X is not normal the sampling distribution of X is still approximately normal with mean
µX and standard error σX√n, and this approximation improves arbitrarily as n ∞. This
is known as the central limit theorem (CLT).

Situation 2: Standard Deviation Unknown


 The standard deviation of the raw measurement distribution that generated your sample
data
 In this eventuality, it's usual to just replace σ x with σ x which is the standard deviation of
the sampled data.
 Standardized values of the sampling distribution of X follow a t-distribution with=n-1
degrees of freedom; standardization is performed using the standard error s X/n
 In additionally, nis small, then it is necessary to assume the distribution of X is normal for
the validity of this t-based sampling distribution of X
 The nature of the sampling distribution of X therefore depends upon whether the true
standard deviation of the observations is known, as well as the sample size n
 The CLT states that normality occurs even if the raw observation distribution is itself not
normal but this approximation is less relable if n is small, it’s a common rule of thumb to
rely on the CLT only if n≥ 30.
 If sX, the sample standard deviation, is used to calculate the standard error of X, then the
sampling distribution is the t-distribution (following standardization). Agan, this is
generally taken to reliable only if n<=230.

3
Example2:
#Population parameters
population_mean < 75
population std < 10
sample_size <25
num_samples <-1000 #Number of samples to take
# Generating samples and calculating sample means
sample_means <numeric(num_samples)
for (i in 1num_samples) {
sample <- norm(sample_size, mean population_mean, Sd = population_std)
sample_means[i] <- mean(sample)
}
#Calculating the mean of sample means

4
sampling_distribution_mean <- mean(sample_means)
print(paste("Mean of the sampling distribution of the mean", sampling
_distribution_mean))
Output:
Mean of the sampling distribution of the mean: 74.9392512095679"
Sampling Distribution of Proportion
Sampling distribution of proportion focuses on proportions in a population. Here, you
select samples and calculate their corresponding proportions. The means of the sample
proportions from each group represent the proportion of the entire population.

Sampling distribution of proportion –


1Sampling distribution of proportion –
2 Formula for the sampling distribution of a proportion (often denoted as p̂ ) is:
p̂ = x/n
where:
 p̂ is Sample Proportion
 x is Number of "successes" or occurrences of Event of Interest in Sample
 n is Sample Size
This formula calculates the proportion of occurrences of a certain event (e.g., success,
positive outcome) within a sample.
T-Distribution
Sampling distribution involves a small population or a population about which you don't
know much. It is used to estimate the mean of the population and other statistics such as
confidence intervals, statistical differences and linear regression. T-distribution uses a t-
score to evaluate data that wouldn't be appropriate for a normal distribution.Formula for
the t-score, denoted as t, is:
t = [x - μ] / [s /√(n)]
where:
 x is Sample Mean
 μ is Population Mean (or an estimate of it)
 s is Sample Standard Deviation
 n is Sample Size
This formula calculates the difference between the sample mean and the population mean,
scaled by the standard error of the sample mean. The t-score helps to assess whether the
observed difference between the sample and population means is statistically significant.
Central Limit Theorem [CLT]
Central Limit Theorem is the most important theorem of Statistics.

5
Central Limit Theorem
According to the central limit theorem, if X1, X2, ..., Xn is a random sample of size n
taken from a population with mean µ and variance σ2 then the sampling distribution of
the sample mean tends to normal distribution with mean µ and variance σ2/n as sample
size tends to large.
This formula indicates that as the sample size increases, the spread of the sample means
around the population mean decreases, with the standard deviation of the sample means
shrinking proportionally to the square root of the sample size, and the variate Z,
Z = (x - μ)/(σ/√n)
where,
 z is z-score
 x is Value being Standardized (either an individual data point or the sample mean)
 μ is Population Mean
 σ is Population Standard Deviation
 n is Sample Size
This formula quantifies how many standard deviations a data point (or sample mean) is
away from the population mean. Positive z-scores indicate values above the mean, while
negative z-scores indicate values below the mean. Follows the normal distribution with
mean 0 and variance unity, that is, the variate Z follows standard normal distribution.
According to the central limit theorem, the sampling distribution of the sample means
tends to normal distribution as sample size tends to large (n > 30)
Confidence Intervals
 A confidence interval (CI) is an interval defined by a lower limit /and an upper limit u ,
used to describe possible values of a corresponding true population parameter in light of
observed sample data
 Interpretation of a confidence interval therefore allows you to state a "level of
confidence" that a true parameter of interest falls between this upper and lower limit,
often expressed as percentage
 As such, it is a common and useful tool built directly from the sampling distribution of
the statistic of interest.
 The following are the important points to note:
The level of confidence is usually expressed as a percentage, such that we’d construct a
100*(1-a) percent confidence interval, where0 <a < 1 is an “amount of tail probability.
 The three most common intervals are defined with ether a = 0.1(a 90 percent interval), a =
0.05 (a 95 percent interval),or a =0.01(a 99 percent interval).
 Confidence intervals may be constructed in different ways, depending on the type of
statistic statistic critical value standard error

6
 A critical value is the value of the test statistic which defines the upper and lower bounds
of a confidence interval or which defines the threshold of statistical significance in a
statistical test Statistic to state the population parameter standard error, in the sampling
distribution standard deviation.

The formula for a confidence interval for the mean in a sampling distribution, assuming
anormal distribution or a sufficiently large sample size (Central Limit Theorem),
Confidence Interval-Sample Mean=(Cortical Value Standard Deviation Sample Size)
Confidence Interval-Sample Mean+= (critical Value Sample Size Standard Deviation)
For example, let's say we want to find a 95% confidence interval for the mean weight of
apples sampled from a farm We collect a sample of 50 apples, measure their weight, and find
the sample mean weight to be 150 grams. Let's assume the population standard deviation is10
grams
Solution:
Given a 95% confidence level and a normal distribution (z-distribution) with acritical value
of 1.96 for a 95% confidence interval Confidence Interval=150+(1.96x10/√50)
Confidence Interval=150 (1.96*10/√50) =150±2.78

Therefore, the 95% confidence interval for the mean weight of apples in the population
would be approximately from 147.22147.22 grams to 152.78152.78 grams.
#Given data
sample mean<- 2.5
sample_standard_deviation <-10.8
sample_size 500
Calculate critical value for 95% confidence level
confidence level <- 0.95
alpha<-1 - confidence level
critical value qnorm (1 alpha 2)
Calculate margin of error
margin_of_error critical value (sample_standard_deviation/sqrt(sample_size))
#Calculate confidence interval
lower_bound - sample mean -margin_of_error
upper bound sample_mean +margin_of_error
#Print the confidence interval
cat( The 95% confidence interval for the average time spent on social media is
[" lower_bound, "". upper_bound, “] \n")

7
Output The 95% confidence interval for the average time spent on social media is
[2.429878, 2.570122]
a) An Interval for a Mean
A confidence interval for a mean gives us a range of possible values for the population mean.
If a confidence interval does not include a particular value, we can say that it is not likely that
the particular value is the true population mean.
b) An Interval for a Proportion
In order to construct a confidence interval for a population proportion, we must be able to
assume the sample proportions follow a normal distribution. Calculate and interpret
confidence intervals for estimating a population proportion. Calculating the Margin of Error.
Calculating the Margin of Error
The margin of error for a confidence interval with confidence level C for an unknown
population proportion pi s
Margin of error =z*√^p*(1-p)/n
Hypothesis Testing
Hypothesis testing is a tool for making statistical inferences about the population data. It is an
analysis tool that tests assumptions and determines how likely something is within a given
standard of accuracy. Hypothesis testing provides a way to verify whether the results of an
experiment are valid.
A null hypothesis and an alternative hypothesis are set up before performing the hypothesis
testing. This helps to arrive at a conclusion regarding the sample obtained from the
population. In this article, we will learn more about hypothesis testing, its types, steps to
perform the testing, and associated examples.
What is Hypothesis Testing in Statistics?
Hypothesis testing uses sample data from the population to draw useful conclusions regarding
the population probability distribution. It tests an assumption made about the data using
different types of hypotheses testing methodologies. The hypothesis testing results in either
rejecting or not rejecting the null hypothesis.
Hypothesis Testing Definition
Hypothesis testing can be defined as a statistical tool that is used to identify if the results of
an experiment are meaningful or not. It involves setting up a null hypothesis and an
alternative hypothesis. These two hypotheses will always be mutually exclusive. This means
that if the null hypothesis is true then the alternative hypothesis is false and vice versa. An
example of hypothesis testing is setting up a test to check if a new medicine works on a
disease in a more efficient manner.
FOUR STEP PROCESS OF HYPOTHESIS TESTING
There are 4 major steps in hypothesis testing:
a) State the hypothesis- This step is started by stating null and alternative hypothesis which
is presumed as true.
b) Formulate an analysis plan and set the criteria for decision- In this step, a significance
level of test is set. The significance level is the probability of a false rejection in a
hypothesis test.
c) Analyze sample data- In this, a test statistic is used to formulate the statistical
comparison between the sample mean and the mean of the population or standard
deviation of the sample and standard deviation of the population.

8
d) Interpret decision- The value of the test statistic is used to make the decision based on
the significance level. For example, if the significance level is set to 0.1 probability, then
the sample mean less than 10% will be rejected. Otherwise, the hypothesis is retained to
be true.
COMPONENTS OF A HYPOTHESIS TEST
1) Hypotheses
As the name would suggest, in hypothesis testing, formally stating a claim and the subsequent
hypothesis test is done with a null and an alternative hypothesis.
The null hypothesis is interpreted as the baseline or no change hypothesis and is the claim
that is assumed to be true.

 Null Hypothesis
The null hypothesis is a concise mathematical statement that is used to indicate that there is
no difference between two possibilities. In other words, there is no difference between certain
characteristics of data. This hypothesis assumes that the outcomes of an experiment are based
on chance alone. It is denoted as H0H0. Hypothesis testing is used to conclude if the null
hypothesis can be rejected or not. Suppose an experiment is conducted to check if girls are
shorter than boys at the age of 5. The null hypothesis will say that they are the same height.
 Alternative Hypothesis
The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the
observations of an experiment are due to some real effect. It indicates that there is a statistical
significance between two possible outcomes and can be denoted as H or Ha. For the above-
mentioned example, the alternative hypothesis would be that girls are shorter than boys at the
age of 5.
In general, null and alternative hypotheses are denoted HO and HA,
respectively, and they are written as follows:
Ho: ….
HA: …..
When HA is defined in terms of a less-than statement, with <, it is one-sided; this is also
called a lower-tailed test.
When HA is defined in terms of a greater-than statement, with >, it is one-sided; this is also
called an upper-tailed test.
When HA is merely defined in terms of a different-to statement, with ≠, it is two-sided; this is
also called a two-tailed test.
2) Test Statistic
Once the hypotheses are formed, sample data are collected, and statistics are calculated
according to the parameters detailed in the hypotheses.

9
The test statistic is the statistic that's compared to the appropriate standardized sampling
distribution to yield the p-value.
A test statistic is typically a standardized or rescaled version of the sample statistic of interest.
3) Hypothesis Testing P Value
In hypothesis testing, the p value is used to indicate whether the results obtained after
conducting a test are statistically significant or not. It also indicates the probability of making
an error in rejecting or not rejecting the null hypothesis. This value is always a number
between 0 and 1. The p value is compared to an alpha level, αα or significance level. The
alpha level can be defined as the acceptable risk of incorrectly rejecting the null hypothesis.
The alpha level is usually chosen between 1% to 5%.
Hypothesis Testing Critical region
All sets of values that lead to rejecting the null hypothesis lie in the critical region.
Furthermore, the value that separates the critical region from the non-critical region is known
as the critical value.
Hypothesis Testing Formula
Depending upon the type of data available and the size, different types of hypothesis testing
are used to determine whether the null hypothesis can be rejected or not. The hypothesis
testing formula for some important test statistics are given below:

4) Significance Level
 For every hypothesis test, a significance level, denoted a, is assumed.
 This is used to qualify the result of the test.
 The significance level defines a cutoff point, at which you decide whether there is
sufficient evidence to view HO as incorrect and favor HA instead.
 If the p-value is greater than or equal to a, then you conclude there is insufficient
evidence against the null hypothesis, and therefore you retain HO when compared to HA.
 If the p-value is less than a, then the result of the test is statistically significant. This
implies there is sufficient evidence against the null hypothesis, and therefore you reject
HO in favor of HA.
 Common or conventional values of a are a = 0.1, α = 0.05, and a = 0.01.

10

You might also like