Unit-5 Biostatistics Descriptive
Unit-5 Biostatistics Descriptive
Theorem
Population:
A population in statistics refers to the entire set of individuals or items that the researcher is
interested in studying. For example, if a researcher is studying the income of households in Karachi, the
population would be all households in the city, and their corresponding incomes would be the
measurements of interest. While it is often impractical to gather data from every single unit in the
population, researchers typically select a sample — a smaller, representative subset of the population
— and use the sample data to estimate population characteristics, such as the average income. These
estimates rely on statistical principles like the sampling distribution and the Central Limit Theorem,
which help researchers draw conclusions about the population based on sample data.
Sample:
A sample is a subset of a population that is selected for study, and it is used to make inferences or
estimates about the entire population. When researchers want to estimate certain characteristics (such
as the average income) of a population, they typically select a random sample. A random sample is
chosen in such a way that every member of the population has an equal chance of being selected,
which helps ensure that the sample is a good representative of the population. This reduces the risk of
bias and increases the likelihood that the sample's characteristics reflect the true characteristics of the
population, allowing researchers to make accurate generalizations.
Unit 5: The sampling distribution and the Central Limit
Theorem
Sampling Distribution
A sampling distribution refers to a probability distribution of a statistic that comes from choosing
random samples of a given population. Also known as a finite-sample distribution, it represents the
distribution of frequencies on how spread apart various outcomes will be for a specific population.
Unit 5: The sampling distribution and the Central Limit
Theorem
The sampling distribution depends on multiple factors – the statistic, sample size, sampling process,
and the overall population. It is used to help calculate statistics such as means, ranges, variances, and
standard deviations for the given sample.
3. T-distribution
T-distribution is used when the sample size is very small or not much is known about the population. It
is used to estimate the mean of the population, confidence intervals, statistical differences, and linear
regression. (no need of T-distribution over here)
Practical Example
Suppose you want to find the average height of children at the age of 10 from each continent. You
take random samples of 100 children from each continent, and you compute the mean for each sample
group.
For example, in South America, you randomly select data about the heights of 10-year-old children, and
you calculate the mean for 100 of the children. You also randomly select data from North America and
calculate the mean height for one hundred 10-year-old children.
As you continue to find the average heights for each sample group of children from each continent, you
can calculate the mean of the sampling distribution by finding the mean of all the average heights of
each sample group. Not only can it be computed for the mean, but it can also be calculated for other
statistics such as standard deviation and variance.
A statistic, on the other hand, is a descriptive characteristic that summarizes the data from a sample. It
is an estimate of the corresponding population parameter. For example, if a researcher takes a random
sample of households in Karachi and calculates the average income of the sampled households, this
sample mean is a statistic. The sample mean serves as an estimate of the population mean, but it will
likely differ from the true population mean due to sampling variability.
So the, parameters describe populations, while statistics describe samples, and sample statistics (like
the sample mean) are used to estimate population parameters (like the population mean).
Parameters Population mean (µ) Population standard deviation (δ) Statistic (sample estimates)
µ = X bar
δ=S
The sampling distribution of the mean refers to the probability distribution of the sample mean
calculated from multiple random samples taken from the population. Because no two samples are
exactly the same, each sample will likely have a slightly different sample mean. Therefore, the sample
mean itself is a random variable, and like all random variables, it has its own distribution.
Key points about the sampling distribution of the mean:
1. Variability of the Sample Mean: Each sample will produce a different sample mean, meaning that
the sample mean will vary from sample to sample. This variability is due to the randomness of the
sampling process.
2. Random Variable: Since the sample mean is a random variable, it has a distribution that can be
described. This distribution is known as the sampling distribution of the sample mean.
3. Shape of the Distribution: According to the Central Limit Theorem (CLT), if the sample size nnn is
sufficiently large, the sampling distribution of the sample mean will approximate a normal
distribution, regardless of the shape of the population distribution (as long as the population has a
finite variance). If the sample size is large enough (typically n≥30n \geq 30n≥30), we can assume
that the sample means follow a normal distribution.
4. Properties of the Normal Distribution: If the sampling distribution of the sample mean is normal,
we can use the properties of the normal distribution to compute probabilities. For instance, we can
Unit 5: The sampling distribution and the Central Limit
Theorem
calculate the likelihood that the sample mean will fall within a certain range or estimate how far
the sample mean is likely to differ from the population mean.
Interpretation: On average, the sample mean is an unbiased estimator of the population mean,
meaning that repeated sampling will not systematically overestimate or underestimate the true
population mean.
Interpretation: As the sample size nnn increases, the standard error decreases, meaning that the
sample mean becomes a more precise estimate of the population mean. This is why larger samples tend
to yield more reliable estimates of the population parameters.
So:
The Mean of the sampling distribution of the sample mean is equal to the population mean.
The Standard Deviation of the sampling distribution (also called the standard error) is smaller than the
population standard deviation and is given by . σ/√ n The larger the sample size n, the smaller the
standard error, meaning more precise estimates of the population mean.
These properties allow us to understand the behavior of sample means and to make statistical
inferences about the population, such as estimating the population mean and determining how likely it
is that a sample mean falls within a certain range.
The standard deviation of the sample means is called the standard error of the mean.
S Xˉ = σ/√ n
Example 1
The sample mean , is to be calculated from a random sample of size 2 taken from a population
consisting of the five values ($2, $3, $4, $5, $6).
Find the sampling distribution x of based on a sample of size 2. First note that the population mean (µ)
is: 2+3+4+5+6/5=$4
Solution:
1. Population Information:
Unit 5: The sampling distribution and the Central Limit
Theorem
The population consists of the following values:
2,3,4,5,6
To calculate the population mean (μ\muμ):
μ=2+3+4+5+6/5=20/5=4
So, the population mean μ=4
μXˉ=(2.5×1)+(3×1)+(3.5×2)+(4×2)+(4.5×2)+(5×1)+(5.5×1)10
μxˉ= 2.5+3+7+8+9+5+5.5/10
μxˉ= 40/10 = 4
Thus, the mean of the sampling distribution is μxˉ=4.
So;
As expected, the mean of the sampling distribution of the sample mean equals the population mean
(μ=4), and we have also calculated the variance and standard deviation of the sampling distribution.
Percentile
In statistics, a percentile is a term that describes how a score compares to other scores from the same
set.
Percentile are position measures used in educational health related fields to indicate the position of an
individual in a group
• Percentile divide the data set into 100 equal groups. At least n% of the data lie above the nth
percentile, and at most (100-n)% of the data lie below the nth percentile. E.g. 90th percentile indicates
that at least 10% of the data lie above it, and at most 90% of the data lie below it.
For example:
If a test score is in the 90th percentile, it means that the score is higher than 90% of all the other scores,
and only 10% of the scores are higher.
Percentile Rank:
Unit 5: The sampling distribution and the Central Limit
Theorem
This refers to the percentage of scores in a distribution that fall below a particular score.
Percentile Rank tells you how a specific data point compares to all the other points in the dataset.
Percentile Value:
The percentile value is the actual value or observation in the data set that corresponds to a given
percentile.
For example, the 25th percentile (often called Q1, the first quartile) is the value below which 25% of
the data points fall.
Percentiles divide a data set into 100 equal parts, so there are 100 percentiles (each representing 1% of
the data). For example:
The 1st percentile is the value below which 1% of the data points fall.
The 25th percentile is the value below which 25% of the data points fall (this is also called the first
quartile or Q1).
The 50th percentile is the median value, where 50% of the data points fall below it (and 50% are
above it).
The 75th percentile is the value below which 75% of the data points fall (this is the third quartile or
Q3).
The 100th percentile is the maximum value in the data set.
Normal Distribution
The normal distribution is one of the most fundamental concepts in statistics, as it is widely used in a
variety of fields, including psychology, economics, biology, and social sciences. It describes how data
values are distributed in many real-world scenarios, especially when the data is symmetrically
distributed around the mean.
1. Shape: The normal distribution is a bell-shaped curve that is symmetrical around the mean. It is
sometimes referred to as the Gaussian distribution.
2. Symmetry: The distribution is perfectly symmetrical around the mean. This means that the left half
of the distribution mirrors the right half. Therefore, the mean, median, and mode of a normal
distribution are all equal and located at the center.
4. Standard Normal Distribution: A standard normal distribution is a normal distribution with a mean
of 0 and a standard deviation of 1. The Z-score formula is used to standardize data points in any
normal distribution to fit the standard normal distribution:
Z=X−μ/σ
Where:
Z is the Z-score (how many standard deviations X is from the mean),
X is the value of the data point,
μ\muμ is the mean,
σ\sigma is the standard deviation.
Unit 5: The sampling distribution and the Central Limit
Theorem
1. Bell-shaped Curve:
The curve is symmetrical, with the highest point at the mean (μ\muμ), and it tapers off towards
both ends. The two tails never touch the horizontal axis, but they get infinitely close to it.
2. Asymptotic:
The tails of the normal distribution extend infinitely in both directions, approaching but never
actually touching the horizontal axis. This indicates that extreme values (both high and low) are still
possible, but less likely.
Examples
Suppose Z has a standard normal distribution.
A) Find the 84th percentile of this distribution
0.50
84% 0.34
0 z= ?
Unit 5: The sampling distribution and the Central Limit
Theorem
Examples cont..
b) Find the 50th percentile or the median of the standard normal distribution
0.5 0.5
Examples
c. Find the 16th Percentile of this distribution.
0.50
0.34
Z=? 0
Example:
Suppose the reaction time of a particular durg X, has a normal distribution with a mean of 10 min and a
standard deviation of 2 min.
Unit 5: The sampling distribution and the Central Limit
Theorem
50th percentile
Area of 0.34
0.5
0.34
Z score 0 z=?
Mean
Xi =10 Xi=?
Suppose the age in a population has a normal distribution with mean 50 years and standard deviation
of 10 yrs.
a)Find the 50th percentile of the variable age
b)Find the 65th percentile of the variable age
c)Find the 10th percentile of the variable age
Tutorial 1.
If a set of score on an epidemiology examination are approximately normally distributed with a mean of
76 and standard deviation of 4, find:
Given Information:
So;
The 33rd percentile of the exam scores is approximately:
74.28
Thus, the score at the 33rd percentile is approximately 74.28.
b. What percent of the students who take this examination score at most 78?what
percentile is score 78?
Solution:
Given Information:
c. What percent of the students who take this examination get a score at least 67?
What percentile
So the Results:
Percent of students who score at least 67: 98.78%.
Percentile for a score of 67: 1.22nd percentile.