Agricultural Statistics and Biometry (Agr 304) - 2021.2022
Agricultural Statistics and Biometry (Agr 304) - 2021.2022
Agricultural Statistics and Biometry (Agr 304) - 2021.2022
2021/2022
LECTURE WEEK 3 (May 22-26)- SECTION 2: Dr. Mrs C. P. Anyanwu
Introduction to Statistical Inference
Measurements of Location - Means, Median, Mode
Dispersion – Range, Standard Variation, Variance etc
Statistical inference is the process of using data analysis to infer properties of an underlying
distribution of probability. Inferential statistical analysis infers properties of a population, for
example by testing hypotheses and deriving estimates.
Statistical inference is the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population
What is statistical inference?
The practice of statistics falls broadly into two categories (1) descriptive or (2) inferential.
When we are just describing or exploring the observed sample data, we are doing descriptive
statistics. So descriptive statistics is important for:
i. Summarizing data
ii. Making comparison or determining the relationships between variables
iii. Checking assumptions eg – whether there outliers or if the data is properly distributed
However, we are often also interested in understanding something that is unobserved in the wider
population; this could be the average fruit yield of many varieties cucumber, or the true effect of
poultry droppings at different rates, or whether a new rate will perform better or worse than the
standard treatment. In these situations we have to recognize that almost always we observe only one
sample or do one experiment. If we assess another sample or conduct another experiment, we might
observe a variable result. This means that there is uncertainty in our result, if we took another sample
or did another experiment and based our conclusion solely on the observed sample data, we may even
end up drawing a different conclusion.
Business Analysis
Artificial Intelligence
Financial Analysis
Fraud Detection
Machine Learning
Share Market
Pharmaceutical Sector
Hypothesis testing and confidence intervals are the applications of the statistical inference.
It helps to assess the relationship between the dependent and independent variables.
The purpose of statistical inference to estimate the uncertainty or sample to sample
variation.
It allows us to provide a probable range of values for the true values of something in the
population.
The components used for making statistical inference are:
Sample Size
Variability in the sample
A point estimate is a statistic that is calculated from the sample data and serves as a best guess of an
unknown population parameter. For example, we might be interested in the height of male students
AGR 304 and we randomly select 5 males per Dept since it might be difficult for us to measure all the
males taking AGR 304. In this example, the population mean is the population parameter and the
sample mean is the point estimate, which is our best guess of the population mean. Population
parameters are typically unknown because we rarely measure the whole population.
Pearson Correlation
Bi-variate regression
Multi-variate regression
ANOVA or T-test
Conduct statistical tests to see if the collected sample properties are adequately different from
what would be expected under the null hypothesis to be able to reject the null hypothesis
Estimating uncertainty:
Almost of all of the statistical methods you will come across are based on something called the
sampling distribution.
It is the theoretical distribution of a sample statistic such as the sample mean over infinite
independent random samples. We most often do one experiment and don't replicate it so
many times that we could easily observe the sampling distribution. However we can estimate
what the sampling distribution looks like for our sample statistic or point estimate of interest
based on only one sample or one experiment.
We can know the spread of the sampling distribution(theoretical distribution of the sample
statistic (eg, mean) that we don't observe which is the the standard error) by calculating its
standard deviation, just like the spread of a sample distribution(the distribution of the
individual observations that we observe or measure) is captured by the standard deviation.
This is useful because the standard deviation of the sampling distribution captures the error due
to sampling, it is thus a measure of the precision of the point estimates or put another way, a
measure of the uncertainty of our estimate. Since we often want to draw conclusions about
something in a population based on only one study, understanding how our sample statistics
may vary from sample to sample, as captured by the standard error, is also really useful
Confidence intervals:
Confidence intervals are computed from a random sample and therefore they are also random.
The long run behavior of a 95% confidence interval is such that we’d expect 95% of the
confidence intervals estimated from repeated independent sampling to contain the true
population parameter. The population parameter (eg; population mean) is not random, it is fixed
(but unknown), and the point estimate of the parameter (eg; sample mean) is random (but
observable).
A 95% confidence interval is defined by the mean plus or minus 2 standard errors. If the
estimate is likely to be within two standard errors of the parameter, then the parameter is likely
to be within two standard errors of the estimate. This is the foundation on which the correct
interpretation and understanding of a confidence interval lies.
Therefore it is okay to interpret a 95% confidence interval as "a range of plausible values for our
parameter of interest" or "we're 95% confident that the true value lies between these limits". It is
not okay to say "there's a 95% probability that the true population value lies between these
limits". The true population value is fixed, so it is either in those limits or not in those limits,
Hypothesis tests:
A hypothesis test asks the question, could the difference we observed in our study be due to
chance?
We can never prove a hypothesis, only falsify it, or fail to find evidence against it.
The statistical hypothesis is called the null hypothesis and is typically stated as no effect or no
difference, this is often opposite to the research hypothesis that motivated the study.
You can see a hypothesis test as a way of quantifying the evidence against the null hypothesis.
The evidence against the null hypothesis is estimated based on the sample data and expressed
using a probability (p-value).
A p-value is the probability of getting a result more extreme than was observed if the null
hypothesis is true. All correct interpretations of a p-value concur with this statement.
Therefore, if p=0.04, it is correct to say "the chance (or probability) of getting a result more
extreme than the one we observed is 4% if the null hypothesis is true. It is not correct to say
"there's a 4% chance that the null hypothesis is true". The hypothesis is fixed and the data (from
the sample) are random, so the hypothesis is either true or it isn't true, it has no probability other
than 0 (not true) or 1 (true). Like with confidence intervals, understanding this will means you
have reached a milestone of understanding of statistical concepts.
All point estimates (statistics calculated from the sample data) are subject to sampling
variation, and all methods of statistical inference seek to quantify this uncertainty in some
way.
The ideas of a confidence interval and hypothesis form the basis of quantifying
uncertainty. Almost all statistics in the published literature (excluding descriptive) will
report a p-value and/or a measure of effect or association with a confidence interval.
Much of the critical appraisal of the methodology of a study can be seen as a special case
of evaluating bias or precision.
Solution:
By statistical inference solution,
Total number of events = 400
i. e 90+100+120+90=400
(1) The probability of getting diamond cards:
Number of trials in which diamond card is drawn = 90
Therefore, P (diamond card) = 90/400 = 0.225
(2) The probability of getting black cards:
Number of trials in which black card showed up = 90+100 =190
Therefore, P (black card) = 190/400 = 0.475
(3) Except for spade
Number of trials other than spade showed up = 90+100+120 =310
Therefore, P (except spade) = 310/400 = 0.775
Measures of Location
Often it is not possible to list all the data or draw a histogram; it would be nice to have one
number which best represents a data set. Measures of location describe the central tendency of
the data thus summarizing quantitative data (a list of numbers) by a typical value. Summarizing
data can help us understand them, especially when the number of data is large. The three most
common measures of location are the mean, the median, and the mode. Analysis of data
distribution determines whether the data have a strong or a weak central tendency based on
their dispersion. When the data distribution is symmetrical and the mean = median = mode, the
data are said to have a normal distribution.
1. Mean or Average
The arithmetic mean (arithmetic average) is computed by adding up the scores in the distribution
and dividing this sum by the sample size. In plain words, and using the summation operator, the
arithmetic mean of
Xi is calculated as∑ni=1xin=x1+x2+x3+…+xn
n
While the underlying calculations are the same, we do draw a distinction between the population
mean and the sample mean, using different symbols when calculating each. The formulas are
given below in (2.1) and (2.2), respectively:
Population Mean = μ= ∑Ni=1xi
N (2.1)
Sample Mean=¯x=∑ni=1xi
n (2.2)
2. Median
The median is defined as the middle point of the ordered data. It is estimated by first ordering the
data from smallest to largest, and then counting upwards for half the observations. The estimate
of the median is either the observation at the centre of the ordering in the case of an odd number
of observations, or the simple average of the middle two observations if the total number of
observations is even. More specifically, if there are an odd number of observations, it is the
[(n+1)/2]th observation, and if there are an even number of observations, it is the average of the
[n/2]th and the [(n/2)+1]th observations.
A natural question is why we have more than one measure of the typical value. The
following example helps to explain why these alternative definitions are useful and
necessary. This is usually evident from a histogram of the data.
This plot shows histograms for 10,000 random numbers generated from a normal, an
exponential, a Cauchy, and a lognormal distribution.
The first histogram is a sample from a normal distribution. The mean is 0.005, the median is -
0.010, and the mode is -0.144 (the mode is computed as the midpoint of the histogram
interval with the highest peak).
The normal distribution is a symmetric distribution with well-behaved tails and a single peak
at the center of the distribution. By symmetric, we mean that the distribution can be folded
about an axis so that the 2 sides coincide. That is, it behaves the same to the left and right of
some center point. For a normal distribution, the mean, median, and mode are actually
equivalent. The histogram above generates similar estimates for the mean, median, and mode.
Therefore, if a histogram or normal probability plot indicates that your data are approximated
well by a normal distribution, then it is reasonable to use the mean as the location estimator.
4. Measures of Dispersion or Variability
The most common measurements of central tendency are the mean, median, and mode.
Identifying the central value allows other values to be compared to it, showing the spread or
cluster of the sample, which is known as the dispersion or distribution. These measurements of
dispersion are categorized in 2 groups: measures of dispersion based on percentiles and measures
of dispersion based on the mean (standard deviations). Measures of dispersion describe the
spread of the data. They include the range, interquartile range, standard deviation and variance.
The expression ∑(xi - )2 is interpreted as: from each individual observation (x i) subtract the mean
( ), then square this difference. Next add each of the n squared differences. This sum is then
divided by (n-1). This expression is known as the sample variance (s2). The variance is expressed
in square units, so we take the square root to return to the original units, which gives the standard
deviation, s. Examining this expression it can be seen that if all the observations were the same
(i.e. x1 = x2 = x3 ... = xn), then they would equal the mean, and so s would be zero. If the x's were
widely scattered about, then s would be large. In this way, s reflects the variability in the data.
The calculation of the standard deviation is described in Example 3. The standard deviation is
vulnerable to outliers, so if the 2.1 was replace by 21 in Example 3 we would get a very different
result.