Agricultural Statistics and Biometry (Agr 304) - 2021.2022

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

AGRICULTURAL STATISTICS AND BIOMETRY (AGR 304).

2021/2022
LECTURE WEEK 3 (May 22-26)- SECTION 2: Dr. Mrs C. P. Anyanwu
Introduction to Statistical Inference
Measurements of Location - Means, Median, Mode
Dispersion – Range, Standard Variation, Variance etc

 Statistical inference is the process of using data analysis to infer properties of an underlying
distribution of probability. Inferential statistical analysis infers properties of a population, for
example by testing hypotheses and deriving estimates.

 Statistical inference is the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population
What is statistical inference?

The practice of statistics falls broadly into two categories (1) descriptive or (2) inferential.
When we are just describing or exploring the observed sample data, we are doing descriptive
statistics. So descriptive statistics is important for:

i. Summarizing data
ii. Making comparison or determining the relationships between variables

iii. Checking assumptions eg – whether there outliers or if the data is properly distributed

iv. Answering research questions and objectives


-Finding the mean weight of a particular population,
-All other descriptive statistical techniques
– Gathering demographic data

However, we are often also interested in understanding something that is unobserved in the wider
population; this could be the average fruit yield of many varieties cucumber, or the true effect of
poultry droppings at different rates, or whether a new rate will perform better or worse than the
standard treatment. In these situations we have to recognize that almost always we observe only one
sample or do one experiment. If we assess another sample or conduct another experiment, we might
observe a variable result. This means that there is uncertainty in our result, if we took another sample
or did another experiment and based our conclusion solely on the observed sample data, we may even
end up drawing a different conclusion.

 The purpose of statistical inference is to estimate this sample to sample


variation or uncertainty. Understanding how much our results may differ if we did the study
again, or how uncertain our findings are, allows us to take this uncertainty into account
when drawing conclusions.
 It allows us to provide a plausible range of values for the true value of something in the
population, such as the mean, or size of an effect, and it allows us to make statements about
whether our study provides evidence to reject a hypothesis.

Importance of Statistical Inference


Inferential Statistics is important to examine the data properly. To make an accurate conclusion, proper
data analysis is important to interpret the research results. It is majorly used in the future prediction for
various observations in different fields. It helps us to make inference about the data. The statistical
inference has a wide range of application in different fields, such as:

 Business Analysis
 Artificial Intelligence

 Financial Analysis

 Fraud Detection

 Machine Learning

 Share Market

 Pharmaceutical Sector

Applications of Statistical inferences:

Hypothesis testing and confidence intervals are the applications of the statistical inference.

 It helps to assess the relationship between the dependent and independent variables.
 The purpose of statistical inference to estimate the uncertainty or sample to sample
variation.
 It allows us to provide a probable range of values for the true values of something in the
population.
The components used for making statistical inference are:

 Sample Size
 Variability in the sample

 Size of the observed differences

Two key terms are point estimates and population parameters.

A point estimate is a statistic that is calculated from the sample data and serves as a best guess of an
unknown population parameter. For example, we might be interested in the height of male students
AGR 304 and we randomly select 5 males per Dept since it might be difficult for us to measure all the
males taking AGR 304. In this example, the population mean is the population parameter and the
sample mean is the point estimate, which is our best guess of the population mean. Population
parameters are typically unknown because we rarely measure the whole population.

Types of Statistical Inference


There are different types of statistical inferences that are extensively used for making conclusions.
They are:

 One sample hypothesis testing


 Confidence Interval

 Pearson Correlation

 Bi-variate regression

 Multi-variate regression

 Chi-square statistics and contingency table

 ANOVA or T-test

Statistical Inference Procedure


The procedure involved in inferential statistics are:

 Begin with a theory


 Create a research hypothesis

 Operationalize the variables

 Recognize the population to which the study results should apply

 Formulate a null hypothesis for this population

 Accumulate a sample from the population and continue the study

 Conduct statistical tests to see if the collected sample properties are adequately different from
what would be expected under the null hypothesis to be able to reject the null hypothesis

Estimating uncertainty:

 Almost of all of the statistical methods you will come across are based on something called the
sampling distribution.

It is the theoretical distribution of a sample statistic such as the sample mean over infinite
independent random samples. We most often do one experiment and don't replicate it so
many times that we could easily observe the sampling distribution. However we can estimate
what the sampling distribution looks like for our sample statistic or point estimate of interest
based on only one sample or one experiment.

We can know the spread of the sampling distribution(theoretical distribution of the sample
statistic (eg, mean) that we don't observe which is the the standard error) by calculating its
standard deviation, just like the spread of a sample distribution(the distribution of the
individual observations that we observe or measure) is captured by the standard deviation.

This is useful because the standard deviation of the sampling distribution captures the error due
to sampling, it is thus a measure of the precision of the point estimates or put another way, a
measure of the uncertainty of our estimate. Since we often want to draw conclusions about
something in a population based on only one study, understanding how our sample statistics
may vary from sample to sample, as captured by the standard error, is also really useful

Confidence intervals:

 Confidence intervals are computed from a random sample and therefore they are also random.
The long run behavior of a 95% confidence interval is such that we’d expect 95% of the
confidence intervals estimated from repeated independent sampling to contain the true
population parameter. The population parameter (eg; population mean) is not random, it is fixed
(but unknown), and the point estimate of the parameter (eg; sample mean) is random (but
observable).
 A 95% confidence interval is defined by the mean plus or minus 2 standard errors. If the
estimate is likely to be within two standard errors of the parameter, then the parameter is likely
to be within two standard errors of the estimate. This is the foundation on which the correct
interpretation and understanding of a confidence interval lies.

 Therefore it is okay to interpret a 95% confidence interval as "a range of plausible values for our
parameter of interest" or "we're 95% confident that the true value lies between these limits". It is
not okay to say "there's a 95% probability that the true population value lies between these
limits". The true population value is fixed, so it is either in those limits or not in those limits,

Hypothesis tests:

 A hypothesis test asks the question, could the difference we observed in our study be due to
chance?

 We can never prove a hypothesis, only falsify it, or fail to find evidence against it.

 The statistical hypothesis is called the null hypothesis and is typically stated as no effect or no
difference, this is often opposite to the research hypothesis that motivated the study.
 You can see a hypothesis test as a way of quantifying the evidence against the null hypothesis.
The evidence against the null hypothesis is estimated based on the sample data and expressed
using a probability (p-value).

 A p-value is the probability of getting a result more extreme than was observed if the null
hypothesis is true. All correct interpretations of a p-value concur with this statement.

 Therefore, if p=0.04, it is correct to say "the chance (or probability) of getting a result more
extreme than the one we observed is 4% if the null hypothesis is true. It is not correct to say
"there's a 4% chance that the null hypothesis is true". The hypothesis is fixed and the data (from
the sample) are random, so the hypothesis is either true or it isn't true, it has no probability other
than 0 (not true) or 1 (true). Like with confidence intervals, understanding this will means you
have reached a milestone of understanding of statistical concepts.

 Statistical significance is not the same as practical (or clinical) significance.

Connections with other material

 All point estimates (statistics calculated from the sample data) are subject to sampling
variation, and all methods of statistical inference seek to quantify this uncertainty in some
way.
 The ideas of a confidence interval and hypothesis form the basis of quantifying
uncertainty. Almost all statistics in the published literature (excluding descriptive) will
report a p-value and/or a measure of effect or association with a confidence interval.

 The probability distribution of a statistic is actually the sampling distribution.

 Much of the critical appraisal of the methodology of a study can be seen as a special case
of evaluating bias or precision.

Statistical Inference Examples


An example of statistical inference is given below.
Question: From the shuffled pack of cards, a card is drawn. This trial is repeated for 400 times, and the
suits are given below:

Suit Spade Clubs Hearts Diamonds

No. of times drawn 90 100 120 90


While a card is tried at random, then what is the probability of getting a
1. Diamond cards
2. Black cards
3. Except for spade

Solution:
By statistical inference solution,
Total number of events = 400
i. e 90+100+120+90=400
(1) The probability of getting diamond cards:
Number of trials in which diamond card is drawn = 90
Therefore, P (diamond card) = 90/400 = 0.225
(2) The probability of getting black cards:
Number of trials in which black card showed up = 90+100 =190
Therefore, P (black card) = 190/400 = 0.475
(3) Except for spade
Number of trials other than spade showed up = 90+100+120 =310
Therefore, P (except spade) = 310/400 = 0.775
Measures of Location

Often it is not possible to list all the data or draw a histogram; it would be nice to have one
number which best represents a data set. Measures of location describe the central tendency of
the data thus summarizing quantitative data (a list of numbers) by a typical value. Summarizing
data can help us understand them, especially when the number of data is large. The three most
common measures of location are the mean, the median, and the mode. Analysis of data
distribution determines whether the data have a strong or a weak central tendency based on
their dispersion. When the data distribution is symmetrical and the mean = median = mode, the
data are said to have a normal distribution.

1. Mean or Average
The arithmetic mean (arithmetic average) is computed by adding up the scores in the distribution
and dividing this sum by the sample size. In plain words, and using the summation operator, the
arithmetic mean of
Xi is calculated as∑ni=1xin=x1+x2+x3+…+xn
n
While the underlying calculations are the same, we do draw a distinction between the population
mean and the sample mean, using different symbols when calculating each. The formulas are
given below in (2.1) and (2.2), respectively:
Population Mean = μ= ∑Ni=1xi
N (2.1)

Sample Mean=¯x=∑ni=1xi
n (2.2)
2. Median
The median is defined as the middle point of the ordered data. It is estimated by first ordering the
data from smallest to largest, and then counting upwards for half the observations. The estimate
of the median is either the observation at the centre of the ordering in the case of an odd number
of observations, or the simple average of the middle two observations if the total number of
observations is even. More specifically, if there are an odd number of observations, it is the
[(n+1)/2]th observation, and if there are an even number of observations, it is the average of the
[n/2]th and the [(n/2)+1]th observations.

Example 1 Calculation of mean and median


Consider the following 5 birth weights, in kilograms, recorded to 1 decimal place:
1.2, 1.3, 1.4, 1.5, 2.1
The mean is defined as the sum of the observations divided by the number of observations. Thus
mean = (1.2+1.3+…+2.1)/5 = 1.50kg. It is usual to quote 1 more decimal place for the mean than
the data recorded.
There are 5 observations, which is an odd number, so the median value is the (5+1)/2 = 3rd
observation, which is 1.4kg. Remember that if the number of observations was even, then the
median is defined as the average of the [n/2]th and the [(n/2)+1]th. Thus, if we had observed an
additional value of 3.5kg in the birth weights sample, the median would be the average of the 3rd
and the 4th observation in the ranking, namely the average of 1.4 and 1.5, which is 1.45kg.

Advantages and disadvantages of the mean and median


The major advantage of the mean is that it uses all the data values, and is, in a statistical sense,
efficient.
The main disadvantage of the mean is that it is vulnerable to outliers. Outliers are single
observations which, if excluded from the calculations, have noticeable influence on the results.
For example, if we had entered '21' instead of '2.1' in the calculation of the mean in Example 1,
we would find the mean changed from 1.50kg to 7.98kg. It does not necessarily follow, however,
that outliers should be excluded from the final data summary, or that they always result from an
erroneous measurement.
The median has the advantage that it is not affected by outliers, so for example the median in the
example would be unaffected by replacing '2.1' with '21'. However, it is not statistically efficient,
as it does not make use of all the individual data values.
3. Mode
A third measure of location is the mode. This is the value that occurs most frequently, or, if the data are grouped, the
grouping with the highest frequency. It is not used much in statistical analysis, since its value depends on the accuracy
with which the data are measured; although it may be useful for categorical data to describe the most frequent category.
The expression 'bimodal' distribution is used to describe a distribution with two peaks in it. This can be caused by mixing
populations. For example, height might appear bimodal if one had men and women on the population. Some illnesses may
raise a biochemical measure, so in a population containing healthy and ill people one might expect a bimodal distribution.
However, some illnesses are defined by the measure (e.g. obesity or high blood pressure) and in this case the distributions
are usually unimodal.

A natural question is why we have more than one measure of the typical value. The
following example helps to explain why these alternative definitions are useful and
necessary. This is usually evident from a histogram of the data.

This plot shows histograms for 10,000 random numbers generated from a normal, an
exponential, a Cauchy, and a lognormal distribution.

The first histogram is a sample from a normal distribution. The mean is 0.005, the median is -
0.010, and the mode is -0.144 (the mode is computed as the midpoint of the histogram
interval with the highest peak).
The normal distribution is a symmetric distribution with well-behaved tails and a single peak
at the center of the distribution. By symmetric, we mean that the distribution can be folded
about an axis so that the 2 sides coincide. That is, it behaves the same to the left and right of
some center point. For a normal distribution, the mean, median, and mode are actually
equivalent. The histogram above generates similar estimates for the mean, median, and mode.
Therefore, if a histogram or normal probability plot indicates that your data are approximated
well by a normal distribution, then it is reasonable to use the mean as the location estimator.
4. Measures of Dispersion or Variability
The most common measurements of central tendency are the mean, median, and mode.
Identifying the central value allows other values to be compared to it, showing the spread or
cluster of the sample, which is known as the dispersion or distribution. These measurements of
dispersion are categorized in 2 groups: measures of dispersion based on percentiles and measures
of dispersion based on the mean (standard deviations). Measures of dispersion describe the
spread of the data. They include the range, interquartile range, standard deviation and variance.

5. Range and Interquartile Range


The range is given as the smallest and largest observations. This is the simplest measure of
variability. Note in statistics (unlike physics) a range is given by two numbers, not the difference
between the smallest and largest. For some data it is very useful, because one would want to
know these numbers, for example knowing in a sample the ages of youngest and oldest
participant. If outliers are present it may give a distorted impression of the variability of the data,
since only two observations are included in the estimate.

6. Quartiles and Interquartile Range


The quartiles, namely the lower quartile, the median and the upper quartile, divide the data into
four equal parts; that is there will be approximately equal numbers of observations in the four
sections (and exactly equal if the sample size is divisible by four and the measures are all
distinct). Note that there are in fact only three quartiles and these are points not proportions. It is
a common misuse of language to refer to being ‘in the top quartile’. Instead one should refer to
being ‘in the top quarter or ‘above the top quartile’. However, the meaning of the first statement
is clear and so the distinction is really only useful to display a superior knowledge of statistics!
The quartiles are calculated in a similar way to the median; first arrange the data in size order and
determine the median, using the method described above. Now split the data in two (the lower
half and upper half, based on the median). The first quartile is the middle observation of the
lower half, and the third quartile is the middle observation of the upper half. This process is
demonstrated in Example 2, below.
The interquartile range is a useful measure of variability and is given by the lower and upper
quartiles. The interquartile range is not vulnerable to outliers and, whatever the distribution of the
data, we know that 50% of observations lie within the interquartile range.

Example 2 Calculation of the quartiles


Suppose we had 18 birth weights arranged in increasing order.
1.51, 1.53. 1.55, 1.55, 1.79. 1.81, 2.10, 2.15, 2.18,
2.22, 2.35, 2.37, 2.40, 2.40, 2.45, 2.78. 2.81, 2.85.
The median is the average of the 9th and 10th observations (2.18+2.22)/2 = 2.20 kg. The first half
of the data has 9 observations so the first quartile is the 5th observation, namely 1.79kg. Similarly
the 3rd quartile would be the 5th observation in the upper half of the data, or the 14th
observation, namely 2.40 kg. Hence the interquartile range is 1.79 to 2.40 kg.

7. Standard Deviation and Variance


The standard deviation of a sample (s) is calculated as follows:
s=∑(xi−x¯)2n−1−√s=∑(xi−x¯)2n−1

The expression ∑(xi - )2 is interpreted as: from each individual observation (x i) subtract the mean
( ), then square this difference. Next add each of the n squared differences. This sum is then
divided by (n-1). This expression is known as the sample variance (s2). The variance is expressed
in square units, so we take the square root to return to the original units, which gives the standard
deviation, s. Examining this expression it can be seen that if all the observations were the same
(i.e. x1 = x2 = x3 ... = xn), then they would equal the mean, and so s would be zero. If the x's were
widely scattered about, then s would be large. In this way, s reflects the variability in the data.
The calculation of the standard deviation is described in Example 3. The standard deviation is
vulnerable to outliers, so if the 2.1 was replace by 21 in Example 3 we would get a very different
result.

Example 3 Calculation of the standard deviation


Consider the data from example 1. The calculations required to determine the sum of the squared
differences from the mean are given in Table 1, below. We found the mean to be 1.5kg. We
subtract this from each of the observations. Note the mean of this column is zero. This will
always be the case: the positive deviations from the mean cancel the negative ones. A convenient
method for removing the negative signs is squaring the deviations, which is given in the next
column. These values are then summed to get a value of 0.50 kg 2. We need to find the average
squared deviation. Common-sense would suggest dividing by n, but it turns out that this actually
gives an estimate of the population variance, which is too small. This is because we are using the
estimated mean in the calculation and we should really be using the true population mean. It
can be shown that it is better to divide by the degrees of freedom, which is n minus the number of
estimated parameters, in this case n-1. An intuitive way of looking at this is to suppose one
had n telephone poles each 100 meters apart. How much wire would one need to link them? As
with variation, here we are not interested in where the telegraph poles are, but simply how far
apart they are. A moment's thought should convince one that n-1 lengths of wire are required to
link n telegraph poles.

Table 1 Calculation of the mean squared deviation


From the results calculated thus far, we can determine the variance and standard deviation, as
follows:
n=5
Variance = 0.50/(5-1) = 0.125 kg2
Standard deviation = √(0.125) = 0.35 kg
Why is the standard deviation useful?
It turns out in many situations that about 95% of observations will be within two standard
deviations of the mean, known as a reference interval. It is this characteristic of the standard
deviation which makes it so useful. It holds for a large number of measurements commonly made
in medicine. In particular, it holds for data that follow a Normal distribution. Standard deviations
should not be used for highly skewed data, such as counts or bounded data, since they do not
illustrate a meaningful measure of variation, and instead an IQR or range should be used. In
particular, if the standard deviation is of a similar size to the mean, then the SD is not an
informative summary measure, save to indicate that the data are skewed. Standard deviation is
often abbreviated to SD in the medical literature.

You might also like