Statistical Method
Statistical Method
References:
Introduction
Origin and Development of Statistics: The Statistics seems to have been derived from the
Latin word ‘Status’ or the Italian word ‘Statista’ or the German word ‘Statistik’ each of
which means a “political state”. In ancient times, the government used it to collect the
information regarding the population and “property or wealth” of the country.
Sir, Ronald A. Fisher (1890-1962) known as the father of statistics who applied statistics into
various field such as Genetics, Biometry, Education and Agriculture etc.
Definition of statistics: “These are the aggregates of facts affected to a marked extent by
multiplicity of causes, numerically expressed, enumerated or estimated according to
reasonable standards of accuracy, collected in a systematic manner, for a predetermined
purpose and placed in relation to each other”. by Prof. Horace & Secrist.
When it is used in plural, it means the quantitative data.
When it is used in singular, it is defined as “science which deals with collection, presentation,
analysis and interpretation of numerical data”. by Croxton and Cowden.
Limitations of statistics:
Statistics is not suited to the study of qualitative phenomenon.
Statistics does not study individuals.
Statistical laws are not exact.
Statistics is liable to be misused.
Frequency distribution
It is an arrangement of variate values along with their respective frequency.
(i) The classes should be clearly defined and free from ambiguity.
(ii) The classes should be exhaustive, i.e. each of the given value should be included in one
of the class.
(iii) The classes should be mutually exclusive and non-overlapping.
(iv) The classes should be of equal width.
(v) Indeterminate classes, open end classes: less than or greater than should be avoided as
far as possible.
(vi) The number of classes should neither be too large nor too small. It should preferably lie
between 5 and 15. Struges used the formulae for determining the approximate number
of classes K = 1 + 3.322 log10N, where N is the total frequency.
4
Graphical Representation
Graphical representations are represented by points plotted on a graph paper which
makes the unwieldy data intelligible and conveys to the eye the general run of
observations. Graphical representation also facilitates the comparison of two or more
frequency distribution.
Some important type of graphical representation are:
(i) Histogram
(ii) Frequency Polygon
(iii) Frequency curve
Histogram: If the frequency distribution is not continuous first it is to be converted into
continuous distribution by subtracting 0.5 from the lower limit and adding 0.5 to the
upper limit of each classes. In drawing histogram of a continuous frequency distribution
we first mark off class intervals on x-axis and corresponding frequency on y-axis by
selecting a suitable scale. On each class interval we erect rectangles with heights
proportional to the frequency of the corresponding class interval so that the area of the
rectangle is proportional to the frequency of the class. If, however, the classes are of
unequal width then the heights of the rectangle, will be proportional to the ratio of the
frequency to the width of the class, the diagrams of continuous rectangles so obtained is
called histogram.
Step-I: In case of ungrouped data, if the number of observation is odd then median is the
middle value after the values have been arranged in ascending or descending order of
magnitude.
Step-II: In case of even number of observations there are two middle terms and median
is obtained by taking the arithmetic mean of these middle terms after arranging the series
in ascending or descending order.
Step-III: In case of discrete frequency distribution, median is obtained by:
Mode: This is that value of the variable which occurs most frequently or whose frequency is
maximum.
In case of continuous distribution mode is given by:
Mode =
f1 & f2 are the frequencies of preceding and following of the modal class respectively.
h = Magnitude of the modal class.
Merits:
1. It is readily comprehensible and easy to calculate.
2. It is not at all affected by the extreme values
3. It can be obtained simply by inspection.
4. It can be computed in case of open end class.
Demerits:
1. It is not rigidly defined. A distribution with two modes is called bi-modal and the
distribution with more than two modes is called multi-modal.
2. It is not suitable for further mathematical treatment.
3. It is not based on all the observations.
4. It is affected to a great extent by fluctuation of sampling.
Uses: Mode is the average to be used in finding the ideal size e.g. in business forecasting, in
manufacture of ready-made garments, shoes size etc.
For a symmetrical distribution; mean, median and mode coincide. If the distribution is
moderately asymmetrical the mean ,median and mode obey the following empirical relations:
DISPERSION
“Dispersion is the measure of extent to which individual items vary by” L.R Connor.
Consider the series (i) 7, 8, 9, 10, 11 (ii) 3, 6, 9, 12, 15 (iii) 1, 5, 9, 13, 17
In all these cases we see that the number of observation is 5 and the mean is 9. We can not
form an idea as to whether it is the average of 1st series or 2nd series or third series or any other
series of 5 observation whose sum is 45. Thus we see that the measure of central tendency are
inadequate to give us a complete idea of distribution. They must be supported and
supplemented by some other measures. One such measure is dispersion.
Shortcut method
where, di = xi – A
=ℎ =ℎ Si − S id where, di =
Where A= Arbitrary value
h = Class interval
9
Moments.
The rth moment of a variable X about the point x = A, usually denoted by mr’ is given by:
m = ∑ ( − ) , ∑ =
= ∑ ℎ = −
The rth moment of a variable x about the mean, usually denoted by mr is given by:
m = ∑ ( ) = ∑ =1
Also, m = ∑ ( ) = 2
i.e. m = 1, m = 0 and m = 2
1 = , 2 =
Kurtosis enables us to have an idea about the ‘flatness or peakedness’ of the frequency
curve. It is measured by the coefficient 2 or its derivation 2 given by:
2 = , 2 = 2 - 3
11
Theory of Probability
Result can be predicted with certainty eg. for a Result can not be predicted with certainty eg.
perfect gas V in tossing of a coin one may not be sure
whether he will get head or tail.
P(E) = p = =
Lim
p = P(E) = n
Definitions of various terms:
Trial and Event: Consider an experiment which, though repeated under essentially
homogeneous and identical conditions, does not give unique results but may result in any one
of the several possible outcomes. The experiments are known as a trial and the outcomes are
known as events or cases. For example:
Exhaustive Events:
(i) The total number of possible out comes in any trial is known as exhaustive events or
exhaustive cases. In tossing of a coin there are two exhaustive cases head and tail (the
possibility of the coin standing on its edge is being ignored)
12
(ii) In throwing of a die, there are 6 exhaustive cases since any one of six faces 1, 2, …,6
may come uppermost.
(iii) In drawing two cards from a pack of 52 cards, the exhaustive number of cases is 52C2
(iv) In throwing of two dice, the exhaustive number of cases is 62 = 36
(v) In general, in throwing of n dice, the exhaustive numbers of cases is 6n.
Favourable Events:
The numbers of cases favourable to an event in a trial is the number of outcomes which entail
the happening of the event. For example:
i. In drawing a card from a pack of 52 cards the number of cases favourable to drawing
of an ace is 4, for drawing a spade is 13 and for drawing a red card is 26.
Mutually Exclusive Events: Events are said to be mutually exclusive or incompatible if the
happening of any one of them precludes the happening of all the others i.e., if no two or more
of them can happen simultaneously in the same trial. For example,
1. In throwing a die all the 6 faces numbered 1 to 6 are mutually exclusive since if
any one of these faces comes, the possibility of others, in the same trial, is ruled
out.
2. Similarly in tossing a coin the events head and tail are mutually exclusive.
Equally Likely Events: Outcomes of trial are said to be equally likely if taking into
consideration all the relevant evidences, there is no reason to expect one in preference to the
others. For example,in a random toss of an unbiased or uniform coin, head and tail are
equally likely events.
Independent Events. Several events are said to be independent if the happening (or non-
happening) of an event is not affected by the supplementary knowledge concerning the
occurrence of any number of the remaining events. For example,in tossing an unbiased coin,
the event of getting a head in the first toss is independent of getting a head in the second,
third and in any subsequent throw.
Addition law of probability:
If A and B are any two events (subsets of sample space S) and are not disjoint, then
P (A B) = P (A) + P (B) - P (A ∩ B)
Multiplication law of probability:
For two events A and B,
P (A ∩ B) = P (A). P (B | A), P (A) > 0
= P (B). P (A | B), P (B) > 0
Where P (B/A) represents conditional probability of occurrence of event B when the event A
has already happened and P (A | B) is the conditional probability of happening of event A,
given that B has already happened.
P (B | A) = P (A | B) =
13
If X N (m, 2), then Z = is a standard normal variate with E(Z) = 0 and Var (Z) = 1 and
we write Z N (0 1)
BINOMIAL DISTRIBUTION
Binomial distribution was discovered by James Bernoulli (1654-1705) in the year 1700 and
was first published posthumously in 1713.
Definition:
A random variable X is said to follow binomial distribution if it assumes only non-
negative values and its probability mass function is given by:
P(X=x) = P(x) = ; = 0, 1, 2, … . , ; = 1−
0 , otherwise
The two independent constants n and p in the distribution are known as the parameters
of the distribution ‘n’ is also sometimes, known as the degree of the binomial distribution.
Binomial distribution is a discrete distribution as X can take only the integral value
viz., 0,1,2,…,n. Any random variable which follows binomial distribution is known as
binomial variate.
We shall use the notation X B (n, p) to denote that the random variable X follows
binomial distribution with parameters n and p.
Mean of binomial distribution = np
Variance of binomial distribution = npq
Physical conditions for Binomial Distribution. We get the binomial distribution under the
following experimental conditions.
(i) Each trial results into exhaustive and mutually disjoint outcome, termed as success
and failure.
(ii) The number of trials ‘n’ is finite.
(iii) The trails are independent of each other.
(iv) The probability of success ‘p’ is constant for each trial.
The trails satisfying the conditions (i), (iii) and (iv) are also called Bernoulli trials.
The problems relating to tossing of a coin or throwing of dice or drawing cards from a pack
of cards with replacement lead to binomial probability distribution.
Binomial distribution is important not only because of its wide applicability, but because it
gives rise to many other probability distributions.
Example: - Ten coins, are thrown simultaneously. Find the probability of getting at least
seven heads.
Solution: p= Probability of getting a head =
POISSON DISTRIBUTION
Poisson distribution was discovered by the French mathematician and physicist Simeon
Denis Poisson (1781-1840) who published it in 1837. Poisson distribution is a limiting case
of the binomial distribution under the following conditions.
(i) n, the number of trials is indefinitely large, i.e., n
(ii) p, the constant probability of success for each trial is indefinitely small, i.e.,p 0.
(iii) np = l, (say) is finite.
Thus p = l/n, q = 1 - l/n, where l is a positive real number.
The probability of x successes in a series of n independent trials is:
B (x; n,p) = ; = 0, 1, 2, … . ,
Here l is known as the parameter of the distribution. We shall use the notation
X P(l),denote that X is a Poisson variate with parameter l.
Mean and variance of poisson distribution are equal and equal to l.
Poisson distribution occurs when there are events which do not occur as outcomes of a
definite number of trials (unlike that in binomial distribution) of an experiment but which
occur at random points of time and space wherein our interest lies only in the number of
occurrences of the event, not in its non-occurrences.
Following are some instance where Poisson distribution may be successfully employed:
(i) Number of deaths from a disease (not in the form of an epidemic) such as heart attack or
cancer or due to snake bite.
(ii) Number of suicides reported in a particular city.
(iii) The number of defective material in a packing manufactured by a good concern.
(iv) Number of faulty blades in a packet of 100.
(v) Number of air accidents in some unit of time.
(vi) Number of printing mistakes at each page of the book.
16
(vii) Number of telephone calls received at a particular telephone exchange in some unit of
time or connections to wrong numbers in a telephone exchange.
(viii) Number of cars passing a crossing per minute during the busy hours of a day.
(ix) The number of fragments received by a surface area ‘A’ from a fragment atom bomb.
(x) The emission of radioactive (alpha) particles.
Introduction to sampling
Parameter: It is the characteristics of population values such as population mean (µ) and
population variance (σ2).
Statistic: It is an estimate of parameter obtained from the sample is the function of sample
value only. Eg. sample mean ( ), sample variance (S2)
Standard Error: The standard deviation of the sampling distribution of a statistic is known
as standard error and denoted by S.E.
Standard error of mean: It is the positive square root of the variance of sampling
distribution of mean
S.E. of Mean = (σ2/n) Where, σ = population standard deviation and n = sample size
Utility of S.E.- S.E. plays every important role in large sample theory and forms the basis of
the testing of hypothesis if t is any statistic, then for large samples.
()
Z= ( )
N (0,1)
Sampling vs complete enumeration
Sampling survey: A survey involving only a part of population is called sample survey. A
sample is a subset of population.
Complete enumeration/ census survey: A survey in which each and every unit of the
population is under consideration is known as complete enumeration. The money manpower
and time required to carry out complete enumeration are generally larger then sample survey.
The main merits of sampling technique over the complete enumeration may be outlined as
follows:
1. Less time. There is considerable saving in time and labour since only a part of the
population has to be examined. The sampling results can be obtained more rapidly
and the data can be analysed much faster since relatively fewer data have to be
collected and processed.
17
2. Reduced cost of the survey. Sampling usually results in reduction in cost in terms of
money and in terms of man hours. Although,the amount of labour and the expenses
involved in collecting information are generally greater per unit of sample than in
complete enumeration, the total cost of the sample survey is expected to be much
smaller than that of the complete census.
3. Greater Accuracy of Results. The results of a sample survey are usually much more
reliable than those obtained from a complete census.
4. Greater Scope. Sample survey has generally greater scope as compared with complete
census. The complete enumeration is impracticable, rather inconceivable if the survey
requires a highly trained personnel and more sophisticated equipment for the
collection and analysis of the data. Since sample survey saves in time and money. It is
possible to have a thorough and intensive enquiry because a more detailed
information can be obtained from a small group of respondents.
5. If the population is too large, as for example, trees in a jungle, we are left with no way
but to resort to sampling.
6. If testing is destructive, i.e., if the quality of an article can be determined only by
destroying the article in the process of testing, as for example.
(i) Testing the quality of milk or chemical salt by analysis,
(ii) Testing the breaking strength of chalks,
(iii) Testing of crackers and explosives,
(iv) Testing the life of an electric tube or bulb, etc.
the same frequency and independently of each other, so does each of the pairs 00 to 99 or
triplets 000 to 999 or quadruplets 0000 to 9999, and so on.
The method of drawing the random number consists in the following steps:
(i) To identify the N units in the population with the numbers from 1 to N.
(ii) To select at random, any page of the “random number table” and pick up the
numbers in any row or column at random.
The population units corresponding to the numbers selected in step (ii) constitute the random
sample.
Test of Significance
It is the statistical procedure for deciding whether the difference under study is significant or,
not.Common test of significance are t-test, F-test, Chi-square (χ2) test.
Error in Sampling:
The main theory of sampling is to draw a valid inference about the population
parameter on the basis of sample drawn from it and in this way, we are liable to commit two
types of error: type-I error and type II error.
Type-I error: Reject H0 when it is true = P {Reject H0 when it is true}= P {Reject H0/H0}and
denoted by “α”. It is also known as “producer’s risk”. α is the size of type I error.
If cal. Value ≥ tab. value at given degree of freedom and level of significance. Result is
significant and we reject H0.i.e. we accept H1. If cal. value < tab. The result is non-significant,
we accept H0 i.e. we reject H1 and to draw conclusion accordingly.
Degree of freedom: Number of observations (n) - number of restrictions (k) imposed
upon them, degree of freedom = n-k
Student's t-test
t- test was first given by W.S. Gosset in 1908 and modified by R.A. fisher in1926.
Definition:Let xi (i = 1,2,..., n) be a random sample of size n drawn from a normal population
with mean µ and variance σ2. Then student’s t is defined by the statistic:
2
where = , is the sample mean and S =
Fisher’s‘t’ (Definition). It is the ratio of a standard normal variate to the square root of an
independent chi-square variate divided by its degrees of freedom. If x is a N(0,1) and c2 is an
independent chi-square variate n d.f., then Fisher’s t is given by:
x
t=
c /
Application of t-test:
To test the significance of difference from the sample mean from the hypothetical value
of population mean.
To test the significance difference between two samples mean.
To test the significance of observed sample correlation coefficient and regression
coefficient.
First application, under H0: (i) The sample has been drawn from the population with mean,
µ0 or (ii) There is no significant difference between the sample mean and the population
mean µ0.
Second application, under H0: (i).µx = µy (ii) The sample means and do not differ
significantly.[ n1 n2 ; and σx2 =σy2 = σ2 i.e. population variances are equal and unknown].
20
F-test
Definition of F-test: If X and Y are two independent chi-square variates with υ1and υ2
degree of freedom respectively, then F-test is given by:
F = (X/ υ1)/(Y/ υ2) with (υ1- 1) and (υ2 – 1) d.f.
Applications of F-test :
1. To test the equality of two population variances.
2. To test the significance of an observed sample correlation coefficient.
3. To test the significance of an observed multiple correlation co-efficient.
4. To test the significance of quality of several mean (design of experiment).
5. To test the linearity of regression.
To test the equality of two population variances:
H0 : 2 = 2
=
2
(say)
where = ∑ ( – )2
and = ∑ ( – )2
Chi-square test
2
If X N (m, ), then Z = is a standard normal variate and square of standard normal
variate is known as ch-square variate with 1 d.f.
Chi-square test was first discovered by Karl Pearson in 1900.
Definition of Chi-square test of Goodness of fit: If Oi (i = 1,2,3...,n) is a set of observed
frequencies and Ei (i = 1,2,3...,n) is the corresponding set of expected frequencies, then chi-
square is given by
χ2 = Σ (Oi – Ei)2 with (n – 1) d.f.
i=1 Ei
Conditions for the validity of χ2 test:
(i) The sample observations should be independent i.e samples are random.
(ii) No theoretical cell frequency should be less than 5.
(iii)Total number of frequencies should be reasonable large(>50).
(iv) Σ Oi = ΣEi.
21
Application of χ2 test:
1.Test of Goodness of fit: It enables us to find the deviations in experiment from theory
is just by chance.
2. Test of independence of attributes: We test whether two or more attributes are
independent to each other.
Contingency Table.
We consider two attributes A and B, A divided into r classes A1, A2... A r and B divided into s
classes B1, B2,...,B S. Such a classification in which attributes are divided into more than two
classes is known as manifold classification. The various cell frequencies can be expressed in
the tables known as r x s manifold contingency table where (A) is the number of persons
possessing the attribute Ai,(i = 1, 2, 3,...r), (B j) is the number of persons possessing the
attribute B j ( j= 1, 2, ...,s) and (Ai B j) is the number of persons possessing both the attributes
Ai and B j, (i = 1, 2, ..., r; j = 1, 2,..., s).
s
Also = ∑ j=1 (Bj) = N, where N is the total frequency.
CONTINGENCY TABLE(r x s)
Yate’s correction for continuity: In a 2 x 2 contingency table, the number of d.f. is (2-1) (2-
1)= 1. If any one of the theoretical cell frequencies is less than 5, then use of pooling method
for χ2- test results in χ2 with 0 d.f. (since 1 d.f. is lost in pooling) which is meaningless. In this
case we apply correction due to F. Yates (1934), which is usually known as “Yate’s
Correction for continuity” [as we know, χ2 is a continuous distribution and it fails to maintain
its character of continuity if any of the expected frequency is less than 5; hence the name
‘Correction for continuity’].This consists in adding 0.5 to the cell frequency which is less
than 5 and then adjusting for the remaining cell frequency accordingly. The χ2- test of
goodness of fit is then applied without pooling method.
22
a b we have χ2 =
For a 2 x 2 contingency table,
c d
According to Yate’s correction, as explained above, we subtract (or add) ½ from ‘a’ and ‘d’
and add (subtract) ½ to ‘b’ and ‘c’ so that the marginal totals are not disturbed at all.
There, corrected value of χ2 is given as:
2
χ =
CORRELATION
If the change in one variable affects a change in the other variables, the variables are
said to be correlated. If the increase (or decrease) in one results in a corresponding increase
(or decrease) in the other, correlation is said to be direct or positive e.g., (i) height and weight
of a group of persons (ii) income and expenditure.
If increase (or decrease) in one results in corresponding decrease (or increase) in the
other, correlation is said to be indirect or negative. e.g., (i) volume and pressure of a perfect
gas. (ii) price and demand of a commodity.
( , ) å . ) å . )
r (X,Y) = = =
å å å å
variables are correlated or not. If the points are very dense, i.e. very close to each other, we
should expect a fairly good amount of correlation between the variables and if the points are
widely scattered, a poor correlation is expected. This method, however, is not suitable if the
number of observations is fairly large.
a. Lines of Regression :-
If the variables in a bivariate distribution are related we will find that the
points in Scatter diagram will cluster armed some curve called the curve of
regrerssion. If the curves is straight line it is called the line of regression Line of
REGRESSION
Literal meaning of regression is “stepping back toward average”. It was first used by Sir Francis
Galton, Regression analysis is a mathematical measure of the average relationship between two or
more variables in terms of the original units of the data. In regression analysis there are two types of
variables
r=±
4. If one regression coefficient is greater than unity the other must be less than unity.
Lines of Regression of X on Y is
x- = r (y- )
Lines of Regression of Y on X is
y- = r (x- ̅ )
24
Design of Experiment
Basic Principles of Experimental Designs:
The basic principles of experimental designs are randomization, replication and local
control. These principles make a valid test of significance possible. Each of them is described
briefly in the following subsections.
(1) Randomization: The first principle of an experimental design is randomization, which is a
random process of assigning treatments to the experimental units. The principle of
randomization asserts that each treatment has equal chance/probability of being allotted to the
same plot.
(2) Replication: Repetition of the treatment is called Replication. The number, the shape and
the size of replicates depend upon the nature of the experimental material.
3) Local Control: The process of dividing the whole experimental material into a group of
homogeneous plots (called Blocks) in such a manner that the plots within the block is homogeneous
and plots between the blocks is heterogeneous. The blocking is done perpendicular to the direction of
the fertility gradient.
The name ‘basic design’ is due to the fact that these were the first designs to be discovered by
Prof.R.A.Fisher (Father of Design of Experiments).
Blocks: In agricultural experiments, most of the times we divide the whole experimental unit
(field) into relatively homogeneous sub-groups or strata. These strata, which are more
uniform amongst themselves than the field as a whole are known as blocks.
COMPLETELY RANDOMIZED DESIGN (CRD):
When the treatments are arranged randomly over the predetermined homogeneous set of
experimental units, design is known as Completely Randomized Design. Incidentally, CRD is the
only design where relaxation of not applying each treatment equal no. of times is allowed. However,
this should not be used indiscriminately.
Applicability:
When the experimental material is homogeneous, CRD is adopted. Normally this condition is
not achieved in the field experiments. Thus, CRD is applied in Laboratory experiments or Pot
experiments or in the Greenhouse.
Mathematical Model
Yij = µ + Ti + eij
Where µ = General Effect
Ti = Effect due to applying ith treatment in the jth plot
eij = Error due to applying ith treatment in the jth plot
Yij = Yield due to applying ith treatment in the jth plot
LAYOUT
T1 T3 T4 T1
T5 T2 T5 T3
T2 T4 T1 T2
T3 T3 T5 T4
T4 T1 T2 T3
ANOVA
Sources of Variation D.F. S.S. M.S.S. = S.S./D.F. F
Treatment t-1 S1 S1/(t-1) =VT VT/VE
Error (N-1)-(t-1) S2 S2/(N-1)-(t-1) =VE
Total N-1 S
Correction Factor (C.F.) = G2/N
Total Sum of Squares (T.S.S.) = ∑Y2ij - C.F. = S
Treatment Sum of Squares (Tr.S.S.) = ∑Ti2/r - C.F. = S1
Error Sum of Squares (E.S.S.) = T.S.S.- Tr.S.S.= S2
S.E./Plot = √ VE
S.E.diff. mean = √ 2x VE/r
C.D. = t 0.05 (for error d.f.) x S.E.(d)
C.V. = S.E./Plot/G.M. X 100
26
- - - - - - -
Tv Yv1 Yv2 - Yvr Tv tv
Total R1 R2 - R3 G G.M.
Total N-1 S
27
Advantages:
Increased Precision is obtained due to using Local Control
Any no. of treatments can be included. If large no. of homogeneous units are available,
Large no. of treatments can be included
The analysis is simple. It remains simple even if some plots are missing
The amount of information in RBD is more than that of CRD. Thus RBD is more
efficient than CRD
Disadvantages
RBD is not suitable for large no. of treatments, because it increases the block size and
heterogeneity of the blocks which increases the experimental error.
For this disadvantage, RBD is a Versatile design which is most frequently used in
agricultural experiments.