RM Record
RM Record
Classification: The collected data, also known as raw data or ungrouped data, are always in an
unorganized form and need to be organized and presented in meaningful and readily
comprehensible form in order to facilitate further statistical analysis. It is, therefore, essential
for an investigator to condense a mass of data into more and more comprehensible and
assimilable form. The process of grouping into different classes or sub classes according to
some characteristics is known as classification. Thus, classification is the first step in
tabulation.
Objects of classification: The following are main objectives of classifying the data:
1. It condenses the mass of data in an easily assimilable form.
2. It eliminates unnecessary details.
3. It facilitates comparison and highlights the significant aspect of data
4. It enables one to get a mental picture of the information collected.
5. It helps in the statistical treatment of the information collected.
Types of classification: Statistical data are classified in respect of their characteristics. Broadly
there are four basic types of classification namely
Eg: The data related with population, sales of a firm, imports and exports of a country are
always subjected to chronological classification.
• Geographical classification: In this type of classification the data are classified according
to geographical region or place. For instance, the production of paddy in different states in
India, production of wheat in different countries etc. this classification is usually listed in
alphabetical order for easy reference.
Eg: Country America, China, Denmark, France, India yield of wheat in (kg/acre) 1925
893 225 439 862.
• Qualitative classification: In this type of classification data are classified on the basis of
same attributes or quality like sex, literacy, religion, employment etc, such attributes cannot
be measured along with a scale. For example, if the population to be classified in respect
to one attribute, say sex, then we can classify them into two namely that of males and
females. Similarly, they can also be classified into 'employed' or "unemployed on the basis
of another attribute 'employment'. Thus, when the classification is done with respect to one
attribute, which is dichotomous in nature, two classes are formed, one possessing the
attribute and the other not possessing the attribute. This type of classification is called
simple dichotomous classification.
The classification, where two or more attributes are considered and several classes are formed,
is called a manifold classification.
Eg: If we classify population simultaneously with respect to two attributes, eg sex and
employment, then population are first classified with respect to "sex" into males and females.
Each of these classes may then be further classified into 'employment and "unemployment on
the basis of attribute 'employment and as such population are classified into four classes namely
i. Male employed
ii. Male unemployed
iii. Female employed
iv. Female unemployed
Still the classification may be further extended by considering other attributes like marital
status etc. This can be explained by the following chart population male female employed
unemployed.
Tabulation is the process of summarizing classified or grouped data in the form of a table so
that it is easily understood, and an investigator is quickly able to locate the desired information.
A table is a systematic arrangement of classified data in columns and rows. Thus, a statistical
table makes it possible for the investigator to present a huge mass of data in a detailed and
orderly form. It facilitates comparison and often reveals certain patterns in data which are
otherwise not obvious. Classification and "Tabulation", as a matter of fact, are not two distinct
processes. Actually, they go together. Before tabulation data are classified and then displayed
under different columns and rows of a table. Tabulation is essential because of the following
reasons.
1. It conserves space and reduces explanatory and descriptive statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of errors and omissions.
4. It provides a basis for various statistical computations.
Generally accepted principles of tabulation: such principles of tabulation, particularly of
constructing statistical tables, can be briefly states as follows:
1. Every table should have a clear, concise and adequate title so as to make the table
intelligible without reference to the text and this title should always be placed just above
the body of the table.
2. Every table should be given a distinct number to facilitate easy reference.
3. The column headings (captions) and the row headings (stubs) of the table should be clear
and brief.
4. The units of measurements under each heading or sub-heading must always be indicated.
5. Explanatory footnotes, if any, concerning the table should be placed directly beneath
the table, along with the reference symbols used in the table.
6. Source or sources from where the data in the table have been obtained must be indicated
just below the table.
7. Usually, the columns are separated from one another by lines which make the table more
readable and attractive. Lines are always drawn at the top and bottom of the table and
below the captions.
8. There should be thick lines to separate the data under one class from the under another
class and the lines separating the sub-divisions of the classes should be comparatively
thin lines.
9. The columns may be numbered to facilitate reference.
10. Those columns whose data are to be compared should be kept side by side. Similarly,
percentages and/or averages must also be kept close to the data.
11. It is generally considered better to approximate figures before tabulation as the same
would reduce unnecessary details in the itself.
12. In order to emphasize the relative significance of certain categories, different kinds of
type, spacing and indentations may be used.
13. It is important that all column figures be properly aligned. Decimal points and (+) or
(-) signs should be in perfect alignment.
14. Abbreviations should be avoided to the extent possible and ditto marks should not be
used in the table.
15. Miscellaneous and exceptional items, if any, should be usually placed in the last row of
the table.
16. Table should be made as logical, clear, accurate and simple as possible. If the data
happens to be very large, they should not be crowded in a single table for that would
make the table wieldy and inconvenient.
17. Total of rows should normally be placed in the extreme right column and that of columns
should be placed at the bottom.
18. The arrangement of the categories in a table may be chronological, geographical,
alphabetical or according to magnitude to facilitate comparison.
Above all, the table must suit the needs and requirements of an investigation.
Advantages Of Tabulation:
Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented is easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages, dispersion,
correlation etc.
4. It presents facts in minimum possible space and unnecessary repetitions and
explanations are avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references, and they make it easier to present the information
in the form of graphs and diagrams.
Preparing A Table: The making of a compact table itself an art. This should contain all the
information needed within the smallest possible space. What the purpose of tabulation is and
how the tabulated information is to be used are the main points to be kept in mind while
preparing for a statistical table.
Format Of a Table:
Title Head note
Column heading
Stub entries Body
1. Table number: A table should be numbered for easy reference and identification. This
number, if possible, should be written in the centre at the top of the table. Sometimes it is
also written just before the title of the table.
2. Title of the table: A good table should have a clearly worded, brief but unambiguous title
explaining the nature of the data contained in the table. It should also state arrangement
of data and the period covered. The title should be placed centrally on the top of a table
just below the table number (or just after table number in the same line).
3. Captions or column headings: Captions in a table stand for brief and self- explanatory
headings of vertical columns. Captions may involve headings and 43 sub-headings as
well. The unit of data contained should also be given for each column. Usually, a relatively
less important and shorter classification should be tabulated in the columns.
4. Stubs or row designation: Stubs stands for brief and self-explanatory headings of
horizontal rows. Normally, a relatively more important classification is given in rows.
Also, a variable with a large number of classes is usually represented in rows. For
example, rows may stand for score of classes and columns for data related to sex of
students. In the process, there will be many rows for scores classes but only two columns
for male and female students.
5. Body of the table: The body of the table contains the numerical information of frequency
of observations in different cells.
6. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or
information included in the table which needs some explanation. Thus, they are meant for
explaining or providing further details about the data, that have not been covered in title,
captions and stubs.
7. Sources of data: Lastly one should also mention the source of information from which
data is taken. This may preferably include the name of the author, volume, page and the
year of publication. This should also state whether the data contained in the table is of
'primary or secondary' nature.
Requirements of a Good Table: A good statistical table is not merely a careless grouping of
columns and rows but should be such that it summarizes the total information in an easily
accessible form in minimum possible space. Thus, while preparing a table, one must have a
clear idea of the information to be presented, the facts to be compared and the points to be
stressed. Though there is no hard and fast rule for forming a table yet.
Types Of Tables:
Tables can be classified according to their purpose, stage of enquiry, nature of data or number
of characteristics used.
On the basis of the number of characteristics, tables may be classified as follows:
a. Simple and complex tables
b. General purpose and special purpose (summary) tables
Simple or one-way table: In this type only one characteristic is shown. The number of adults
in different occupations in a locality.
Simple or Two-way Table: A table, which contains data on two characteristics, is called a two-
way table. In such case, therefore, either stub or caption is divided into two co-ordinate parts.
In the given table, as an example the caption may be further divided in respect of 'sex'. This
subdivision is shown in two-way table, which now contains two characteristics namely,
occupation and sex.
Manifold or higher order table: More and more complex tables can be formed by including
3 or more characteristics. For example, we may further classify the caption sub-heading in the
above table in respect of 'marital status', 'religion' and 'socio-economic status' etc. A table which
has more than two characteristics is termed as manifold table.
General purpose tables, also known as reference tables or repository tables, provide
information for general use or reference. They usually contain detailed information and not
constructed for specific discussion.
Special purpose tables, also known as summary tables, provide information for particular
discussion. When attached to a report they are found in the body of the text.
DIAGRAMMATIC AND GRAPHICAL REPRESENTATION
OF RAW DATA
One of the most convincing and appealing ways in which statistical results may be presented
is through diagram and graphs. Just one diagram is enough to represent a given data more
effectively than thousand words. It is easy to understand diagrams even for ordinary people.
Diagram is a visual form for presentation of statistical data, highlighting their basic facts and
relationship. If we draw diagrams on the basis of the data collected, they will easily be
understood and appreciated by all. It is readily intelligible and saves a considerable amount of
time and energy.
Types of Diagrams
In such diagrams, only one-dimensional measurements, i.e., height is used, and the width
is not considered.
1. Line Diagram.
2. Simple Diagram.
3. Multiple Bar Diagram.
4. Sub-divided Bar Diagram.
5. Percentage Bar Diagram.
Line Diagram is used in case where there are many items to be shown and there is not much
difference in their values. Such diagram is prepared by drawing a vertical line for each item
according to the scale. The distance between the lines is kept uniform. Line diagram makes
comparison easy, but it is less attractive.
In two-dimensional diagrams, the area represents the data and so the length and breadth
have both to be considered. Such diagrams are also called Area Diagrams or Surface
Diagrams.
Rectangles
• Rectangles are used to represent the relative magnitude of two or more values.
• The area of the rectangles is kept in proportion to the values.
• Rectangles are placed side by side for comparison.
• We may represent the figures as they are given or may convert them to percentages and
then subdivide the length into various components.
Squares
• The rectangular method of diagrammatic presentation is difficult to use where the
values of items vary widely.
• The method of drawing a square diagram is very simple.
• One has to take the square root of the values of various items that are to be shown in
the diagrams and then select a suitable scale to draw the squares.
Pie Diagram or Circular Diagram
• In such diagram, both the total and the component parts or sectors can be shown. The
area of a circle is proportional to the square of its radius.
• While making comparisons, pie diagrams should be used on a percentage basis and not
on an absolute basis.
It consists of cubes, cylinders, spheres, etc. In such diagrams three things, namely
length, width and height have to be taken into account. Of all the figures, making of cubes is
easy, Side of a cube is drawn in proportion to the cube root of the magnitude of data, Cubes of
figures can be ascertained with the help of logarithms. The logarithm of the figures can be
divided by 3 and the antilog of that value will be the cube root.
❖ PICTOGRAMS AND CARTOGRAMS
Pictograms are not abstract presentation such as lines or bars but really depict the kind
of data we are dealing with. Pictures are attractive and easy to comprehend and as such this
method is particularly useful in presenting statistics to the layman. When Pictograms are used,
data is represented through pictorial symbol that are carefully selected.
Graphs
A graph is a visual form of presentation of statistical data. A graph is more attractive than a
table figure. Even a common man can use and understand the message of data from the graph
Comparisons can be made between two or more phenomena very easily with the help of a
graph
Graphs are divided into
▪ Histogram.
▪ Frequency Curve.
▪ Ogive.
▪ Lorenz Curve.
B. Frequency Polygon: If we mark the midpoints of the top horizontal sides of the rectangles
in the histogram and then join them by a straight line, the figure so formed is called a
Frequency Polygon. This is done under the assumption that the frequencies in a class
interval are evenly distributed throughout the class. The area of the polygon is equal to the
area of the histogram, because the area left outside is equal to the area included in it.
C. Frequency Curve: If the middle point of the upper boundaries of the rectangles of a
histogram are corrected by a smooth freehand curve, then that diagram is called Frequency
Curve. The curve should begin and end at the base line.
D. Ogives: For a set of observations, we know how to construct a frequency distribution. In
some cases, we may require the number of observations less than a given value or more
than a given value. This is obtained by accumulating (adding) the frequencies up to (or
above) the given value. This accumulated frequency is called Cumulative Frequency. These
cumulative frequencies are then listed in a table called Cumulative Frequency Table. The
curve table is obtained by plotting cumulative frequencies. This is called a Cumulative
Frequency Curve or an Ogive.
In ungrouped or raw data, individual items are given. The average of a numbers is
obtained by finding their sum (by adding) and then dividing it by n.
𝑥1+𝑥2+𝑥3+⋯+𝑥𝑛 ∑𝑛
𝑖 =1 𝑥1
x= =
𝑛 𝑛
Merits
1. It can be easily calculated.
2. Its calculations are based on all the observations.
3. It is easy to understand.
Demerits
1. It may not be represented in actual data and so it is theoretical.
2. The extreme values have greater effect on mean.
3. It cannot be calculated if all the values are not known.
Uses of Arithmetic mean
1. A common man uses mean for calculating average marks obtained by a student.
2. It is extensively used in practical statistics.
3. Estimates are always obtained by mean.
Direct method If n observations in the raw data consist of a distinct value denoted by x1, x2,
x3......., xn of the observed variable x occurring with frequencies 𝑓1, 𝑓2, 𝑓3,…, 𝑓𝑛 respectively, then
the arithmetic mean of the variable x is given by x, where
The word "average" implies a value in the distribution, around which the other values are
distributed, gives a mental picture of the central value. There are several kinds of averages, of
which of which the commonly used are
3. The mode.
Many of the questions from the patient satisfaction surveys include rating scales. This will
require calculating means and standard deviations for data analysis. This can be done using
popular spread she software, such as Microsoft Excel, or even online calculators. If neither of
these is readily available both the mean and standard deviation of a data set can be calculated
using arithmetic formula. Following are brief descriptions of the mean and standard deviation
with examples of how to calculate each.
The Mean For a data set, the mean is the sum of the observations divided by the number
observations. It identifies the central location of the data, sometimes referred to in English as
the average.
The Arithmetic mean is widely used in statistical calculation. It is sometimes simply called
Mean. To obtain the mean the individual values are first added together, and then divided by
the number observations. The operation of adding together is called 'Summation' and is denoted
by the sign å. The individual observation sis denoted by the D. And the mean is denoted by the
sign X bar.
The arithmetic mean works well for values that fit the normal distribution.
It is sensitive to extreme values, which makes it not work well for data that are highly skewed.
Eg. The mean is calculated thus: The diastolic blood pressure of 10 individuals is 81 If the
values given are 83,75, 81, 79, 71, 95, 75, 77,84,90
The total is 810/10 = 81
The advantages of mean are that it is easy to calculate.
The disadvantages are that sometimes it may be unduly influenced by abnormal values in
distribution
Sometimes it may
The mean is the average of all numbers and is sometimes called the arithmetic mean. To
calculate m add together all of the numbers in a set and then divide the sum by the total count
of numbers Calculation of arithmetic mean for individual observations.
The arithmetic mean is computed by summing up the observations and diving the sum by the
1 number of observations.
b. The following is the monthly income of 10 employees in an office Income in Rs: 1780,
1760, 1810, 1680, 1940, 1790, 1890, 1960, 1810, 1050.
Calculate the mean income.
åfX
As X =
𝑁
X = 756/60 = 12.6
X CL f Mid value m fm
Less than 20 10 – 20 2 15 30
20 – 30 20 – 30 8 25 200
30 – 50 30 – 50 20 40 800
50 – 90 50 – 90 25 70 1750
90 – 120 90 – 120 16 105 1680
Above 120 120 – 150 9 135 1215
N = 18 åfm = 5675
Note. 1st interval has been taken equal to second and last equal to penultimate one. We can use
any method to find X; used for Continuous series.
Uses of the arithmetic mean
a. A common man can use to calculate averages
b. It is extremely used in practical statistics
c. Estimates are always obtained by mean
d. Businessman uses it to find out the operation cost, profit per unit of article, output per
man as per machine.
Merits:
a. It can be easily calculated
b. Its calculation is based on all the observations
c. It is easy to understand
d. It is rightly defined by the mathematical formula
e. It is least affected by sampling fluctuations
f. It is the best average to compare two or more series.
g. It is the average obtained by calculations and it does not depend upon any position
Demerits:
1. It may not be represented in actual data
2. The extreme values have greater effect on mean
3. It cannot be calculated if values are not known
4. It cannot be determined for qualitative data such as love, beauty, honesty
5. May lead to fallacious conditions in the absence of original observations.
Standard Deviation
The standard deviation is the most common measure of variability, measuring the spread of the
data and the relationship of the mean to the rest of the data. If the data points are close to the
mean, indicate that the responses are fairly uniform, then the standard deviation will be small.
Conversely, if many d points are far from the mean, indicating that there is a wide variance in
the responses, then the stand deviation will be large. If all the data values are equal, then the
standard deviation will be zero.
The standard deviation is calculated using the following formula.
Σ(Χ−Μ)2
S2 =
𝑛−1
Where Σ -Sum of X-Individual score, M-Mean of all scores, N-Sample size (number of scores)
Median is defined as the middle most or the central value of the variable in a set of
observations, when the observations are arranged either in ascending or in descending order of
their magnitudes. It divides the arranged series in two equal parts. Median is a position average,
whereas the arithmetic mean is a calculated average.
Calculation of Median
If the given data is ungrouped, arrange the n values of the given variable in ascending (or
descending) order of magnitudes.
When the data is Ungrouped
Case 1: When n is odd,
𝑛+1
In this case th term is the median,
2
𝒏+𝟏
Median: Md or M = th term
𝟐
Case2: When n is even,
In this case, there are two middle terms (n/2)th term and (n/2+1)th.. The median is the average
of these two terms
𝑛 𝑛
( 2 )𝑡ℎ 𝑡𝑒𝑟𝑚+[( 2 +1)]𝑡ℎ 𝑡𝑒𝑟𝑚
Median: Md or M =
2
MODE
Modal class: It is that class in a grouped frequency distribution in which the mode lies.
The modal class can be determined either by inspection or with the help of grouping table.
𝒇𝒎−𝒇𝟏
Mode = l + ×𝒊
𝟐𝒇𝒎−𝒇𝟏−𝒇𝟐
where, ∆1 = fm - f1,
∆2 = fm - f2,
MERITS, DEMERITS AND USES OF MODE
Merits
1. It can be easily understood.
2. It can be located in some cases by inspection.
3. It is capable of being ascertained graphically.
Demerits
1. There are different formulae for its calculations which ordinarily give different answers.
2. Mode is determinate. Some series have two or more than two modes.
3. It is an unsuitable measure as it is affected more by sampling fluctuations.
Uses
The value of the variable which divides the series, when arranged in ascending order, into 10
equal parts is called a decile.
Deciles are denoted by D1, D2, D3,…, D9. The fifth decile, Ds, is the median of the given data.
Computation of Deciles
Case 1: Computation of Deciles for Individual series
In this case, the kth decile is given by
𝑛+1
Dk = Value of k [ 10 ] th term, 𝑘 = 1,2,3,4, … . . ,9.
When the series is arranged in ascending order.
Computation of Percentiles
Case 1: Computation of Percentile of Individual series
In this case, the kth percentile is given by
𝑛+1
Pk = Value [k( 100 )]th term,
Where,
x = (X-X υ) and
y = (Y-Y υ)
It is obvious that while applying this formula we have not to calculate separately the standard
deviation of X and Y series as is required by the formula.
This simplifies greatly the task of calculating correlation coefficient.
Steps In Calculating Correlation Coefficient
1. Take the deviations of X series from the mean of X and denote these deviations by x.
2. Square these deviations and obtain the total i.e.. Σ x².
3. Take the deviations of Y series from the mean of Y and donate these deviations by y
4. Square these deviations and obtain the total i.e., Σ y2.
5. Multiply the deviations of X and Y series and obtain the total, i.e., Σ xy.
6. Substitute the values of Σ xy, Σ x² and Σ y2 in the above formula.
The coefficient of rank correlation is based on the various values of the variates and is denoted
by R. it is applied to the problems in which data cannot be measured quantitatively but
qualitative assessment is possible such as beauty, honesty etc. In this case the best individual
is given rank number 1, next rank 2 and so on. The coefficient of rank correlation is given by
the formula:
6 Σ 𝐷2
R=1−
𝑛(𝑛2 −1)
where,
D² is the square of the difference of the corresponding ranks and
n is the number of pairs of observations.
• When the ranks are given, the difference of ranks of X from the corresponding ranks of
Y is calculated and thus column D is obtained and by squaring these terms D2 column
is headed and thus all these values are substituted in the given formula.
• When only the data is given and the ranks are not mentioned, then first the ranks are to
be assigned accordingly to both the series X and Y by giving rank 1 to the highest values
in both the series and so on.
CALCULATION OF "t" TEST AND ITS INTERPRETATION
In biological experiments it becomes essential to compare the means of two samples to draw a
conclusion. Visual expression of the difference between the two sample means usually fail to
give significant difference between two sample means usually fail to give significant
difference. Therefore, the degree of level of significance of difference between two means is
to be qualifies to reach a definite conclusion.
To test the significance of difference of means of two samples, W.S Gosset 1908 applied a
statistical tool called t Test. The pen name of Gosset was STUDENT and hence this test is
called as Students T test. t ratio is the ratio of difference between two means. Aylmer Fisher
developed student's t test and explained in various ways.
In students t test we make a choice between two alternatives.
1. To accept the null hypothesis (no difference between two means)
2. To reject the null hypothesis that is the difference between the means of two samples is
statistically significant.
Determination of Significance:
Probability of occurrence of any calculated value of t is determined by comparing it with the
value given in the t table corresponding to the combined degree of freedom derived from the
number of observations in the samples under study. If the calculated value of t exceeds the
value of given at p= 0.05 in the table (5% level), it is said to be significant
If the calculated value is less than the value given in the table, it is not significant.
Degree of Freedom.
The quantity in the denominator which is one less than the independent number of observations
in a sample is called degree of freedom.
In unpaired t test df = N-1 in paired t test df = N1+N2-2 (N1 and N2 are the number of
observations in each of the two series)
Application of the t- distribution;- The following are some of the examples to test the
significance of the various results obtained from small sample.
√𝒅𝟐 −𝐧(𝐝)²
S=
𝒏−𝟏
Where d deviation from the assumed mean.
Interpretation of the results:
If the calculated value of t is more than the table value t0.05, the difference between X and 𝜇 is
significant at 5% level of significance.
If the calculated value of t is less than the table value t0.05, the difference between X and 𝜇 is
not significant at 5% level of significance., hence the sample might have been drawn from the
population with mean = µ.
Fiducial limits of population mean: Assuming that the sample is a random sample from a
normal population of unknown mean the 95% fiducial limits of the population mean (𝜇) are:
𝑺
X+ t0.05
√𝒏
The x2 test is one of the simplest and most widely used nonparametric tests in statistical work.
The x² was first used by Karl Pearson in the year 1900.
The quantity x² describe the magnitude of the discrepancy between theory and observation.
It is defined as x² = Σ(0-Ε)2/Ε
Where,
O = Observed frequencies
E = Expected frequencies
STEPS- To determine the value of x2, the steps required are:
a) Calculate the expected frequencies. In general, the expected frequency for any call can be
calculated from the following equation:
E= RT x CT/N
E = Expected frequency
RT = The row total for the row containing the cell
CT = The column total for the column containing the cell
N= The total number of observations.
b) Take the difference between observed and expected frequencies and obtain the squares of
these difference i.e., obtain the value of (0-E)2
c) Divide the values of (O-E)2 obtained in the step (b) by the respective expected frequency
and obtain the total (Σ(0-E)2/E). This gives the value of x2 which can range from zero to
infinity. If x2 is zero it means that the observed and expected frequencies completely coincide.
The greater the discrepancy between the observed and expected frequency, the greater shall be
value of x2
The calculated value of x2 is compared with the table value of x2 for given degrees of freedom
at a certain specified level of significance.
If at the stated level (generally 5% level is selected), the calculated value of x2 is more than the
table value of x2. The difference between the theory and observation is considered to be
significant.
If, on the other hand, the calculated value of x is less than the table value, the difference between
theory and observation is not considered as significant.
The computed value of x2 is a random variable which takes on different values from sample to
sample, that is x2 has a sampling distribution.
It should be noted that the value x2 is always positive and its upper limit is infinity. Also, since
x2 is derived from observation, it is a statistic and a parameter.
The chi square test (x2 test) is, therefore, termed non- parametric
Degrees of Freedom
While comparing the calculated value of x2 with the table value we must determine the degrees
of freedom. By degrees of freedom, we mean the number of classes to which the values can be
assigned arbitrarily or at will without violating the restrictions or limitations placed.
The number of degrees of freedom is obtained by subtracting from the number of classes the
number of degrees of freedom lost in fitting.
Interpretation
The chi square test is one of the most popular statistical inference procedures today.
It is applicable to very large number of problems in practice which can be summed up under
the following heads:
a) x2 test as a test of independence:
With the help of chi square test, we can find out whether two or more attributes are associated
or not.
Suppose we have N observations classified according to some attributes we may ask whether
the attributes are related or independent.
In order to test whether or not the attributes are associated we take the null hypothesis that there
is no association in the attributes under study or, in other words, the two attributes are
independent.
If the calculated value of x2 is less than the table value at a certain level of significance
(generally 5% level), we say that the results of the experiment provide no evidence for doubting
the hypothesis or, in other words, the hypothesis that the attributes are not associated holds
good.
On the other hand, if the calculated value of x2 is greater than the table value at a certain level
of significance, we say that the results of the experiment do not support the hypothesis or, in
other words the attributes are associated.
It should be noted that x2 is not a measure of the degree or form of relationship. It only tells us
whether two principles of classification are or not significantly related, without reference to
any assumptions concerning the form of relationship.
b) x2 test as a test of goodness of fit;
x2 test is very popularly known as test of goodness of fit for the reason that it enables us to
ascertain how approximately the theoretical distributions such as Binomial, Poisson, Normal,
etc., fit empirical distributions, i.e., those obtained from sample data.
When an ideal frequency curve whether normal or some other type is fitted to the data, we are
interested in finding out how well this curve fits with the observed facts.
A test of the goodness of fit of the two can be made just by inspection, but such a test is
obviously inadequate. Precision can be secured by applying the x2 test.
The following are the steps in testing the goodness of fit:
1. A null and alternative hypothesis are established, and a significance level is selected for
rejection of the null hypothesis.
2. A random sample of observations is drawn from a relevant statistical population.
3. A set of expected or theoretical frequencies is derived under the assumption that the null
hypothesis is true. This generally takes the form of assuming that a particular probability
distribution is applicable to the statistical population under consideration.
4. The observed frequencies are compared with the expected, or theoretical frequencies.
5. If the calculated value of x2 is less than the table value at a certain. The level of significance
(generally 5% level) and for certain degrees of freedom the fit is considered to be good, i.e.
the divergence between the actual and expected frequencies is attributed to fluctuations of
simple sampling. On the other hand, if the calculated value of x² is greater than the table
value. The fit is poor, i.e., it cannot be attributed to fluctuations of simple sampling rather
it is due to the inadequacy of the theory to fit the observed facts.
6. It should be borne in mind that in repeated sampling too good a fit is just as likely as too
bad a fit. When the computed chi square is too close to zero, we should suspect the
possibility that two sets of frequencies have been manipulated to force them to agree and,
therefore, the design of our experiment should be thoroughly checked
The F= distribution (named after the famous statistician R.A. Fisher) measures the ratio of the
variance between groups to the variance within groups. The variance between the samples
means is the numerator and the variance within the sample means is the denominator. If there
is no real difference from group to group any sample difference will be explainable by random
variation and the variance between groups should be close to the variance within groups.
However, if there is a real difference between the groups the variance between groups will be
significantly larger than the variance within groups.
4. Compare the calculated value of F with the table value of F for the difference at a
certain critical level
(generally, we take 5% level of significance). If the calculated value of F is greater than the
table value it is concluded that the difference in sample means is significant. i.e, it could not
have arisen due to fluctuations of simple sampling, or, in other words, the samples do not come
from the sample population. On the other hand, if the calculated value of F is less than the table
value, the difference is no significant and has arisen due to fluctuations of simple sampling.
It is customary to summarise calculations for sum of squares, together with the r numbers of df
an mean square in a table called the analysis of variance table, generally abbreviated ANOVA.
The specimen of ANOVA table is given below:
Analysis of variance (ANOVA) table: One way classification Model