0% found this document useful (0 votes)
14 views41 pages

RM Record

The document discusses the classification and tabulation of raw data, emphasizing the importance of organizing ungrouped data into meaningful formats for statistical analysis. It outlines various types of classification, including chronological, geographical, qualitative, and quantitative, and explains the process of tabulation as a means to summarize data in tables for easy comprehension. Additionally, it covers the principles and requirements for creating effective tables and diagrams to present statistical information clearly and efficiently.

Uploaded by

jamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

RM Record

The document discusses the classification and tabulation of raw data, emphasizing the importance of organizing ungrouped data into meaningful formats for statistical analysis. It outlines various types of classification, including chronological, geographical, qualitative, and quantitative, and explains the process of tabulation as a means to summarize data in tables for easy comprehension. Additionally, it covers the principles and requirements for creating effective tables and diagrams to present statistical information clearly and efficiently.

Uploaded by

jamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CLASSIFICATION AND TABULATION OF RAW DATA

Classification: The collected data, also known as raw data or ungrouped data, are always in an
unorganized form and need to be organized and presented in meaningful and readily
comprehensible form in order to facilitate further statistical analysis. It is, therefore, essential
for an investigator to condense a mass of data into more and more comprehensible and
assimilable form. The process of grouping into different classes or sub classes according to
some characteristics is known as classification. Thus, classification is the first step in
tabulation.

Objects of classification: The following are main objectives of classifying the data:
1. It condenses the mass of data in an easily assimilable form.
2. It eliminates unnecessary details.
3. It facilitates comparison and highlights the significant aspect of data
4. It enables one to get a mental picture of the information collected.
5. It helps in the statistical treatment of the information collected.
Types of classification: Statistical data are classified in respect of their characteristics. Broadly
there are four basic types of classification namely

• Chronological classification: In chronological classification the collected data are


arranged according to the order of time expressed in years, months, weeks, etc., The data
is generally classified in ascending order of 39 times.

Eg: The data related with population, sales of a firm, imports and exports of a country are
always subjected to chronological classification.

• Geographical classification: In this type of classification the data are classified according
to geographical region or place. For instance, the production of paddy in different states in
India, production of wheat in different countries etc. this classification is usually listed in
alphabetical order for easy reference.

Eg: Country America, China, Denmark, France, India yield of wheat in (kg/acre) 1925
893 225 439 862.

• Qualitative classification: In this type of classification data are classified on the basis of
same attributes or quality like sex, literacy, religion, employment etc, such attributes cannot
be measured along with a scale. For example, if the population to be classified in respect
to one attribute, say sex, then we can classify them into two namely that of males and
females. Similarly, they can also be classified into 'employed' or "unemployed on the basis
of another attribute 'employment'. Thus, when the classification is done with respect to one
attribute, which is dichotomous in nature, two classes are formed, one possessing the
attribute and the other not possessing the attribute. This type of classification is called
simple dichotomous classification.
The classification, where two or more attributes are considered and several classes are formed,
is called a manifold classification.

Eg: If we classify population simultaneously with respect to two attributes, eg sex and
employment, then population are first classified with respect to "sex" into males and females.
Each of these classes may then be further classified into 'employment and "unemployment on
the basis of attribute 'employment and as such population are classified into four classes namely

i. Male employed
ii. Male unemployed
iii. Female employed
iv. Female unemployed

Still the classification may be further extended by considering other attributes like marital
status etc. This can be explained by the following chart population male female employed
unemployed.

• Quantitative classification: Quantitative classification refers to the classification of the


data according to some characteristics that can be measured such as height, weight, etc.
A variable may either be continuous eg; heights and weights of the person 160.2-165.2, 150.1-
152.4 and 53.2-54.2kg, 66.4-67.4kg respectively or discrete eg: number of rooms, number of
machines (2,3,4,6).
TABULATION OF RAW DATA

Tabulation is the process of summarizing classified or grouped data in the form of a table so
that it is easily understood, and an investigator is quickly able to locate the desired information.
A table is a systematic arrangement of classified data in columns and rows. Thus, a statistical
table makes it possible for the investigator to present a huge mass of data in a detailed and
orderly form. It facilitates comparison and often reveals certain patterns in data which are
otherwise not obvious. Classification and "Tabulation", as a matter of fact, are not two distinct
processes. Actually, they go together. Before tabulation data are classified and then displayed
under different columns and rows of a table. Tabulation is essential because of the following
reasons.
1. It conserves space and reduces explanatory and descriptive statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of errors and omissions.
4. It provides a basis for various statistical computations.
Generally accepted principles of tabulation: such principles of tabulation, particularly of
constructing statistical tables, can be briefly states as follows:
1. Every table should have a clear, concise and adequate title so as to make the table
intelligible without reference to the text and this title should always be placed just above
the body of the table.
2. Every table should be given a distinct number to facilitate easy reference.
3. The column headings (captions) and the row headings (stubs) of the table should be clear
and brief.
4. The units of measurements under each heading or sub-heading must always be indicated.
5. Explanatory footnotes, if any, concerning the table should be placed directly beneath
the table, along with the reference symbols used in the table.
6. Source or sources from where the data in the table have been obtained must be indicated
just below the table.
7. Usually, the columns are separated from one another by lines which make the table more
readable and attractive. Lines are always drawn at the top and bottom of the table and
below the captions.
8. There should be thick lines to separate the data under one class from the under another
class and the lines separating the sub-divisions of the classes should be comparatively
thin lines.
9. The columns may be numbered to facilitate reference.
10. Those columns whose data are to be compared should be kept side by side. Similarly,
percentages and/or averages must also be kept close to the data.
11. It is generally considered better to approximate figures before tabulation as the same
would reduce unnecessary details in the itself.
12. In order to emphasize the relative significance of certain categories, different kinds of
type, spacing and indentations may be used.
13. It is important that all column figures be properly aligned. Decimal points and (+) or
(-) signs should be in perfect alignment.
14. Abbreviations should be avoided to the extent possible and ditto marks should not be
used in the table.
15. Miscellaneous and exceptional items, if any, should be usually placed in the last row of
the table.
16. Table should be made as logical, clear, accurate and simple as possible. If the data
happens to be very large, they should not be crowded in a single table for that would
make the table wieldy and inconvenient.
17. Total of rows should normally be placed in the extreme right column and that of columns
should be placed at the bottom.
18. The arrangement of the categories in a table may be chronological, geographical,
alphabetical or according to magnitude to facilitate comparison.
Above all, the table must suit the needs and requirements of an investigation.
Advantages Of Tabulation:
Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented is easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages, dispersion,
correlation etc.
4. It presents facts in minimum possible space and unnecessary repetitions and
explanations are avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references, and they make it easier to present the information
in the form of graphs and diagrams.
Preparing A Table: The making of a compact table itself an art. This should contain all the
information needed within the smallest possible space. What the purpose of tabulation is and
how the tabulated information is to be used are the main points to be kept in mind while
preparing for a statistical table.
Format Of a Table:
Title Head note

Stub heading Caption heading

Column heading
Stub entries Body

Foot notes Table number


An ideal table should consist of the following main parts:

1. Table number: A table should be numbered for easy reference and identification. This
number, if possible, should be written in the centre at the top of the table. Sometimes it is
also written just before the title of the table.
2. Title of the table: A good table should have a clearly worded, brief but unambiguous title
explaining the nature of the data contained in the table. It should also state arrangement
of data and the period covered. The title should be placed centrally on the top of a table
just below the table number (or just after table number in the same line).
3. Captions or column headings: Captions in a table stand for brief and self- explanatory
headings of vertical columns. Captions may involve headings and 43 sub-headings as
well. The unit of data contained should also be given for each column. Usually, a relatively
less important and shorter classification should be tabulated in the columns.
4. Stubs or row designation: Stubs stands for brief and self-explanatory headings of
horizontal rows. Normally, a relatively more important classification is given in rows.
Also, a variable with a large number of classes is usually represented in rows. For
example, rows may stand for score of classes and columns for data related to sex of
students. In the process, there will be many rows for scores classes but only two columns
for male and female students.
5. Body of the table: The body of the table contains the numerical information of frequency
of observations in different cells.
6. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or
information included in the table which needs some explanation. Thus, they are meant for
explaining or providing further details about the data, that have not been covered in title,
captions and stubs.
7. Sources of data: Lastly one should also mention the source of information from which
data is taken. This may preferably include the name of the author, volume, page and the
year of publication. This should also state whether the data contained in the table is of
'primary or secondary' nature.

Requirements of a Good Table: A good statistical table is not merely a careless grouping of
columns and rows but should be such that it summarizes the total information in an easily
accessible form in minimum possible space. Thus, while preparing a table, one must have a
clear idea of the information to be presented, the facts to be compared and the points to be
stressed. Though there is no hard and fast rule for forming a table yet.

A few general points should be kept in mind:

1. A table should be formed in keeping with the objects of statistical enquiry.


2. A table should be carefully prepared so that it is easily understandable.
3. A table should be formed so as to suit the size of the paper. But such an adjustment
should not be at the cost of legibility.
4. If the figures in the table are large, they should be suitably rounded or approximated.
The method of approximation and units of measurements too should be specified.
5. Rows and columns in a table should be numbered and certain figures to be stressed may
be put in 'box' or 'circle' or in bold letters.
6. The arrangements of rows and columns should be in a logical and systemic order. This
arrangement may be alphabetical, chronological or according to size.
7. The rows and columns are separated by single, double or thick lines to represent various
classes and sub-classes used. The corresponding proportions or percentages should be
given in adjoining rows and columns to enable comparison. A vertical expansion of the
table is generally more convenient than the horizontal one.
8. The averages or totals of different rows should be given at the right of the table and that
of columns at the bottom of the table. Totals for every sub-classes too should be
mentioned.
9. In case it is not possible to accommodate all the information in a single table, it is better
to have two or more related tables.

Types Of Tables:

Tables can be classified according to their purpose, stage of enquiry, nature of data or number
of characteristics used.
On the basis of the number of characteristics, tables may be classified as follows:
a. Simple and complex tables
b. General purpose and special purpose (summary) tables

Simple or one-way table: In this type only one characteristic is shown. The number of adults
in different occupations in a locality.

Eg: Occupations vs no. of adults.

Simple or Two-way Table: A table, which contains data on two characteristics, is called a two-
way table. In such case, therefore, either stub or caption is divided into two co-ordinate parts.
In the given table, as an example the caption may be further divided in respect of 'sex'. This
subdivision is shown in two-way table, which now contains two characteristics namely,
occupation and sex.

Manifold or higher order table: More and more complex tables can be formed by including
3 or more characteristics. For example, we may further classify the caption sub-heading in the
above table in respect of 'marital status', 'religion' and 'socio-economic status' etc. A table which
has more than two characteristics is termed as manifold table.
General purpose tables, also known as reference tables or repository tables, provide
information for general use or reference. They usually contain detailed information and not
constructed for specific discussion.

Special purpose tables, also known as summary tables, provide information for particular
discussion. When attached to a report they are found in the body of the text.
DIAGRAMMATIC AND GRAPHICAL REPRESENTATION
OF RAW DATA

One of the most convincing and appealing ways in which statistical results may be presented
is through diagram and graphs. Just one diagram is enough to represent a given data more
effectively than thousand words. It is easy to understand diagrams even for ordinary people.
Diagram is a visual form for presentation of statistical data, highlighting their basic facts and
relationship. If we draw diagrams on the basis of the data collected, they will easily be
understood and appreciated by all. It is readily intelligible and saves a considerable amount of
time and energy.

Significance of Diagrams and Graphs

1. They are attractive and impressive.


2. They make the data simple and intelligible.
3. They make comparison possible.
4. They save time and labour.
5. They have universal utility.
6. They give more information.
7. They have a great memorizing effect.

General rules for constructing diagrams

1. A diagram should be neatly drawn and be attractive.


2. The measurements of geometrical figures used in diagram should be accurate and
proportional.
3. The size of the diagrams should match the size of the paper.
4. Every diagram must have a suitable but short heading.
5. The scale should be mentioned in the diagram.
6. Diagrams should be neatly as well as accurately drawn with the help of drawing
instruments.
7. Index must be given for identification so that the reader can easily make out the meaning
of the diagram.
8. Footnote must be given at the bottom of the diagram.
9. Economy in cost and energy should be exercised in drawing the diagram.

Types of Diagrams

• One Dimensional Diagrams


• Two Dimensional Diagrams
• Three Dimensional Diagrams
• Pictograms and Cartograms
❖ ONE DIMENSIONAL DIAGRAMS

In such diagrams, only one-dimensional measurements, i.e., height is used, and the width
is not considered.

One dimensional is divided into

1. Line Diagram.
2. Simple Diagram.
3. Multiple Bar Diagram.
4. Sub-divided Bar Diagram.
5. Percentage Bar Diagram.

Line Diagram is used in case where there are many items to be shown and there is not much
difference in their values. Such diagram is prepared by drawing a vertical line for each item
according to the scale. The distance between the lines is kept uniform. Line diagram makes
comparison easy, but it is less attractive.

Simple Bar Diagram can be drawn either on horizontal or vertical base.


• Bars must be of uniform width and the intervening space between the bars must be equal.
• While constructing a simple bar diagram, the scale is determined on the basis of the
highest value of series.
• To make the diagram attractive, the bars can be coloured.
• Bar diagrams are used in business and economics.
• However, an important limitation of such diagrams is that they can present only one
classification and one category of data.

Multiple Bar Diagram


• Multiple bar diagram is used for comparing two or more sets of statistical data.
• Bars are constructed side by side to represent the set values for comparison.
• In order to distinguish bars, they may be either differently coloured or there should be
different types of crossing or dotting, etc.
• An index is also prepared to identify the meaning of different colours of dotting

Sub-divided Bar Diagram


• In a sub-divided bar diagram, the bar is sub-divided into various parts in proportion to
the value that is given in the data and the whole bar represents the total.
• Such diagrams are also called Component Bar Diagrams.
• The subdivisions are distinguished by different colours or crossings or dotting.
• The main defect of such a diagram is that all the parts do not have a common base to
enable one to compare accurately the various components of the data.
Percentage Bar Diagram
• The components are not the actual values but percentages of the whole.
• The main difference between the sub-divided and the percentage bar diagram is that
informer the bars are of different heights since their totals may be different whereas in
the latter the bars are of equal heights since each bar represents 100 percent.
• In the case of data having sub-division, percentage bar diagram will be more appealing
than sub-divided bar diagram.

❖ TWO DIMENSIONAL DIAGRAMS

In two-dimensional diagrams, the area represents the data and so the length and breadth
have both to be considered. Such diagrams are also called Area Diagrams or Surface
Diagrams.

The important types of area diagrams are: Rectangles, Squares, Pie-diagrams.

Rectangles
• Rectangles are used to represent the relative magnitude of two or more values.
• The area of the rectangles is kept in proportion to the values.
• Rectangles are placed side by side for comparison.
• We may represent the figures as they are given or may convert them to percentages and
then subdivide the length into various components.
Squares
• The rectangular method of diagrammatic presentation is difficult to use where the
values of items vary widely.
• The method of drawing a square diagram is very simple.
• One has to take the square root of the values of various items that are to be shown in
the diagrams and then select a suitable scale to draw the squares.
Pie Diagram or Circular Diagram
• In such diagram, both the total and the component parts or sectors can be shown. The
area of a circle is proportional to the square of its radius.
• While making comparisons, pie diagrams should be used on a percentage basis and not
on an absolute basis.

❖ THREE DIMENSIONAL DIAGRAMS (volume diagrams)

It consists of cubes, cylinders, spheres, etc. In such diagrams three things, namely
length, width and height have to be taken into account. Of all the figures, making of cubes is
easy, Side of a cube is drawn in proportion to the cube root of the magnitude of data, Cubes of
figures can be ascertained with the help of logarithms. The logarithm of the figures can be
divided by 3 and the antilog of that value will be the cube root.
❖ PICTOGRAMS AND CARTOGRAMS

Pictograms are not abstract presentation such as lines or bars but really depict the kind
of data we are dealing with. Pictures are attractive and easy to comprehend and as such this
method is particularly useful in presenting statistics to the layman. When Pictograms are used,
data is represented through pictorial symbol that are carefully selected.

Cartograms or Statistical maps are used to give quantitative information as a geographical


basis. These are used to represent the spatial distributions. The quantities on the map can be
shown in many ways such as through shades or colours or dots or placing pictogram in each
geographical unit.

Graphs
A graph is a visual form of presentation of statistical data. A graph is more attractive than a
table figure. Even a common man can use and understand the message of data from the graph
Comparisons can be made between two or more phenomena very easily with the help of a
graph
Graphs are divided into

▪ Histogram.
▪ Frequency Curve.
▪ Ogive.
▪ Lorenz Curve.

A. Histogram: A histogram is a bar chart or a graph showing the frequency of occurrence of


each value of the variable being analyzed. In histogram, data are plotted as a series of
rectangles. Class intervals are shown on the 'X-axis' and the frequencies on the 'Y-axis".
The height of each rectangle represents the frequency of the class interval. Each rectangle
is formed with the other so as to give a continuous picture. Such a graph is also called
Staircase or Block Diagram. We cannot construct a histogram distribution with open end
classes. It is also quite misleading if the distribution has inequal intervals suitable
adjustments in frequencies are not made.

B. Frequency Polygon: If we mark the midpoints of the top horizontal sides of the rectangles
in the histogram and then join them by a straight line, the figure so formed is called a
Frequency Polygon. This is done under the assumption that the frequencies in a class
interval are evenly distributed throughout the class. The area of the polygon is equal to the
area of the histogram, because the area left outside is equal to the area included in it.

C. Frequency Curve: If the middle point of the upper boundaries of the rectangles of a
histogram are corrected by a smooth freehand curve, then that diagram is called Frequency
Curve. The curve should begin and end at the base line.
D. Ogives: For a set of observations, we know how to construct a frequency distribution. In
some cases, we may require the number of observations less than a given value or more
than a given value. This is obtained by accumulating (adding) the frequencies up to (or
above) the given value. This accumulated frequency is called Cumulative Frequency. These
cumulative frequencies are then listed in a table called Cumulative Frequency Table. The
curve table is obtained by plotting cumulative frequencies. This is called a Cumulative
Frequency Curve or an Ogive.

Ogive is classified into


a. The 'less than ogive method starts with the upper limits of the classes and goes
adding the frequencies. When these frequencies are plotted; we get a rising curve.
b. The 'more than ogive method starts with the lower limits of the classes and from the
total frequencies we subtract the frequency of each class. When these frequencies
are plotted, we get a declining curve.

E. Lorenz Curve: Lorenz Curve is a graphical method of studying dispersion. It was


introduced by Max. O. Lorenz, a great Economist and a Statistician, to study the
distribution of wealth and income. It is also used to study the variability in the distribution
of profits, wages, revenue, etc. It is specially used to study the degree of inequality in the
distribution of income and wealth between countries or between different periods. It is a
percentage of cumulative values of one variable in combined with the percentage of
cumulative values in other variable and then Lorenz curve is drawn. The curve starts from
the origin (0, 0) and ends at (100, 100). If the wealth, revenue, land etc. are equally
distributed among the people of the country, then the Lorenz curve will be diagonal of the
square. But this is highly impossible. The deviation of the Lorenz curve from the diagonal,
shows how the wealth, revenue, land etc are not equally distributed among people.
MEAN

Arithmetic Mean of Raw Data

In ungrouped or raw data, individual items are given. The average of a numbers is
obtained by finding their sum (by adding) and then dividing it by n.

Let x1, x2......., xn be n numbers, then their average or arithmetic mean is

𝑥1+𝑥2+𝑥3+⋯+𝑥𝑛 ∑𝑛
𝑖 =1 𝑥1
x= =
𝑛 𝑛

Merits and Demerits of Arithmetic Mean

Merits
1. It can be easily calculated.
2. Its calculations are based on all the observations.
3. It is easy to understand.
Demerits
1. It may not be represented in actual data and so it is theoretical.
2. The extreme values have greater effect on mean.
3. It cannot be calculated if all the values are not known.
Uses of Arithmetic mean
1. A common man uses mean for calculating average marks obtained by a student.
2. It is extensively used in practical statistics.
3. Estimates are always obtained by mean.

Arithmetic Mean of an Ungrouped Data

Direct method If n observations in the raw data consist of a distinct value denoted by x1, x2,
x3......., xn of the observed variable x occurring with frequencies 𝑓1, 𝑓2, 𝑓3,…, 𝑓𝑛 respectively, then
the arithmetic mean of the variable x is given by x, where

𝑓1𝑥1 + 𝑓2𝑥2+ 𝑓3𝑥3+⋯+𝑓𝑛𝑥𝑛 ∑𝑛


𝑖 =1 𝑓𝑖𝑥𝑖 ∑𝑛
𝑖 =1 𝑓𝑖𝑥𝑖
x= = ∑ 𝑓𝑖
=
𝑓1+𝑓2+𝑓3+⋯+𝑓𝑛 𝑁

where, N = ∑𝑛𝑖=1 𝑓𝑖 = 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 = Sum of frequencies

Short-cut method for grouped data

If the given data is grouped data, the mean x is given by


∑ 𝑓𝑎
x=a+ where
𝑁

𝑓𝑑 - product of the frequency and the corresponding deviation,

N = ∑ 𝑓 = the sum of all the frequencies.


CALCULATION OF MEAN AND STANDARD DEVIATION

The word "average" implies a value in the distribution, around which the other values are
distributed, gives a mental picture of the central value. There are several kinds of averages, of
which of which the commonly used are

1. The Arithmetic mean

2. The median and

3. The mode.

Many of the questions from the patient satisfaction surveys include rating scales. This will
require calculating means and standard deviations for data analysis. This can be done using
popular spread she software, such as Microsoft Excel, or even online calculators. If neither of
these is readily available both the mean and standard deviation of a data set can be calculated
using arithmetic formula. Following are brief descriptions of the mean and standard deviation
with examples of how to calculate each.

The Mean For a data set, the mean is the sum of the observations divided by the number
observations. It identifies the central location of the data, sometimes referred to in English as
the average.

The mean is calculated using the following formula.


å(X) M-N
Where ∑ ⬚ = Sum of X = Individual data points
N = Sample size (number of data points)
Example:
To find the mean of the following data set: 3,2,4,1,4,4
3+2+4+1+4+4
M= = 18/6 = 3
6

The Arithmetic mean is widely used in statistical calculation. It is sometimes simply called
Mean. To obtain the mean the individual values are first added together, and then divided by
the number observations. The operation of adding together is called 'Summation' and is denoted
by the sign å. The individual observation sis denoted by the D. And the mean is denoted by the
sign X bar.

The arithmetic mean works well for values that fit the normal distribution.
It is sensitive to extreme values, which makes it not work well for data that are highly skewed.

Eg. The mean is calculated thus: The diastolic blood pressure of 10 individuals is 81 If the
values given are 83,75, 81, 79, 71, 95, 75, 77,84,90
The total is 810/10 = 81
The advantages of mean are that it is easy to calculate.

The disadvantages are that sometimes it may be unduly influenced by abnormal values in
distribution

Sometimes it may
The mean is the average of all numbers and is sometimes called the arithmetic mean. To
calculate m add together all of the numbers in a set and then divide the sum by the total count
of numbers Calculation of arithmetic mean for individual observations.

The arithmetic mean is computed by summing up the observations and diving the sum by the
1 number of observations.

Symbolically X bar = x1+x2+x3……xn/n or åx/n

Where x = bar arithmetic mean


åx = sum of all variables of value x

That s X1, X2, X3


N = number of observations
a. The following data represents the number of productive tillers per plant of a wheat variety
Calculate the mean number of tillers per plant.
Number of productive tillers 10,11,10,11,9,7,9,12,11,10 2.

b. The following is the monthly income of 10 employees in an office Income in Rs: 1780,
1760, 1810, 1680, 1940, 1790, 1890, 1960, 1810, 1050.
Calculate the mean income.

II. Calculation of ARITHEMETIC MEAN-DISCRETE SERIES.

In discrete series calculation of arithmetic mean is done by


Symbolically.
Χ = ΣfΧ/Ν
Where f = frequency,
X = the value of the variable in the question
N = the sum of frequency or N = åf (total no of observations)
Steps:
1. Multiply the frequency of each row with the variable and obtain the total of åfX
2. Divide the total obtained by step i by the no of observations, that id total frequency
E.g. Calculate A.M. from the following data
Marks Obtained 4 8 12 16 20
No. of students 6 12 18 15 9

Marks X No. of students f fX


4 6 24
8 12 96
12 18 216
16 15 240
20 9 180
N = 60 åfX = 756

åfX
As X =
𝑁
X = 756/60 = 12.6

Series of Continuous series:


The difference in continuous series is midpoints of various el intervals are required to be
obtained. Midpoint L1+L2/2. Here L1 presents lower limit and L2 upper limit
Symbolically
∑ 𝑓𝑚
X=
𝑁
e.g:
Find the X from the following data
X: Less than 20 20 - 30 30 - 50 50 - 90 90 - 120 Above 120
f: 2 8 20 25 16 9

X CL f Mid value m fm
Less than 20 10 – 20 2 15 30
20 – 30 20 – 30 8 25 200
30 – 50 30 – 50 20 40 800
50 – 90 50 – 90 25 70 1750
90 – 120 90 – 120 16 105 1680
Above 120 120 – 150 9 135 1215
N = 18 åfm = 5675
Note. 1st interval has been taken equal to second and last equal to penultimate one. We can use
any method to find X; used for Continuous series.
Uses of the arithmetic mean
a. A common man can use to calculate averages
b. It is extremely used in practical statistics
c. Estimates are always obtained by mean
d. Businessman uses it to find out the operation cost, profit per unit of article, output per
man as per machine.
Merits:
a. It can be easily calculated
b. Its calculation is based on all the observations
c. It is easy to understand
d. It is rightly defined by the mathematical formula
e. It is least affected by sampling fluctuations
f. It is the best average to compare two or more series.
g. It is the average obtained by calculations and it does not depend upon any position
Demerits:
1. It may not be represented in actual data
2. The extreme values have greater effect on mean
3. It cannot be calculated if values are not known
4. It cannot be determined for qualitative data such as love, beauty, honesty
5. May lead to fallacious conditions in the absence of original observations.
Standard Deviation
The standard deviation is the most common measure of variability, measuring the spread of the
data and the relationship of the mean to the rest of the data. If the data points are close to the
mean, indicate that the responses are fairly uniform, then the standard deviation will be small.
Conversely, if many d points are far from the mean, indicating that there is a wide variance in
the responses, then the stand deviation will be large. If all the data values are equal, then the
standard deviation will be zero.
The standard deviation is calculated using the following formula.
Σ(Χ−Μ)2
S2 =
𝑛−1

Where Σ -Sum of X-Individual score, M-Mean of all scores, N-Sample size (number of scores)

Example: To find the Standard deviation of the data set: 3,2,4,1,4,4,


Step 1: Calculate the mean and deviation
X M X-M (X-M)2
3 3 0 0
2 3 -1 1
4 3 +1 1
1 3 -2 4
4 3 +1 1
4 3 +1 1
Step 2: Using the deviation, calculate the standard deviation
(0+1+1+4+1+1) 8
S2 = = = 1.6
(6−1) 5
S = 1.265
The standard deviation is a measure of Dispersion.
Standard deviation is the square root of the arithmetic averages of the squares of the deviations
means from the mean.
It is an important and popular measure of Dispersion and is introduced by Karl Pearson.
It is known as the root mean square deviation because it is the square root of the means of squa
deviations from the arithmetic mean.
It is denoted by the small Greek letter sigma (s)

Calculation of the standard deviation:


A. Series of individual observations:
1. Calculate the actual mean of the observations
2. Obtain the deviations of the values from the mean (calculate X - X)
3. Denote these deviations by x
4. Square the deviations and obtain the total åx2
5. Divide Σx2
By number of observations ad find out the square root σ = Ö Σχ²/Ν;
X-(X-‘X)
Standard Deviation in Discrete Series
Characteristics and Uses of Standard Deviation
• It is rightly defined
• Its computation is based on all the observations
• If all the variate values are the same S.D = 0
• S.D is least affected by fluctuation of sampling
• It is affected by the change of scale, but not affected by the change of the origin.
Uses:
It is used in the computation of the different statistical quantities like regression coefficients,
correlation coefficient etc.
It is also used in the testing the reliability of the certain statistical measures.
Merits and Demerits:
Standard deviation summarizes he deviation of a large distribution from mean in one figure
used as unit of variation., It indicates whether the variation of difference of an individual from
the mean is or by chance.
Standard deviation helps in finding the suitable size of sample for valid conclusions, helps in
calculating the standard error
Demerits:
• Standard deviation gives weight age to only extreme values.
• The process of Squaring deviations and taking square root is a lengthy process It is
complex and
• difficult to understand
Inference from the Mean and Standard Deviation
If the data points are close to the mean, indicating that the responses are fairly uniform,
then the standard deviation will be small. Conversely, if many data points are far from the
mean, indicating that there is wide variance in the responses, then the standard deviation will
be large.
However, the standard deviation alone is not particularly useful without a context within which
one determine meaning.
A standard deviation of 1.265 with a mean of 3, as calculated in the example, is much different
that standard deviation of 1.265 with a mean of 12.
By calculating how the standard deviation relates to the mean, otherwise known as the
coefficient variation (CV), you will have a more uniform method of determining the relevance
of the stand deviation and what it indicates about the responses of your sample.
The closer the CV is to 0, the greater the uniformity of data.
The closer the CV is to 1, the greater the variability of the data.
CV = M/S
Using our example of a standard deviation of 1.265 and a mean of 3, you will see that the
coefficient variation is rather large, indicating that the data has a great deal of variability with
respect to the m and there is not general consensus among the sample.
1.265
CV = S/M = = 42
3
MEDIAN

Median is defined as the middle most or the central value of the variable in a set of
observations, when the observations are arranged either in ascending or in descending order of
their magnitudes. It divides the arranged series in two equal parts. Median is a position average,
whereas the arithmetic mean is a calculated average.
Calculation of Median
If the given data is ungrouped, arrange the n values of the given variable in ascending (or
descending) order of magnitudes.
When the data is Ungrouped
Case 1: When n is odd,
𝑛+1
In this case th term is the median,
2
𝒏+𝟏
Median: Md or M = th term
𝟐
Case2: When n is even,
In this case, there are two middle terms (n/2)th term and (n/2+1)th.. The median is the average
of these two terms
𝑛 𝑛
( 2 )𝑡ℎ 𝑡𝑒𝑟𝑚+[( 2 +1)]𝑡ℎ 𝑡𝑒𝑟𝑚
Median: Md or M =
2
MODE

Modal class: It is that class in a grouped frequency distribution in which the mode lies.
The modal class can be determined either by inspection or with the help of grouping table.
𝒇𝒎−𝒇𝟏
Mode = l + ×𝒊
𝟐𝒇𝒎−𝒇𝟏−𝒇𝟐

Where, l-the lower limit of the modal class


i = the width of the modal class
fi = the frequency of the class preceding the modal class
fm = the frequency of the modal class
f2= the frequency of the class succeeding the modal class.
Sometimes, the above formula fails to give the mode. In this case, the modal value lies in a
class other than the one containing maximum frequency. In such cases the following
formula is used:
∆𝟏
Mode = l+ ×𝒊
∆𝟏+ ∆𝟐

where, ∆1 = fm - f1,

∆2 = fm - f2,
MERITS, DEMERITS AND USES OF MODE
Merits
1. It can be easily understood.
2. It can be located in some cases by inspection.
3. It is capable of being ascertained graphically.

Demerits

1. There are different formulae for its calculations which ordinarily give different answers.
2. Mode is determinate. Some series have two or more than two modes.
3. It is an unsuitable measure as it is affected more by sampling fluctuations.

Uses

1. It is used for the study of mass popular fashion.


2. It is extensively used by businessmen and commercial managements.
DECILES

The value of the variable which divides the series, when arranged in ascending order, into 10
equal parts is called a decile.

Deciles are denoted by D1, D2, D3,…, D9. The fifth decile, Ds, is the median of the given data.

Computation of Deciles
Case 1: Computation of Deciles for Individual series
In this case, the kth decile is given by
𝑛+1
Dk = Value of k [ 10 ] th term, 𝑘 = 1,2,3,4, … . . ,9.
When the series is arranged in ascending order.

Case 2: Computation of Deciles for a Discrete Frequency Distribution


Step1 Arrange given data in ascending order
Step2 Compute the cumulative frequencies
Step3 Find N = Σ f i
𝑖𝑁
Step4 Find 10 to compute Di the ith decile, i = 1,2,3,.......,9.
𝑖𝑁
Step5 Find the cumulative frequency just greater than . Then the corresponding value
10
of variables is the ith decile Di, i = 1,2,3,….,9

Case 3: Computation of Deciles for a Frequency Distribution with Class Intervals


Step1 Compute the cumulative frequency table. Let N = Σ f i
𝑖𝑁
Step2 Compute to find Di , the ith decile i = 1,2,3,.....9.
10
𝑖𝑁
Step3 Find the cumulative frequency just greater than and the corresponding
10

Class. This class is called the decile class.


Use the formula
𝑖𝑁
−𝑐
Di = L + 10
×ℎ
𝑓
L = Lower limit of the decile class
C = Cumulative frequency of the class preceding the decile class
f = frequency of the ith decile
h = width of the ith decile class
PERCENTILES

Computation of Percentiles
Case 1: Computation of Percentile of Individual series
In this case, the kth percentile is given by
𝑛+1
Pk = Value [k( 100 )]th term,

When arranged in ascending order, k = 1,2, 3.........,100.


CALCULATION OF COEFFICIENT OF CORRELATION AND ITS
INTERPRETATION

Of the several mathematical methods of measuring correlation, the Karl Pearson's


method, popularly known as Pearson's coefficient of correlation is most widely used in practice.
The Pearson's coefficient of correlation is denoted by the symbol ‘r’.
It is one of the few symbols that are used universally for describing the degree of correlation
between two series.
The formula for computing Pearson’s 'r' is:
∑ xy
r=
𝑁 𝜎𝑥 𝜎𝑦
Here,
x = (Χ-Χ υ)
y = (Y-Y υ)
𝜎𝑥 = standard deviation of series X
𝜎𝑦 = standard deviation of series Y
N = Number of pairs of observation
r = the (product moment) correlation coefficient.
This method is used to be applied only where deviations of items are taken from actual mean
and not from assumed mean.
The value of the coefficient of correlation as obtained by the above formula shall always lie
between + 1. When r = +1, it means there is perfect positive correlation between the variables.
When r = -1, it means there is perfect negative correlation between the variables. When r = 0,
it means there is no relationship between the two variables.
However, in practice such values of r as +1, -1, and 0 are rare. We normally get values which
lie between + 1 and -1 such as + 0.8, - 0.26, etc.
The coefficient of correlation describes not only the magnitude of correlation but also its
direction. Thus, + 0.8 would mean the correlation is positive because the sign of r is + and the
magnitude of correlation is 0.8. Similarly, -0.26, means low degree of negative correlation.
The above formula for computing Pearson's coefficient of correlation can be transformed to the
following form which is easier to apply.
∑ 𝑥𝑦
r=
√Σ×2×Σ𝑦 2

Where,
x = (X-X υ) and
y = (Y-Y υ)
It is obvious that while applying this formula we have not to calculate separately the standard
deviation of X and Y series as is required by the formula.
This simplifies greatly the task of calculating correlation coefficient.
Steps In Calculating Correlation Coefficient
1. Take the deviations of X series from the mean of X and denote these deviations by x.
2. Square these deviations and obtain the total i.e.. Σ x².
3. Take the deviations of Y series from the mean of Y and donate these deviations by y
4. Square these deviations and obtain the total i.e., Σ y2.
5. Multiply the deviations of X and Y series and obtain the total, i.e., Σ xy.
6. Substitute the values of Σ xy, Σ x² and Σ y2 in the above formula.

Interpretation of coefficient of correlation


The coefficient of correlation measures the degree of relationship between two sets of figures.
As the reliability of estimates depends upon the closeness of the relationship it is imperative
that utmost care be taken while interpreting the value of coefficient of correlation, otherwise
fallacious conclusions can be drawn.
Unfortunately, the interpretation of the coefficient of correlation depends very much on
experience.
The full significance of ‘r’ can only be grasped after working out a number of correlation
problems and seeing the kinds of data that give rise to various values of ‘r’.
The investigator must know his data thoroughly in order to avoid errors of interpretation and
emphasis.
He must be familiar, or become familiar, with all the relationships and theory which bear upon
the data and should reach a conclusion based on logical reasoning and intelligent investigation
on significantly related matters.
However, the following general rules are given which would help in interpreting the value of
‘r’:
1. When r+1, it means there is perfect positive relationship between the variables.
2. When r = -1, it means there is perfect negative relationship between the variables.
3. When r = 0, it means that there is no relationship between the variables i.e., the variables
are uncorrelated.
4. The closer r is to +1 or 1, the closer the relationship between the variables and the closer r
is to 0, the less close the relationship. Beyorid this it is not safe to go.
The full interpretation of 'r' depends upon circumstances one of which is the size of the sample.
All that can really be said that when estimating the value of one variable from the value if
another, the higher the value of 'r the better the estimates.
5. The closeness of the relationship is not proportional to ‘r’. if the value of ‘r’ is 0.8 it does
not indicate a relationship twice as close as one of 0.4. It is in fact, very much closer.
SPEARMAN'S RANK CORRELATION COEFFICIENT

The coefficient of rank correlation is based on the various values of the variates and is denoted
by R. it is applied to the problems in which data cannot be measured quantitatively but
qualitative assessment is possible such as beauty, honesty etc. In this case the best individual
is given rank number 1, next rank 2 and so on. The coefficient of rank correlation is given by
the formula:
6 Σ 𝐷2
R=1−
𝑛(𝑛2 −1)

where,
D² is the square of the difference of the corresponding ranks and
n is the number of pairs of observations.
• When the ranks are given, the difference of ranks of X from the corresponding ranks of
Y is calculated and thus column D is obtained and by squaring these terms D2 column
is headed and thus all these values are substituted in the given formula.
• When only the data is given and the ranks are not mentioned, then first the ranks are to
be assigned accordingly to both the series X and Y by giving rank 1 to the highest values
in both the series and so on.
CALCULATION OF "t" TEST AND ITS INTERPRETATION

In biological experiments it becomes essential to compare the means of two samples to draw a
conclusion. Visual expression of the difference between the two sample means usually fail to
give significant difference between two sample means usually fail to give significant
difference. Therefore, the degree of level of significance of difference between two means is
to be qualifies to reach a definite conclusion.
To test the significance of difference of means of two samples, W.S Gosset 1908 applied a
statistical tool called t Test. The pen name of Gosset was STUDENT and hence this test is
called as Students T test. t ratio is the ratio of difference between two means. Aylmer Fisher
developed student's t test and explained in various ways.
In students t test we make a choice between two alternatives.
1. To accept the null hypothesis (no difference between two means)
2. To reject the null hypothesis that is the difference between the means of two samples is
statistically significant.

Determination of Significance:
Probability of occurrence of any calculated value of t is determined by comparing it with the
value given in the t table corresponding to the combined degree of freedom derived from the
number of observations in the samples under study. If the calculated value of t exceeds the
value of given at p= 0.05 in the table (5% level), it is said to be significant
If the calculated value is less than the value given in the table, it is not significant.

Degree of Freedom.
The quantity in the denominator which is one less than the independent number of observations
in a sample is called degree of freedom.
In unpaired t test df = N-1 in paired t test df = N1+N2-2 (N1 and N2 are the number of
observations in each of the two series)
Application of the t- distribution;- The following are some of the examples to test the
significance of the various results obtained from small sample.

1. To test the significance of the mean of a random sample:


In determining whether the mean of a sample drawn from a normal population deviates
significantly from a stated value (Hypothetical value of the populations mean), when variance
of the population is unknown, the formula is
(𝒙−𝝁)
t= √𝒏
𝒔
where,
x = the mean of the sample
𝜇 = the actual or hypothetical mean of the population
n = the sample size
s = the standard deviation of the sample.
√(𝐱−𝐱)²
S=
𝒏−𝟏

√𝒅𝟐 −𝐧(𝐝)²
S=
𝒏−𝟏
Where d deviation from the assumed mean.
Interpretation of the results:

If the calculated value of t is more than the table value t0.05, the difference between X and 𝜇 is
significant at 5% level of significance.

If the calculated value of t is less than the table value t0.05, the difference between X and 𝜇 is
not significant at 5% level of significance., hence the sample might have been drawn from the
population with mean = µ.
Fiducial limits of population mean: Assuming that the sample is a random sample from a
normal population of unknown mean the 95% fiducial limits of the population mean (𝜇) are:
𝑺
X+ t0.05
√𝒏

2. Testing the difference between means of two samples (independent samples):


Given two independent random samples of size n1 and n2 with means of X1 and X2 and
standard deviations S1 and S2, the hypothesis can be that the samples come from the same
normal population. Then the formula to calculate would be:
𝐗𝟏−𝐗𝟐 √𝒏𝟏+𝒏𝟐
t= x
𝐒 𝒏𝟏 + 𝒏𝟐
where,
X1 = mean of the first sample
X2 = mean of the second sample
n1 = no. of observations in the first sample
n2 = no. of observations in the second sample
S= combined standard deviation.
The value of S is calculated by the following formula:
√(𝐗𝟏−𝐗𝟏)𝟐 +𝟎 (𝐗𝟐−𝐗𝟐)𝟐
S=
𝐧𝟏+𝐧𝟐−𝟐
when we are given with the number of observations and standard deviation of the samples, the
pooled estimate of the standard deviation can be obtained as follows:
√(𝐧𝟏−𝟏)𝑺𝟏𝟐 + (𝐧𝟐−𝟐)𝑺𝟐𝟐
S=
𝐧𝟏+𝐧𝟐−𝟐
Interpretation of the results:
If the calculated value of t is more than the t 0.05 (t0.01), the difference between the sample
means is said to be significant at 5% (1%) level of significance otherwise t calculate value is
less than table value no significant difference exists between sample means.
3. Testing the difference between means of two samples (Dependent sample or matched
paired observations)
Two samples are said to be dependent when the elements in one sample are related to those in
the other in any significant or meaningful manner. E.g., to find out the effect of training on
some employees, find out the efficacy of a coaching class or determine whether there is a
significant difference in the efficacy of two drugs- one made within country and other imported.
The t-test based on the paired observations is defined by the following formula:
𝐝√𝐧
t=
𝐒
Where,
d = the mean of differences.
S = the standard deviation the differences.

4. Testing the significance of an observed correlation coefficient:


Given a random sample from a bivariate normal population, if the hypothesis is tested to be
zero by correlation coefficient, i.e the variables in the population are uncorrelated, then the
formula is:
𝒓 √𝒏−𝟐
t=
√𝟏−𝒓𝟐
df = n-2
If the calculated value of t is more than the t0.05, the value of ‘r’ is significant at 5%.
A t-test is an analysis of two populations means through the use of statistical examination; a t-
test with two samples is commonly used with small sample sizes, testing the difference between
the samples when the variances of two normal distributions are not known.
Significance Testing: The terms "significance level" or "level of significance" refer to the
likelihood that the random sample you choose (for eg;- test scores) is not representative of the
population. The lower the significance level, the more confident you can be in replicating your
results. Significance levels most commonly used in educational research are the .05 and .01
levels.
Eg:- .05 as another way of saying 95/100 times that the sample from the population, Similarly,
.01 suggests that 99/100 times that the sample from the population. These numbers and signs
come from Significance Testing, which begins with the Null Hypothesis.
Part I: The Null Hypothesis
The traditional way to test this question involves:
Step 1. Develop a research question.
Step 2. Find previous research to support, refute, or suggest ways of testing the question.
Step 3. Construct a hypothesis by revising research question:

Hypothesis Summary Type


H1: A-B There is no relationship between A and B Null

H2: A=B There is a relationship between A and B. Alternative


Here, there is a relationship, Alternate but
we don't know if it is positive or negative.
H3: A<B There is a negative relationship between A Alternative
and B. Here, the suggests Alternate that the
less A is involved, the better B.
H4: A>B There is a positive relationship between A Alternative
and B. Here, the suggests Alternate that the
more B is involved, the better A.

Step4. Test the null hypothesis.


To test the null hypothesis, A = B, we use a significance test. The italicized lowercase p you
often see, followed by> or <sign and a decimal (p≤.05) indicate significance. In most cases,
the researcher tests the null hypothesis, A = B, because is it easier to show there is some sort
of effect of A on B, than to have to determine a positive or negative effect prior to conducting
the research. This way, you leave yourself room without having the burden of proof on your
study from the beginning.
Step5. Analyze data and draw a conclusion.
Testing the null hypothesis leaves two possibilities:

Outcome Wording Type


Fail to reject the null. We find no
A=B Null
relationship between A and B.
Reject the null. We find a
A, < or > B Alternate
relationship between A and B.

Step 6. Communicate results. See Wording results, below.


Part II: Conducting a t-test (for Independent Means)
So how do we test a null hypothesis? One way is with a t-test. A t-test asks the question Is the
difference between the means of two samples different (significant) enough to say that some
other characteristic (teaching method, teacher, gender, etc.) could have caused it?"
To conduct a t-test using an online calculator, complete the following steps:
Step 1. Compose the Research Question.
Step 2. Compose a Null and an Alternative Hypothesis.
Step 3. Obtain two random samples of at least 30, preferably 50, from each group.
Step 4. Conduct a t-test:
Step 5. Interpret the results (see below).
Step 6. Report results in text or table format (see below).
• Get p from "P value and statistical significance:" Note that this is the actual value.
• Get the confidence interval from "Confidence interval:"
• Get the t and df values from "Intermediate values used in calculations:"
• Get Mean, and SD from "Review your data."
Part III. Interpreting a t-test (Understanding the Numbers):
T Tells you a t test was used
(98) Tells you the degree of freedom (the sample -# of tests performed)
3.09 Is the t statistic- the result of the calculation
P≤ .05 Is the probability of getting the observed score from the sample groups.
This is the most important art of this output to you.
If this sign It means all these things
P≥ .05 Likely to be a result of chance (same saying A= B)
Difference is not significant
Null is not significant
Null is correct
Fail to reject the null
There is no relationship between A and B
If this sign It means all these things
P≤ .05 Not likely to be a result of chance (same as saying A is not equal to B)
Difference is significant
Null is incorrect
Reject the null
There is a relationship between A and B
Note: We acknowledge that the average scores are different. With a t-test we are deciding if
that difference is significant (is it due to sampling error or something else).
Understanding the Confidence Interval (CI)
The Confidence Interval (CI) of a mean is a region within which a score (like mean test score)
may be said to fall with a certain amount of "confidence." The CI uses sample size and standard
deviation to generate a lower and upper number that you can be 95% sure will include any
sample you take from a set of data.
Consider Georgia's AYP measure, the CRCT. For a science CRCT score, we take several
samples and compare the different means. After a few calculations, we could determine
something like the average difference (mean) between samples is -7.5, with a 95% CI of -22.08
to 6.72. In other words, among all students' science CRCT scores, 95 out of 100 times we take
group samples for comparison (for example by year, or gender etc...) one of the groups, on
average will be 7.5 points lower than the other group. We can be fairly certain that the
difference in scores will be between -22.08 and 6.72 points.
Part IV. Wording Results
Wording Results in Text
In text, the basic format is to report: population (N), mean (M) and standard deviation (SD) for
both samples, t value, degree freedom (df), significance (p) and confidence interval (C195)
Example 1: p≤ .05 or significant results
Among 7th grades in Lowndes Country schools taking the CRCT reading exam (N=336), there
was a statistically significant difference between the two teaching teams, teams 1 (M=818.92,
SD=16.11) and team 2((M=828.28, SD=14.09), t(98) = 3.09, p <0.5, C195-15.37, -3.35.
Therefore, we reject the null hypothesis that there is no difference in reading scores between
teaching teams 1 and 2.
Example 2: p>.05 or not significant results
Among 7th grades in Lowndes Country schools taking the CRCT reading exam (N=336), there
was no statistically significant difference between female students (M=834.00, SD=32.81) and
male students (M=841.08, SD = 28.76), t(98) = 1.15, p>0.5, C195-19.32, 5.16. Therefore, we
fail to reject the null hypothesis that there is no difference in science scores between females
and males.
CALCULATION OF CHI SQUARE TEST AND IT'S INTERPRETATION

The x2 test is one of the simplest and most widely used nonparametric tests in statistical work.
The x² was first used by Karl Pearson in the year 1900.
The quantity x² describe the magnitude of the discrepancy between theory and observation.
It is defined as x² = Σ(0-Ε)2/Ε
Where,
O = Observed frequencies
E = Expected frequencies
STEPS- To determine the value of x2, the steps required are:
a) Calculate the expected frequencies. In general, the expected frequency for any call can be
calculated from the following equation:
E= RT x CT/N
E = Expected frequency
RT = The row total for the row containing the cell
CT = The column total for the column containing the cell
N= The total number of observations.
b) Take the difference between observed and expected frequencies and obtain the squares of
these difference i.e., obtain the value of (0-E)2
c) Divide the values of (O-E)2 obtained in the step (b) by the respective expected frequency
and obtain the total (Σ(0-E)2/E). This gives the value of x2 which can range from zero to
infinity. If x2 is zero it means that the observed and expected frequencies completely coincide.
The greater the discrepancy between the observed and expected frequency, the greater shall be
value of x2
The calculated value of x2 is compared with the table value of x2 for given degrees of freedom
at a certain specified level of significance.
If at the stated level (generally 5% level is selected), the calculated value of x2 is more than the
table value of x2. The difference between the theory and observation is considered to be
significant.
If, on the other hand, the calculated value of x is less than the table value, the difference between
theory and observation is not considered as significant.
The computed value of x2 is a random variable which takes on different values from sample to
sample, that is x2 has a sampling distribution.
It should be noted that the value x2 is always positive and its upper limit is infinity. Also, since
x2 is derived from observation, it is a statistic and a parameter.
The chi square test (x2 test) is, therefore, termed non- parametric
Degrees of Freedom
While comparing the calculated value of x2 with the table value we must determine the degrees
of freedom. By degrees of freedom, we mean the number of classes to which the values can be
assigned arbitrarily or at will without violating the restrictions or limitations placed.
The number of degrees of freedom is obtained by subtracting from the number of classes the
number of degrees of freedom lost in fitting.
Interpretation
The chi square test is one of the most popular statistical inference procedures today.
It is applicable to very large number of problems in practice which can be summed up under
the following heads:
a) x2 test as a test of independence:
With the help of chi square test, we can find out whether two or more attributes are associated
or not.
Suppose we have N observations classified according to some attributes we may ask whether
the attributes are related or independent.
In order to test whether or not the attributes are associated we take the null hypothesis that there
is no association in the attributes under study or, in other words, the two attributes are
independent.
If the calculated value of x2 is less than the table value at a certain level of significance
(generally 5% level), we say that the results of the experiment provide no evidence for doubting
the hypothesis or, in other words, the hypothesis that the attributes are not associated holds
good.
On the other hand, if the calculated value of x2 is greater than the table value at a certain level
of significance, we say that the results of the experiment do not support the hypothesis or, in
other words the attributes are associated.
It should be noted that x2 is not a measure of the degree or form of relationship. It only tells us
whether two principles of classification are or not significantly related, without reference to
any assumptions concerning the form of relationship.
b) x2 test as a test of goodness of fit;
x2 test is very popularly known as test of goodness of fit for the reason that it enables us to
ascertain how approximately the theoretical distributions such as Binomial, Poisson, Normal,
etc., fit empirical distributions, i.e., those obtained from sample data.
When an ideal frequency curve whether normal or some other type is fitted to the data, we are
interested in finding out how well this curve fits with the observed facts.
A test of the goodness of fit of the two can be made just by inspection, but such a test is
obviously inadequate. Precision can be secured by applying the x2 test.
The following are the steps in testing the goodness of fit:
1. A null and alternative hypothesis are established, and a significance level is selected for
rejection of the null hypothesis.
2. A random sample of observations is drawn from a relevant statistical population.
3. A set of expected or theoretical frequencies is derived under the assumption that the null
hypothesis is true. This generally takes the form of assuming that a particular probability
distribution is applicable to the statistical population under consideration.
4. The observed frequencies are compared with the expected, or theoretical frequencies.
5. If the calculated value of x2 is less than the table value at a certain. The level of significance
(generally 5% level) and for certain degrees of freedom the fit is considered to be good, i.e.
the divergence between the actual and expected frequencies is attributed to fluctuations of
simple sampling. On the other hand, if the calculated value of x² is greater than the table
value. The fit is poor, i.e., it cannot be attributed to fluctuations of simple sampling rather
it is due to the inadequacy of the theory to fit the observed facts.
6. It should be borne in mind that in repeated sampling too good a fit is just as likely as too
bad a fit. When the computed chi square is too close to zero, we should suspect the
possibility that two sets of frequencies have been manipulated to force them to agree and,
therefore, the design of our experiment should be thoroughly checked

c) x2 test as a test of homogeneity:


The chi square test of homogeneity is an extension of the chi square test of independence. Tests
of homogeneity are designed to determine whether two or more independent random samples
are drawn from the same population or from different populations.
Instead of one sample as we use with independence problem, we shall now have two or more
samples.
CALCULATION OF ANOVA AND INTERPRETATION

The analysis of variance is frequently referred to as ANOVA which is a statistical


tool and technique specially designed to test whether the means of more than two quantitative
populations are equal. This technique developed by R.A Fisher in 1920's is capable of fruitful
application to a diversity of practical problems. Basically, it consists of classifying and cross
classifying statistical results and testing whether the means of a specified classification differ
significantly. The word treatment in analysis of variance is used to refer to any factor in the
experiment that is controlled at different levels or values, ANOVA application is found in nearly
every type of experimental design, in natural sciences as well as social sciences.
It should be kept in mind that the analysis of variance test is not intended to serve
the ultimate purpose of testing for the significance of the difference between two sample
variances; rather its purpose is to test for the significance of the differences among sample
means.
Assumptions in calculating ANOVA:
1. Normality
2. Homogeneity
3. Independence of error.
It may be noted that, whenever any of these assumptions are not met, the ANOVA technique
cannot be employed to yield valid inferences.
Technique of ANOVA:
The technique of analysis of variance has been discussed under
a) one-way classification
b) two-way classification
a) one-way classification:
In One way analysis the data are classified according to one criterion. The null hypothesis is:
Ho:H1 H2 H3......k
HI: H1 H2 H3……..uk. All means are not equal.
i.e. the arithmetic means of the population from which the K samples were randomly drawn
are equal to one another.
The steps in carrying out the analysis are:
1. Calculate the variance between the samples:
The variance between samples (groups) measures the differences between the sample
mean of each group and the overall mean weighed by the number of observations in each group.
The variance between samples taken into account the random variations from observation to
observation. It also measures difference from one group to another. The sum of squares between
samples is denoted by SSC. Fc.
Calculating variance between the samples take the total of the square of the deviations
of the means of various samples from the grand average and divide this total by the degrees of
freedom.
The steps in calculating variance between samples will be:
a. Calculate the mean of each sample., X1, X2, etc:
b. Calculate the grand average X, its value is obtained as follows:
X=X+X2+X3+.....
N₂+N2+N3+....
c. Take the difference between the means of the various samples and the grand average.
d. Square these deviations and obtain the total which will give sum of squares between the
samples and
e. Divide the total obtained in step (d) by the degrees of freedom. The degrees of freedom will
be on less than the number of samples. i.e., if there are 4 samples then the difference will
be 4-1 = 3 or 9 = k-1, where k= number of samples.
2. Calculate variance within the sample:
The variance (or sum of squares) within the samples measures those inter-sample differences
due to chance only. It is denoted by SSE. The variance within samples (groups) measures
variability around the mean of each group. since the variability is not affected by group
differences it can be considered: measure of the random variation of values within a group. For
calculating the variance within the samples take the values of the respective samples and divide
this total by the degrees of freedom. Thus, the steps in calculating variance within the samples
will be:
a. Calculate the mean of each sample., X1, X2, etc:
b. Take the deviations of the various items in a sample from the mean values of the respective
samples."
c. Square these deviations and obtain the total which gives the sum of squares within the
samples and
d. Divide the total obtained in step (c) by the difference. The difference is obtained by
deduction from the total number of items the number of samples. i.e. 9 N-K where K refers
to the number of samples and N refers to the total number of all the observations.
3. Calculate the ratio of F as follows:
Between− Column variance
F=
Within Column variance
Symbolically,
𝑆12
F=
𝑆22

The F= distribution (named after the famous statistician R.A. Fisher) measures the ratio of the
variance between groups to the variance within groups. The variance between the samples
means is the numerator and the variance within the sample means is the denominator. If there
is no real difference from group to group any sample difference will be explainable by random
variation and the variance between groups should be close to the variance within groups.
However, if there is a real difference between the groups the variance between groups will be
significantly larger than the variance within groups.
4. Compare the calculated value of F with the table value of F for the difference at a
certain critical level
(generally, we take 5% level of significance). If the calculated value of F is greater than the
table value it is concluded that the difference in sample means is significant. i.e, it could not
have arisen due to fluctuations of simple sampling, or, in other words, the samples do not come
from the sample population. On the other hand, if the calculated value of F is less than the table
value, the difference is no significant and has arisen due to fluctuations of simple sampling.
It is customary to summarise calculations for sum of squares, together with the r numbers of df
an mean square in a table called the analysis of variance table, generally abbreviated ANOVA.
The specimen of ANOVA table is given below:
Analysis of variance (ANOVA) table: One way classification Model

SORCE OF SS (sum of DF MS mean square VARIANCE


VARIATTION squares) RATIO OF F
Between the SSC 1=c-1 MSC= SSC/(C-1)
samples
Within the SSE 2=n-c MSE = SSE/(n-c) MSC/MSE
samples

Total SST n-1

SST= Total sum of squares of variations


SSC = Sum of squares be within the sample (rows)
MSC = Mean sum of squares between samples
MSE= Mean sum of squares within samples.
Interpretation of ANOVA test:- The calculated value of F is less than the table value and
hence the difference in the mean values of the sample is not significant, i.e, the samples
could have come from the same universe.
Analysis of variance in two-way classification model:
When two independent factors have an effect on the response variable of interest, it is possible
to design the test so that an analysis of variance can be used to test for the effects of the two
factors simultaneously. Such a test is called a two- factor analysis of variance. Two sets of
hypotheses can be tested with the same data at the same time.
In a two-way classification the data are classified according to two different criteria or factors.
The procedure for analysis of variance is different than one way classification.

SORCE OF SS (sum of DF MS mean square VARIANCE


VARIATTION squares) RATIO OF F
Between the samples SSC (C-1) MSC= SSC/(C-1) MSC/MSE
Between rows SSR r-1 MSR = SSR/(r-1) MSR/MSE
Residual or error SEE (C-1) (r-1) MSE = SSE/(r-1) (c-1)
Total SST n-1

SST = Total sum of squares of variations


SSC = Sum of squares between samples (columns)
SSR = Sum of squares within the sample (rows)
SSE = Sum of squares due to error
The sum of squares for the source 'residual' is obtained by subtracting from the total sum of
squares, the sum of squares between columns and rows i.e. SSE SST-(SSC + SSR)
The total number of df-n-1 or c, r-1
Where c refers to number of rows and r refers to number of rows.
Number of df between columns = (c-1)
Number of df between rows= (r-1)
Number of df for residual (c-1) (r-1)
F-values are calculated as follows:
F(1,2) = Where,
1= (c-1) and 2=(c-1) (r-1)
F(9 1,92) = Where,
91 (c-1) and 2 = (c-1) (r-1)
It should be carefully noted that 9 1 may not be the same in both cases - in one it is (C-1) and
in other case 1 (r-1)
Interpretation of the data:
If the calculated value of F is greater than the table value, the null hypothesis is rejected.
If the calculated value of F is lesser than the table value, then the null hypothesis is accepted.

You might also like