Unit 1
Unit 1
Structure
1.0 Introduction
1.1 Objectives
1.2 Types of Data: Quantitative and Qualitative
1.3 Quantitative Data
1.3.1 Types of Quantitative Data
1.3.2 Tabulation and Organisation of Quantitative Data
A. Frequency Distribution
B. Cumulative Frequency Distribution
1.3.3 Graphical Presentation of Quantitative Data
1.3.4 Analysis of Quantitative Data
A. Measures of Central Tendency
B. Measures of Variability
C. Measures of Relative Position
D. Measures of Relationship
1.4 Qualitative Data
1.4.1 Organisation of Qualitative Data
1.4.2 Analysis of Qualitative Data
A. Content Analysis
B. Inductive Analysis
C. Logical Analysis
1.5 Let Us Sum Up
1.6 Glossary
1.7 Check Your Progress: The Key
1.0 INTRODUCTION
In Block 3, we dealt with the nature of various tools used in the collection of data. These
data are mostly expressed in quantified terms. However, quantitative data may not be
available in certain cases. In such a situation, the researcher has to consider the
phenomenon as a whole without breaking it down into measurable or quantifiable
components. Indeed, he/she should be familiar not only with the two types of data –
quantitative and qualitative, but also with the process of classifying data, graphical
representation and the various methods of data analysis.
The aim of this Unit is to make you understand the nature of quantitative and qualitative
data, the procedures for classifying and tabulating quantitative data, presenting them
graphically, and the various methods used in data analysis.
7
Data Analysis
1.1 OBJECTIVES
The data collected through the administration of various types of tools on the selected
samples are of two types - qualitative and quantitative. In quantitative data,
numerical values are assigned to the characteristics or properties of objects or events,
according to logically accepted rules. It is a process wherein a number system like figures,
ratings or scores is imposed on empirical data. However, when the researcher takes into
consideration the phenomenon as a whole and does not attempt to analyse it in
measurable or quantifiable terms, the approach becomes ‘qualitative’. Generally, in
educational and behavioural sciences, both types of data, (i) quantitative and
(ii) qualitative, are recognised. We will look at the characteristics of both in the following
sections. Section 1.3 deals with quantitative data, their types, tabulation, frequency
distribution and cumulative frequency distributions, the need to represent data graphically,
the various types of graphs, and the methods of analysing quantitative data, viz., measures
of central tendency, variability, relative positions and relationship. In section 1.4, we shall
briefly discuss qualitative data and their analysis with reference to content analysis, logical
analysis and inductive analysis. The application of various parametric and non-parametric
tests is discussed in more detail in Unit 2 of this Block.
It includes how much or how little of a given characteristic or attribute is present. For
example, the difference in the amount of an attribute possessed by individuals with test
8
Types of Data
Parametric data are obtained by applying interval or ratio scales of measurement. Scores
of the tests of ability, achievement, attitude, interest, values, personality etc. are examples
of interval scales of measurement. In the study of reaction time we use ratio scale. In this
type of experiment, the zero point in the absolute sense is known and it makes sense to
look at the ratio of the time taken to respond in different treatment situations.
Non-parametric data are either counted or ranked. In counted data, we make use of the
nominal scale. Each individual can be a member of only one category and all the members
of that category have the same, defined characteristics. For example, we may categorise
a group as a sample of ‘female students’ of a particular ‘study centre’ of an open
university. The categorisation of teachers at different educational levels—school, college
and university—is another example of nominal data.
In ranked data, we apply the ordinal scale of measurement. The sets or classes of objects
are ordered on a continuum in a series ranging from the lowest to the highest according to
the characteristics we wish to measure. The ranking of students in a class for height,
weight or academic achievement are examples of ordinal data.
9
Data Analysis
A. Frequency Distribution
Data collected from a test and by using other gathering/measuring tools are raw and may
have little meaning to the researcher until they are tabulated and organised in a systematic
order. One of the ways of doing so is to prepare a frequency distribution. The method
for tabulating the quantified data in a frequency distribution can be illustrated by
considering the following scores of 40 students of MA (DE) of the Indira Gandhi National
Open University in course MDE-412.
57 70 80 82 87
60 72 80 82 88
64 73 80 82 87
67 70 78 80 93
67 76 77 84 95
62 76 78 85 97
61 75 80 85 98
63 70 78 85 90
It is difficult to see from the above table how the scores are distributed. Inspection of
scores, however, shows that many scores occur more than once.
You will observe that there are one 98, one 97, one 95, one 88, two 87s, three 85s, and so
on. For our convenience, you can arrange the data in columns as shown in Table 1.2. In
one column, you can arrange the marks in class-intervals and in the other, you can record
the number of students who have scored these marks by tallies. Inspection of the scores
in Table 1.1 shows that the highest score is 98 and the lowest is 57. The range is 41 (i.e.
98-57). Therefore, the distribution of scores can be conveniently arranged by dividing the
range of 41 into eight or more class-intervals if the classes are taken to be of 5 points
each. If you take the starting point at 56, the scores within the range 56 to 60, that is all
scores with the values 57 and 60 will be grouped together to form the lowest class-
interval. All scores from 61 to 65, that is, 61, 62, 63, 64 and 65 will form the next class-
interval. Similarly you shall group all scores within the ranges 66 to 70, 71 to 75 and so on.
The highest class interval will be 96-100.
10
Types of Data
96 -100 II 2
91 - 95 II 2
86 - 90 IIII 4
81 - 85 IIII II 7
76 - 80 IIII IIII I 11
71 - 75 III 3
66 - 70 IIII 5
61 - 65 IIII 4
56 - 60 II 2
Total number of scores N = 40
In Table 1.2, the class-intervals have been arranged serially from the smallest at the
bottom to the largest at the top, each class-interval covering 5 scores. For each score, we
have marked a ‘tally’ against the corresponding class-interval. The first score, 57, is
represented by a tally placed against the class interval 56-60, the second score of 60 by a
‘tally’ marked against the class interval 56-60, and the third score 64 by a tally against the
class interval 61-65. The remaining scores have been tabulated in the same way. When all
the 40 scores are listed, the total number of tallies in each class-interval are counted and
written in the next column f. The total of ‘f’ gives the total number of scores (in the
present case 40) and is denoted by N.
It may be noted that the interval 56-60 takes care of all the scores from 56 upto 60. The
score of 56 ordinarily means the interval 55.5 to 56.5 and that the score of 60 means 59.5
to 60.5. The mid-point of the bottom most class-interval is 58. Hence, the distribution
represented in Table 1.2 may also be expressed as:
11
Data Analysis
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
.........................................................................................................................................
In some cases, you may not be concerned with the frequencies within the class-intervals,
but rather with the number or the percentage of values greater than or less than a
12
Types of Data
N = 40
Graphical presentation often facilitates understanding of a set of data. With the help of a
well-drawn graph, the data can be read and interpreted very easily. Brief descriptions of
the various types of graph which are useful in visualizing the important properties of a
frequency distribution are given below.
The following three types of graph are commonly used for the above mentioned purposes:
i) Histogram or column diagram
ii) Frequency polygon
iii) Cumulative percentage curve or ogive.
13
Data Analysis
Step 1: A horizontal line is drawn at the bottom of a graph paper. Units representing
class-intervals are marked along this line.
Step 2: A vertical line is drawn at the left hand extreme of the horizontal axis. Along
this vertical axis, units representing individual frequencies of the class-intervals
are marked.
Step 3: Taking class units as bases, rectangles are drawn, such that the areas of
rectangles are proportional to the frequencies of the corresponding classes.
Let us consider the following data for drawing a histogram as an illustration of what you
have read above.
N = 40
10
Y
8
14
Types of Data
12
10
Y 8
2 12 17 22 27 32 37
0 4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5 44.5
X
Fig. 2: Frequency Ploygon plotted from the data of Table 1.5
Step 3: Directly above the mid-point of each class-interval along the horizontal axis plot the
points at a height proportional to the respective frequencies. Join these points by straight
lines. The frequency polygon for the distribution of table 1.5 is shown in the figure 2.
Fig. 3: Cumulative percentage curve or ogive plotted from the data of Table 1.5.
15
Data Analysis
Analysis of quantified data means studying the organised or tabulated data in order to
discover the inherent facts. The data are studied from as many angles as possible to
explore the new facts. Two types of statistical methods are used in the analysis of the
tabulated data measured/expressed in quantified terms. The first category of methods
pertain to ‘descriptive analysis’ and the second, to ‘inferential analysis’ of data.
In this Unit, you will be concerned with ‘descriptive analysis’. Analysis of quantitative
data can be also done by using computer software like SPSS, SAS, Stata and XL Stat. If
you are interested to know more about these software you can check the websites. But it
require thorough understanding of the program and computer.
The following methods are generally used in descriptive statistical analysis of the tabulated
data:
i) Measures of central tendency
ii) Measures of variability
iii) Measures of relative positions
iv) Measures of relationships.
The three most commonly used measures of central tendency are the Mean, the Median
and the Mode.
The formula for finding the mean for an ungrouped data is:
M=
∑X ...................................................................(1)
N
in which
M = Mean
∑ = Sum of
16
Types of Data
X = Observations in a distribution
N = Total number of observations.
16 + 14 + 12 + 18 + 21 + 22 + 13 + 15 + 16 + 18
M =
10
∑X
= = 16.5.00
N
When the number of observations or measures is large, the data is grouped in a frequency
distribution.
Where
M = Mean
AM = Assumed Mean
x′
= [Midpoint score(x) - AM]/(length of the class interval)
∑ fx′
= Sum of the products of frequencies and deviation of observations from
the assumed mean.
i = Width of the class-interval
N = Total number of observation
To illustrate the use of formula (2), consider the grouped data given in table 1.5.
Computations
Step 1: Put the class-intervals in exact limits
Step 2: Find the mid-point of each class interval and take the assumed mean at the
interval which has the maximum frequency.
Step 3: Find the difference between each mid-point score and the assumed mean and
divide it by the length of the class-interval to get the deviation x.
Step 4: Compute fx for each class-interval (fx is the product of the frequency and
deviation of the observation from the assumed mean in a particular case.)
17
Data Analysis
Table 1.6: The Calculation of the Mean from Data Grouped into a Frequency
Distribution (Ref. Table)
34.5 - 39.5 37 4 2* 8
29.5 - 34.5 32 8 1 8
24.5 - 29.5 27 AM 11 0 0 (+16)
9.5 - 24.5 22 8 -1 -8
14.5 - 19.5 17 6 -2 -12
9.5 - 14.5 12 3 -3 -9 (-29)
N = 40
37 - 27
*2 =
5
Using the formula (2):
∑fx′
M = AM + ×i
N
-13
= 27.0 + ×5
40
= 27.0 - 1.625
The median is a point in an array, above and below which one half of the
observations fall. It is a measure of position rather than magnitude.
If the observations are ungrouped and their number is small, the observations are
arranged in the order of magnitude. The middle score is determined by counting up half
the value of N if the number of observation (N) is even. When the number of
observations (N) is odd, the mid-observation value is median. For example, 10 is the
median of scores : 7, 8, 9, 10, 11, 12, 13. When the number of scores (N) is even, the
18
Types of Data
median is the mid-point between the two middle scores. For example:
(10+11)
= 10.5 is the median of scores:
2
In the case of grouped data, cumulative frequency distribution is prepared and the
median is calculated by the formula:
N / 2 -F
Mdn = l + ×i ...............(3)
f
Mdn = Median
l = Exact lower limit of the class-interval upon which the median lies.
N/2 = One half of the total number of observations
F = Sum of all frequencies below l.
f = Frequency within the class-interval upon which the median lies.
i = Width of the class interval in which the median lies.
To illustrate the use of formula (3) consider the data of table 1.5 once again.
Table 1.7: The Calculation of the Median from Data Grouped into a Frequency
Distribution
Class-Interval Frequency (F) Cumulative Frequency (F)
34.5 – 39.5 4 40
29.5 – 34.5 8 36
Median 24.5 – 29.5 11 28
Class
19.5 – 24.5 8 17
14.5 – 19.5 6 9
9.5 – 14.5 3 3
N = 40
19
Data Analysis
N/2-F
Mdn = l + ×i
f
(20 - 17)
= 24.5 + ×5
11
15
= 24.5 +
11
= 25.86
III) The Mode
In a simple ungrouped series of measures, the crude or empirical mode is that single
measure which occurs most frequently. For example, in the series 7, 8, 9, 10, 11, and 12
the most often recurring measure, namely, 9 is the crude or empirical mode.
When data are grouped into a frequency distribution, the crude or empirical mode is
usually taken to be the mid-point of that interval which contains the largest frequency. In
the example given in table 1.5, the interval 24-29 contains the largest frequency and hence
26.5, its mid-point, is the crude mode.
The true mode, that is, the point of greatest concentration in the distribution, or the point at
which more measures fall than at any other point, is calculated by the formula:
...................................(4)
fm - f1
Mode = l + ×i
2fm - f1 - f2
Where
l = Lower limit of the modal class i.e., the class interval having maximum
frequency
fm = Frequency of the modal class.
f1 = Frequency of the class-interval preceding the modal class.
f2 = Frequency of the class-interval following the modal class.
i = Width of the modal class.
20
Types of Data
To illustrate, let us make use of formula (4) for the data in table 1.5. Here the maximum
frequency is 11 which lies in class interval 24.5 - 29.5.
11 - 8
= 24.5 + ×5
2 × 11 − 8 − 8
3
= 24.5 + ×5
6
= 24.5 + 2.5
= 27.00
B. Measures of Variability
The measures of central tendency are very useful in describing the nature of a distribution
of measures, but they do not give the researcher a complete picture of the data. These
measures will not tell the researcher how the scores tend to be distributed. For this, you
use a different set of measures which are called measures of ‘variability’ or measures of
‘spread’ or ‘dispersion’. The most commonly used measures of variability include the
range, the variance and standard deviation.
I. The Range
The range is defined as the difference between the two extreme measures or values in a
distribution. Suppose the scores of 10 learners in the course MDE -412 are :
50, 40, 39, 35, 29, 28, 24, 27, 19. 18.
The range for this distribution will be (50-18) = 32. Although the range has the advantage
of being easily calculated, it has the following serious limitations:
1) As the value of range is based on only two extreme values in the total distribution, it
does not give any idea of the variation of many other values of the distribution.
21
Data Analysis
2) It is not a stable statistic as its value can differ from sample to sample drawn from
the same population.
The average of the squared deviations of the measures or values from their mean is
known as the variance. The standard deviation is the positive square root of variance.
The variance for the ungrouped data is found by using the formula:
∑ x2
σ2 = ..........................................(5)
N
= Variance of the sample
x = Deviation of the raw measures or values from the mean.
N = Number of values or measures
Let us consider the following data of scores for the application of formula (5):
10, 10, 9, 9, 8, 8, 7, 7, 6, 6.
As the deviation of each score from the mean is required, the first thing to do is to
calculate the mean. Using formula (1)
M=
∑ x = 80 = 8
N 10
Now, from each raw score, the mean is substracted to get the value of x.
Table 1.8: Distribution of the Test Scores of Ten Learners of Course MDE - 412
----x2 =20
22
Types of Data
σ2 = ∑x 2
20
= =2
10
Now to get the standard deviation, you need the positive square root of the variance, σ 2 .
Standard Deviation = σ =
∑x 2
= 2
= 1.41
The raw scores instead of deviation scores may also be used. The raw score formulae for
variance and standard deviation are given as follows
N ∑ X 2 − (∑ X )2
Variance = σ =
2
..........................................(6)
N2
N ∑ X 2 − (∑ X ) 2
Standard Deviation = σ = ..........................................(7)
N2
In which
X = Raw score
N = The number of scores in the distribution.
Using the same set of data, you can calculate variance and standard deviation with the
help of raw-score formulae:
23
Data Analysis
Table 1.9: The calculation of variance and standard deviation from original (row)
scores when the assumed mean is taken at zero and the data is ungrouped
Score (X) X2
10 100
10 100
9 81
9 81
8 64
8 64
7 49
7 49
6 36
6 36
....X =80 .........X2 = 660
10 × 660 − (80 ) 2
=
100
6600 − 6400
=
100
=2
N ∑ X 2 − (∑ X ) 2
Standard Deviation = σ =
N
10 × 660 − (80) 2
=σ =
10
6600 − 6400
=
10
14.14
=
10
= 1.414
24
Types of Data
In the case of grouped data in a frequency distribution, the variance and standard
deviation are calculated by using the formulae:
i2
Variance = = σ = 2 N ∑ fx −
N
2 '2
[ (∑ fx ) ]
'2 2
..........................(8)
Standard Deviation =
i2
N 2
N ∑ fx ' 2 − (∑ fx )'2 2
..........................(9)
Where
i = Width of the class-interval
N = Total number of measures
f = Frequency of class-interval
x1 = Deviation of the raw measure from the assumed mean divided by the
length of class-interval.
To illustrate the use of these formulae let us consider the distribution given in table 1.10.
Table 1.10: The Calculation of Variance and Standard Deviation from Data
Grouped in a Frequency Distribution
71-75 73 3 3 9 27
66-70 68 4 2 8 16
61-65 63 9 1 9 9
56-60 58 15 0 0 0
51-55 53 8 -1 -8 8
46-50 48 6 -2 -12 24
41-45 43 5 -3 -15 45
Variance = σ =
2 i2
N2
[
N ∑ fx ' 2 − (∑ fx ) ]
'2 2
=
(5)2 [50 ×129 − (− 9)2 ]
(50)2
= 63.69
25
Data Analysis
Standard Deviation =
i2
2
N
N ∑ fx ' 2 − (∑ fx )
'2 2
=
5
50
(50 ×129 − (− 9) )
2
1
= 6369
10
1
= × 79.81
10
= 7.98
The standard deviation is a very useful device for comparing characteristics that may be
different or expressed in different units of measurement. It is also used in describing the
status or position of an individual in a group. But before this concept is developed further,
it is essential to understand the nature of the ‘normal probability distribution’.
26
Types of Data
Compute (i) Mean (ii) Variance and (iii) Standard Deviation for the following frequency
distribution:
Class Interval F
195-199 1
190-194 2
185-189 4
180-184 5
175-179 8
170-174 10
165-169 6
160-164 4
155-159 4
150-154 2
145-149 3
140-144 1
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
27
Data Analysis
The normal probability distribution is based upon the law of probability. It is not an
actual distribution of measures or scores; instead, it is a mathematical model. It is
represented by a curve which is called the Normal Probability Curve. Figure 4 represents
an ideal normal probability curve.
1. The curve is symmetrical around its vertical axis called ordinate. It implies that the
size, shape and slope of the curve on one side of the ordinate is identical to that on its
other side.
2. The values of mean, mode and median computed for a distribution following this
curve, are always the same.
3. The height of the vertical line called ordinate is maximum at mean and in the unit
normal curve it is equal to 0.3989.
4. The curve is asymptotic. It approaches but does not meet horizontal axis and extends
from (minus infinity) to (plus infinity).
5. The points of inflection of the curve occur at points ± 1, standard deviation ,
above and below the mean. Thus the curve changes from convex to concave in
relation to the horizontal axis at these points.
6. About 68.26 percent of the total area falls between the limits M + and M – ;
95.44 percent of the total area of the curve falls between the limits M + and M
- and 99.73 percent of the total area of the curve falls between M+ and M
- .
However, these calculations are rarely necessary, as Normal Table is available from
which the information about the area is readily available. For this reason it is very
essential that the use of Normal Table (Table I Appendix, Unit 2) be clearly understood.
Table I gives the fractional parts of the total area under the normal curve found between
28
Types of Data
mean and ordinate (Y’s) erected at various distances from the mean. The total area under
the curve is taken arbitrarily to be 10,000, because of the greater convenience with which
fractional parts of the total area may then be calculated. You know that x = (X-M)
measures the deviation of a raw score (X) from the mean (M). If x is divided by , then
this deviation is expressed in units. These deviation scores are called sigma scores
X −M x x
or z-scores i.e.z = = . The first column of the table under gives
σ σ σ
distance from the mean in the tenth of and distance from the mean in the hundredth of
are given by the headings of the other columns.
To find the number of cases in the normal distribution between the mean and the ordinate
x
erected at a distance of from the mean, you go down the column until 1.0 is
σ
reached, and in the next column under .00 you take the entry opposite to 1.0, namely
34.13. This figure means that 3413 cases in 10,000, or 34.13 percent of the entire area of
the curve lies between the mean and . Similarly, if you have to find out the
x
percentage of the distribution between mean and 1.65 , you go down the column till
σ
1.6, then across horizontally to the column headed .05, and take the entry 45.5.
This shows that in a normal distribution, 45.05 percent of the total area lies between the
mean and 1.65 .
29
Data Analysis
A raw score on a test, taken by itself, has no meaning. It gets meaning only by
comparison with some reference group or groups. for eg. If a student score 50 in Maths
and 30 in Science, it does not mean that the student did better in Maths. 50 may be the
lowest score in Maths test and 30 may be the highest score in Science test. The
comparison may be done with the help of the following measures:
1. Sigma Scores
2. Standard Scores
3. Percentiles
4. Percentile Ranks.
1. Sigma Scores
A sigma score makes a realistic comparison of scores possible and provides a basis for
equal weighting of the scores as the scores on different tests are expressed on a scale
with a mean of zero and standard deviation of 1.
Let us suppose that the mean of a test is 75 and the standard deviation is 5.0. Then if A
earns a score of 85 on this test, his/her deviation from the mean is 85-75 = 10. Dividing
this deviation of 10 by the standard deviation , i.e., 5.0, we give him a score of
10
= 2 . If B’s score on this test is 64, his/her deviation from the mean is 64-75 = -11 and
5
the score in units is -2.20. Deviations from the mean expressed in terms are called
sigma scores.
Half of the scores in a distribution lie below and half above the mean, about half of
scores are positive and half are negative.
2. Standard Scores
The sigma scores, which are often small decimal fractions and half of them are negative,
are somewhat inconvenient to deal with. Hence, scores are usually converted into a new
distribution with mean and standard deviation so selected that it makes all scores positive
and relatively easy to handle in computation. Such scores are called ‘standard scores’.
The formula for the conversion of a raw score to a standard score is as follows:
σ'
X' = (X − M ) + M ' .......................................(10)
σ
30
Types of Data
in which
When the mean (M’) and standard deviation are taken to be 50 and 10
respectively, the standard score is called a T-score.
10
i.e. T = ( X − M ) + 50 ................................(11)
σ
Example: To illustrate, let us consider a distribution with its mean 67 and = 12.5. Let
us also suppose that A’s score is 76 and B’s score is 54.
Express these scores as (i) standard scores in a distribution with a mean of 250 and of
50 and (ii) T-scores.
50
X' = ( X − 67) + 250
12.5
50
X' = (76 − 67) + 250
12.5
50 × 9
= + 250
12.5
= 286
Substituting B’s scores of 54 in the above equation
50
X' = (54 − 67) + 250
12.5
= 198
31
Data Analysis
10
T= ( X − 67) + 50
12.5
10
T= (76 − 67) + 50
12.5
T = 0.8 × 9 + 50
= 57.2
10
T= (54 − 67) + 50
12.5
= 39.6
3. Percentiles
Percentiles are the points which divide the entire scale of measurement into 100 equal
parts. They are denoted by P0, P1, P2, P3, P4, P5 ……………… P99, and P100.
The first percentile may be defined as that point in a frequency distribution below which
lie 1 percent of the total measures or scores. Similarly, twentieth percentile may be
defined as that point in a frequency distribution below which 20 percent of the total
measures or scores fall. It is evident that the median, expressed as a percentile, is P50. It
should be noted that P0 lies at the beginning of the distribution and P100 at the end of the
distribution.
Pp = l +
(PN − F ) × i
fp .......................................(12)
32
Types of Data
in which
81.5 - 86.5 1 8
76.5 - 81.5 4 79
71.5 - 76.5 5 75
66.5 - 71.5 10 70
61.5 - 66.5 35 60
56.5 - 61.5 12 25
51.5 - 56.5 9 13
46.5 - 51.5 2 4
41.5 - 46.5 2 2
N=80
20 − 13
P25 = 56.5 + ×5
12
33
Data Analysis
Similarly
36 − 25
P45 = 61.5 + ×5
35
76 − 75
P95 = 76.5 + ×5
4
4. Percentile Ranks
The percentile rank is the point in the distribution below which a given percentage of
scores falls. If the 80th percentile rank is a score of 65, then 80 percent of the scores falls
below 65. The median is the 50th percentile rank, for, 50 percent of the scores fall below
it.
The process of calculating percentile ranks is the reverse process of calculating percentile
points. You have to calculate ranks corresponding to particular scores. If R is the rank and
N is the total number of cases, then:
100 R − 50
Percentile Rank = 100 − ................................... (13)
N
Suppose A ranks 13th in the class of 80 learners, 12 learners rank above it, 67 below it. Its
percentile rank is :
100 ×13 − 50
= 100 −
80
= 100 − 15.625
= 84
D. Measures of Relationship
The data in which we secure measures of two variables for each individual is called
bivariate data. The essential feature of bivariate data is that one measure can be
34
Types of Data
compared with another measure for each member of the group. When bivariate data are
studied, you may like to know the degree of relationship between the variables of such
data. This degree of relationship is known as correlation. It can be quantitatively
represented by the coefficient of correlation. Its value ranges from -1.00 to +1.00. A
value of -1.00 describes a perfect negative correlation and +1.00 describes perfect
positive correlation. A zero value describes complete lack of correlation between the two
variables. The sign of the co-efficient indicates the direction of relationship and numerical
value is its strength/magnitude.
There are various methods of correlating variables. Their use is relative to the situation
and type of data. Product Moment Correlation and Rank Order Correlation are
mostly used for computing correlation between two variables.
1. Product-moment correlation
In some situations, the data for two variables X and Y are expressed in interval or ratio
level of measurement and the distributions of these variables have a linear relationship.
Moreover, the distributions of variables are uni-modal and their variances are
approximately equal. In such situations, product moment method of correlation is used
generally. It is also called Pearson’s r.
When the size of the sample is small, there is no need of grouping the data and Pearson’s
r may be calculated with the help of the following formula:
N ∑ xy − (∑ x )(∑ y )
rxy =
[N ∑ x − (x) ][N ∑ y − (∑ y ) ]
2 2
................(14)
2
in which
To illustrate the use of formula (14), let us compute the product moment ‘r’ from the
following data for the two variables X and Y for 10 learners who are enrolled in a Study
Centre of an Open University.
X : 45 54 52 58 62 46 55 49 50 54
Y : 42 50 55 46 59 41 46 48 45 48
35
Data Analysis
X Y x y x2 y2 xy
45 42 -7 -6 49 36 42
54 50 2 2 4 4 4
52(AM) 55 0 7 0 49 0
59 46 6 -2 36 4 -12
62 59 10 11 100 121 110
46 41 -6 -7 36 49 42
55 46 3 -2 9 4 -6
49 48(AM) -3 0 9 0 0
50 45 -2 -3 4 9 6
54 48 2 0 4 0 0
∑ x =5 ∑ y =0 ∑ x =251 ∑y =276
2 2
∑xy=186
N∑xy - (∑x)(∑y)
rxy =
√ [N∑x2 - (∑x)2[(Ν∑y2-(∑y)2]
10 × 186 - 5 × 0
=
√ [10 × 251) - (5)2][10 × 276 - (0)2]
1860
=
2618.89
= 0.71
When N is large or even moderate in size, and when no calculating machine is available,
the best procedure is to group data in both variables X and Y and to form a scattergram.
36
Types of Data
The values from the scattergram may be used in the following formula:
N∑fxy - (∑fx)(∑fy)
rxy = ..............(15)
√ [N∑fx2 - (∑fx)2[(Ν∑fy2-(∑fy)2]
To illustrate the use of the formula (15) consider the data of 50 learners enrolled with
IGNOU in Course X and in Course Y in the following scattergram:
The computation for the values ∑fxy, ∑fx2, ∑fxy etc., may be done in the following
steps in the order given below.
Step 1
The distribution of Course X scores for the 50 learners is found in the f(y) column on the
right of the scattergram. Assume a mean for the distribution of scores of course X (the
mid-point of that interval which contains the largest frequency), and draw double lines to
mark off the row in which the assumed mean falls. In the present example, the mean
37
Data Analysis
score for course X has been taken at 46 (mid point of interval 45-47) and y’s (deviations
from the assumed mean) have been taken from this point.
Step 2
The distribution of the Course Y of 50 learners is found in the f(x) row at the bottom of
the scattergram. Assume a mean for this distribution and draw double lines to designate
the column under the assumed mean. The mean for the Course Y scores is taken at 26.5
(mid-point of interval 26-27), and the x’s (deviations from assumed mean) are taken from
this point. Fill in the fx and then fx2 rows.
Step 3
The fxy for a cell is computed by multiplying the frequency given in the particular cell with
the corresponding x and y. For example, there is a frequency 1 corresponding with Course
Y score 24-25 and Course X score 51-53. The corresponding x for this cell frequency is -
1 and corresponding y is +2. Thus fxy for this cell is (1) (-1) (+2) = -2. Similarly the value
for fxy is computed for all the cells and their sum is calculated row-wise as well
as column-wise. The two sums should equal each other. In the present example, it has
come to be 4.
Step 4
50 × 4 − (− 33)(− 2)
rxy =
[50 ×109 − (− 33) ][50 × 46 − (− 2) ]
2 2
= 0.042
It is also known as the Spearman rank order co-efficient of correlation and is denoted
by r (rho). When the data are available in ordinal (rank) form of measurement rather than
in interval or ratio form, this type of correlation is useful.
To find out Spearman rank order coefficient of correlation, the following formula is used:
6∑ D 2
p = 1−
(
N N 2 −1 ) .........................................(16)
38
Types of Data
in which
∑D 2
= Sum of the squared differences between ranks
N = Number of paired ranks.
To make use of formula (16) let us consider the following data. Two judges X and Y
ranked 10 distance learners in a declamation contest. The ranks given to them by the
judges are given in table 1.13.
A 2 3 -1 1
B 4 5 -1 1
C 5 4 1 1
D 10 9 1 1
E 8 7 1 1
F 1 2 -1 1
G 3 1 2 4
H 9 8 1 1
I 6 10 -4 16
J 7 6 1 1
∑D 2
=28
6∑ D 2
p = 1−
(
N N 2 −1 )
6 × 28
= 1−
10(100 − 1)
= 0.83
39
Data Analysis
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
.......................................................................................................................................
............................................................................................................................
............................................................................................................................
40
Types of Data
The responses to open-ended questions on a questionnaire are pretty extensive; they are
neither systematic nor standardised. However, they permit the researcher to understand
situations as seen and felt by the respondent. The data gathered through participant
observation or an open ended/unstructured interview are also descriptive in nature. These
descriptions may be in the form of field notes specifying some basic information pertaining
to the place where the observation has taken place as well as descriptions about the
people who participated in the activities and their extrinsic behaviour in the course of the
activities. However, it is not possible to interpret minds while observing their extrinsic
behaviour. Through an open-ended/unstructured interview, you can know more about
those events which had occurred earlier or could not be observed during participant
observation. It provides a framework within which the researcher should be able to gather
information from people conveniently and accurately. The information may pertain to a
programme, the reaction of participants about the programme and the type of change the
participants perceive in themselves after their involvement in the programme. The data
are mostly in the form of responses to structured and unstructured questions put to
respondents by the researcher during conversation. The responses are generally direct
quotations from respondents in their own words and provide details about the situations,
events, people, experiences, behaviours, values, customs, etc.
Actual classification or organisation of the data can begin only after the copies are made.
There are no formal or universal rules for organising the data in various units, patterns, or
categories. It requires a creative approach and a lot of perseverance to give a meaningful
look to the data. The contents of field notes about interviews or observations may be read
carefully by the researcher and he/she may note down his/her comments on the margins
41
Data Analysis
or attach small pieces of paper with his/her written comments/notes using staples or tags.
The arrangement of data in topics, using abbreviations, is the next step. The abbreviated
topics are written down either on the margins of the relevant data or on slips of paper
which may be attached to the relevant pages. The process of classifying or labelling
various kinds of data help in the preparation of a data index. Sometimes there are large
data. In such situations, computers are helpful in developing systematic and
comprehensive classification schemes using code numbers for different categories and
sub-categories. The computerized classification system permits the use of organised data
by several groups of people over a long period of time. It permits easy cross-classification
and cross- comparison of descriptive narrations for complex analysis.
Analysis of qualitative data means studying the organised material available in the form of
detailed descriptions, direct quotations or case-documentation in order to discover inherent
facts. These data are studied from as many angles as possible either to explore the new
facts or to reinterpret already known or existing facts.
The following methods are generally used in the analysis of qualitative data:
i. Content Analysis
A. Content Analysis
Content analysis is concerned with the classification, organisation and comparison of the
content of the document or communication. The terms, content analysis and coding are
sometimes used interchangeably as both the processes involve objective, systematic, and
qualitative description of any symbolic behaviour. Since content analysis involves the
classification, evaluation and comparison of the content of communication or document, it
is sometimes referred to as ‘documentary activity’ or ‘information analysis’. The
communication may be in the form of responses to an open-ended questionnaire,
conversation as a result of an interview, or description of an observed activity. It may also
be in the form of official records (census, birth, accident, crime, school, institutional and
personal records), judicial decisions, laws, budget and financial records, cumulative
records, courses of study, content of text books, reference works, news papers,
periodicals or journals, prospectus of various educational institutions or universities, direct
quotations, and notes of an interview.
There are three approaches that a researcher may adopt in content analysis. These
include: (i) characteristics of content, (ii) procedures or causes of content, and (iii)
audience or effects of content. In the first approach, the researcher is interested primarily
in the characteristics of the content itself. He/she may focus either on the ‘substantive
nature’ of the content or upon the ‘form’ of the content. In the second approach, the
42
Types of Data
researcher attempts to draw valid inferences about the nature of the procedures of the
content or the causes of the symbolic material from the characteristics of the material
itself. In the third approach to content analysis, the researcher interprets the content so as
to reveal something about the nature of its ‘audience’ or its ‘effects’. He/she takes the
content material as a basis for drawing inference about the characteristics of the
‘audience’ for whom the material (content) is designed or about the effects of
communication, which it brings about.
The steps involved in the process of content analysis includes (i) defining the unit of
analysis, (ii) specifying variables and categories, (iii) frequency, direction and intensity of
units, (iv) contingency analysis, (v) sampling of units, and (vi) constructing the content
analysis outline. Defining the unit of analysis indicates whether the unit (material) is
confined to single words, phrases, complete sentences, paragraphs, or to even larger
amounts of materials. Once the unit is defined, the researcher conducts its analysis so
as to create reproducible or objective data for scientific treatment and generalisation
beyond the specific set of symbolic material analysed. For converting symbolic material
into objective data, it is necessary to specify the “variables” explicitly in terms of which
descriptions are to be made. Once the unit is defined and the variables along with their
categories specified, the researcher will classify units in the material to be analysed
according to : (i) the number of units (frequency), (ii) favourableness/ unfavourableness of
the content (direction), and (iii) the emotional impact of the units (intensity). The
contingency analysis aims at considering the content within which the unit is found. The
researcher considers the favourableness or unfavourableness of a single unit in the light of
the reminder of the communication so that its real meaning is not lost.
1. The first step is to understand your data. Must read the data carefully again and
again for the quality.
2. Second step is to identify the purpose of evaluation. See that how the respondents
have answered to the question.
3. Categorize the data into themes/patterns. Then organize into categories. This is the
most important step in Qualitative Analysis. You can assign the codes which can be
few letters or words or symbols.
4. In the next step you have to find out the patterns and connections within the
categories or between the categories identified.
5. The last step is the interpreting the data. You have to think and design an outline to
present the findings.
6. You can feed the data to the computer by entering the text to word processing
program. These days software programmes like Ethnograph, MODIST etc. can be
used to analyze the qualitative data. If the data is not big, you can also analyze
manually. There are other software also available. You can choose according to your
convenience.
43
Data Analysis
B. Inductive Analysis
Inductive analysis means that patterns, themes, and categories of analysis emerge out of
the data. In this analysis, researcher looks for natural variation in the data. The study of
natural variation involves particular attention to variations in programme processes and
how participants respond to and are affected by programmes. Two ways of representing
the patterns emerge from the analysis of data. First, the researcher can use the categories
developed and articulated in the programme studied to organize presentation of particular
themes. Second, the researcher may also become aware of categories or patterns for
which the people in the programme did not have labels or terms, and the analyst develops
terms to describe these inductively generated categories.
C. Logical Analysis
Logical analysis is used for representing patterns as dimensions or categories using either
participant-generated constructions or evaluator generated constructions. It is sometimes
useful to cross-classify different dimensions to generate new insights about how the data
can be organized and to look for patterns that may not have been recognised in the initial
induction analysis. Logical analysis aims at creating potential categories by crossing one
typology with another, and then moving back and forth between the logical construction
and the actual data for creating a “new typology” using cross-classification matrices.
There are other ways of analysing qualitative data. We have not discussed all of them.
The idea is to give you a feel of qualitative data analysis and show how it differs from
quantitative data analysis.
........................................................................................................................................
........................................................................................................................................
........................................................................................................................................
........................................................................................................................................
........................................................................................................................................
........................................................................................................................................
........................................................................................................................................
............................................................................................................................
............................................................................................................................
............................................................................................................................
44
Types of Data
1. The data collected through the administration of various tools on the selected
samples are of (i) quantitative and (ii) qualitative nature.
2. Quantitative data are expressed in nominal, ordinal, interval or ratio scales of
measurement. These data are classified into two categories: (i) parametric and (ii)
non-parametric. The parametric data are obtained by applying interval or ratio
scales of measurement, whereas non-parametric data are either enumerated or
ranked. In the enumerated data we make use of nominal scale and in the ranked
one we apply ordinal scale.
3. The quantified data is tabulated in ‘frequency distribution’ and can be represented
graphically with the help of a histogram, a frequency polygon, and/or an ogive.
4. Measures of (i) central tendency, (ii) variability, (iii) relative positions, and
(iv) relationship are the four types of descriptive statistical measures.
5. Mean, median and mode are the three measures of central tendency.
6. Mean is the arithmetic average of a distribution. It is obtained by dividing the sum of
all values of observation by the total number of values. The formula for finding the
mean for ungrouped data is:
M=
∑X
N
When the number of observations is large, the data is grouped in a frequency distribution.
The formula for computing the mean here is:
M = AM +
∑ fx × i
N
7. Median is a point in an array, above and below which one half of the values or
measures fall. If the values are ungrouped and their number is small, the values are
arranged in order of magnitude and the middle value is determined by counting up
half the value of N.
When the number of values is odd, the mid-value is the median. When the number of
values is even, the median is the mid-point between the two middle values.
N
−F
Mdn = l + 2 xi
x
45
Data Analysis
8. Mode is the most frequently occurring value in a distribution. If only one value
occurs a maximum number of times, the distribution is said to have one mode (uni-
modal). A two mode distribution is bi-modal, and more than a two mode distribution
is called multimodal.
In a simple ungrouped series of measures or values, the crude mode is that single measure
or value which occurs most frequently.
fm − fi
Mode = l + xi
2 fm − f1 − f 2
9. The range, variance and standard deviation are the most commonly used measures
of variability.
10. The range is the difference between the two extreme values or measures in a
distribution.
11. The average of the squared deviations of the measures or values from their mean is
known as variance. Standard deviation is the positive square root of variance.
Variance and standard deviation for the ungrouped data are found by the formulae:
N ∑ X 2 − (∑ X )
2
Variance = σ = +
2
N2
N ∑ X 2 − (∑ X )
2
Standard Deviation σ =
N2
When the data are grouped in a frequency distribution, the variance and standard
deviation are computed by the formulae:
Variance = σ 2 =
i2
N 2
[
+ N ∑ fx '2 − (∑ fx ) ] ' 2
Standard Deviation = σ =
i
N
[N ∑ fx '2
− (∑ fx ) ]
' 2
12. The normal probability distribution is represented by a curve which has the
following characteristics:
i) The curve is symmetrical around its vertical axis called ordinate.
ii) The mean, mode and median of the distribution have the same values.
iii) The height of the vertical line called ordinate is maximum at the mean and in
the unit normal curve, it is equal to 0.3989.
iv) The curve is asymptotic.
v) The points of inflection of the curve occur at points 1 ± , standard deviation
( ± 1σ ) above and below the mean.
vi) About 68.26 percent of the total area of the curve falls between limits Mean
±1σ , 95.44 percent of the total area falls between Μ±2σ and 99.73 percent
of the total area falls between Μ±3σ.
46
Types of Data
13. Sigma scores, standard scores, percentiles and percentile ranks are the measures of
relative positions.
14. A sigma score makes it possible to obtain a realistic comparison of scores and
provides a basis for equal weighting of the scores as the scores on different tests
are expressed on a scale with a mean of zero and standard deviation 1.
15. When the sigma scores are converted into a new distribution with mean and
standard deviation so selected as to make all scores positive, the scores are called
standard scores.
The formula for the conversion of a raw score to a standard score is:
σ′
X = σ (X - M) + M′
When the mean (M) and standard deviation (σ′) are taken to be 50 and 10 respectively,
the standard score is called a T-score. It is expressed by the formula:
10
T= (X - M) + 50
σ
16. Percentiles are the points which divide the entire scale of measurement into 100
equal parts.
(P - F)
Pp = l + ×i
f
17. Percentile rank is the point in the distribution below which a given percentage of
scores fall. If R is the rank and N is the total number of cases.
100R - 50
Percentile Rank = 100 -
N
18. Product Moment correlation and rank-difference correlation are the commonly used
measures of relationship between any two variables.
19. When the size of sample is small and the variables are measured in interval scales
of measurement, the product-moment correlation is computed by the formulae:
N∑xy - (∑x)(∑y)
rxy =
√ [N∑x2 - (∑x)2[(Ν∑y2-(∑y)2]
47
Data Analysis
When the size of the sample is large, the product-moment correlation is found by the
formulae:
N∑fxy - (∑fx)(fy)
rxy =
√ [N∑fx2 - (∑fx)2[(Ν∑fy2-(∑fy)2]
20. When the data are available in ordinal (rank) form of measurement and the size of
the sample is small, the formula for computing the rank-difference correlation is:
6∑D2
ρ=1−
N(N2 - 1)
To assess your learning yourself, see whether you are now able to:
• name various types of data
• define quantitative data
• describe quantitative data
• describe various types of quantitative data with examples
• tabulate a given quantitative data into frequency distribution
• illustrate the methods of expressing the class intervals with the help of examples
• compute the cumulative frequencies and cumulative percents for a given
frequency distribution
• name the four methods of representing a data graphically
• construct a histogram, a frequency polygon and an ogive for a given distribution
• name the four descriptive statistical measures
• name and define the three measures of central tendency or averages (mean,
median and mode)
• compute mean, median and mode from a given (i) ungrouped data (ii) grouped
data
48
Types of Data
• name and define the three measures of variability (range, variance and standard
deviation)
• compute range for ungrouped data
• compute variance and standard deviation for a given (i) ungrouped and (ii)
grouped data
• describe the nature and characteristics of Normal Distribution
• use the Normal Table
• name and define the various measures of comparing individuals on the basis of
different types of scores (Sigma Scores, Standard Scores, Percentiles and
Percentile ranks)
• convert a raw score into a sigma score corresponding to the mean and standard
deviation of a distribution
• convert a given raw score into a standard score (Z-score or T-score)
corresponding to the mean and standard deviation of a distribution
• define a percentile
• compute certain percentiles for a given distribution of scores
• define percentile rank
• compute the percentile rank of an individual corresponding to his/her rank in the
group to which he/she belongs
• name the various measures of relationships
• compute the product moment correlation for a given (i) ungrouped and (ii)
grouped data
• compute rank order correlation for a given ungrouped data
• define qualitative data with examples
• name and describe some methods used in the analysis of qualitative data
1.6 GLOSSARY
1. Quantitative Data: Data which are expressed in nominal, ordinal,
interval or ratio scales of measurement.
2. Qualitative Data: Data which are available in the form of detailed
descriptions of situations, events, people,
interactions, and observed behaviour, direct
quotations from people about their experiences,
attitudes, beliefs, and thoughts, and excerpts from
documents, correspondence, records, and case
histories.
3. Parametric Data: These are data which are got by applying interval or
ratio scales of measurement.
4. Nonparametric Data: These are data which are got by applying nominal or
ordinal scales of measurement. These types of data
are either counted or ranked.
5. Central Tendency: A measure of central tendency provides a single
most typical value as representative of a group of
49
Data Analysis
50
Types of Data
∑ fx′
(i) Mean = A.M. + ×i
N
(-12)
= 172 + ×5
50
= 170.80
i2
(ii) Variance =σ =2
[N ∑ fx′ 2
−(∑ fx′)2 ]
N
(5)2
= [(50 × 322 − (−12)2]
(50)2
= 159.52
i2
(iii) Standard Deviation =
N [N ∑ fx′ 2
−(∑ fx′)2 ]
5
= [(50 × 322 − (−12)2]
50
= 12.63
51
Data Analysis
4 Normal probability curve is symmetrical around its vertical axis, i.e. ordinate.
The values of mean median and mode coincide and have the same value.
The height ordinate is maximum at the mean,
The curve is asymptotic.
The points of inflection of the curve occur at points 1 ± , standard deviation
above and below the mean.
About 68.26 percent of the total area falls between the limits M + 1σ and M - 1σ; 95.44
per cent of the total area of the curve falls between limits M + 2σ and M - 2σ; and 99.73
percent of the total area of the curve falls between M + 3σ and M - 3σ.
5.
X Y x y x2 y2 xy
∑ x = 2, ∑ y = 6, ∑ x 2
= 14/4, ∑y2
= 1650, ∑ xy = 1403,
N∑xy - (∑x)(∑y)
rxy =
√ [N∑x2 - (∑x)2[(Ν∑y2-(∑y)2]
11 × 1403 - 2 × 6
=
√ [11 × 1414 - (2)2[(11× 1650 -(6)2]
= 0.92
52