Types of Data and Data Sources Mba
Types of Data and Data Sources Mba
Research Methodology
Figure 1
In statistics, data are classified into two broad categories: quantitative data and qualitative data. This
classification is based on the kind of characteristics that are measured.
Qualitative data refer to qualitative characteristics of a subject or an object. Data that can be placed
into distinct categories, according to some characteristic or attribute. These can not be expressed
numerically. A characteristic is qualitative in nature when its observations are defined and noted in
terms of the presence or absence of a certain attribute. These data are further classified as nominal and
rank data.
(i) Nominal data are the outcome of classification into two or more categories of items or units
comprising a sample or a population according to some quality characteristic. Classification of
students according to gender (as males andfemales), of workers according to skill (as skilled, semi-
skilled, and unskilled), and of employees according to the level of education (as matriculates,
undergraduates, and post-graduates), all result into nominal data. Given any such basis of
classification, it is always possible to assign each item to a particular class and make a summation
of items belonging to each class. The count data so obtained are called nominal data.
(ii) Rank data are the result of assigning ranks to specify order in terms of the integers 1,2,3, ..., n.
Ranks may be assigned according to the level of performance in a test. a contest, a competition, an
interview, or a show. The candidates appearing in an interview, for example, may be assigned ranks
in integers ranging from 1 to n, depending on their performance in the interview. Ranks so assigned
can be viewed as the continuous values of a variable involving performance as the quality
characteristic.
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Quantitative data are those that can be quantified in definite units of measurement. These refer to
characteristics whose successive measurements yield quantifiable observations. Depending on the
nature of the variable observed for measurement, quantitative data can be further categorized as
continuous and discrete data. Level of measurment for quantitative data is either interval or ratio.
Obviously, a variable may be a continuous variable or a discrete variable.
(i) Continuous data represent the numerical values of a continuous variable. A continuous variable
is the one that can assume any value between any two points on a line segment, thus representing
an interval of values. The values are quite precise and close to each other, yet distinguishably
different. All characteristics such as weight, length, height, thickness, velocity, temperature,
tensile strength, etc., represent continuous variables. Thus, the data recorded on these and similar
other characteristics are called continuous data.
(ii) Discrete data are the values assumed by a discrete variable. A discrete variable is the one whose
outcomes are measured in fixed numbers. Such data are essentially count data. These are derived
from a process of counting, such as the number of items possessing or not possessing a certain
characteristic. The number of customers visiting a departmental store everyday, the incoming
flights at an airport, and the defective items in a consignment received for sale, are all examples of
discrete data.
TYPES OF DATA ACCORDING TO DATA SOURCE
Data sources could be seen as of two types as follows:
(i) Primary data: Those data which do not already exist in any form, and thus have to be collected for
the first time from the primary source(s). By their very nature, these data require fresh and first-time
collection covering the whole population or a sample drawn from it. These may be collected by any
one of the following methods
Observation Method
Interview Method
Questionnaire Method
Schedule Method
Focus group
(ii) Secondary data: They already exist in some form: published or unpublished - in an identifiable
secondary source. They are, generally, available from following sources, though not necessarily in
the form actually required:
Official publications of Govt., International Organizations, Banks, Trade Unions, CII, BSE
etc.
Internet, Journals, News Papers, Magazines etc.
Unpublished Research Reports
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
The secondary data can be relied upon only by examining the following factors:
(1) source from which they have been obtained;
(2) their true significance;
(3) completeness and
(4) method to collection.
Choice between Primary and Secondary Data
An investigator has to decide whether he will collect fresh (primary) data or he will compile data from
the published sources. Following factors should also be considered while making choice between the
primary or secondary data :
(i) Nature and scope of enquiry.
(ii) Availability of time and money.
(iii) Degree of accuracy required and
(iv) The status of the investigator i.e., individual, Pvt. Co., Govt. etc.
However, in certain investigations both primary and secondary data may have to be used, one may be
supplement to the other.
Table 1: DifferencebetweenPrimary&SecondaryData
PrimaryData SecondaryData
Primarydataareoriginalandarecollecte Datawhicharecollectedearlierbysomeoneelse,and
Basisnat
dforthefirsttime. whicharenowinpublished or unpublishedstate.
ure
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
ORGANIZATION
So far, we know how to collect data. Now we have to organise the collected data so that they can be
analysed. The collected data (also known as raw data) are always in an unorganized form and need to be
organized. But before organizing the data we need to edit by correcting errors (such as missing entries,
extreme values etc.), if any. After editing we first classify the data and then tabulate it. Thus, the
organisation of data consists of the following steps
Figure 2
CLASSIFICATION OF DATA
Arranging data into sequences and groups according to their common characteristics is known as
classification. For example,letters in the postoffice are classified according to their destinations viz.,
Delhi, Jaipur, Agra, Kanpur, etc.; the televisions in a shop may be classified according to their screen
sizes; the individuals in a group may be classified into various income groups according to their income.
Main objectives of classifying the data are:
1. It condenses the mass of data in a simple form.
2. It eliminates unnecessary details.
3. It facilitates comparison and highlights the significant aspect of data.
4. It enables one to get a mental picture of the information and helps in drawing inferences.
5. It helps in the statistical treatment of the information collected.
TYPES OF CLASSIFICATION
1. Chronological or Temporal classification: In chronological classification the collected data are
arranged according to the order of time expressed in years, months, weeks etc. The data are
generally classified in ascending order of time. For example, the data related with population, sales
of a firm, imports and exports of a country are always subjected to chronological classification.
Specifically, data relating to the road accidents (in thousands) in madhya pradesh during 2003 –
2010 are
Table 2 : Number of Road Accidents in MP
Dr.Bhavana Likhitkar/Faculty/LNCTU
M
n
d
u
t
S
m
F
G
P
e
l
a
s M
n
d
u
t
S
m
F
e
l
a
s
User
s
LNCT GROUP OF COLLEGES
Table 3: State wise Computer and Internet users (North east India)
State Sikkim
4228
AP
5232
Mizoram
5527
Nagalan
d
6799
Meghalay
a
8074
Tripur
a
8428
Manipu
r
10650
3. Qualitative or categorical classification: In this type of classification, data are classified on the
basis of some attributes or quality like sex, literacy, religion, employment, etc. Such attributes
cannot be measured along with a scale.
When the classification is done with respect to one attribute, which is dichotomous in nature, two
classes can be formed, one possessing the attribute and the other not possessing the attribute. This
type of classification is called simple or dichotomous classification. A simple classification may be
shown as under:
Figure 3
The classification, where two or more attributes are constructed and several classes are formed, is
called a manifold classification. The above example of a manifold classification can also be
explained by the following chart:
Figure 4
4. Quantitative classification: In quantitative classification, the collected data are grouped with
reference to the characteristics, which can be measured and numerically described such as height,
weight, sales, imports, age, income, etc. The first step in the direction of putting observations in
some ordered form is to arrange them in ascending or descending order of magnitude.The data are
then said to be in an array. For example, the following table presents the Number of SmartPhones
Sold in 40 consecutive days by a cell phone dealer. The data displayed here are in raw form, that is,
the numerical observations are not arranged in any particular order or sequence.
Table 4: Raw Data Pertaining to Total Time Hours Worked by Laborers
7 8 5 10 9 10 5 12 8 6 8 12 8 8 10 15 7 6 8 8
10 11 6 5 10 11 10 5 9 13 5 6 9 7 14 8 7 5 5 14
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
The raw data can be reorganized in a data array and frequency distribution. Such an arrangement
enables us to see quickly some of the characteristics of the data we have collected. When a raw data
set is arranged in rank order, from the smallest to the largest observation or vice-versa, the ordered
sequence obtained is called an ordered array. Following table reorganizes the above raw data
Table 5 : Ordered Array
5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 8
1 1
8 8 8 9 9 9 10 10 10 10 10 11 11 12 12 14 14 15
0 3
It may be observed that an ordered array does not summarize the data in any way as the number of
observations in the array remains the same. To overcome this problem we reorganize the data in the
form of a frequency distribution.
A frequency Distribution shows the frequency (number of occurrences) of different values of a
single phenomenon (number of overtime hours) or one may divide observations in the data set into
conveniently established numerically ordered classes (groups or categories). The number of
observations in each class is referred to as frequency of that class. There are two different types of
frequency distributions:
1. Discrete Frequency Distribution: When the raw data is related to a Discrete Variable and we
arrange the different values in ascending or descending order along with the number of
occurrence. The frequency distribution of the number of smart phonessold given is shown in the
following table
Table 6 : Array and Tallies
Number of Frequency
Tally
Smart Phones (Number of Days)
5 |||| || 7
6 |||| 4
7 |||| 4
8 |||| ||| 8
9 ||| 3
10 |||| | 6
11 || 2
12 || 2
13 | 1
14 || 2
15 | 1
40
As the number of observations obtained gets larger, the above method to condense the data
becomes difficult and time-consuming. Thus, to further condense the data into frequency
distribution tables, we create a continuous frequency distribution.
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
2. Continuous Frequency Distribution: When the data are related to a continuous variable, the
data is classified on the basis of class intervals which are exhaustive and mutually exclusive. It is
acomplished by performing the following steps:
Select an appropriate number of non-overlapping class intervals
The decision on the number of class groupings depends largely on the judgment of the
individual investigator and/ or the range that will be used to group the data, although there
are certain guidelines that can be used. As a general rule, a frequency distribution should
have at least five class intervals (groups), but not more than fifteen.
Determine the width of the class intervals.
Largest Value−Smallest value
Width of the class interval=
Number of Class Intervals
Determine class limits (or boundaries) for each class interval to avoid overlapping.
The limits of each class interval should be clearly defined so that each observation (element) of
the data set belongs to one and only one class. Each class has two limits— a lower limit and an
upper limit. The usual practice is to let the lower limit of the first class be a convenient number
slightly below or equal to the lowest value in the data set.For example, The data given below
relate to the time (in minutes) 30 different customers had to wait at a Two-wheeler service
centre:
Table 7: Customer Waiting Time (in Minutes)
13.
11.8 3.6 16.6 4.8 8.3 8.9 9.1 7.7 2.3 12.1 6.1 6.2 11.0 10.4
5
11.
10.2 8.1 11.4 6.8 9.6 19.5 15.3 12.3 8.5 15.9 18.7 7.2 5.5 14.5
7
Step 1: Number of class intervals taken = 6
Step 2: Class Width = (19.5 – 2.3)/6~3
Step 3: Class Boundaries for the class are taken as 2 and 5, 5 and 8, 8 and 11 ……upto 17 and
20
Thus, the frequency distribution table is as follows:
Table 8: Exclusive Frequency Distribution
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
14 and under 17 IIII 4
17 and under 20 II 2
30
There are two ways in which observations in the data set are classified on the basis of class
intervals, namely Exclusive method and Inclusive method.
Exclusive Method (above example): When the data are classified in such a way that the upper
limit of a class interval is the lower limit of the succeeding class interval then it is said to be the
exclusive method of classifying data and in this method any observation equal to the upper limit
of any class interval is included in the next class interval.For example, in the above problem the
observation 11 is not included in the class “8 and under 11” but in the class “11 and under 14”.
Inclusive Method:Inclusive Method When the data are classified in such a way that both lower
and upper limits of a class interval are included in the interval itself, then it is said to be the
inclusive method of classifying data. The frequency distribution formed by this method looks
like the following table
Table 9 : Inclusive Frequency Distribution
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
4.5 – 8.5 22
8.5 – 12.5 13
12.5 – 16.5 8
16.5 – 20.5 2
Cumulative frequency distribution:
Cumulative frequency distribution is used to determine the number of observations that lie
above (or below) a particular value in a data set. The cumulative frequency is calculated using a
frequency distribution table. There are two types of cumulative frequency distributions:
1. Less than type Cumulative Frequency Distribution
The less than cumulative frequencies are related to upper limits of the classes and form an
increasing sequence. For this type of distribution, the cumulative frequency is calculated by
adding each frequency from a frequency distribution table to the sum of its predecessors.
The last value will always be equal to the total for all observations, since all frequencies will
already have been added to the previous total. For example, the cumulative frequency
frequency distribution from the frequency distribution given in Table 8will be as follows
Table 11: Less than Type Cumulative Distribution
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
5 and under 8 6 More than 5 27 (6 + 8 + ….+ 2)
8 and under 11 8 More than 8 21 (8 + 7 +..+2)
11 and under 14 7 More than 11 13 (7 + 4 + 2)
14 and under 17 4 More than 14 6 (4 + 2)
17 and under 20 2 More than 17 2 (2)
TABULATION
Tabulation is the process of summarizing classified or grouped data in the form of a table so that it is
easily understood and an investigator is quickly able to locate the desired information. A Table is a
systematic arrangement of classified data in columns and rows. Thus, a statistical table makes it
possible for the investigator to present a huge mass of data in a detailed and orderly form. It facilitates
comparison and often reveals certain patterns in data, which are otherwise not obvious. Before
tabulation, data are classified and then displayed under different columns and rows of a table.
Advantages of Tabulation
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented are easily understood,
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages, dispersion, correlation etc.
4. It presents facts in minimum possible space and unnecessary repetitions and explanations are
avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references and they make it easier to present the information in the
form of graphs and diagrams.
Preparing a Table
The making of a compact table is itself an art. This should contain all the information needed within the
smallest possible space. What the purpose of tabulation is and how the tabulated information is to be
used are the main points to be kept in mind while preparing for a statistical table. An ideal table should
consist of the following main parts:
1. Table number: A table should be numbered for easy identification and reference in future. The table
number may be given either in the centre or side of the table but above the top of the title of the
table. If the number of columns in a table is large, then these can also be numbered so that easy
reference to these is possible.
2. Title of the table: Each table must have a brief, self-explanatory, and complete title which can
indicate the nature of data contained. explain the locality (i.e., geographical or physical) of data
covered. indicate the time (or period) of data obtained. contain the source of the data to indicate the
authority for the data, as a means of verification and as a reference. The source is always placed
below the table.
3. Caption and stubs: The headings for columns and rows are called caption and stub, respectively.
They must be clear and concise.
4. Body: The body of the table should contain the numerical information. The numerical information is
arranged according to the descriptions given for each column and row.
5. Prefactory or head note: If needed, a prefactory note is given just below the title for its further
description in a prominent type. It is usually enclosed in brackets and is about the unit of
measurement.
6. Footnotes: Anything written below the table is called a footnote. It is written to further clarify either
the title captions or stubs. For example, if the data described in the table pertain to profits earned by
a company, then the footnote may define whether it is profit before tax or after tax. There are
various ways of identifying footnotes:
Numbering footnotes consecutively with small number 1, 2, 3, …, or letters a, b, c, …, or star *, **,
… or symbols like @ or $.
7. Sources of data: The source of data should be given below the table.
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
A model structure of the table consisting of above parts is given below:
Table 14 : Newpaper Preferred by Various Occupations
Occupation Caption
Stub Newspaper* Private Sector Public Sector Self Sub
Employee Employee Employed Captions
Hindustan Times 43 18 51
Indian Express 21 38 22
Sub Stub Body
The Hindu 15 37 20
Times of India 29 27 33
Footnote - * All newspapers are in English language
Source Note - PEW Research Center’s Indian Life Project, Tracking Study, July 25–August 26, 2011
Type of Tables
Tables can be classified according to their purpose, stage of enquiry, nature of data or number of
characteristics used. On the basis of the number of characteristics, tables may be classified as follows:
1. Simple or one-way table: A simple or one-way table is the simplest table, which contains data of
one characteristic only. A simple table is easy-to construct and simple to follow. For example, the
following table show the number of readers of various english news papers in a loca1ity.
Table 15: Readers of various Newspapers
Occupation
Newspaper* Private Sector Public Sector Self
Employee Employee Employed
Hindustan Times 43 18 51
Times of India 29 27 33
Indian Express 21 38 22
The Hindu 15 37 20
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
3. Manifold table: A table, in which more than two characteristics of data are considered, is called a
manifold table. For instance, table below shows three characteristics, namely, occupation,
newspaper and marital status.
Table 17: Newspaper Readers from different Occupations and Marital status
Occupation
Private Sector Public Sector Self
Newspaper*
Employee Employee Employed
M S M S M S
Hindustan Times 21 22 8 10 40 11
Times of India 20 9 20 7 20 13
Indian Express 12 9 20 18 13 9
The Hindu 10 5 25 12 15 5
Foot note- M stands for married and S for single.
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Number of Readers of Popular Newspapers
120 112 Number of Readers
100 89
81
80 72
60
40
20
0
Hindustan Times of India Indian Express The Hindu
Times
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Percentage distribution of Readers
Indian Express
23% Times of India
25%
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Frequency Distribution of No. Of Smartphones Sold in 40 Days
9
8
8
Frequency (No. of Days)
7
7
6
6
5
4 4
4
3
3
2 2 2
2
1 1
1
0
5 6 7 8 9 10 11 12 13 14 15
No. of Smartphones
Figure 9
6. Histogram: For frequency distribution of continuous quantitative data, one of the most useful
graph is a histogram. It is a diagram consisting of rectangles whose area is proportional to the
frequency of a variable and whose width is equal to the class interval.
Making a Histogram Using a Frequency Distribution Table
To make a histogram, follow these steps:
1. On the vertical axis (Y – axis), place frequencies. Label this axis "Frequency".
2. On the horizontal axis (X – axis), place the lower value of each interval.
3. Draw a bar extending from the lower value of each interval to the lower value of the next
interval. The height of each bar should be equal to the frequency of its corresponding
interval.
Example: A histogram showing the frequency distribution of the waiting time of customers at a
two-wheeler service station (see table 13)
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Histogram
9
8
8
7
7
6
Frequency 6
5
4
4
3
3
2
2
1
0
2 5 8 11 14 17 20
Waiting Time (Min)
Figure 10
A Histogram provides a visual representation so you can see where most of the measurements
(values of the variable) are located and how spread out they are. In the above histogram most of
the values are located in the interval 8 – 11 and they are almost symmetrically distributed on both
the sides of this interval.
7. Frequency Polygon
We create a frequency polygon from a histogram. If the middle top points of the bars of the
histogram are joined, a frequency polygon is formed. Frequency polygon and histogram fulfills the
same purpose. However, the former one is useful in comparison of different datasets. Following is
an example of frequency polygon created from the histogram (Figure 15) of waiting times of
customes.
9 Frequency Polygon
8
7
6
Frequency
5
4
3
2
1
0
0.5 3.5 6.5 9.5 12.5 15.5 18.5 21.5
Figure 11
8. Ogive
The ogive is a graph that represents the cumulative frequencies for the classes in a frequency
distribution.An ogive uses class boundaries along the horizontalscale, and cumulative frequencies
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
along the vertical scale.Ogives are useful for determining the number of values below or above
some particular value. Following two diagrams represent a less than type Ogive and a more than
type Ogive created from the cumulative frequency distributions given in table 11 and table 12:
Cumulative Frequency
30 30
25 25
20 20
15 15
10 10
5 5
0 0
2 5 8 11 14 17 20 2 5 8 11 14 17 20
Waiting Time (Minutes) Waiting Time (Minutes)
Figure 12 : Ogive
9. Scatter Plot
A scatterplot (or scatter diagram) is a plot of paired (x, y) quantitative data with a horizontal x-axis
and a vertical y-axis. The horizontal axis is used for the first (x) variable, and the vertical axis is
used for the second variable. The pattern of the plotted points is often helpful in determining
whether there is a relationship between two quantitative variables such as price and demand, sales
and price, temperature and volume etc. For example Doctors are interested in the possible
relationship between the dosage of a medicine and the time required for a patient’s recovery. The
following table shows, for a sample of 10 patients, dosage levels (in grams) and recovery times (in
hours).
Table 18
1. 1. 1. 1.
Dosage level 1.2 1 1.4 1.8 1.3 1.3
3 5 2 4
Recovery time 25 28 40 38 10 9 27 30 16 18
We can describe the data graphically with a scatter plot as shown below:
Dr.Bhavana Likhitkar/Faculty/LNCTU
LNCT GROUP OF COLLEGES
Dosage Level and Recovery time
50
Revovery Time (Hours)
40
30
20
10
0
0.75 1 1.25 1.5 1.75 2
Dosage Level
Figure 13
The above graph shows that the recovery time decreases as the dosage level increases i.e. we can
say that there is a relationship between the two variables.
Dr.Bhavana Likhitkar/Faculty/LNCTU