0% found this document useful (0 votes)
10 views160 pages

Chapter Three Data Organization and Classification and Presentation

The document outlines the processes involved in data organization, classification, and presentation, emphasizing the importance of data processing, editing, coding, and cleaning. It details the steps necessary for accurate data handling, including editing for completeness, consistency, accuracy, and legibility, as well as the methods for coding and classifying data. Additionally, it discusses the significance of proper data entry and classification to facilitate analysis and ensure reliable research outcomes.

Uploaded by

Dagim Taye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views160 pages

Chapter Three Data Organization and Classification and Presentation

The document outlines the processes involved in data organization, classification, and presentation, emphasizing the importance of data processing, editing, coding, and cleaning. It details the steps necessary for accurate data handling, including editing for completeness, consistency, accuracy, and legibility, as well as the methods for coding and classifying data. Additionally, it discusses the significance of proper data entry and classification to facilitate analysis and ensure reliable research outcomes.

Uploaded by

Dagim Taye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 160

Chapter

Data organization,
classification and BY:

presentation
MOHAMMED BESHIR
DATA PROCESSING
 Once the researcher collected data from the respondents or at the time of data
collection, he or she has to reduce the mass of data to a form suitable for analysis.
 This data reduction is usually referred to as data processing.
 Data processing can be done either manually or by using computer.
 It involves editing and cleaning , coding, classification, and tabulation of
collected data so that they are amendable to analysis.
 Accordingly, we have the following four major phases:
 data processing (editing, coding)
 classification of data
 tabulation of data (statistical table ): data presentation
 charting of data (statistical Chart ): data presentation
Data preparation/ entry, coding ,editing and
cleaning/
3.1.1. Editing
 Once the data have been collected, the next task is editing.
 Editing is the process of examining errors and omissions in the collected data
and making necessary correction.
 Editing should be done by experienced persons with care to ensure that the data
are accurate, consistent with other gathered ,uniformly entered, as completed as
possible and have been well arranged to facilitate coding and tabulation.
 The editing can made either by the respondent or by interviewer.
 The editor can ensure this using a different colored pencil for editing the raw
data.
 Where collateral corrections are to be made, it is necessary that these (edited
questionnaires) should be kept distinct from the change made either by the
respondent or by the interviewer.
3.1.1. Editing

 It involves a careful scrutiny of completed questionnaires or schedules.


 Because most of the time, under questionnaires and schedules; the answers
either may not be ticked at proper level; some question may be left
unanswered or may be given in a form which needs reconstructions in
categories designed for analysis.
 Example converting daily /monthly income in to annual income.
3.1.1. Editing
 One should edit the following major aspects of the questionnaire (data)
 Editing for completeness ( data should be completed)
The editor should see that each schedule or questionnaire is complete in all respects.
That is answer to each and every question has been furnished.
A. Editing for constancy
refers to phasing at answers to questions which are related whether they are
contradictory or not.
If there are mutually contradictory answer, he/she should try to obtain the correct
answer either by referring back the questionnaire or by contrasting whenever possible
the informant in person.
B. Editing for accuracy
Editor should see that the formation is correct in all aspects as problem of accuracy
adversely affect reliability of inference.
3.1.1. Editing
C. Editing for homogeneity
 By homogeneity we mean that all the questions have been understood in the same
sense by different respondents.
 The editor must check various questions carefully.
For example, if some information have given on monthly income, others annual
income and still others weekly income. No comparisons can be made between them
unless they are converted to one.
D. legibility:
the recorded data must be legible so that it could be coded later.
An illegible response may be corrected by getting in touch with the people who recorded it or
alternatively it may be inferred from the other parts of the questions.
3.1.1. Editing
 There are two types of editing: field editing and central editing.
A. Field editing consists of the review of the reporting forms by the investigators or
completing (translating or rewriting) what the latter has written in abbreviated and/or in
illegible form at the time of recoding the respondent’s response.
 This type of editing is necessary in view of the fact that individual writing style often
can be difficult for others to decipher.
 The type of editing should be done as soon as possible after the interview as it may
necessarily sometimes to recall the memory.
 While doing so care should be taken so that investigator does not correct the error of
omission by simply guessing what the respondent would have answered if the question
was put to him.
3.1.1. Editing
 Central editing should be done when all forms or schedules have been completed and
returned to the office.
 At this stage editing, it should be done by single editor (for a single study) and/or by a
group of editors.
 While editing, editor may correct the obvious errors such as an entry in the wrong place,
entry recorded in month when it should have been recorded in a week and the like.
 Sometimes inappropriate or missing replies can also be recorded by the editor by
reviewing the other information recorded in the schedules.
 If necessary, the respondent may be contacted for clarification.
 All the incorrect replies, which are quite obvious, must be deleted from the schedules.
 The editor should be familiar with the instructions and the codes given to the interviews
while editing.
 The new entry made by the editor should be in some distinctive form and they be initiated
Data Cleaning
 Real-world application data can be incomplete, noisy, and inconsistent
 No recorded values for some attributes
 Not considered at time of entry
 Random errors
 Irrelevant records or fields
 Data cleaning attempts to:
 Fill in missing values
 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data
Dealing with Missing Values
 Data is not always available (missing attribute values in records)
 equipment malfunction
 deleted due to inconsistency or misunderstanding
 not considered important at time of data gathering
 Solving Missing Data
 Ignore the record with missing values;
 Fill in the missing values manually;
 Use a global constant to fill in missing values (NULL, unknown, etc.);
 Use the attribute value mean to filling missing values of that attribute;
 Use the attribute mean for all samples belonging to the same class to fill in the missing
values;
 Infer the most probable value to fill in the missing value
3.1.2. Coding:
 coding refers to the process of assigning numerical, alphabetical or other
symbols or both to answers of questions so that responses can be put into a
limited number of categories or classes.
 It permit the transfer of data from survey to computer.
 The classes should be appropriate to the research problems being studied.
 These classes must possess the characteristics of exhaustiveness (there must
be a class for every data item) and also that of mutual exclusivity which
means that a specific answer can be placed in one and only one cell in a
given category set.
 Further, every class must be defined in terms of only one concept.
3.1.2. Coding:
 Coding is necessary for efficient analysis and through it several replies may
be reduced to a small number of classes, which contain the critical
information required for analysis
 Coding is necessary for efficient analysis of data. Because, several replies
may be reduced to a small number of classes which contain the critical
information required for analysis.
 Coding decision should usually be taken at the designing stage of the
questionnaire so that likely responses to questions are per -coded. This
simplifies computer tabulations of the data for further analysis.
 Coding for open ended questions is more tedious than closed questions.
 For a closed ended or structured questions, the coding scheme is very simple
and designed prior to the field work.
 In case of hand coding, it is possible to coding on the margin of the questionnaire with
colored pencil or to transcribe the data the questionnaire to a coding sheet.
 In coding, categories are the partitions of a data set of a given variable. For example, if
the variable is gender, the partitions are male and female.
 Categorization is the process of using rules to partition a body of data.
 Both closed and free-response questions must be coded.
E.g., Closed end question
1 [ ] Yes
2 [ ] No
Or
 Less than 200 [ ] 001
201- 699 [ ] 002
1500 and more [ ] 006
 Most software programs work more efficiently in the numeric mode;
-Instead of entering the word male or female in response to a question that
asks for the identification of one’s gender, we would use numeric codes,
e.g., 0 for male and 1 for female
 Numeric coding simplifies the researcher’s task in converting a nominal
variable, like gender, to a “dummy variable”
3.1.3. Data Entry
 Data entry converts information gathered by secondary or primary methods to a medium for reviewing and
manipulation.
 Keyboarding remains a mainstay for researchers who need to create a data file immediately and store it in a
minimal space on a variety of media.
 Keyboarding: A full screen editor, where an entire data file can be edited or browsed, is a viable means of
data entry for statistical packages like SPSS or SAS.
 SPSS offers several data entry products, including Data Entry Builder
which enables the development of forms and surveys, and Data Entry
Station which gives centralized entry staff, such as telephone interviews
or online participants, access to the survey.
 Both SAS and SPSS offer software that effortless accesses data from
databases, spreadsheets, data warehouses, or data marts.
Example for SPSS software
 You may create a data file using one of your favorite text editors, or word
processing packages (e.g., MS-Word).
 You may enter your data into a spreadsheet (e.g., Excel, dBASE) and read it
directly into SPSS for Windows.
 Finally, you may enter the data directly into the spreadsheet-like Data Editor
of SPSS for Windows.
 In this document we are going to examine one data entry methods: using
the Data Editor of SPSS for Windows.
 Most research studies result in a large volume of raw data, which must be
reduced into homogeneous group. Which means to classify the raw data or
arranging data in-groups or classes on the basis of common characteristics.
 Data Classification implies the processes of arranging data in groups or
classes on the basis of common characteristics. Data having common
characteristics placed in one class and in this way the entire data get divided
into a number of groups or classes.
Data Classification:
 Most research studies result in a large volume of raw data,
which must be reduced into homogeneous group. Which
means to classify the raw data or arranging data in-groups or
classes on the basis of common characteristics.
 Data Classification implies the processes of arranging data in
groups or classes on the basis of common characteristics.
Data having common characteristics placed in one class and
in this way the entire data get divided into a number of
groups or classes.
3.2. Classification Data
 Classification is the process of arranging things in group or classes according
to their resemblance (on basis of common characteristics especially for
studies with large volume of raw data).
 Classification can be done according to common attributes describing
information (such as literacy, sex, honesty, etc) or characteristics.
 Dates having a common characteristic are placing in one class attribute and
in this way the entire data get divided in to a number of groups or classes.
 A researcher based on attribute can classify in to classes consisting of items
and class which do not possess the given attributes.
 Classification can be one of the following two types, depending upon
the nature of the phenomenon involved:
i. Classification according to attributes: Data are classified on the
basis of common characteristics, which can either be descriptive
(such as literacy, sex, honesty, etc) or numerical (such as, weight,
age height, income, expenditure, etc.).
-Descriptive characteristics refer to qualitative phenomenon, which
cannot be measured quantitatively: only their presence or absence in
an individual item can be noticed
Ii. Classification according to class interval:
 Unlike descriptive characteristics the numerical characteristics refer to
quantitative phenomenon, which can be measured through some
statistical unit.
 Numerical characteristics relating to income, production, age, weight
etc can be classified based on class intervals.
 For instance, person whose income say are within Br.201-400 can form
one group, those whose income are within Br, 401-600 can form
another group and so on.
 Data relating to inc ome, production, age, weighted, come under
this category.
 Such data are known as statistics of variables and are classified on the basis
of class interval.
 Fore example, individuals whose incomes, say, are within 1001-1500 Birr
can form one group, those whose incomes within 500-1000 Birr form
another group and so on.
 The main objectives of classification are;
 To
bring out the unity of attributes out of the diversities
persistent in the collected data
 Toenable one to make comparison easier and draw
inference
 to bring out point of similarity and difference
 to
enable one to form mental picture of objects of
measurements
 To make proper use of the collected data
 To
give prominence to the important information
gathered while dropping out the unnecessary
3.3.1 Characteristics of Ideal Classification
 There are no hard and fast rules for classification of data. Technically, the
classification of data should take in to consideration the nature, scope and purpose of
the enquiry. But, an ideal classification should possess the following characteristics.
 It should be unambiguous. Classification is meant for removing ambiguity in which
the various classes should be so defined that there is no room for doubt.
 It should be stable. If a classification is not stable and each time an enquiry is
conducted, it has to be changed, the data should not fit for comparison. Stability of
classification does not mean rigidity of classes. The term is used in relative sense.
With change in time, socio economic condition changes which necessitate changes
and, hence some classes become obsolete and have to be dropped, while fresh classes
have to be added.
 It should be flexible: A good classification should be flexible and should have the
capacity of adjustment to new situations and circumstance. An ideal classification
should be such that it can adjust itself to different changes and yet retain its stability.
 3.3.2 Basic of Classification
 Statisticaldata are classified on the basis of the characteristics possessed by
different groups or units of a universe. These characteristics give expression
on the units of attribute which may be traced in adversity of individual units.
 Broadly speaking data can be classified on the following four basics.
 Geographical- place or area wise
 Geographic classifications are arrangement according to places like
continents, regions, and countries. Example
Region Common name

1 Tigry

2 Afar

3 Amhara

4 Oromia
3. Chronological-time as a basic
 Data are classified based on time (month, year, day etc). For instance, the amount
of sugar produced and sold by Wonji sugar factory form 2001- 2007 in tones.

Year 2000 2001 2003 2004 2005 2006 2007

Sales 200 300 400 500 600 700 900


B. Qualitative: according to some attribute
 Clarification of data is based on some attribute or quality such as sex,
color of hair, religion, etc.
 The attributes cannot be measured: one can only find out the
presences or absence of the attribute desired.
 For instance, if the attribute understudy is blindness. These classes
can be formed based on variables /observations which are qualitative
or categorical; blinds and non blinds.
 That is one can see whether somebody considered for study is blind or
not. Furthermore, we can find how many people (persons) are blind in
a given population.
 The employee in a factory can be categorized as educated and
uneducated.
D. Quantitative classification
 This refers to classification of data according to some characteristics that can be
measured such as height, weight, income, sales, etc. For example, the workers of a
factory may be classified according to wages as follows
 Monthly wages No of Workers
 4000- 4500……………………………...50
 4500 – 5000……………………………200
 5000 – 5500……………………………260
 One can also see number of children with number of families
No of children No of families
 0……………………………10
 1……………………………400
 2……………………………800
 3……………………………700
 4……………………………250
 5……………………………150
 6……………………………50
 2,360
 3.3.3 The data array and frequency distribution
 Data collected include many observations which are very large to summarize and to infer something.
 When the raw data have been collected, they should be put in to an ordered array in an ascending or
descending order so that it can be looked at more objectively.
 Then, it becomes necessary to organize the mass of data so that they are reduced to meaningful proportions.
The known mechanism of organizing data is using frequency distribution.
 Frequency is the number of counts assigned to individuals having a particular characteristics or the number
of values in specific class of distributions.
 Frequency distribution is the arrangement of variable of interest along with corresponding frequency in
tabular form. Stated differently, it is the organization of raw data in table using classes and correspo
 nding frequencies for grouped data. Frequency distribution differs depending on the type of variables and
data type.
 Based on the types of data, we can have two types of frequency
distributions.
 I. Frequency distribution for discrete series
 II. Frequency distribution for continuous distribution
 We will see how frequency distribution can be prepared for each type
of data.
 3.3.3.1. Frequency distributions for discrete series
(Ungrouped Frequency Distribution)
 Discrete variables are variables whose value can take single number not
interval.
 For example height takes one specific value. Height of Ababa is 1.75
meter. The number of student in class take definite number such one,
100, 10000
 Given the raw data one can prepare frequency distribution for such data. These are by
preparing data array. An order Array: is a listing of values of collection in order of
magnitude from the smallest values to the largest value. Steps in forming frequency
distribution is
 arranging data in ascending or descending order
 Then counting those with same value and preparing a column of tally
 the frequency table
 Example one:
 The following data are on age of 20 women who attended health education in a certain
hospital. Form frequency distribution for such data
 30, 25, 23, 41, 39, 27, 41, 24, 32, 29, 35, 31, 36, 33, 36, 42, 35, 37, 41, and
29
 i. ordered array in ascending order
 23, 24, 25, 27, 29, 29, 30, 31, 32, 33, 35, 35, 36, 36, 37, 39, 41, 41, 41, 42
ii. Counting those with the same value

Age = x Tally Frequency


23 I 1
24 I 1
25 I 1
27 I 1
29 II 2
30 I 1
31 I 1
32 I 1
33 I 1
35 II 2
36 II 2
37 I 1
39 I 1
41 III 3
42 I 1

20
iii. Prepare frequency table
Age Frequency
23 1
24 1
25 1
27 1
29 2
30 1
31 1
32 1
33 1
35 2
36 2
37 1
39 1
41 3
42 1
 Example Two
 Suppose there is a class of 30 economics students. Each student in the class is asked
to toss a coin five times and record each time whether he/she gets a head or not. As
a result of this experiment the number of times a person gets heads out of five
tosses for the 30 students is presented as follows.
 3,2,0,4,1,2,3,5,3,3,1,1,3,5,4,2,2,1,0,4,3,2,2,4,2,3,3,1,5.3 Then prepare frequency
distribution for coin tossing experiment?
 Solution
 i. arrange data in orderly manner
 Such data need some better display. One way of doing this is to show
the occurrence of head in a certain order. For instance, we may
show the same data in an ascending order as follows.
 0,0,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,5
 The same data can be arranged in descending order.
 ii. Counting and tally
 Data relating to the tossing of a coin five times by 30 students show
that each figure from 0 to 5 has occurred a certain number of
times. We can condense those data by pairing each of those values
with their corresponding frequency.
 Table 3.1 frequency distribution for number of heads obtained by 30
students
Observation (X) Tally Frequency
0. // 2
1. //// 5
2. /////// 8
3. //////// 9
4. //// 4
5. // 2/30
 iii. Frequency table

Observation (X) Frequency


0 2
1 5
2 8
3 9
4 4
5 2
 Exercise

i. Prepare frequency distribution for the ungrouped data in separate


form of the observations
 5, 6, 7,4,5,7,8,9,8,6,5,3,4,5,6,7,8,7,6,5,4,3,2,1, . . .

ii. Show frequency distribution for data on number of children in 15


families

1 0 3 2 0
2 4 1 3 1
4 1 2 2 3
 3.3.3.2. Frequency distribution and continuous case (variable)
 We have seen frequency distribution for discrete case. If the mass of data
is very large say 200 or 300 it is quite difficult to apply discrete case above.
 Furthermore, sometimes variables do not assume specific values rather
somewhat continuous values over certain intervals. For example, the
temperatures of a given city take value between 30 c0 to 40c0. Furthermore,
the age of employee can be presents in internal form
Age ( year) No of employees
 20-25……………………………10
 25-30…………………………….15
 30-35…………………………….40
 35-40…………………………….45
 40-45……………………………26
 45-50…………………………….4
 140
 For mass of data in either case frequency table can be prepared.
 Hence, it is necessary to condense the data in to appropriate number of
classes or groups of value of the variable and indicate the number of
observed values that fall into each class. We now turn to the formation
of a frequency table when data are continuous by grouping value of
variables.
Advantage of Grouping
 Provide information about the range of the data.
 Give an impression about the values that are frequent and infrequent.
 It provides data that can be easily used for graphical representation.
Disadvantage of Grouping
 Information may be lost since individual values are not displayed.
 Something that can be determined from original data which cannot be
3.3.3.2.1. Common Terminology in a Grouped Frequency
Distribution (GFD)
 Before frequency distribution study, we should have a clear idea of
certain terms which we shall come across frequently.
 1. Class: a group of value of a variable between two specified numbers.
Data in table below is grouped and with four classes with the first class
is 1-25,2nd from 26-50,3rd from 51-75, and 4th from 76-100.
Class No. Classes Frequency
1 1-25 3
2 26-50 10
3 51-75 18
4 76-100 6
 2. Range(R): the difference between the largest(L) and smallest(S)
value on the data.
R=L-S
 3. Class frequency
 The number of observation belonging to a particular class is class
frequency. In above table, the frequency of the 1st, 2nd, 3rd, 4th classes
are 3, 10, 18, and 6.respectivley
 Suppose there are 20 students who have obtained marks ranging from
30-40 and’44 students have obtained marks ranging from 50-60 then
the frequency of each class can be
Range Frequency
 30-40 20
 50-60 44
 4. Frequency Distribution
 A frequency distribution is the distribution values of a variable linked in
to groups along with corresponding number of observations in each
group (frequency). This is usually formed as class interval type of
 5. Class limit
 The lowest and the highest values of a class, for example, take the
class 26-50. Here, we find lowest class limit 26 and highest class limit
50. They are denoted as the upper class limit (UCL) and lower class
limit (LCL) respectively.
 for the first class LCL=1, UCL=25
 for the second class LCL=26, UCL=50
 For the third class LCL=51, UCL=75
 For the forth class LCL=76, UCL=100
 6. Class boundary: are boundaries obtained by subtracting half of the
unit of measurement of the lower class limit or by adding half (1/2u) on
upper limits of a class where u is the gap between two successive
classes. These are the two boundaries of the class: upper class
boundary (UCB) and lower class Boundary (LCB). The units of
measurement (u) are the gap between any two successive classes. i.e.
 i.e. UCBi = UCLi+1/2u
LCBi = LCli +1/2u
 An example, consider the 2nd class, 26-50, since u-26-25=1.
LCL2=26, UCL2=50
LCB2= 26-1/2(1)
UCB2=50+1/2(1)=50.5
7. Exclusive method (class interval)
 When the class intervals are so fixed that the upper limit of one class is
the lower limit of the next class.
 This can be best explained using example. Let us take some
hypothetical data which are given below.
 Profits earned by companies
 Profit (million birr) No of Companies
 10- 20……………………………………12
 20-30…………………………………….17
 30-40……………………………………..30
 40-50……………………………………..25
 50-60……………………………………..16
100
 In above case, the upper limit of one class is shown as lower limit of the class. For instance, 20 are upper
limit of 1st class as well as lower limit of the second class. Similar logic hold true for the rest classes.
 Not that
 It should be noted that it the class intervals shown in table it is presumed that the upper limit is exclusive and
that the item of that value is included in the next class interval. For example, 20 is the upper limits of the 1 st
class and it is excluded from that class, but it is included in 2 nd class which is 20-30.
8. Inclusive method
 The upper limit of one class is included in that class itself. Suppose we
have the following frequency distribution.
 Profit (Birr) No of companies
 10-19…………………………………………….12

 20-29…………………………………………….17

 30-39…………………………………………….30

 40-49…………………………………………….25

 50-59…………………………………………….16

100
 Those values with decimal greater 0.5 should be placed in upper class
and other in the lower class.
Note that
 To adjust the class limits, we take the difference between two classes
(upper and lower limits). In our case 20-19 =1 is the gap between the
two limits.
 By dividing it by two we get 0.5 which is termed as a correction factor.
One can adjust and make exclusive class using exclusive method by
deducting 0.5 from the lower limits of all classes and adding 0.5 to upper
limits. The adjusted class would then be presented above
9. Class interval (width)or size of class
 The difference between upper limit and lower limit of a class is the width
of the class. The class interval of above case 26-50 is 25 which is 50-26.
 It can be sometimes computed as the difference between the upper
and lower class boundaries of any class.
W= UCLi- LCli or Width(W)=UCBi - LCBi
 For grouped data we form the class with the same width (interval) can
approximated using range and number of desired classes.
Class width = Range/number of classes,
i.e. R/k = R/k = L-S/k
 Struge’s formula suggests mechanism for determining the approximate
number of classes. The formula is as follows
k= 1+3.322 log N
 Remark
 If both the LCL and UCL are included in a class, it is called an inclusive class. For
inclusive classes the class width can Width (W)=UCBi - LCBi
 If LCL is included and the UCL is not included in a class, it is an exclusive classes. For
exclusive classes W= UCLi- LCli
 10. Class midpoint (class mark)
 The value lying half way between the lower and upper class limits of classes interval.
The midpoint of the class can be ascertained as follow
Mid-point of a data = upper limit of the class - lower limit of the class
2
 The midpoint of each class interval is taken to represent it for the purpose of
statistical calculation. For example, the class midpoint of above class intervals can
be calculated as follows
Class interval class mid point
 30-40 ………………………………..30+40 =70 =35
2 2
 50-60…………………………………50+60= 110 =55
2 2
 Example one:
 The weekly income of (in birr) 30 workers was given

50 23 75 42 55 67
61 71 25 40 25 54
70 31 51 81 45 63
31 68 45 38 59 75
84 50 88 56 63 32
 Then
A. constructs GFD with 7 classes?
B. completes the FD with class boundaries and class marks?
3.3.3.2.2. Rules for forming a grouped frequency
distribution
To construct a GFD, the following points should be considered
A. The classes should be clearly defined. That is each observation should
fall in to one and only one class
B. The number of classes neither too many nor too few.
 The first and for most question one confronts is how many classes should
be formed? It is difficult to lay down any hard and fast rules for classifying
the data.
 It is more of subjective and based on interest of individual. The number of
class intervals depends mainly on the number of observations as well as
their range.
 However, the following general considerations may be born in mind
for ensuring the classification of data.
 If number of observation is too many but the desired number of the
classes are too few, the original data will be compressed so that only
limited information will be available resulting loss of information.
 If the number of observations is small. Obviously the classes will be
few as we cannot classify small data in to 12 to 15 classes.
 Hence, too few intervals are undesirable. On the other hand, if too
many intervals are used, the objective of summarization will not be
met as it can be boring.
 The recommended number of class should be between 5 to 20. .
5  toNbe. C  15,
Generally, it desirable to have class intervals
C. Choosing a suitable size or unit of a class interval /Class intervals/width
 All the classes should be of the same width because unequal class
interval create problem in graphing and computing some statistical
measures.
 As a principle non-over lapping intervals or classes should be developed
such that each value in a set of observations can be placed in one and
only one interval.
 An approximate suitable class width can be obtained as
Class width = Range/number of classes, i.e. R/k = R/k = L-S/k
Let R/n =6.8263
where L= largest values in the data set
S=Smallest value in the data set
D. The suitable number of classes can be obtained using Strunge’s
formula as follows
k= 1+3.322 log N where N is number of observations
 Depends on personal preference using formula suitable class size
can be I = Range
1+3.322 log N
 For example, if the total number of observation is 100, then the
number of classes would be
1+3.322 log10100 = 1+3.22(2)1og 1010
= 1+3.22(2)
= 7.644 or 8
 If your result is with decimal, it should be round up or down depending
on the relevance and ordinary rule.
For example, if we have a sample of size 275 observations that we will
have
K = 1 + 3.322
= 1 + 3.322(2.4395)
= 9 classes
One can read log10N from logarithmic table.
Note that:
 Approximate “w” to the nearest integer
 It’s preferable to have odd “w” since it has advantage of having a
midpoint which is an integer to ensure to have the same value as data.
Example one: consider the age data given previously
 Example one: Illustration
The profits (in birr) of 30 companies for the year 1999- 2000 are given below
20 22 35 42 37 42 48 53 49 65 39 48 67
18 16 23 37 35 49 63 65 55 45 58 57 69
25 29 58 65

 Classify the above data taking a suitable class interval?


 Solution
 Let us determine the suitable class- interval with the help of the following formula:
 i. = range
1+3.322 log N
Range = Gg -16 = 53, N=30
 ii. 53 = 53 53 = 8.97 or 9
 (1+3.322 log N) 1+4.91 5.91
 Since value like 3, 7, 9 etc should be avoided and therefore, we will take
10 as the class interval and the 1st class as 15-25
 Frequency distribution of the profits
Profit ( birr) Tally No of companies
 15-25 ///// 5
 25-35 // 2
 35-45 /////// 7
 45-55 ////// 6
 55-65 ///// 5
 65-75 ///// 5
)

Frequency distribution (exclusive method

Profit Number of
companies
15-25 5

25-35 2
35-45 7
45-55 6
55-65 5
65-75 5
3.4. Relative frequency and percentage distribution

 Our discussion so far was confined to absolute frequencies. It is a useful


way of comparing class frequencies within a series.
 Sometimes, it is necessary to compute relative frequency.
 Wherever two or more sets of data contain different number of
observations, a comparison with absolute frequencies will be erroneous.
 Relative frequency: the concerned frequency is divided by the number of
observations or the total frequencies.
 In such cases, it is necessary to use the relative frequency.
 In such cases, it is necessary to use the relative frequency.

Frequency
Re lative frequency 
 total number of observations
 The proportion of individual in a given class
 fi 
Percentage Relative frequency =   .100
n

For example
 Students grade frequency relative frequency
percentage
20-40 10 10/100= 01
0.1x100= 10
40-60 30 30/100 = 0.3
0.3x100=30
60-80 50 50/100 = 0.5
0.5x100=50
80-100 10 10/100 = 0.1
0.1x100= 10
Table 3.2 relative frequency distribution for number of needs
obtained by 30 stud.
Observation frequency relative frequency
f/x100
0 2 0.17 7

1 5 0.17 17
2 8 0.27 27
3 9 0.30 30
4 4 0.12 12
5 2 0.07
7
100 100
3.5. Cumulative frequency
distribution
 Sometimes, we may be interested not only in the number of
observations in each class but also in the number below or above a
certain specified limits.
 For example, we are interested to find out the number of persons,
whose income from all sources is less than a particular amount 5000
birr.
 In this case we need cumulative frequency. We are interested to find
out the number of students who got 75 or above marks out of 100.
 Table xx: cumulative frequency distribution
 Mark obtained No of student mark obtained
cumulative frequency
 1 10 not more than 1
10
 2 30 not more than 2
40
 3 35 >>
75
 4 28 >>
103
 5 39 >>
142
 6 20 >>
162
 It is the collection of values of a variable above or below specified
values in a distribution. The Cumulative Frequency Distribution
is usually divided in to namely, less than and more than cumulative
distribution.
 ‘Less Than’ Cumulative Frequency Distribution (<CFD):
shows the collection of cases lying below the upper class boundaries
of each class.
 ‘More Than’ Cumulative Frequency Distribution (>CFD):
shows the collection of cases lying above the lower class boundaries
of each class.
 Remark: The frequency distribution does not tell us directly the
number of units above or below specified values of the classes this
can be determined from a “cumulative Frequency Distribution’
 Example 11 Consider the frequency distribution in Example
 Class (xi) Frequency (fi) Less than Cumulative Frequency (<cfi)
More than Cumulative Frequency (>cfi)
Class (xi) Frequency (fi) Less than Cumulative More than Cumulative
Frequency (<cfi) Frequency (>cfi)

3–6 4 4 30

7 – 10 7 11 26

11 – 14 10 21 19

15 – 18 6 27 9

19 – 22 3 30 3

 This means that from ‘less than’ cumulative frequency distribution there
are 4 observations less than 6.5, 11 observations below 10.5, etc and
from ‘more than’ cumulative frequency distribution 30 observations are
above 2.5, 25 above 6.5 etc.
Example: Consider the age of human data
Class Limit Frequency Less than cumulative More than cumulative
23 - 26 3 3/20 26 3 3/20 23 20 20/20
27 - 30 4 4/20 30 7 7/20 30 17 17/20
31 - 34 3 3/20 34 10 10/20 34 13 13/20
35 - 38 5 5/20 38 15 15/20 35 10 10/20
39 - 42 5 5/20 42 20 20/20 42 5 5/20
0 0/20
20

Total
3.6. Tabulation of data
 When a mass of data has been assembled, it becomes necessary for
researcher to arrange the some kind of concise and logical order.
 Alternatively, data collected through a statistical investigation is
classified according to some characteristics.
 The classified data should be presented in a concise, clear, definite
form. The procedure is referred to as tabulation.
 It is the process of summarizing (condensing) raw data or classified
data in the form of table and displaying the same in compact form for
further analysis.
 One of simplest and most revealing device for summarizing data and
presenting them in a meaningful fashion is the statistical table.
 A table is a systematic arrangement of statistical data in column and
rows. To see tabular presentation is brief let’s see parts of table
 Refers to the orderly arrangement of data in a table or other summary format.
 It presents responses or the observations on a question-by-question or item-by-item
basis and provides the most basic form of information.
 It tells the researcher how frequently each response occurs.
 This starting pint of analysis requires the counting of responses or observations for
each of the categories. E.g., Frequency tables,

Need for tabulation

 It conserves space and reduces explanatory and descriptive statement to a minimum

 It facilitate the process of comparison

 It facilitate the summation of items and the detection of errors and omission

 It provide basis for various statistical computation,


3.6.1. Objectives of
tabulations
 The basic objective of tabulation is to summarize the mass of numerical
data and present it in the simplest possible form. Some of the important
objectives of tabulation are explained below:
 Simplification of data: tabulation is to present the statistical data in the
simplest and most intelligible form which avoids all necessary details and
repetitions. Since the data is arranged systematically in rows and columns
one can understand the contents of a table without confusion.
 Comparison of data: The purpose of a table is to simplify the presentation
and to facilitate comparisons. Comparison is facilitated by bringing related
items or information close together (side by side).
 Identification of data: since data is arranged in a table in suitable
columns and rows with appropriate title and number, it is easy to
identify tabulated data at any future data. It can be used as a source
of reference in the interpretation of a problem.
 Facilitates statistical processing: statistical data which is in raw
form is not convenient for statistical processing. Since data is
arranged in systematic and orderly manner, it is convenient to
compute statistical measures such as averages, dispersions,
skewness, correlation, regression, etc. It also helps in detections of
errors or ommisions.
 Itreveals patterns: Tabulation reveals the trends and behavior of
the variables which is not possible in the case of descriptive form of
data presentation.
 Provide more information: usually tables contains numerical
facts relating to different variables which is more informative than
any other kind of data presentation.
3.6.2. Parts of a table
 Parts of a table may vary from case to case depending up on the given
data. But a good table must contain at least the following parts.
 A. Table number E. body of a table
 B. title of the table F. Head note
 C. caption G. foot note
 D. stub
A. Table number
 Every table should be numbered so that it can be identified. The
number is normally indicated at the top of the table.
 Some time the number may be either given in the center at the top
above a title or at the bottom of the table on the left hand side.
 B. title of the table
 Every table must have a suitable title. The title is a description on the
contents of the table. the title should be stated in a clear, concise, and
self-explanatory manner. The complete title has to ensure the
questions :What, Where , and When is that sequence.
 what precisely are the data in the table is about ( i.e what categories
of statistical data are shown)
 where the data occurred or collected (i.e the precise geographical,
political or physical data are shown)
 when data occurred ( i.e the specific time or period covered by the
statistical material on the table)
 Note that
 the title should be clear, brief and self explanatory
 the title should be so warded that it permits one and only one interpretation
 It should be in the form of a series of phrases rather than complete sentence.
 3. Captions
 Captions refer to the column headings. It explains what the column represents. It ma
consists of one or more column headings. The caption should be clearly defined and placed
at the middle of the column. As compared with the main part of the table, the captions
should be shown in small letters.
 4. Stub
 Stubs are distinguished used of rows. The headings or sub headings given in row. The
stubs are usually wider than column headings but should be kept as narrow as possible
without sacrificing precision and clarity of statements.
 5. Body
 The body of a table is the most vital part of the table. The body of the
table contains the numerical information’s. Data should be presented or
arranged in the body according to descriptions and classifications of the
captions and stubs.
 Such an arrangement of data is generally shown from left to right in the
rows and from top to bottom in the columns. Its size and shape should
be suitable to accommodate the data. Data can be arranged in the
body of a table in any of the following manner:
 Alphabetical order
 geographical order
 chronological order
 conventional order
6. Head note
 A head note is invariably given just below the title of a table indicating
the unit of measurement applicable to the data displayed. This should
be shown within brackets.
 e.g “ in thousands “ or “ in million tones” etc
7. Foot note
 Footnotes are intended to provide further information about the data
contained in the stubs, captions, and/or main body of the table.
 Sometimes it becomes necessary to give some explanation for the data
used or to explain the meaning of the abbreviation used.
 Foot notes are placed directly below body of the table. Just below the
horizontal line.
 Generally, footnotes are used when:
 The data is inconsistent and one needs to clarify or point out any exception as to the
basis of arriving at the data. For examples, sales recorded ate “ ex – factory price”. Any
heterogeneity in data recorded must be disclosed to avoid wrong conclusions.
 The data contain ambiguity or to clarify anything in the table : There are various
systems of identifying the footnotes. One is numbering them consecutively with small
number 1,2,3, or letter a, b, c, d or fire footnotes with star: one star (*) second footnote
with two stars (**) and 3rd foot notes with phrase stares (***) and so on. But, it is
convenient to use small numbers like 1, 2, 3, etc
 the data is affected by a special circumstance affecting the data for example strike,
lockout, fire, etc
 the source is acknowledged in the case of secondary data
 8. Source note
 Whenever secondary data are used, it is necessary to show the source form which such
data are taken.
 The purpose of providing the source note is to facilitate the reader to refer to the
source if s/he so wants to visit.
 The reference to the sources should be complete.
 If the data are obtained from some periodical; its name, data of publication, page no
table number etc should be mentioned.
 3.6.3.Structure of a table
 Sample table-1
 Table No _____________
 Title _________________
 Head notes____________

Stub Headings Captions

Captions entries (headings)

BODY
Stub entries
 Sample table-2
 Table No _____________
 Title _________________
 Head notes____________

Sub headings Captions Captions


Table

Caption Caption Captions captions

Stub entries

Main body

Total

 Foot note
 Source:
 3.6.4.Types of tables
 Table may broadly be classified in to two categories
 Simple and complex tables
 general purpose and special purpose ( or summary) tables
 1. Simple and complex tables
 A. one-way table: In simple table only one characteristic is shown. It is usually termed
as one may table. Such table supply answer to questions about one characteristic of
data only. The following table will illustrate the point:
 Table 3.7
 Marks obtained by 100 students Marks Number of students
30-40 14
40-50 16
50-60 20
60-70 25
70-80 25
Total 100
 B. Two way tables: If the information e=with respect to two characteristics is
shown in the table. That is two way tables give information about two interrelated
characteristics of a particular phenomenon.
 Two-way table can be prepared either by dividing the stubs or the captions into
subdivisions. The following is a special of two way table: If the number of students
given in table is further divided sex wise, the table would become a two way table.
 This table gives information about two characteristics, namely the marks obtained
by the students in economics and the sex wise distribution of students in various
class intervals of marks.
 Table 3.8
 Marks obtained by 100 student’s sex wise
Marks Number of students Total
Males Females
30-40 8 6 14
40-50 6 10 16
50-60 14 6 20
60-70 13 12 25
70-80 12 13 25
Total 53 47 100
 C. High order table
 When more than two variables are used for classification, then the table formed is
called high order table.
 Example: the following table is on the smoking status classified in sex and degree of
smoking
Smoking Status
Health Center Gender Total
Y N

M 10 32 42
1
F 23 98 121

M 33 65 98
2
F 12 21 33

M 11 32 43
3
F 21 21 42

Total 110 269 379


 This is shown in table below for number of workers in Adaki factory
 According to income, age, and sex
Age

20-30 30-40 40-50 50-60 Total Total


Income

M F M F M F M F M F

200-300

300-400

400-500

500-600

600-700

Total
 Important features of table
 Tables should be simple
 Each column or row should be labeled concisely and clearly giving
units of measurements for all quantitative data.
 The title should describe the content of table and the scale should
be understood and out reference to the text.
A good title will answer the questions of what, where and when.
 Any necessary explanation foot note should be included, at the
bottom of the table.
Frequency Table
Generally, the first approach to examining your
data.
Identifies distribution of variables overall
Identifies potential outliers
Investigate outliers as possible data entry errors
Investigate a sample of others for data entry errors

87
Frequency Table
 A research study has been conducted examining the number of
children in the families living in a community.
 The following data has been collected based on a random
sample of n = 30 families from the community.
2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0, 5, 8, 6, 5, 4 , 2, 4, 4,
7, 6
Organize this data in a Frequency Table!

88
X=No. of Count Relative Freq.
Children (Frequency)
0 2 2/30=0.067
1 3 3/30=0.100
2 5 5/30=0.167
3 5 5/30=0.167
4 6 6/30=0.200
5 4 4/30=0.133
6 2 2/30=0.067
7 2 2/30=0.067
8 1 1/30=0.033
89
Frequency Table

Now, construct a similar frequency table for the age of patients with
Heart related problems in a clinic.

The following data has been collected based on a random sample of


n = 30 patients who went to the emergency room of the clinic for
Heart related problems.

The measurements are: 42, 38, 51, 53, 40, 68, 62, 36, 32, 45, 51, 67,
53, 59, 47, 63, 52, 64, 61, 43, 56, 58, 66, 54, 56, 52, 40, 55, 72, 69.

90
Age Groups Frequency Relative
Frequency
32 -36 yr 2 2/30=0.067
37- 41 yr 3 3/30=0.100
42-46 yr 4 4/30=0.134
47-51 yr 3 3/30=0.100
52-56 yr 8 8/30=0.267
57-61 yr 3 3/30=0.100
62-66 yr 4 4/30=0.134
67-72 yr 3 3/30=0.100
Total n=30 91
Organizing Data and
Presentation
• Frequency Table
• Frequency Histogram
• Relative Frequency Histogram
• Frequency polygon
• Relative Frequency polygon
• Bar chart
• Pie chart
• Box plot
92
Frequency Polygon
 Use to identify the distribution of your data

9
8 Female

7 Male

6
Frequency

5
4
3
2
1
0
20- 30- 40- 50- 60-69
Age in years

93
Organizing data
in tables and charts:
Criteria for effective presentation
Why does order of variables
matter?
 Thearrangement of items in a table or chart should
coordinate with order they are mentioned in the
prose description.
 Avoidzigzagging back and forth across a chart or among
rows and columns of a table.

 Usually
describe a pattern based on observed
numeric values, e.g., most to least common.

 Oftena hypothesis includes some theoretical basis of


how items relate to one another.
Ordinal and continuous variables
 Valuesof ordinal, interval, and ratio variables have
an inherent numeric order.
 E.g., age groups, dates, blood pressure.

 Numeric or chronological order of values is the


principle for organizing those values in a table or
chart.
Nominal variables
 Values of nominal variables have no inherent numeric
order.
 E.g., categories of race, gender, or region.

 Need an organizing principle to determine sequence


of items.

 Same issue if you have >1 variable to present.


Several different causes of death.
Prevalence of >1 symptoms, attitudes, etc.
+ and - of different tools
Strengths Weaknesses
Prose  Easiest way to  Hard to organize a
explain patterns lot of numbers
Table  Holds lots of #s  Harder to "see"
 Good for detail patterns
 Predictable
structure
Chart  Holds lots of #s  Difficult to see
 Easy to see specific values
general patterns
 Predictable
structure
Complementary use of
prose, tables & charts
 Use tables and charts to present full set of numeric
values.
 Use prose to describe the pattern or address the
hypothesis.
 Use same ordering principle in table or chart and its
accompanying prose.
Improves clarity of narrative line.
Prose description of a pattern
 Objectives:
Describe size and shape of the pattern.
Explain whether it matches hypothesis.

 Specifydirection and magnitude of association.


Direction: “Which is higher?
Magnitude: “How much higher?”
Direction for different types of
variables
Direction for ordinal, interval or ratio variable:
Is the relationship positive, negative, or level?
E.g., as income rises, do death rates increase,
decrease or remain constant?
For nominal variables:
Which category has the highest value?
E.g., which gender has the higher death rate?
Principles for organizing data
 Alphabetical order
 Order of items on original data collection instrument
 Empirical order
 Theoretical groupings
 Arbitrary order – NEVER a good idea!
Think about how the data will be used, and choose
one of the above principles!
For tables and charts
accompanied by prose
PATTERN DESCRIPTION
OR HYPOTHESIS TESTING
Example: Attitudes about legal abortion
“Please tell me whether or not you think it should be % of
possible for a pregnant woman to obtain a legal respondents
abortion” who agree
If the woman wants it for any reason 43.7
If there is a strong chance of defect in the baby 79.8
If the woman's own health is seriously endangered by the
pregnancy 88.2

If she is not married and does not want to marry the man 42.5
If she becomes pregnant as a result of rape 80.8

If she is married and does not want any more children 44.4
Order of items from
questionnaire
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents

80

60

40

20

0
Any Defect in Wants no Mother's Pregnant Not
reason baby more kids health due to married
rape
Order of items from
questionnaire
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents

80

60

40

20

0
Any Defect in Wants no Mother's Pregnant Not
reason baby more kids health due to married
rape
Alphabetical order
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey

100
% of respondents

80

60

40

20

0
Any Defect in Mother's Not Rape Wants no
reason baby health married more
Empirical order (descending)
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey

100
% of respondents

80

60

40

20

0
Mother's Rape Defect in Wants no Any Not
health baby more reason married
Theoretical grouping
Agreement with legal abortion under specified
circumstances, 2000 U.S. General Social Survey
100
% of respondents

80

60

40

20

0
Mother's Pregnant Defect in Wants no Any Not
health* due to baby* more reason married
rape* kids
Health reasons Social reasons
Theoretical grouping
Agreement with legal abortion under specified
circumstances, 2000 U.S. General Social Survey
100

80
% of respondents

60

40

20

0
Mother's Pregnant Defect in Wants no Any Not
health* due to baby* more reason married
rape* kids
Health reasons Social reasons
Combining theoretical & empirical criteria
Descending dollar value of expenditures for
necessities and non-necessities,
2002 U.S. Consumer Expenditure Survey
$15,000
$12,000
$9,000
$6,000
$3,000
$-

Necessities Non-necessities
Pattern with a third variable
Agreement with legal abortion, by gender of respondent and
circumstances of abortion, 2000 U.S. General Social Survey
Organized by topic of abortion question
% of respondents

100 Men
80 Women
60
40
20
0
Mother's Pregnant Defect in Wants Any Not
health* due to baby* no more reason married
rape* kids
Health reasons Social reasons

* difference between men and women is statistically significant at p<.05


Pattern with a third variable
Agreement with legal abortion, by gender of respondent and
circumstances of abortion, 2000 U.S. General Social Survey
Organized by topic of abortion question
% of respondents

100 Men
80 Women
60
40
20
0
Mother's Pregnant Defect in Wants Any Not
health* due to baby* no more reason married
rape* kids
Health reasons Social reasons

* difference between men and women is statistically significant at p<.05


Identifying theoretical criteria
 Consultthe published literature on your topic to
learn about theoretical criteria for organizing your
variables.
 Innew research areas, empirical sorting may yield
clusters with similar response patterns that can
then be explored for conceptual overlap.
For self-guided data lookup
Why is it important? When is it used?
 Researchers look up data for own research questions, then organize the data using
empirical or theoretical criteria.

How to organize data for such tasks?


Alphabetical order
Order of items from data collection instrument
Standard ordering used in periodic reports
Alphabetical order
 Widely familiar principle, e.g., used in
Phone book
Daily stock market report
 Learned at an early age
 Facilitates self-guided lookup
Ordering for a public data source
 Orderof items on original data collection
instrument
Users can refer to codebook
Easy to find the variables they need
 Ordering used in periodic reports
Standardized from year to year for a given
topic
Summary
 There is no one principle for organizing numeric data that fits all possible tasks.
 Determine your main objective
 Hypothesis testing or pattern description
 Data reporting for others’ use
 Choose the organizing principle accordingly.
Organizing Data and
Presentation
• Frequency Table
• Frequency Histogram
• Relative Frequency Histogram
• Frequency polygon
• Relative Frequency polygon
• Bar chart
• Pie chart
• Box plot
119
Frequency Table

Generally, the first approach to


examining your data.
Identifies distribution of variables
overall
Identifies potential outliers
Investigate outliers as possible data
entry errors
Investigate a sample of others for
data entry errors 120
Frequency Table

A research study has been conducted examining


the number of children in the families living in a
community. The following data has been
collected based on a random sample of n = 30
families from the community.
2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0,
5, 8, 6, 5, 4 , 2, 4, 4, 7, 6
Organize this data in a Frequency Table!
121
X=No. of Count Relative Freq.
Children (Frequency)
0 2 2/30=0.067
1 3 3/30=0.100
2 5 5/30=0.167
3 5 5/30=0.167
4 6 6/30=0.200
5 4 4/30=0.133
6 2 2/30=0.067
7 2 2/30=0.067
8 1 1/30=0.033
122
Frequency Table
Now, construct a similar frequency table for the
age of patients with Heart related problems in a
clinic.

The following data has been collected based on a


random sample of n = 30 patients who went to the
emergency room of the clinic for Heart related
problems.

The measurements are: 42, 38, 51, 53, 40, 68, 62,
36, 32, 45, 51, 67, 53, 59, 47, 63, 52, 64, 61, 43, 56,
58, 66, 54, 56, 52, 40, 55, 72, 69. 123
Age Groups Frequency Relative
Frequency

32 -36 yr 2 2/30=0.067
37- 41 yr 3 3/30=0.100
42-46 yr 4 4/30=0.134
47-51 yr 3 3/30=0.100
52-56 yr 8 8/30=0.267
57-61 yr 3 3/30=0.100
62-66 yr 4 4/30=0.134
67-72 yr 3 3/30=0.100
Total n=30
124
Frequency Polygon
 Use to identify the distribution of your data

9
8 Female

7 Male

6
Frequency

5
4
3
2
1
0
20- 30- 40- 50- 60-69
Age in years

125
Table 1 in a paper
Describe your study population in a frequency table

Table Title
Name of variable Frequency Mean
(Units of variable) %
(n) (SD)
-
- Categories
-

Total
126
Data Presentation
 Two types of statistical presentation of data - graphical
and numerical.
 Graphical Presentation: We look for the overall pattern and
for striking deviations from that pattern. Over all pattern
usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is
called an outlier.
 Bar diagram and Pie charts are used for categorical
variables.
 Histogram, stem and leaf and Box-plot are used for
numerical variable.
Histogram
 A histogram is a graphical display of data using bars of different
heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of continuous
sample data
Box Plotting

 Box plots (also called box-and-whisker


plots or box-whisker plots) give a good graphical
image of the concentration of the data.
 They also show how far the extreme values are from
most of the data.
 A box plot is constructed from five values: the
minimum value, the first quartile, the median, the third
quartile, and the maximum value.
Box Plotting

The image above is a boxplot. A boxplot is a standardized way of displaying the


distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
Statistical concepts of
classification of Data
 Classification is the process of arranging data into
homogeneous (similar) groups according to their common
characteristics.
 Raw data cannot be easily understood, and it is not fit for
further analysis and interpretation. Arrangement of data helps
users in comparison and analysis. It is also important for
statistical sampling.
Classification of Data
There are four types of classification. They are:
 Geographical classification
When data are classified on the basis of location or areas, it is called geographical
classification
 Chronological classification
Chronological classification means classification on the basis of time, like months,
years etc.
 Qualitative classification
In Qualitative classification, data are classified on the basis of some attributes or
quality such as gender, colour of hair, literacy and religion. In this type of
classification, the attribute under study cannot be measured. It can only be found
out whether it is present or absent in the units of study.
 Quantitative classification
Quantitative classification refers to the classification of data according to some
characteristics, which can be measured such as height, weight, income, profits
etc.
Quantitative classification
 There are two types of quantitative classification of data: Discrete
frequency distribution and Continuous frequency distribution.
 In this type of classification there are two elements
 variable
Variable refers to the characteristic that varies in magnitude or
quantity. E.g. weight of the students. A variable may be discrete or
continuous.
 Frequency
Frequency refers to the number of times each variable gets repeated.
For example there are 50 students having weight of 60 kgs. Here 50
students is the frequency.
Frequency distribution
 Frequency distribution refers to data classified on the basis of
some variable that can be measured such as prices, weight,
height, wages etc.
Frequency distribution
The following technical terms are important when a
continuous frequency distribution is formed
Class limits: Class limits are the lowest and highest
values that can be included in a class. For example
take the class 51-55. The lowest value of the class is
51 and the highest value is 55. In this class there can
be no value lesser than 51 or more than 55. 51 is the
lower class limit and 55 is the upper class limit.
Class interval: The difference between the upper
and lower limit of a class is known as class interval of
that class.
Class frequency: The number of observations
corresponding to a particular class is known as the
frequency of that class
Measures of Centre Tendency
 In statistics, the central tendency is the descriptive summary of a
data set.
 Through the single value from the dataset, it reflects the centre of the
data distribution.
 Moreover, it does not provide information regarding individual data
from the dataset, where it gives a summary of the dataset. Generally,
the central tendency of a dataset can be defined using some of the
measures in statistics.
Mean
 The mean represents the average value of the dataset.
 It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
 Some other measures of mean used to find the central tendency
are as follows:
 Geometric Mean (nth root of the product of n numbers)
 Harmonic Mean (the reciprocal of the average of the reciprocals)
 Weighted Mean (where some values contribute more than others)
 It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the
sum of the elements of a set by the number of values in the set. So you
can use the layman term Average. If any data set consisting of the values
b1, b2, b3, …., bn then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)

The arithmetic mean of Virat Kohli’s batting scores also called his Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in
Arithmetic Progression, and harmonic mean (or shortly written as HM) can be
calculated by dividing the number of terms by reciprocals of its terms.

In particular cases, especially those involving rates and ratios, the harmonic
mean gives the most correct value of the mean. For example, if a vehicle travels
a specified distance at speed x (eg 60 km / h) and then travels again at the
speed y (e.g.40 km / h), the average speed value is the harmonic mean x, y (Ie,
48 km / h).
Geometric Mean
 The Geometric Mean (GM) is the average value or mean which
signifies the central tendency of the set of numbers by finding
the product of their values.
 Basically, we multiply the numbers altogether and take out the
nth root of the multiplied numbers, where n is the total number
of values.
 For example: for a given set of two numbers such as 3 and 1, the
geometric mean is equal to √(3+1) = √4 = 2.
Use of Geometric Mean
 For example, suppose you have an investment which earns 10% the first
year, 50% the second year, and 30% the third year. What is its average
rate of return?
 It is not the arithmetic mean, because what these numbers mean is that
on the first year your investment was multiplied (not added to) by 1.10,
on the second year it was multiplied by 1.60, and the third year it was
multiplied by 1.20. The relevant quantity is the geometric mean of these
three numbers.
 The question about finding the average rate of return can be rephrased
as: "by what constant factor would your investment need to be multiplied
by each year in order to achieve the same effect as multiplying by 1.10
one year, 1.60 the next, and 1.20 the third?"
 If you calculate this geometric mean
 You get approximately 1.283, so the average rate of return is about 28%
(not 30% which is what the arithmetic mean of 10%, 60%, and 20% would
give you).
Median
 Median is the middle value of the dataset in which the
dataset is arranged in the ascending order or in descending
order.
 When the dataset contains an even number of values, then
the median value of the dataset can be found by taking the
mean of the middle two values.
 If you have skewed distribution, the best measure of finding
the central tendency is the median.
 The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give a
realistic picture of the major part of the data. It is
influenced by extreme value 990.
Mode

 The mode represents the frequently occurring value in


the dataset.
 Sometimes the dataset may contain multiple modes and
in some cases, it does not contain any mode at all.
 If you have categorical data, the mode is the best choice
to find the central tendency.
Measures of Dispersion
Dispersion is the state of getting dispersed or spread. Statistical dispersion
means the extent to which a numerical data is likely to vary about an
average value. In other words, dispersion helps to understand the
distribution of the data.
Objectives of computing

dispersion
Comparative study
Measures of dispersion give a single value indicating the degree of consistency or
uniformity of distribution. This single value helps us in making comparisons of
various distributions.
Reliability of an average
 A small value of dispersion means low variation between observations and
average. It means that the average is a good representative of observation and
very reliable. A higher value of dispersion means greater deviation among the
observations.
Control the variability
 Different measures of dispersion provide us data of variability from different
angles, and this knowledge can prove helpful in controlling the variation.
Basis for further statistical analysis
 Measures of dispersion provide the basis for further statistical analysis like
computing correlation, regression, test of hypothesis, sampling etc.
Types of Measures of Dispersion
There are two main types of dispersion methods in statistics which
are:
 Absolute Measure of Dispersion
 Relative Measure of Dispersion
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes range, standard
deviation, quartile deviation, etc. The types of absolute measures of dispersion are:

 Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
 Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no of values in the data set
is the variance. Variance (σ2)=∑(X−μ)2/N
 Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
 Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
 Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of
central tendency is known as the mean deviation (also called mean absolute
deviation).
Range
 It is the simplest method of measurement of dispersion.
 It is defined as the difference between the largest and the
smallest item in a given distribution.
 Range = Largest item (L) – Smallest item (S)
Interquartile Range
 It is defined as the difference between the Upper Quartile and
Lower Quartile of a given distribution.
 Interquartile Range = Upper Quartile (Q3)–Lower
Quartile(Q1)
Variance
 Variance is a measure of how data points differ from the mean.
 A variance is a measure of how far a set of data (numbers) are spread
out from their mean (average) value.
 The more the value of variance, the data is more scattered from its
mean and if the value of variance is low or minimum, then it is less
scattered from mean. Therefore, it is called a measure of spread of data
from mean.
 the formula for variance is
Var (X) = E[(X –μ) 2]
 the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance

Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9,


11, 10, 12, 7.
Given,
3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean (μ) = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8
Variance
Coefficient of variance
 The coefficient of variance (CV) is a relative measure of variability
that indicates the size of a standard deviation in relation to its mean.
 It is a standardized, unitless measure that allows you to compare
variability between disparate groups and characteristics.
 It is also known as the relative standard deviation (RSD).
 The coefficient of variation facilitates meaningful comparisons in
scenarios where absolute measures cannot.
Quartile Deviation
 The Quartile Deviation (QD) is the product of half of the
difference between the upper and lower quartiles.
 Mathematically we can define as: Quartile Deviation = (Q3 – Q1) /
2
 Quartile Deviation defines the absolute measure of dispersion.
Whereas the relative measure corresponding to QD, is known as
the coefficient of QD, which is obtained by applying the certain
set of the formula: Coefficient of Quartile Deviation = (Q3 – Q1) /
(Q3 + Q1)
 A Coefficient of QD is used to study & compare the degree of
variation in different situations.
Skewness
 Skewness is a measure of the degree of asymmetry of a
distribution.
 If the left tail (tail at small end of the distribution) is more
pronounced than the right tail (tail at the large end of the
distribution), the function is said to have negative skewness.
 If the reverse is true, it has positive skewness. If the two are
equal, it has zero skewness.
Kurtosis
 Kurtosis is a measure of whether the data are heavy-tailed or
light-tailed relative to a normal distribution.
 That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or
lack of outliers.
 Significant skewness and kurtosis clearly indicate that data are
not normal.
Types of Distributions
Normal Distribution
 In probability theory and statistics, the Normal Distribution, also
called the Gaussian Distribution, is the most significant
continuous probability distribution.
 A large number of random variables are either nearly or exactly
represented by the normal distribution, in every physical science
and economics.
 In a normal distribution, the mean, mean and mode are equal.
(i.e., Mean = Median= Mode). The normally distributed curve
should be symmetric at the centre.
Normal Distribution
SAS Exam papers
Paper Name of paper Sincere Normal
No. preparatio preparation
n
PC 1 Language Skill 10 6
PC 2 Logical, Analytical and Quantitative 9 3
Abilities
PC 3 Information Technology (Theory) 7-8 2

PC 4 Information Technology (Practical) 10 10


PC 5 Constitution of India, Statutes and 7 2-3
Service Regulations
PC 8 Financial Rules and Principles of 6-7 0
Government Accounts
PC 14 Financial Accounting with 6-7 0
Elementary Costing
PC 16 Public Works Accounts 4-5 0
PC 22 Government Audit 6-7 0
Thank you for giving this opportunity to interact with
you
and
please feel free to contact me in case of any doubt
regarding the lecture
Gaurav Kr. Prajapat
Mobile 9461588507

You might also like