Chapter Three Data Organization and Classification and Presentation
Chapter Three Data Organization and Classification and Presentation
Data organization,
classification and BY:
presentation
MOHAMMED BESHIR
DATA PROCESSING
Once the researcher collected data from the respondents or at the time of data
collection, he or she has to reduce the mass of data to a form suitable for analysis.
This data reduction is usually referred to as data processing.
Data processing can be done either manually or by using computer.
It involves editing and cleaning , coding, classification, and tabulation of
collected data so that they are amendable to analysis.
Accordingly, we have the following four major phases:
data processing (editing, coding)
classification of data
tabulation of data (statistical table ): data presentation
charting of data (statistical Chart ): data presentation
Data preparation/ entry, coding ,editing and
cleaning/
3.1.1. Editing
Once the data have been collected, the next task is editing.
Editing is the process of examining errors and omissions in the collected data
and making necessary correction.
Editing should be done by experienced persons with care to ensure that the data
are accurate, consistent with other gathered ,uniformly entered, as completed as
possible and have been well arranged to facilitate coding and tabulation.
The editing can made either by the respondent or by interviewer.
The editor can ensure this using a different colored pencil for editing the raw
data.
Where collateral corrections are to be made, it is necessary that these (edited
questionnaires) should be kept distinct from the change made either by the
respondent or by the interviewer.
3.1.1. Editing
1 Tigry
2 Afar
3 Amhara
4 Oromia
3. Chronological-time as a basic
Data are classified based on time (month, year, day etc). For instance, the amount
of sugar produced and sold by Wonji sugar factory form 2001- 2007 in tones.
Year 2000 2001 2003 2004 2005 2006 2007
20
iii. Prepare frequency table
Age Frequency
23 1
24 1
25 1
27 1
29 2
30 1
31 1
32 1
33 1
35 2
36 2
37 1
39 1
41 3
42 1
Example Two
Suppose there is a class of 30 economics students. Each student in the class is asked
to toss a coin five times and record each time whether he/she gets a head or not. As
a result of this experiment the number of times a person gets heads out of five
tosses for the 30 students is presented as follows.
3,2,0,4,1,2,3,5,3,3,1,1,3,5,4,2,2,1,0,4,3,2,2,4,2,3,3,1,5.3 Then prepare frequency
distribution for coin tossing experiment?
Solution
i. arrange data in orderly manner
Such data need some better display. One way of doing this is to show
the occurrence of head in a certain order. For instance, we may
show the same data in an ascending order as follows.
0,0,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,5
The same data can be arranged in descending order.
ii. Counting and tally
Data relating to the tossing of a coin five times by 30 students show
that each figure from 0 to 5 has occurred a certain number of
times. We can condense those data by pairing each of those values
with their corresponding frequency.
Table 3.1 frequency distribution for number of heads obtained by 30
students
Observation (X) Tally Frequency
0. // 2
1. //// 5
2. /////// 8
3. //////// 9
4. //// 4
5. // 2/30
iii. Frequency table
1 0 3 2 0
2 4 1 3 1
4 1 2 2 3
3.3.3.2. Frequency distribution and continuous case (variable)
We have seen frequency distribution for discrete case. If the mass of data
is very large say 200 or 300 it is quite difficult to apply discrete case above.
Furthermore, sometimes variables do not assume specific values rather
somewhat continuous values over certain intervals. For example, the
temperatures of a given city take value between 30 c0 to 40c0. Furthermore,
the age of employee can be presents in internal form
Age ( year) No of employees
20-25……………………………10
25-30…………………………….15
30-35…………………………….40
35-40…………………………….45
40-45……………………………26
45-50…………………………….4
140
For mass of data in either case frequency table can be prepared.
Hence, it is necessary to condense the data in to appropriate number of
classes or groups of value of the variable and indicate the number of
observed values that fall into each class. We now turn to the formation
of a frequency table when data are continuous by grouping value of
variables.
Advantage of Grouping
Provide information about the range of the data.
Give an impression about the values that are frequent and infrequent.
It provides data that can be easily used for graphical representation.
Disadvantage of Grouping
Information may be lost since individual values are not displayed.
Something that can be determined from original data which cannot be
3.3.3.2.1. Common Terminology in a Grouped Frequency
Distribution (GFD)
Before frequency distribution study, we should have a clear idea of
certain terms which we shall come across frequently.
1. Class: a group of value of a variable between two specified numbers.
Data in table below is grouped and with four classes with the first class
is 1-25,2nd from 26-50,3rd from 51-75, and 4th from 76-100.
Class No. Classes Frequency
1 1-25 3
2 26-50 10
3 51-75 18
4 76-100 6
2. Range(R): the difference between the largest(L) and smallest(S)
value on the data.
R=L-S
3. Class frequency
The number of observation belonging to a particular class is class
frequency. In above table, the frequency of the 1st, 2nd, 3rd, 4th classes
are 3, 10, 18, and 6.respectivley
Suppose there are 20 students who have obtained marks ranging from
30-40 and’44 students have obtained marks ranging from 50-60 then
the frequency of each class can be
Range Frequency
30-40 20
50-60 44
4. Frequency Distribution
A frequency distribution is the distribution values of a variable linked in
to groups along with corresponding number of observations in each
group (frequency). This is usually formed as class interval type of
5. Class limit
The lowest and the highest values of a class, for example, take the
class 26-50. Here, we find lowest class limit 26 and highest class limit
50. They are denoted as the upper class limit (UCL) and lower class
limit (LCL) respectively.
for the first class LCL=1, UCL=25
for the second class LCL=26, UCL=50
For the third class LCL=51, UCL=75
For the forth class LCL=76, UCL=100
6. Class boundary: are boundaries obtained by subtracting half of the
unit of measurement of the lower class limit or by adding half (1/2u) on
upper limits of a class where u is the gap between two successive
classes. These are the two boundaries of the class: upper class
boundary (UCB) and lower class Boundary (LCB). The units of
measurement (u) are the gap between any two successive classes. i.e.
i.e. UCBi = UCLi+1/2u
LCBi = LCli +1/2u
An example, consider the 2nd class, 26-50, since u-26-25=1.
LCL2=26, UCL2=50
LCB2= 26-1/2(1)
UCB2=50+1/2(1)=50.5
7. Exclusive method (class interval)
When the class intervals are so fixed that the upper limit of one class is
the lower limit of the next class.
This can be best explained using example. Let us take some
hypothetical data which are given below.
Profits earned by companies
Profit (million birr) No of Companies
10- 20……………………………………12
20-30…………………………………….17
30-40……………………………………..30
40-50……………………………………..25
50-60……………………………………..16
100
In above case, the upper limit of one class is shown as lower limit of the class. For instance, 20 are upper
limit of 1st class as well as lower limit of the second class. Similar logic hold true for the rest classes.
Not that
It should be noted that it the class intervals shown in table it is presumed that the upper limit is exclusive and
that the item of that value is included in the next class interval. For example, 20 is the upper limits of the 1 st
class and it is excluded from that class, but it is included in 2 nd class which is 20-30.
8. Inclusive method
The upper limit of one class is included in that class itself. Suppose we
have the following frequency distribution.
Profit (Birr) No of companies
10-19…………………………………………….12
20-29…………………………………………….17
30-39…………………………………………….30
40-49…………………………………………….25
50-59…………………………………………….16
100
Those values with decimal greater 0.5 should be placed in upper class
and other in the lower class.
Note that
To adjust the class limits, we take the difference between two classes
(upper and lower limits). In our case 20-19 =1 is the gap between the
two limits.
By dividing it by two we get 0.5 which is termed as a correction factor.
One can adjust and make exclusive class using exclusive method by
deducting 0.5 from the lower limits of all classes and adding 0.5 to upper
limits. The adjusted class would then be presented above
9. Class interval (width)or size of class
The difference between upper limit and lower limit of a class is the width
of the class. The class interval of above case 26-50 is 25 which is 50-26.
It can be sometimes computed as the difference between the upper
and lower class boundaries of any class.
W= UCLi- LCli or Width(W)=UCBi - LCBi
For grouped data we form the class with the same width (interval) can
approximated using range and number of desired classes.
Class width = Range/number of classes,
i.e. R/k = R/k = L-S/k
Struge’s formula suggests mechanism for determining the approximate
number of classes. The formula is as follows
k= 1+3.322 log N
Remark
If both the LCL and UCL are included in a class, it is called an inclusive class. For
inclusive classes the class width can Width (W)=UCBi - LCBi
If LCL is included and the UCL is not included in a class, it is an exclusive classes. For
exclusive classes W= UCLi- LCli
10. Class midpoint (class mark)
The value lying half way between the lower and upper class limits of classes interval.
The midpoint of the class can be ascertained as follow
Mid-point of a data = upper limit of the class - lower limit of the class
2
The midpoint of each class interval is taken to represent it for the purpose of
statistical calculation. For example, the class midpoint of above class intervals can
be calculated as follows
Class interval class mid point
30-40 ………………………………..30+40 =70 =35
2 2
50-60…………………………………50+60= 110 =55
2 2
Example one:
The weekly income of (in birr) 30 workers was given
50 23 75 42 55 67
61 71 25 40 25 54
70 31 51 81 45 63
31 68 45 38 59 75
84 50 88 56 63 32
Then
A. constructs GFD with 7 classes?
B. completes the FD with class boundaries and class marks?
3.3.3.2.2. Rules for forming a grouped frequency
distribution
To construct a GFD, the following points should be considered
A. The classes should be clearly defined. That is each observation should
fall in to one and only one class
B. The number of classes neither too many nor too few.
The first and for most question one confronts is how many classes should
be formed? It is difficult to lay down any hard and fast rules for classifying
the data.
It is more of subjective and based on interest of individual. The number of
class intervals depends mainly on the number of observations as well as
their range.
However, the following general considerations may be born in mind
for ensuring the classification of data.
If number of observation is too many but the desired number of the
classes are too few, the original data will be compressed so that only
limited information will be available resulting loss of information.
If the number of observations is small. Obviously the classes will be
few as we cannot classify small data in to 12 to 15 classes.
Hence, too few intervals are undesirable. On the other hand, if too
many intervals are used, the objective of summarization will not be
met as it can be boring.
The recommended number of class should be between 5 to 20. .
5 toNbe. C 15,
Generally, it desirable to have class intervals
C. Choosing a suitable size or unit of a class interval /Class intervals/width
All the classes should be of the same width because unequal class
interval create problem in graphing and computing some statistical
measures.
As a principle non-over lapping intervals or classes should be developed
such that each value in a set of observations can be placed in one and
only one interval.
An approximate suitable class width can be obtained as
Class width = Range/number of classes, i.e. R/k = R/k = L-S/k
Let R/n =6.8263
where L= largest values in the data set
S=Smallest value in the data set
D. The suitable number of classes can be obtained using Strunge’s
formula as follows
k= 1+3.322 log N where N is number of observations
Depends on personal preference using formula suitable class size
can be I = Range
1+3.322 log N
For example, if the total number of observation is 100, then the
number of classes would be
1+3.322 log10100 = 1+3.22(2)1og 1010
= 1+3.22(2)
= 7.644 or 8
If your result is with decimal, it should be round up or down depending
on the relevance and ordinary rule.
For example, if we have a sample of size 275 observations that we will
have
K = 1 + 3.322
= 1 + 3.322(2.4395)
= 9 classes
One can read log10N from logarithmic table.
Note that:
Approximate “w” to the nearest integer
It’s preferable to have odd “w” since it has advantage of having a
midpoint which is an integer to ensure to have the same value as data.
Example one: consider the age data given previously
Example one: Illustration
The profits (in birr) of 30 companies for the year 1999- 2000 are given below
20 22 35 42 37 42 48 53 49 65 39 48 67
18 16 23 37 35 49 63 65 55 45 58 57 69
25 29 58 65
Profit Number of
companies
15-25 5
25-35 2
35-45 7
45-55 6
55-65 5
65-75 5
3.4. Relative frequency and percentage distribution
Frequency
Re lative frequency
total number of observations
The proportion of individual in a given class
fi
Percentage Relative frequency = .100
n
For example
Students grade frequency relative frequency
percentage
20-40 10 10/100= 01
0.1x100= 10
40-60 30 30/100 = 0.3
0.3x100=30
60-80 50 50/100 = 0.5
0.5x100=50
80-100 10 10/100 = 0.1
0.1x100= 10
Table 3.2 relative frequency distribution for number of needs
obtained by 30 stud.
Observation frequency relative frequency
f/x100
0 2 0.17 7
1 5 0.17 17
2 8 0.27 27
3 9 0.30 30
4 4 0.12 12
5 2 0.07
7
100 100
3.5. Cumulative frequency
distribution
Sometimes, we may be interested not only in the number of
observations in each class but also in the number below or above a
certain specified limits.
For example, we are interested to find out the number of persons,
whose income from all sources is less than a particular amount 5000
birr.
In this case we need cumulative frequency. We are interested to find
out the number of students who got 75 or above marks out of 100.
Table xx: cumulative frequency distribution
Mark obtained No of student mark obtained
cumulative frequency
1 10 not more than 1
10
2 30 not more than 2
40
3 35 >>
75
4 28 >>
103
5 39 >>
142
6 20 >>
162
It is the collection of values of a variable above or below specified
values in a distribution. The Cumulative Frequency Distribution
is usually divided in to namely, less than and more than cumulative
distribution.
‘Less Than’ Cumulative Frequency Distribution (<CFD):
shows the collection of cases lying below the upper class boundaries
of each class.
‘More Than’ Cumulative Frequency Distribution (>CFD):
shows the collection of cases lying above the lower class boundaries
of each class.
Remark: The frequency distribution does not tell us directly the
number of units above or below specified values of the classes this
can be determined from a “cumulative Frequency Distribution’
Example 11 Consider the frequency distribution in Example
Class (xi) Frequency (fi) Less than Cumulative Frequency (<cfi)
More than Cumulative Frequency (>cfi)
Class (xi) Frequency (fi) Less than Cumulative More than Cumulative
Frequency (<cfi) Frequency (>cfi)
3–6 4 4 30
7 – 10 7 11 26
11 – 14 10 21 19
15 – 18 6 27 9
19 – 22 3 30 3
This means that from ‘less than’ cumulative frequency distribution there
are 4 observations less than 6.5, 11 observations below 10.5, etc and
from ‘more than’ cumulative frequency distribution 30 observations are
above 2.5, 25 above 6.5 etc.
Example: Consider the age of human data
Class Limit Frequency Less than cumulative More than cumulative
23 - 26 3 3/20 26 3 3/20 23 20 20/20
27 - 30 4 4/20 30 7 7/20 30 17 17/20
31 - 34 3 3/20 34 10 10/20 34 13 13/20
35 - 38 5 5/20 38 15 15/20 35 10 10/20
39 - 42 5 5/20 42 20 20/20 42 5 5/20
0 0/20
20
Total
3.6. Tabulation of data
When a mass of data has been assembled, it becomes necessary for
researcher to arrange the some kind of concise and logical order.
Alternatively, data collected through a statistical investigation is
classified according to some characteristics.
The classified data should be presented in a concise, clear, definite
form. The procedure is referred to as tabulation.
It is the process of summarizing (condensing) raw data or classified
data in the form of table and displaying the same in compact form for
further analysis.
One of simplest and most revealing device for summarizing data and
presenting them in a meaningful fashion is the statistical table.
A table is a systematic arrangement of statistical data in column and
rows. To see tabular presentation is brief let’s see parts of table
Refers to the orderly arrangement of data in a table or other summary format.
It presents responses or the observations on a question-by-question or item-by-item
basis and provides the most basic form of information.
It tells the researcher how frequently each response occurs.
This starting pint of analysis requires the counting of responses or observations for
each of the categories. E.g., Frequency tables,
It facilitate the summation of items and the detection of errors and omission
BODY
Stub entries
Sample table-2
Table No _____________
Title _________________
Head notes____________
Stub entries
Main body
Total
Foot note
Source:
3.6.4.Types of tables
Table may broadly be classified in to two categories
Simple and complex tables
general purpose and special purpose ( or summary) tables
1. Simple and complex tables
A. one-way table: In simple table only one characteristic is shown. It is usually termed
as one may table. Such table supply answer to questions about one characteristic of
data only. The following table will illustrate the point:
Table 3.7
Marks obtained by 100 students Marks Number of students
30-40 14
40-50 16
50-60 20
60-70 25
70-80 25
Total 100
B. Two way tables: If the information e=with respect to two characteristics is
shown in the table. That is two way tables give information about two interrelated
characteristics of a particular phenomenon.
Two-way table can be prepared either by dividing the stubs or the captions into
subdivisions. The following is a special of two way table: If the number of students
given in table is further divided sex wise, the table would become a two way table.
This table gives information about two characteristics, namely the marks obtained
by the students in economics and the sex wise distribution of students in various
class intervals of marks.
Table 3.8
Marks obtained by 100 student’s sex wise
Marks Number of students Total
Males Females
30-40 8 6 14
40-50 6 10 16
50-60 14 6 20
60-70 13 12 25
70-80 12 13 25
Total 53 47 100
C. High order table
When more than two variables are used for classification, then the table formed is
called high order table.
Example: the following table is on the smoking status classified in sex and degree of
smoking
Smoking Status
Health Center Gender Total
Y N
M 10 32 42
1
F 23 98 121
M 33 65 98
2
F 12 21 33
M 11 32 43
3
F 21 21 42
M F M F M F M F M F
200-300
300-400
400-500
500-600
600-700
Total
Important features of table
Tables should be simple
Each column or row should be labeled concisely and clearly giving
units of measurements for all quantitative data.
The title should describe the content of table and the scale should
be understood and out reference to the text.
A good title will answer the questions of what, where and when.
Any necessary explanation foot note should be included, at the
bottom of the table.
Frequency Table
Generally, the first approach to examining your
data.
Identifies distribution of variables overall
Identifies potential outliers
Investigate outliers as possible data entry errors
Investigate a sample of others for data entry errors
87
Frequency Table
A research study has been conducted examining the number of
children in the families living in a community.
The following data has been collected based on a random
sample of n = 30 families from the community.
2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0, 5, 8, 6, 5, 4 , 2, 4, 4,
7, 6
Organize this data in a Frequency Table!
88
X=No. of Count Relative Freq.
Children (Frequency)
0 2 2/30=0.067
1 3 3/30=0.100
2 5 5/30=0.167
3 5 5/30=0.167
4 6 6/30=0.200
5 4 4/30=0.133
6 2 2/30=0.067
7 2 2/30=0.067
8 1 1/30=0.033
89
Frequency Table
Now, construct a similar frequency table for the age of patients with
Heart related problems in a clinic.
The measurements are: 42, 38, 51, 53, 40, 68, 62, 36, 32, 45, 51, 67,
53, 59, 47, 63, 52, 64, 61, 43, 56, 58, 66, 54, 56, 52, 40, 55, 72, 69.
90
Age Groups Frequency Relative
Frequency
32 -36 yr 2 2/30=0.067
37- 41 yr 3 3/30=0.100
42-46 yr 4 4/30=0.134
47-51 yr 3 3/30=0.100
52-56 yr 8 8/30=0.267
57-61 yr 3 3/30=0.100
62-66 yr 4 4/30=0.134
67-72 yr 3 3/30=0.100
Total n=30 91
Organizing Data and
Presentation
• Frequency Table
• Frequency Histogram
• Relative Frequency Histogram
• Frequency polygon
• Relative Frequency polygon
• Bar chart
• Pie chart
• Box plot
92
Frequency Polygon
Use to identify the distribution of your data
9
8 Female
7 Male
6
Frequency
5
4
3
2
1
0
20- 30- 40- 50- 60-69
Age in years
93
Organizing data
in tables and charts:
Criteria for effective presentation
Why does order of variables
matter?
Thearrangement of items in a table or chart should
coordinate with order they are mentioned in the
prose description.
Avoidzigzagging back and forth across a chart or among
rows and columns of a table.
Usually
describe a pattern based on observed
numeric values, e.g., most to least common.
If she is not married and does not want to marry the man 42.5
If she becomes pregnant as a result of rape 80.8
If she is married and does not want any more children 44.4
Order of items from
questionnaire
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents
80
60
40
20
0
Any Defect in Wants no Mother's Pregnant Not
reason baby more kids health due to married
rape
Order of items from
questionnaire
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents
80
60
40
20
0
Any Defect in Wants no Mother's Pregnant Not
reason baby more kids health due to married
rape
Alphabetical order
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents
80
60
40
20
0
Any Defect in Mother's Not Rape Wants no
reason baby health married more
Empirical order (descending)
Agreement with legal abortion under specified circumstances,
2000 U.S. General Social Survey
100
% of respondents
80
60
40
20
0
Mother's Rape Defect in Wants no Any Not
health baby more reason married
Theoretical grouping
Agreement with legal abortion under specified
circumstances, 2000 U.S. General Social Survey
100
% of respondents
80
60
40
20
0
Mother's Pregnant Defect in Wants no Any Not
health* due to baby* more reason married
rape* kids
Health reasons Social reasons
Theoretical grouping
Agreement with legal abortion under specified
circumstances, 2000 U.S. General Social Survey
100
80
% of respondents
60
40
20
0
Mother's Pregnant Defect in Wants no Any Not
health* due to baby* more reason married
rape* kids
Health reasons Social reasons
Combining theoretical & empirical criteria
Descending dollar value of expenditures for
necessities and non-necessities,
2002 U.S. Consumer Expenditure Survey
$15,000
$12,000
$9,000
$6,000
$3,000
$-
Necessities Non-necessities
Pattern with a third variable
Agreement with legal abortion, by gender of respondent and
circumstances of abortion, 2000 U.S. General Social Survey
Organized by topic of abortion question
% of respondents
100 Men
80 Women
60
40
20
0
Mother's Pregnant Defect in Wants Any Not
health* due to baby* no more reason married
rape* kids
Health reasons Social reasons
100 Men
80 Women
60
40
20
0
Mother's Pregnant Defect in Wants Any Not
health* due to baby* no more reason married
rape* kids
Health reasons Social reasons
The measurements are: 42, 38, 51, 53, 40, 68, 62,
36, 32, 45, 51, 67, 53, 59, 47, 63, 52, 64, 61, 43, 56,
58, 66, 54, 56, 52, 40, 55, 72, 69. 123
Age Groups Frequency Relative
Frequency
32 -36 yr 2 2/30=0.067
37- 41 yr 3 3/30=0.100
42-46 yr 4 4/30=0.134
47-51 yr 3 3/30=0.100
52-56 yr 8 8/30=0.267
57-61 yr 3 3/30=0.100
62-66 yr 4 4/30=0.134
67-72 yr 3 3/30=0.100
Total n=30
124
Frequency Polygon
Use to identify the distribution of your data
9
8 Female
7 Male
6
Frequency
5
4
3
2
1
0
20- 30- 40- 50- 60-69
Age in years
125
Table 1 in a paper
Describe your study population in a frequency table
Table Title
Name of variable Frequency Mean
(Units of variable) %
(n) (SD)
-
- Categories
-
Total
126
Data Presentation
Two types of statistical presentation of data - graphical
and numerical.
Graphical Presentation: We look for the overall pattern and
for striking deviations from that pattern. Over all pattern
usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is
called an outlier.
Bar diagram and Pie charts are used for categorical
variables.
Histogram, stem and leaf and Box-plot are used for
numerical variable.
Histogram
A histogram is a graphical display of data using bars of different
heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of continuous
sample data
Box Plotting
The arithmetic mean of Virat Kohli’s batting scores also called his Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in
Arithmetic Progression, and harmonic mean (or shortly written as HM) can be
calculated by dividing the number of terms by reciprocals of its terms.
In particular cases, especially those involving rates and ratios, the harmonic
mean gives the most correct value of the mean. For example, if a vehicle travels
a specified distance at speed x (eg 60 km / h) and then travels again at the
speed y (e.g.40 km / h), the average speed value is the harmonic mean x, y (Ie,
48 km / h).
Geometric Mean
The Geometric Mean (GM) is the average value or mean which
signifies the central tendency of the set of numbers by finding
the product of their values.
Basically, we multiply the numbers altogether and take out the
nth root of the multiplied numbers, where n is the total number
of values.
For example: for a given set of two numbers such as 3 and 1, the
geometric mean is equal to √(3+1) = √4 = 2.
Use of Geometric Mean
For example, suppose you have an investment which earns 10% the first
year, 50% the second year, and 30% the third year. What is its average
rate of return?
It is not the arithmetic mean, because what these numbers mean is that
on the first year your investment was multiplied (not added to) by 1.10,
on the second year it was multiplied by 1.60, and the third year it was
multiplied by 1.20. The relevant quantity is the geometric mean of these
three numbers.
The question about finding the average rate of return can be rephrased
as: "by what constant factor would your investment need to be multiplied
by each year in order to achieve the same effect as multiplying by 1.10
one year, 1.60 the next, and 1.20 the third?"
If you calculate this geometric mean
You get approximately 1.283, so the average rate of return is about 28%
(not 30% which is what the arithmetic mean of 10%, 60%, and 20% would
give you).
Median
Median is the middle value of the dataset in which the
dataset is arranged in the ascending order or in descending
order.
When the dataset contains an even number of values, then
the median value of the dataset can be found by taking the
mean of the middle two values.
If you have skewed distribution, the best measure of finding
the central tendency is the median.
The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give a
realistic picture of the major part of the data. It is
influenced by extreme value 990.
Mode
Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no of values in the data set
is the variance. Variance (σ2)=∑(X−μ)2/N
Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of
central tendency is known as the mean deviation (also called mean absolute
deviation).
Range
It is the simplest method of measurement of dispersion.
It is defined as the difference between the largest and the
smallest item in a given distribution.
Range = Largest item (L) – Smallest item (S)
Interquartile Range
It is defined as the difference between the Upper Quartile and
Lower Quartile of a given distribution.
Interquartile Range = Upper Quartile (Q3)–Lower
Quartile(Q1)
Variance
Variance is a measure of how data points differ from the mean.
A variance is a measure of how far a set of data (numbers) are spread
out from their mean (average) value.
The more the value of variance, the data is more scattered from its
mean and if the value of variance is low or minimum, then it is less
scattered from mean. Therefore, it is called a measure of spread of data
from mean.
the formula for variance is
Var (X) = E[(X –μ) 2]
the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance