0% found this document useful (0 votes)
10 views73 pages

Lecture 2

The document discusses descriptive statistics, focusing on methods for organizing and presenting data, including frequency distributions, graphical representations, and measures of central tendency. It outlines the importance of understanding data characteristics for effective analysis and details various techniques for summarizing data, such as calculating the mean, median, and mode. Additionally, it emphasizes the significance of diagrammatic representation and provides guidelines for constructing tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views73 pages

Lecture 2

The document discusses descriptive statistics, focusing on methods for organizing and presenting data, including frequency distributions, graphical representations, and measures of central tendency. It outlines the importance of understanding data characteristics for effective analysis and details various techniques for summarizing data, such as calculating the mean, median, and mode. Additionally, it emphasizes the significance of diagrammatic representation and provides guidelines for constructing tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Descriptive Statistics:

Methods of data organization and


presentation
The data collected in a survey is called raw
data.
Precise methods of analysis can be decided
on only when the characteristics of the data
are understood.
For the primary objective of this different
techniques of data organization and
presentation like order array, tables, and
diagrams are used.
Summarizing and organizing data can be achieved

through:

1. Frequency Distributions

2. Graphical Representations

3. Measures of Central Tendency

4. Measures of variability
1. Frequency distribution:
• The actual summarization and
organization of data starts from
frequency distribution.
• Frequency distribution: A table
which has a list of each of the
possible values that the data can
assume along with the number of
times each value occurs.
• For nominal and ordinal data, frequency distributions are
often used as a summary.
• Example:

• The % of times that each value occurs, or the relative


frequency, is often listed
• Tables make it easier to see how the data are distributed
• For both discrete and continuous data, the values are
grouped into non-overlapping intervals, usually of
equal width.
a) Qualitative variable: Count the
number of cases in each category.
- Example1: The intensive care unit type of
25 patients entering ICU at a given
hospital:
- Medical, Surgical, Cardiac, Other
b) Quantitative variable:
- Select a set of continuous, non-overlapping intervals
such that each value can be placed in one, and only
one, of the intervals.
- The first consideration is how many intervals to include
For a continuous
variable (e.g. – age), the
frequency distribution
of the individual ages is
not so interesting.
• We “see more”
in frequencies of
age values in
“groupings”.
Here, 10 year
groupings make
sense.
• Grouped data
frequency
distribution
To determine the number of class intervals and the corresponding
width, we may use:

Sturge’s rule:
K 1  3.322(log n)
L S
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
• Cumulative frequencies: When
frequencies of two or more classes are
added.
• Cumulative relative frequency: The
percentage of the total number of
observations that have a value either in
that interval or below it.
• Mid-point: The value of the interval which
lies midway between the lower and the
upper limits of a class.
• True limits: Are those limits that make an interval of a
continuous variable continuous in both directions
• Used for smoothening of the class intervals .
• Subtract 0.5 from the lower and add it to the
upper limit .
Time
(Hours) True limit Mid-point Frequency

10-14 9.5 – 14.5 12 5


15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2

Total 40
Guidelines for constructing tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• State clearly the unit of measurement used,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-
note.
Diagrammatic Representation
Importance of diagrammatic representation:

1. Diagrams have greater attraction than


mere figures.
2. They give quick overall impression of
the data.
3. They have great memorizing value
than
mere figures.
4. They facilitate comparison .
5. Used to understand patterns and trends
.
Specific types of graphs include:
• The choice of the particular form among the different
possibilities will depend on the type of the data.
• Bar graph
• Pie chart Qualitative data

• Histogram
• Frequency polygon Quantitative
• Stem-and-leaf plot data

• Box plot
• Scatter plot
• Line graph
MEASURES OF CENTRAL TENDENCY (MCT)

• A frequency distribution is a general picture of the


distribution of a variable.
• But, can’t indicate the average value and the
spread of the values.
• The tendency of the statistical data to get concentrated
at a certain value is called “central tendency”
• The various methods of determining the point about
which the observations tend to concentrate are called
MCT.
Measures of Central Tendency (MCT)
• The objective of calculating MCT is to determine a
single figure which may be used to represent the
whole data set.

• In that sense it is an even more compact description


of the statistical data than the frequency distribution.

• Since an MCT represents the entire data, it facilitates


comparison within one group or between groups of
data.
CHARACTERISTICS OF A GOOD MCT

A MCT is good or satisfactory if it possesses the following


characteristics.

1. It should be based on all the observations.


2. It should not be affected by the extreme values.
3. It should be as close to the maximum number of values
as possible.
4. It should have a definite value.
5. It should not be subjected to complicated and tedious calculations.
6. It should be capable of further algebraic treatment.
7. It should be stable about sampling.
• The most common measures of central
tendency include:
 Arithmetic Mean
 Median
 Mode
Others
1. ARITHMETIC MEAN
A. Ungrouped Data
• The arithmetic mean is the "average" of the data
set and by far the most widely used measure of
central location and it is usually denoted by
• Is the sum of all the observations divided by the
total number of observations.
b)G ro
u pe dd a
ta
Inc alc
u latingthem e
anfromgroup
eddata
,weass
u m
etha
tallvaluesfallingin
toa
particularc la
ssinte
rva
larelo
cate
datth
em id
-po
into
fth
einterv
al.Itisc alc
ula
teda
s
follow:
k


mf ii
x=i=1k

f
i=
1
i

w
he
re,
k= thenum be
rofclassinterv a
ls
m i=them id
-po
intoftheithc la
ssinterv
al
fi=thefre
q u
encyoftheithc lassin
terval
EXAMPLE. COMPUTE THE MEAN AGE OF 169 SUBJECTS FROM THE GROUPED DATA.

MEAN = 5810.5/169 = 34.48 YEARS

Class interval Mid-point (mi) Frequency (fi) mifi


10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0

Total __ 169 5810.5


WHEN THE DATA ARE SKEWED, THE MEAN IS
“DRAGGED” IN THE DIRECTION OF THE SKEWNESS .

• It is possible in extreme cases for all but one of the


sample points to be on one side of the arithmetic mean
& in this case, the mean is a poor measure of central
location or does not reflect the center of the sample.
PROPERTIES OF THE ARITHMETIC MEAN.

• For a given set of data there is one and only one arithmetic
mean (uniqueness).
• Easy to calculate and understand (simple).
• Influenced by every value in a data set
• Greatly affected by extreme values.
• In the case of grouped data if any class interval is open, the
arithmetic mean can not be calculated.
2. MEDIAN
a) Ungrouped data
• The median is the value which divides the data set into two
equal parts.
• If the number of values is odd, the median will be the middle
value when all values are arranged in order of magnitude.
• When the number of observations is even, there is no single
middle value but two middle observations.
• In this case the median is the mean of these two middle
observations, when all observations have been arranged in the
order of their magnitude.
Median

• The median is a better description (than the mean) of the


majority when the distribution is skewed.
B) GROUPED DATA

• In calculating the median from grouped data,


we assume that the values within a class-
interval are evenly distributed through the
interval.
• The first step is to locate the class interval in
which the median is located, using the
following procedure.
• Find n/2 and see a class interval with a minimum cumulative
frequency which contains n/2.

• Then, use the following formula.


 n 
  Fc 
~
x = Lm   2 W
 fm 
 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
EXAMPLE. COMPUTE THE MEDIAN AGE OF 169 SUBJECTS FROM
THE GROUPED DATA.

N/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169

Total 169
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33


PROPERTIES OF THE
MEDIAN
• There is only one median for a given set of data
(uniqueness)
• The median is easy to calculate
• Median is a positional average and hence it is
insensitive to very large or very small values.
• Median can be calculated even in the case of open-
end intervals
• It is determined mainly by the middle points and is
less sensitive to the remaining data points
(weakness).
QUARTILES

• Just as the median is the value above and


below which lie half the set of data, one can
define measures (above or below) which lie
other fractional parts of the data.
• The median divides the data into two equal
parts
• If the data are divided into four equal parts, we
speak of quartiles.
a) The first quartile (Q1): 25% of all the ranked
observations are less than Q1.

b) The second quartile (Q2): 50% of all the ranked observations


are less than Q2. The second quartile is the median.

c) The third quartile (Q3): 75% of all the ranked


observations are less than Q3.
3. MODE
• The mode is the most frequently occurring value among all
the observations in a set of data.
• It is not influenced by extreme values.
• It is possible to have more than one mode or no mode.
• It is not a good summary of the majority of the data.
3. MODE
Mode
A) UNGROUPED DATA

• It is a value that occurs most frequently in a set of values.


• If all the values are different there is no mode, on the other
hand, a set of values may have more than one mode.
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4

• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”

• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
B) GROUPED DATA

• To find the mode of grouped data, we usually refer to the


modal class, where the modal class is the class interval
with the highest frequency.
• If a single value for the mode of grouped data must be
specified, it is taken as the mid-point of the modal class
interval.
 

x̂ = L m  w f 2 

 f0  f2 
WHERE

L - LOWER BOUNDARY OF THE MODAL CLASS 
F0 – THE FREQUENCY OF THE CLASS NEXT BELOW THE MODAL
CLASS IN VALUE
F2 – THE FREQUENCY OF THE CLASS NEXT ABOVE THE MODAL
CLASS IN VALUE
W – LENGTH OF THE INTERVAL OF THE MODAL CLASS
PROPERTIES OF MODE

 It is not affected by extreme values


 It can be calculated for distributions with open end
classes
 Often its value is not unique
 The main drawback of mode is that often it does not exist
WHICH MEASURE OF CENTRAL TENDENCY
IS BEST WITH A GIVEN SET OF DATA?

• Two factors are important in making this decisions:


• The scale of measurement (type of data)
• The shape of the distribution of the
observations
• The mean can be used for discrete and continuous data .
• The median is appropriate for discrete and continuous
data as well, but can also be used for ordinal data.
• The mode can be used for all types of data, but may be
especially useful for nominal and ordinal measurements .
• For discrete or continuous data, the “modal class” can be
used .
(a) Symmetric and unimodal distribution — Mean, median, and
mode should all be approximately the same .

Mean, Median & Mode


(b) Bimodal — Mean and median should be about the same,
but may take a value that is unlikely to occur; two modes
might be best
(c) Skewed to the right (positively skewed) —Mean is
sensitive to extreme values, so median might be more
appropriate

Mode

Median

Mean
(d) Skewed to the left (negatively skewed) — Same as (c)

Mode

Median

Mean
QUIZ 5%

1. List and Explain three methods of data


collection techniques
2. In what case focus group discussions
(FGD) is more appropriate
3. Which type of Interview you prefer if your
study participants are not un-educated
4. Suppose that an investigator is interested
to summarize the overtime change in
malaria case in Dire Dawa. For this
researcher what type of Pictorial
representation you suggest?
MEASURES OF DISPERSION

Consider the following two sets of data:

A: 177 193 195 209 226


Mean = 200

B: 192 197 200 202 209


Mean = 200
Two or more sets may have the same mean and/or median
but they may be quite different.
THESE TWO DISTRIBUTIONS HAVE THE SAME MEAN,
MEDIAN, AND MODE
MEASURES OF DISPERSION

• MCT are not enough to give a clear understanding


about the distribution of the data.
• Measures that quantify the variation or
dispersion of a set of data from its central location

• Dispersion refers to the variety exhibited by the


values of the data.

• The amount may be small when the values are


close together.
MEASURES OF DISPERSION

Other synonymous term:


• “Measure of Variation”
• “Measure of Spread”
• “Measures of Scatter”
• Measures of dispersion
include:
• Range
• Inter-quartile range
• Variance
• Standard deviation
• Standard error
1. RANGE (R)

• The difference between the largest and smallest observations


in a sample.

• Range = Maximum value – Minimum value

• Example –
• Data values: 5, 9, 12, 16, 23, 34, 37, 42
• Range = 42-5 = 37
• Data set with higher range exhibit more variability
PROPERTIES OF RANGE

 It is the simplest crude measure and can be easily understood


· It considers only two values which causes it to be a poor
measure of dispersion
· Very sensitive to extreme observations
· The larger the sample size, the larger the

range
2. INTERQUARTILE RANGE (IQR)

• Indicates the spread of the middle 50% of the


observations, and used with median

IQR = Q3 - Q1

• Example: Suppose the first and third quartile for


weights of girls 12 months of age are 8.8 Kg and
10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8 and
10.2 Kg.
Properties of IQR:
 It is a simple and versatile measure .
 It encloses the central 50% of the observations .
 It is not based on all observations but only on two
specific values .
 It is important in selecting cut-off points in the
formulation of clinical standards .
 Since it excludes the lowest and highest 25%
values, it is not affected by extreme values .
 Less sensitive to the size of the sample.
VARIANCE (2, S2)
• Variance is used to measure the dispersion of values relative
to the mean.
• The variance is the average of the squares of the deviations
taken from the mean.
• When values are close to their mean (narrow range) the
dispersion is less than when there is scattering over a wide
range.
• Population variance = σ2
• Sample variance = S2
UNGROUPED DATA
B) GROUPED DATA

 (m i  x) 2 f i
S2  i =1
k

i =1
fi - 1

where
mi = the mid-point of the i th class interval
x
fi = the frequency of the i th class interval
k = the number of class intervals
= the sample mean
Properties of Variance:
 The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values .
 The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because the
difference is squared in variance.
• The drawbacks of variance are overcome
by the standard deviation.
7. STANDARD DEVIATION (, S)
• It is the square root of the variance.
• This produces a measure having the same
scale as that of the individual values.

   and S = S
2 2
EXAMPLE. COMPUTE THE VARIANCE AND SD OF THE AGE OF
169 SUBJECTS FROM THE GROUPED DATA.
MEAN = 5810.5/169 = 34.48 YEARS
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80

Total 169 1901.20 20199.22


PROPERTIES OF SD

• The SD has the advantage of being expressed in the


same units of measurement as the mean
• SD is considered to be the best measure of
dispersion and is used widely because of the
properties of the theoretical normal curve.
• However, if the units of measurements of variables of
two data sets is not the same, then there variability
can’t be compared by comparing the values of SD.
SD VS STANDARD ERROR (SE)
• SD describes the variability among
individual values in a given data set .
• SE is used to describe the variability
among separate sample means
obtained from one sample to another .

You might also like