Chapter 2 Reference Summarizing Data
Chapter 2 Reference Summarizing Data
2 Summarizing data
What is a statistic?
─ In this chapter we shall see how data can be
summarized to help to reveal information they contain.
We do this by calculating numbers from the data which
extract the important material. These numbers are
called statistic.
─ A statistic is anything calculated from the data alone.
Frequency distribution
Schizophrenia 474
Subnormality 58
Alcoholism 57
Total 1467
─ Table 2.3 shows the ferequency distribution of a quantitative
variable,parity. This shows the number of previous pregnancies
for a sample of women booking for delivery at St.George’s
Hospital. Only certain values are possible, as the number
pregnancies must be an integer, so this variable is discrete. The
frequency of each separate value is given.
Table 2.3 Party of 125 women attending antenatal clinics at St.George’s Hospital
2.85 3.19 3.50 3.69 3.90 4.14 4.32 4.50 4.80 5.20
2.85 3.20 3.54 3.70 3.96 4.16 4.44 4.56 4.80 5.30
2.98 3.30 3.54 3.70 4.05 4.20 4.47 4.68 4.90 5.43
3.04 3.39 3.57 3.75 4.08 4.20 4.47 4.70 5.00
3.10 3.42 3.60 3.78 4.10 4.30 4.47 4.71 5.10
3.10 3.48 3.60 3.83 4.14 4.30 4.50 4.78 5.10
─ As most of the values occur only once, to get a useful
frequency distribution we need to divid the FEV1 scale into
class intervals,e.g. from 3.0 to 3.5 ,from 3.5 to 4.0,and so on,
and count the number of individuals with FEV1s in each class
interval. The class intervals should not overlap, so we must
decide which interval contains the boundary point to avoid it
being counted twice. It is usual to put the lower bountory of an
interval into that interval and the higher boundary into the next
interval. Thus the interval starting at 3.0 and ending at 3.5
contains 3.0 but not 3.5. We can write this as ‘3.0 -’ or ‘3.0 -
3.5’ or ‘3.0 – 3.499’. Including the lower boundary in the class
interval has this advantage. Most distributions of
measurements have a zero point below which we cannot
go,whereas few have an exact upper limit.
─ If we take a starting point of 2.5 and an interval of 0.5 we get
the frequency distribution shown in Table 2.5.
─ Note that this is not unique
─ The frequency distribution can be calculated easily and
accurately using a computer softpackage such as SPSS.
Table 2.5 Frequency distribution of FEV1 in 57 male medical students
15
Frequency
10
0
2.50 3.00 3.50 4.00 4.50 5.00
FEV1(Litre)
40
Frequency
20
0
0.0 1.0 2.0 3.0 4.0 5.0
PARITY
─ Geometric mean
─ Median
x 362
42.25 years
N 8
1 log X
G log
n
For example, the doses of HbsAg at seven
patients with hepatitis are respectively as follows:
1: 16, 1:32,1:32, 1:64, 1:64,1:128,1:512.
─ Find the geometric mean.
Solution:
─ The charateristic of the data set is that there is
mutiple correlation among the values.
─ Formula:
1 log X
G log
n
G log
1 lg 16 lg 32 lg 32 lg 64 lg 64 lg 128 lg 512
7
1
log 1.8062 64
Definition of Median
─ The median is the value of the middle term in a data set that has
been ranked increasing order,which is often used a variable with
a skewed distribution.
─ As is obvious from the definition of the median,it divides a ranked
data set into two equal parts.
─ The calculation of the median consists of the following two steps:.
Rank the data set in incresing order
Find the middle term. The value of this term is the median
Solution:
─ First,we rank the given data inincreasing order as follows:
3 5 8 10 19
─ There are five observations in the data set. Consequently, n=5
and n 1 5 1
Position of the the middle term 3
2 2
─ Therefor,the median is the value of the third term in the ranked
data.
3 5 8 10 19
Definition of Mode
─ The mode is the value that occurs with the highest feequency in
a data set.
─ For example: The ages 10 randomly selected students from a
class are 21,19,27,22,29,19,25,18,19,and 30.
─ For the variable with the normal distribution, the mode, median
and mean are the same value.
For example, considering the following two data sets on the
ages of all workers in each of two samll companies:
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the
same, 40 years. But as we can observe the variation in the
workers’ages for each of these two companies is very
different. As illustrated in the diagram,the ages of the
workers in the second company have a much larger variation
than the ages of the workers in the first company.
Company 1
35 36 38 39 40 45 47
Company 2
18 27 33 52 70
Therefor, to reveal the shape of the distribution of a
data set, it is nessesary to not only measure the central
tendency, but also dispersion tendency of a variable.
The dispersion of a set of observations refers to the varity
that they exhibit.
A measure of dispersion conveys information regarding the
amount of variability present in a set of data
Measure of dispersion tendency
─ Range
─ Coeffecient of variation
─ Box-and-whisker plots
Range
─ The range is the simplest measure of dispersion to calculate.
─ It is obtained by taking the difference between the largest and the
smallest values in a data set.
─ If the number of observations in a data set is odd, then the
median is given by the value of the middle term in ranked data.
─ For example,
In company 1, the range = Largeset value(47) – Smalllest value
(35)=12
In company 2, the range = Largeset value(70) – Smalllest value
(18)=52
─ The advatage of using the range as a measure of dispersion is
very simple to compute
─ The disadvatage of using the range as a measure of dispersion is
that its calculation is based on two values: the largest and the
samllest. All other values in a data set are ignored when
calculating the range.
The deviation of the x value from the mean
─ x or x x is called the deviation of the x value from the
mean
─ The sum of the deviation of the x values from the mean is
always zero because there are half of x value more than mean
and another half less than mean.That is
( x )=0 and ( x x) 0
─ For this reason we square the deviations to caculate the
variance and standard deviation
─ For example,suppose there are the scores of four students in
Statistics,such as 82, 95,67, and 92. The mean score for these
four students is
82 95 67 92
─ The deviation of the four scores from the
x 84mean are calculated
4
in Table 4.1
Table 4.1
x xx
82 82-84=-2
95 95-84+11
67 67-84=-17
92 92-84=+8
( x x) 0
Variance and Satndard deviation
─ Variance for population data is denoted by 2(read as sigma
squared)
2
( x ) 2
, N= population size
The formula is N
─ Variance for sample data is denoted by s 2.
The formula is
s 2
( x x)
2
,
n 1
n= sample size,
n-1 is called degree of freedom
─ The standard deviation is obtained by taking the positive
square root of the variance.
• Population sandard deviation:
• Sample standard deviation: 2
s s2
The coefficient of variance
─ The standard deviation is useful a s a measure of variation within a
given set of data,which is absolute variation.
─ However, when one desires to compare the diapersion in two sets
of data, comparing the two standard deviations may lead to
fallencious results.
It may be that the two variables involved are measured in
different units. For example, weight(gram) and
height(centimeter)
Although the same unit of measurement is used,the two
means may be quite different. For example, adult height
and children height.
─ The coeffecient of variation expesses the standard deviation as a
percentage of the mean,which is relative variation.
─ The formular is given by
s
C.V.= (100)
x
Suppose two samples of human males yield the following results:
Sample1 sample2
10
C.V.= 100 6.9
145
10
C.V.= 100 12.5
80
IQR = Q3 -Q1
─ Calculations of above-mentioned statistics and
parameters are very complexible by using manual
methods, however, it is very simple by using the
software like SPSS and R.
Box-and –Whisker plots (boxplot)
A useful visual device for communicating the information contained
in a data set is the box-and-whisker plot. The construction of a
box-and whisker plot (sometimes called ,simply, a boxplot) makes
use of the quarters of a data set and may be accomplished by
following these five steps:
─ Represent the variable of interest on the horizontal axis.
─ Draw a box in the space above the horizontal axis in such a way
that left end of the box allgns with the first quartile Q1 and the right
end of the box aliigns with the third quartile Q3.
─ Divide the box into two parts by a vertical line that aligns with the
median Q2.
─ Draw a borizontal line called a whisker from the left end of the box
to a point that aligns with the smallest measurement in the data
set.
─ Draw another horizontal line ,or whisker,from the right end of the
box to a point that aligns with the largest measurement in the data
set.
Table 4.2 Diameters (cm) of pure Sarcomas Removed from the
Breasts of 20 Women
0.5 1.2 2.1 2.5 2.5 3.0 3.8 4.0 4.2 4. 5.0
5
5.0 5.0 5.0 6.0 6.5 7.0 8.0 9.5 13.0
12
TURNSIZE 10
Maximum
8
6
Q3
4
Median(Q2)
Q1
2
0
Minimum
-2