Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
BIOSTATISTICS
IMPORTANT CHARACTERISTICS OF
BIOSTATISTICS:
Definitions
Population is a set of measurement of
interest to the sample collector.
Sample is any subset of
measurements selected from the
population.
Element/Unit an entity on which
measurements are obtained.
Variation is important!!!!
VARIABLE
Eg., height, weight, uric acid level, Xrays
findings, parity, social class etc.
Types of variables
A QUALITATIVE variable is one which does
not take a numerical value. It may be
concerned with the characteristics eg., gender,
survival or death, place of birth, colour of eyes
etc.
A QUANTITATIVE variable takes
a numerical value. eg., height, blood pressure,
lung capacity, exact age, parity, number of
cases in a study, completed family size, age last
birthday etc.
TYPES OF VARIABLES
Variable
Qualitative
or categorical
Nominal
(not ordered)
e.g. ethnic
group
Ordinal
(ordered)
e.g. response
to treatment
Quantitative
measurement
Discrete
(count data)
e.g. number
of admissions
Continuous
(real-valued)
e.g. height
CATEGORICAL VARIABLES
CATEGORICAL NOMINAL
VARIABLES
Named categories
No implied order among categories
Examples:
Gender: Male/Female
Blood Groups: 0, A, B, AB
Ethnic Group: Chinese, Malay, Indian,
Jordanian
Eye color: brown/black/blue/green/mixed
QUANTITATIVE VARIABLES
Can be measured numerically
Examples:
weight
# of admissions to the hospital
concentration of chlorine
#
#
#
#
of
of
of
of
CONTINUOUS DATA
Variable Type
Data Presentation
Quantitative
Graphs, Tables
Categorical
Charts, Tables
TYPES of DATA
Qualitative data Categorical data
Quantitative data Numerical data
Qualitative/Categorical Data
There are two types of categorical
data:
nominal
NOMINAL DATA
Example:
NOMINAL DATA CATEGORIES
Sex/ Gender:
male, female
Marital status: single, married, widowed,
divorced
separated,
ORDINAL DATA
Example:
ORDINAL DATACATEGORIES
Level of knowledge: good, average, poor
Opinion on a statement: fully agree, agree,
disagree, totally disagree
Numerical Data
We speak of NUMERICAL DATA if the
VARIABLES are expressed in numbers. They
can be examined through:
Frequency Distribution
Percentages, Proportions, Ratios and Rates
Figures ETC.
Numerical Data
May be:
Discrete or Continuous
Discrete numerical data considers counts
which can be expressed only as whole
numbers e.g., number of people, parity,
number of males/females in a family etc.
Continuous numerical data considers
measures which can take any value
between two whole numbers e.g., weight,
height, uric acid levels etc.
SCALES OF MEASUREMENT
There are four scales (or levels) at which we measure:
__________________________________________________________
Lowest
Level
Scale
Characteristic
_________________________________________________________
Nominal naming
Ordinal ordering
Interval equal interval without absolute zero
Ratio
equal interval with absolute zero
__________________________________________________________
Highest
__________________________________________________________
DATA SUMMARIZATION
Central Location
Number of people
Spread
Age
Common measures
Arithmetic mean
Median
Mode
Age
27
30
28
31
28
36
29
37
29
34
30
30
27
30
Ob
s
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
MODE
Definition: Mode is the value that occurs
most frequently
Method for identification
1. Arrange data into frequency
distribution or histogram,
showing the values of the
variable and the frequency with
which each value occurs
2. Identify the value that occurs
most often
Mode
Ob
s
Age
27
27
28
28
28
Age
Frequenc
y
29
29
27
29
28
29
29
10
30
11
30
30
12
30
31
13
30
32
14
30
33
15
31
16
31
34
17
32
35
18
34
36
19
36
20
37
37
Mode
Obs
Age
27
27
28
28
28
29
Mode
The most frequent value of the variable
Mode
= 30
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
2
7
3
2
3
3
3
4
3
5
3
6
3
7
18
34
19
36
20
37
Frequency
28
29 30 31
Age (years)
0, 2, 3, 4, 5, 5, 6, 7,
8, 9,
9, 9, 10, 10, 10, 10, 10, 11,
12, 12,
12, 13, 14, 16, 18, 18, 19, 22,
27, 49
Mode = 10
20
Unimodal Distribution
18
Population
16
14
12
10
8
6
4
2
0
18
16
Population
14
12
10
8
6
4
2
0
Bimodal Distribution
MEDIAN
Definition: Median is the middle
value; also, the value that splits the
distribution into two equal parts
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
Median:
Odd Number of Values
n = 19
Median
Observation
=
=
n+1
2
19+1
2
20
2
10
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
Median:
Even Number of Values
n = 20
Median
Observation
=
=
=
n+1
2
20+1
2
21
2
10.5
30 years
Median at 50% = 10
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 149
Quartiles
Definition: Quartile is the value that splits
the distribution into four equal parts
25%
25%
25%
25%
Q1
Q2
Q3
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Quartiles
Q1 age = 28
Q2 age = 30
Q3 age = 31
n+1
Q1 observation = round
4
20+1
21
=
=
4
4
= 5.25 ~ 5th obs
Q2 observation =
10.5 (median)
3(n+1)
Q3 observation = round
4
3(20+1)
3(21)
=
=
4
4
= 15.75 ~ 16th obs
Percentiles
Value of the variable that splits the
distribution in 100 equal parts
35 % of observations are below the 35th percentile
65 % of observations are above 35th percentile
Obs
Age
27
27
28
28
28
29
29
Percentiles
Value
s
(Age)
Fre
q
Percent
(Freq/To
tal)
Cumulati
ve
Percent
27
10%
10%
29
28
15%
25%
29
29
20%
45%
10
30
30
25%
70%
11
30
12
30
31
10%
80%
13
30
32
5%
85%
14
30
34
5%
90%
15
31
36
5%
95%
16
31
37
5%
100%
17
32
18
34
Total
20
100%
19
36
20
37
25th Percentile
90th Percentile
ARITHMETIC MEAN
Arithmetic mean = average value
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Arithmetic Mean
i
x
x
n
n = 20
xi = 605
x 605
20
30.25
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Sum = 360
n = 30
Mean = 360 / 30 = 12
CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9
12
12
12
12
12
12
12
12
12
12
-71
= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
0
1
2
4
6
6
7
10
15
37
Mean = 12.0
10
15
20
25
30
Nights of stay
Mean = 15.3
35
40
45
50
Var A
0 0
0 4
1 4
1 4
1 5
5 5
9 5
9 6
9 6
10
10
Var B
0
1
2
3
4
5
6
7
8
6 9
10 10
Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value
Var A
Var B
Sum: 55 55 55
Mean:
Median:
Mode:
Min:
Max:
Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value
Var A
Var B
Var C
Sum: 55 55 55
Mean: 5 5 5
Median: 5 5 5
Mode: 1,9 4,5,6 none
Min: 0 0 0
Max: 10 10 10
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Same center
but
different dispersions
MEASURES OF SPREAD
Definition: Measures that quantify
the variation or dispersion of a set
of data from its central location
Also known as:
Measure of dispersion
Measure of variation
Common measures
Range
Standard error
Interquartile range
95% confidence
interval
Variance / standard deviation
RANGE
Definition: difference between largest and
smallest values
Properties / Uses
Greatly affected by outliers
Usually used with median
0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, 10, 10, 11, 12,
12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Range = 49 - 0 = 49
10
15
20
25
30
Nights of stay
35
40
45
50
INTERQUARTILE RANGE
Definition: the central 50% of a distribution
Properties / Uses
Used with median
Five-number summary for boxand-whiskers diagram:
INTERQUARTILE RANGE
LENGTH OF STAY DATA
Q1
0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, M 10, 10, 11, 12,
12,
Q3
12,th 13, 14, 16, 18, 18, 19, 22, 27,
Q1 = 25
percentile
@
(30+1)
/
4
=
7
6
49
Median = 50th percentile @ 15.5
10
Q3 = 75th percentile @ 3 (30+1) / 4 = 23
14
BOX-AND-WHISKERS DIAGRAM
LENGTH OF STAY DATA
BOX-AND-WHISKERS DIAGRAMS
VARIABLES A, B, C
Variance
= average of squared deviations
from mean
= Sum (x mean)2 / n-1
Standard deviation
= square root of variance
x : mean
xi : value
n : number
sd: variance
sd : standard deviation
i - x
SD =
n-1
SD =
x i - x
n-1
x i - x
SD
n-1
=
x
x - x
i
2. Subtract the mean from each observation.
x i - x
4. Sum the squared differences
x i - x
3. Square the difference.
= s2
CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9
12
12
12
12
12
12
12
12
12
12
-71
= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
0
1
2
4
6
6
7
10
15
37
= 144
= 100
NORMAL DISTRIBUTION
2.5%
95%
68%
Standard
deviation
Mean
2.5%
Mode
Standard deviation
Median
Arithmetic mean
Range
Interquartile range
Mode
Standard deviation
Median
Arithmetic mean
Range
Interquartile range
Distribution
Properties of
Measures of Central Location & Spread
For quantitative / continuous variables
Mode simple, descriptive, not always useful
Median best for skewed data
Arithmetic mean best for normally distributed
data
Range use with median
Standard deviation use with mean
Standard error used to construct confidence
intervals
Median
Mode
14
12
Population
10
8
6
4
2
0
Age
1st quartile
Minimum
3rd quartile
Interquartile interval
Range
Maximum
Measures of Shapes
MEASURES OF VARIATION
Range is defined as the difference in value
between the highest (maximum) and the lowest
(minimum) observation
Variance is defined as the sum of the squares of
the deviation about the sample mean divided by
one less than the total number of items.
Standard deviation it is the square root of the
variance
.2
F r a c tio n
.1 5
.1
.0 5
0
0
V ar
10
15
An important characteristic of
a normally distributed
variable is that 95% of the
measurements have value
which are approximately
within 2 standard deviations
(SD) of the mean.
ESTIMATIONS
The population
parameters do not change
and remain constant
whereas the sample
estimates can change and
take any random value.
Population
parameters
Sample
estimates
Mean
Standard
deviation
SD
Proportion
Population
correlation
coefficient
Confidence Intervals.