1introduction and Descriptive Stats
1introduction and Descriptive Stats
Course code
• COMH513 for Adult Health Nursing
• Pubh 614 for Maternity and Neonatal Nursing
• COMH513 for Paediatrics and Child Health Nursing
• PuHe5022 for MSc in Midwifery
• Credit Hours: 3
COURSE CONTENTS
• Introduction
• Methods of data collection
• Descriptive statistics
• Basic probability concepts
• Sampling techniques and sampling distribution
• Inference
• Sample size determination
• Statistical Softwares (EPIDATA, Epi Info, STATA)
• Correlation and Linear regression
• Logistic regression
Introduction
Objectives of the section
At the end of this section students will be able to:
Understand the rationale of statistics
Familiar with some frequently used concepts/terms in
biostatistics
Identify the different types of variables
Understand the different types of data collection
methods
Familiar with methods of data organization,
presentation and summarizations
Basic Concepts of statistics and
Biostatistics
Statistical data: it refers to numerical descriptions
of things.
These may take the form of counts or
measurements.
• Statistics is always about numerical description
Statistical methods: it refers to a body of methods
used for collecting, organising, analyzing and
interpreting numerical data for understanding a
phenomenon or making wise decisions
Some basic concepts ….
Statistics is the science of gaining information from
data in presensence of variation
• Statistics is a scientific discipline, or the science of
gathering and describing data and the subsequent
drawing of conclusions(inferences) from data.
Some basic concepts ….
The tools of statistics are employed in many fields
(e.g. health, agriculture, business, economics,
education, psychology etc.)
When the data being analyzed are derived from
the biological sciences, medicine or public
health, we use the term biostatistics to
distinguish this particular application of statistical
tools and concepts from the others
Rational for studying Statistics
• Statistics provides a way of organizing information
• There is a great deal of inherent variation in most biological
processes
• The planning, conducting, and interpretation of much of
medical research
• Quantitative nature of public health and medicine
• For the projection of health trends
• For the analysis of health and health-related data
Some basic concepts ….
Data: numerical fact or figure which is raw or
unprocessed
Data → information → knowledge →
wisdom
Statistics is a tool to convert data into useful
information
2. Qualitative variable/categorical:
Information is measured by assigning names to items
(events) according to a set of rules, which result on
different types of data.
Eg. Gender, blood group, Marital status
Levels of Measurement
The level of measurement determines which statistical
calculations are meaningful.
The four levels of measurement are: nominal, ordinal,
interval, and ratio.
Nominal
Lowest to
Levels of Ordinal highest
Measurement Interval
Ratio
Identify level of measurement
• Age
• FBS
• creatinine
• Viral load
• Educational status
• Level of anemia
• Temperature
• Marital status
Methods of Data Collection
Data and measurement
• Data is the raw material for statistics
• If that data exist in secondary form, then use them to the extent
you can, keeping in mind limitations.
• But if it does not, and you are able to fund primary collection,
then it is the method of choice.
Data collection methods…
Structured observation:
Uses a predetermined checklist or observation
guide to record specific behaviors or events.
Unstructured observation:
Allows for more flexibility in recording
observations without a strict framework, capturing
a wider range of information
Advantages:
Gives relatively more accurate data on behavior and
activities.
Disadvantages:
Investigators or observer’s own biases, desires etc
Needs more resources and skilled human power during the use
of high level Machines.
2. Interview and self- administered questionnaires
Closed-ended questions
Open-ended questions
Sensitive issues
Closed- ended questions
Give a list of possible options which are exhaustive and
mutually exclusive and they are called the fixed- choice
questions
Closed questions are useful if the range of possible
responses is known.
Eg. Do you have children now? 1.Yes 2. No
What is your marital status? A. Single B. Married C.
Separated/ divorced/ D. widowed
Designing a Questionnaires
Designing a good questionnaire always takes several drafts.
In the first draft we should concentrate on the content.
In the second, we should look critically at the formulation and
sequencing of the questions.
Then we should scrutinize the format of the questionnaire.
Finally, we should do a test-run to check whether the questionnaire
gives us the information we require and whether both the
respondents and we feel at ease with it.
Usually the questionnaire will need some further adaptation before
we can use it for actual data collection
Steps in Questionnaire design
Step1: CONTENT
Take your objectives and variables as your starting
point.
Decide on what questions will be needed to measure
or to define your variables and reach your objectives.
When developing the questionnaire, you should
reconsider the variables you have chosen, and, if
necessary, add, drop or change some.
Step 2: FORMULATING QUESTIONS
Formulate one or more questions that will provide the
information needed for each variable.
Take care that questions are specific and precise
enough that different respondents do not interpret
them differently.
Avoid leading question
Step 3: SEQUENCING OF QUESTIONS
Design your interview schedule or questionnaire to be
“consumer friendly.”
The sequence of questions must be logical for the
respondent
At the beginning of the interview, keep questions
concerning “background variables” (e.g., age,
religion, education, marital status, or occupation) to a
minimum.
If possible, pose most or all of these questions later in
Start with an interesting but non-controversial
question (preferably open) that is directly related to
the subject of the study.
This type of beginning should help to raise the
informants’ interest and lessen suspicions concerning
the purpose of the interview (e.g., that it will be used
to provide information to use in levying taxes).
Pose more sensitive questions as late as possible in
the interview (e.g., questions pertaining to income,
sexual behavior, or diseases with stigma attached
to them, etc.
Step 4: FORMATTING THE QUESTIONNAIRE
When you finalize your questionnaire, be sure that:
Each questionnaire has a heading and space to insert
the number, data and location of the interview
Questions belonging together appear together
visually.
If the questionnaire is long, you may use subheadings
for groups of questions.
Sufficient space is provided for answers to open-
ended questions
Step 5: TRANSLATION
If interview will be conducted in one or more local
languages, the
questionnaire has to be translated to standardize the
way questions
will be asked.
After having it translated you should have it
retranslated into the original language.
You can then compare the two versions for differences
Problems in gathering data
Common problems might include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staff
Invasion of privacy
Suspicion (mistrust)
Cultural norms
Pre – test Vs. pilot study
D e s c rip t iv e S t a t is t ic s I n f e re n t ia l S ta t is t ic s
c o lle c t io n m a k in g in f e re n c e s
o rg a n iz in g h y p o t h e s is t e s t in g
s u m m a riz in g d e t e rm in in g re la t io n s h ip
p re s e n t in g o f d a ta m a k in g t h e p re d ic t io n
Descriptive Statistics
• In descriptive analysis there are three major characteristics
of a single variable that we tend to look at:
• the distribution: The simplest distribution would list
every value of a variable and the number of persons
who had each value. E.g. frequency distribution
• the central tendency: an estimate of the "center" of
a distribution of values.
• the dispersion: the spread of the values around the
central tendency.
• Decriptive statistics do not involve generalizing beyond the
data at hand.
Methods of data organization, presentation and
summarization
-Tables
-Graphs
-NSM
Methods of data organization and presentation
Numbers that have not been summarized and organized are called raw
data. i.e. The data collected in a survey/study
In most cases, useful information is not immediately evident from the
mass of unsorted data/raw data.
• Statistics is used to organize and interpret research observations and
findings
• Before interpretation & communication of the findings, the raw data
must be organized and presented in a clear and understandable way
1. Frequency Distribution
The arrangement of data set in a table using values and their corresponding frequency
of occurrence within a data set.
Frequency: the number of same values within a data set.
Consists of the set of classes or categories along with their numerical counts in each
i.e. frequency
The actual summarization and organization of data starts from frequency distribution
The distribution condenses the raw data into a more useful form and allows for a quick
visual interpretation of the data
Cumulative Frequency:
When frequencies of two or more classes are added up, such total frequencies are
called Cumulative Frequencies.
This frequencies help as to find the total number of items whose values are less than
or greater than some value.
Relative frequencies
Express the frequency of each value or class as a percentage to the total frequency.
A. Frequency distributions for categorical variables
Bar charts
Display the frequency distribution for nominal or ordinal data.
In a bar chart the various categories into which the observation fall are
represented along horizontal axis and a vertical bar is drawn above
each category such that the height of the bar represents either the
frequency or the relative frequency of observation within the class.
The bars should be of equal width and should be separated from one
another so as not to imply continuity
In order to compare the distribution of a variable for
two or more groups, bars are often drawn along side
each other for groups being compared in a single bar
chart
BWT
frequency. 43 793
268
Very low
Low
Normal
Big
8870
Histogram
40
35
No of women
30
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
Box and Whisker Plot
• It is another way to display information when the
objective is to illustrate certain locations (skewness) in
the distribution .
• Can be used to display a set of discrete or continuous
observations using a single vertical axis – only certain
summaries of the data are shown
• First the percentiles (or quartiles) of the data set
must be defined
• A box is drawn with the top of the box at the third
quartile (75%) and the bottom at the first quartile
(25%).
• The location of the mid-point (50%) of the distribution is
indicated with a horizontal line in the box.
• Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation
and from the centre of the bottom of the box to the
smallest observation.
Percentiles and Quartiles
P0: The minimum
– P25: 25% of the sample values are less than or equal
to this value. 1st Quartile
. P25 means 25th percentile
– P50: 50% of the sample are less than or equal to this
value.
2nd Quartile
– P75: 75% of the sample values are less than or equal
to this value. 3rd Quartile
– P100: The maximum
• Percentile = p(n+1), p=the required percentile
• Arrange the numbers in ascending order
100
20
saturation of bile in
0
0 10 20 30 40 50 60 70
women.
80
Age
Line graph
• Useful for assessing the trend of particular situation
overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along
the horizontal axis, and
• Values of the quantity being studied is marked on the
vertical axis.
• Values for each category are connected by continuous
line.
Numerical Summary Measures
NSM shows Single numbers which quantify the
characteristics of a distribution of values
Measures of central tendency (location)
Measures of dispersion (spread)
A frequency distribution is a general picture of the
distribution of a variable
But, can’t indicate the average value and the spread of
the values
Measures of Central tendency
Definitions
- On the scale of values of a variable there is a certain
stage at which the largest number of items tends to
cluster/center around.
- Since this stage is usually in the center of
distribution, the tendency of the statistical data to
get concentrated at this stage/value is called" central
Definitions…
Example:
• The following are the lengths (in cm) of a sample of
six garment blanks chosen at random from a large
batch of similar blanks: 54.5, 55.0, 55.7, 51.8, 54.2,
52.4
What is the mean length of the sample of garments?
Solution:
1 6 1
X X i (54.5 53.0 55.7 51.8 54.2 52.4) 53.6m
6 i 1 6
The Mean…
Grouped data
k
1 1
X f i X i ( f 1 X 1 f 2 X 2 ... f k X k )
n i 1 n
Example
• The table below shows age group of 50 MOHA
permanent workers. Calculate the mean
Class limits Frequency (fi) class mark
(Xi)
fi X i
42-48 8 45 360
49-55 8 52 416
56-62 13 59 767
63-69 7 66 462
70-76 6 73 438
77-83 5 80 400
Solution:84-90 3 87 261
1 k 1 3104
X fi X i (360 416 ... 261) 60.08
n i 1 50 50
B. The Mode
• Mode is a value which occurs most frequently in a set
of values.
• The mode may not exist and even if it does exist, it may
not be unique.
• The value having the maximum frequency is the modal
value.
Examples:
1. Find the mode of 5, 3, 5, 8, 9 Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, 5. It is a bimodal Data:
8 and 9
3. Find the mode of 4, 12, 3, 6, and 7. No mode for this
data.
Mode for Grouped data
In grouped data, we usually refer to the modal class,
class with highest frequency.
If a single value for the mode of grouped data must be
specified, it is taken as: 1
Mode L w
1 2
Where 1 f mod f 1 2 f mod f 2
L = The lower class boundary of the modal class;
w = the size of the modal class
f1= frequency of the class preceding the modal class.
f2= frequency of the class succeeding the modal class
fmod = frequency of the modal class.
The Mode…
Example 2.40: Calculate the modal age for the age
distribution of 228 patients below.
C. The Median
• The median is the middle value of a set of data that
has been put into rank order.
• In a distribution, median is the value of the variable
which divides it in to two equal halves.
• Useful measure of central tendency when data are
skewed.
The Median…
A) Ungrouped data
Let x1 , x2 , , xn be n ordered observations.
Then, the median value is:
th
a) The n 1 value if n is odd, and
2
th th
n n
b) The 1 value if n is even.
2 2
2
Example: Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4.
The Median…
B) Grouped Data
Steps
Locate the class interval in which the median is
located. We use the following procedure for this.
– Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2 & this is
the median class.
Find a unique median value, use the following
interpolation format:
Median for grouped data
w n ~
Median L CF x
f med 2
The Median…
Example 2.26: Compute the median for the
following distribution.
Grade Frequen CF
cy
40-49 5 5
50-59 18 23
60-69 27 50
70-79 15 65
80-89 6 71
The Median…
Solution: Construct the less than
cumulative frequency distribution as
follows:
CB’s Freq. CF
39.5-49.5 5 5
49.5-59.5 18 23
59.5-69.5 27 50
69.5-79.5 15 65
79.5-89.5 6 71
The Median…
Since n = 71, 71/2 = 35.5, and the smallest less than CF
greater than or equal to 35.5 is 50; thus, the median class is
the third class. And for this class,
L = 59.5, w = 10, ,CF = 23. Then applying the
Formula we get: f med 27
~
x 10
59.5 35.5 23 64.13
27
Exercise
Values
Frequency
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
200- 210 49
210- 220 34
220- 230 31
230- 240 16
Compute240- 250 mode
the mean, and12median.
Measures of Variation/Dispersion
Measures of Variation/Dispersion
Introduction
• Consider the following two sets of scores:
• Set 1: 40, 50, 60, 60, 40, 50
Set 2: 0, 100, 25, 75, 80, 20
• Both these sets have the same mean (50), but
the second set is a lot more widely dispersed
("scattered") than the first.
Measures of Variation/Dispersion
Set 1 Set 2
Measures of
Variation/Dispersion
• In other words the degree to which numerical data tend to
spread about an average value is called dispersion or
variation of the data.
Range
Variance
Standard deviation
Coefficient of variation
The Range
The Range
- The difference between the largest (maximum)
and smallest (minimum) values.
Range = Maximum-Minimum
For frequency distributed data, the range is:
The difference between the upper class
boundary of the last class and the lower class
boundary of the first class.
Example 4.1
X
2
i
Population var iance 2 i 1 i 1, 2, . . ., N
N
k
f X
2
i i
For the case of frequency distributi on 2 i 1 i 1, 2, . . ., k X i are class marks
N
X X
2
i
2 i 1
Sample var iance S i 1, 2, . . ., n
n 1
k
f i X i X
2
i
X 2
X X
2
2 i
Sample s tan dard deviation S S
n 1
Example : 4.6
24, 25, 29,29,30,31 Mean= 28. Find variance and
standard deviation
Solution:
Value minus Mean Difference Difference Squared
2
24-28 -4 16
25-28 -3 9
29-28 1 1
29-28 1 1
30-28 2 4
31-28 3 9
0 40
Variance and Standard
Deviation…
Variance = Sum of difference squared /n-1
= 40/5 = 8
Standard deviation = = 2.83
Variance is the mean of the squared differences of the
observations from the mean.
The standard deviation is the square root of the
variance.
Individual Assignment
Find the variance, Quartiles (Q1, Q2, Q3), Quartile deviation,
mean deviation about mean and median, decile (D1….D9)
and standard deviation of the following sample data . Also
draw box and whisker plot, histogram for both data sets
1. 5, 17, 12, 10, 20, 15, 16, 8
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Special properties of standard deviation /variance
S 1.46
CV X 100% X 100% 2.72%
X 53.6
Example
Suppose that the mean weight of a group of
students is 165 pounds with a S.D of 8
pounds. If the height of the same group of
students has a mean of 60 inches with a S.D
of 3 inches, compare the variability in weight
and height measurements.
8 b
100 4.85%
16 5
Solution: For weight,
3 in C.V = ,
100 5%
and for height 60 in
Remarks:
In a positively skewed distribution, smaller
observations are more frequent than larger
observations. i.e. the majority of the observations
have a value below an average.
Measures of Skewness
The Karl Pearson’s Coefficient of
Skewness (SK):
Mean Mode 3( Mean Median )
Sk Sk
S tan dard deviation S tan dard deviation