Basic Biostatistics Part I
Basic Biostatistics Part I
November 6, 2024
What is Statistics?
...What is Statistics?
Data Collection
Organization and Presentation of data
Data Analysis
Interpretation of the results
Health statistics are very useful to improve the health situation of the
population of a given country. For example, the following questions could
not be answered correctly unless the health statistics of a given area is
consolidated and given due emphasis.
What is the leading cause of death in the area?
Is it malaria, tuberculosis, etc.?
At what age is the mortality highest, and from what disease?
Are certain diseases affecting specified groups of the population more
than others? (This might apply, for example, to women or children, or
to individuals following a particular occupation.)
Classification of Statistics
Descriptive Statistics
It helps to describe a given set of data without going beyond that data
It consists of collection, organization, summarization,and anaysis of
data
Inferential Statistics
It helps to make inference/conclusion about a population based on the
selected sample
It consists of predict and forecast values of population parameters, test
hypothesis about values of population parameters and make decisions
Designing Surveys
Do people who go to the gym lead a healthier, happier life? How safe
is the city of Addis Ababa? How effective is your HIV-awareness
programme? Questions like these that cannot be answered without
the help of statistics.
Surveys require careful design and implementation, considerations
about the survey format, accounting for bias and fatigue, etc.
Data collected from surveys have to be carefully studied by statistical
analysis experts who also use their own discretion and experience to
derive the most meaningful information from a survey.
Through surveys, governments can determine the effectiveness of an
initiative, businesses can understand the response to a particular
product, and social scientists can perform quantitative research.
Epidemiological Studies
Statistical Modeling
Limitation of Statistics
Sources of Data
In the age of information, data has become the driving force behind
decision-making and innovation.
Whether in business, science, healthcare, or government, data serves
as the foundation for insights and progress.
As a researcher, you need to understand the various sources of data as
they are essential for conducting comprehensive and impactful studies.
...Sources of Data
As you delve into the world of data collection, its important to know
the emerging sources that have gained prominence in recent years.
These newer data sources provide valuable insights and opportunities
for research across various domains. Below are some of these
emerging data sources:
Examples of Emerging Data Sources
Internet of Things (IoT): The Internet of Things (IoT) has changed
data collection in the 21st century through the everyday connection
of devices and objects to the Internet. Smart devices like sensors,
wearables, and home appliances generate vast amounts of data in
real-time.
Social media and web data: Social media platforms and websites
host a wealth of information generated by users worldwide.
Sensor data: Sensor data is becoming increasingly relevant in various
fields, including environmental monitoring, urban planning, and
healthcare.
Zeytu Gashaw Asfaw (PhD) Department of Epidemiology
Basics for
andBiostatistics
Bio-statistics
Part ISchool of Public Health,
November
Addis6,Ababa
2024 University
24 / 194 A
Types of variables
Types of variables
Quantitative Data
Quantitative Data
Discrete
Continuous
Quantitative Data
Scales of measurement
Levels of Measurements
There are four different scales of measurement. The data can be defined
as being one of the four scales. The four types of scales are:
Nominal Scale
Ordinal Scale
Interval Scale
Ratio Scale
Levels of Measurements
Nominal Scale
Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the
ordering and ranking of data without establishing the degree of
variation between them.
Ordinal represents the ”order.”
Ordinal data is known as qualitative data or categorical data.
It can be grouped, named and also ranked.
Characteristics of the Ordinal Scale
Interval Scale
Ratio Scale
weight
pulse rate
respiratory rate
body temperature (K)
body length in infants or height in adults.
enzyme activity
dose amount
reaction rate
flow rate
concentration
To conclude
Data Quality
The following are some of the key characteristics of high quality data:
1 Data accuracy
2 Data completeness
3 Data consistency
4 Data coherence
5 Data timeliness
6 Clear and accessible data definitions
7 Data relevance
8 Data reliability
Tabulation
Frequency Distribution
...Frequency Distribution
Types of Tabulation
1 Simple Tabulation
2 Complex Tabulation
Simple Tabulation
Simple Tabulation is when the information/data are tabulated to one
characteristic.
For example, the survey determined the frequency or number of
employees of a firm owning different brands of mobile phones.
Histogram
A histogram is a graphical display of data using bars of various
heights.
In a histogram, each bar groups numbers into ranges. Taller bars
show that more data falls in this range.
A histogram displays the form/shape and spread of continuous sample
data.
Frequency Polygon
A frequency polygon is a graph constructed by using lines to join the
midpoints of every interval or bin.
The heights of the points depict the frequencies.
A frequency polygon is usually created from the histogram or by
calculating the midpoints of the bins from the frequency distribution
table.
Frequency Polygon
Frequency Curve
A frequency curve is a smooth curve for which the entire area is taken
to be unity.
It’s a limiting sort of a histogram or frequency polygon.
The frequency curve for distribution is obtained by drawing a smooth
and freehand/blank check curve through the mid-points of the upper
sides of the rectangles forming the histogram.
Frequency Curve
Line Chart
A line chart is a graphical representation of an assets historical price
action that connects a series of data points with a continual line.
This is often the foremost basic type of chart used in finance and
typically only depicts a security’s closing prices over time.
Line Chart
Scatter Diagram
A graph during which the values of two variables are plotted along
two axis, the pattern of the resulting points revealing any correlation
present.
Scatter Diagram
Bar Chart
A bar chart or bar graph is a chart or graph that presents categorical
data with rectangular bars with heights or lengths proportional to the
values that they represent.
The bars can be often plotted vertically or horizontally.
A vertical bar chart is usually called a column chart.
Bar Chart
Pictogram
A pictogram is a chart that uses pictures to represent data.
Pictograms are set out in the same way as to bar charts, but rather
than bars they use columns of pictures to point out the numbers
involved.
Pictogram
Pie Chart
A pie chart is a sort of graph in which a circle is split into sectors that
each represents a proportion of the entire.
Pie charts are a useful way to organize data in order to see the size of
components relative to the entire and are particularly good at
showing percentage or proportional data.
Pie Chart
Map Diagram
A map diagram is a way of representation of any event distribution by
means of diagrams, that are placed on the map inside the structure of
territorial division which expresses the summarized value of this event
within the bounds of this territorial structure.
Map Diagram
w1 X1 +w2 X2 +...+wn Xn
WAM = w1 +w2 +...+wn
This formula will be used to calculate the mean and variance for
grouped data
If x1 , ..., xn ≥ 0, then
X1 +X2 +...+Xn √
AM = n ≥ x1 , ..., xn = GM
Median
The Median is the midpoint of the values after they have been
ordered from the smallest to the largest
Equivalently, the Median is a number which divides the data set into
two equal parts, each item in one part is no more than this number,
and each item in another part is no less than this number.
Properties of Median
Mode
The mode is the number that occurs most often in a data set.
The number that has the highest frequency.
It is the value which occurs the maximum number of times in a given
data set.
Example - Mode
The exam scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75,
81, 87
The score of 81 occurs the most often. It is the Mode!
Properties of Mode
Measures of dispersion
Measures of dispersion
Measure of dispersion:
Absolute: Measure the dispersion in the original unit of the data.
Variability in 2 or more distribution can be compared provided they
are given in the same unit and have the same average.
Relative: Measure of dispersion is free from unit of measurement of
data.
It is the ratio of a measaure of absolute dispersion to the average,
from which absolute deviations are measured.
It is called as co-efficient of dispersion
Range
Characteristics of Range
Q1 = ( n+1 th
4 ) ordered observation
Q2 = ( 2[n+1] th
4 ) ordered observation
Q3 = ( 3[n+1] th
4 ) ordered observation
Interquartile Range (IQR): The difference between the 3rd and 1st
quartile. IQR = Q3 − Q1
Q3 −Q1
Semi Interquartile Range:= 2
3 −Q1
Coefficient of quartile deviation: Q
Q3 +Q1
Interquartile Range
Merits:
It is superior to range as a measure of dispersion.
A special utility in measuring variation in case of open end distribution
or one which the data may be ranked but measured quantitatively.
Useful in erratic or badly skewed distribution.
The Quartile deviation is not affected by the presence of extreme
values.
...Interquartile Range
Limitations:
As the value of quartile deviation dose not depend upon every item of
the series it can’t be regarded as a good method of measuring
dispersion.
It is not capable of mathematical manipulation.
Its value is very much affected by sampling fluctuation
Z-score
For continuous grouped data: m1 , ..., mn are the class mid points with
corresponding class frequencyf1 , ..., fn
P
|mi −X̄ |
fiP
MAD = fi
X̄ : Mean of the data series.
MAD
Coeff. Of MAD: = Average
The average from which the Deviations are calculated.
It is a relative measure of dispersion and is comparable to similar
measure of other series.
Measure of Shape
Skewness
Kurtosis
What is Probability?
Random Experiments
Sample space
Events
1 Mutually exclusive events (Disjoint events)
2 Equally likely events - equal chance to occur.
3 Favourable events - the number of outcomes favourable to an event in
an experiment is the number of outcomes which entail the happening
of the event
4 Exhaustive events - outcomes are said to be exhaustive when they
include all possible outcomes.
5 Independent events - if the occurrence or non-occurrence of an event
does not affect the occurrence or non-occurrence of the other.
Counting rules
Addition Rule
Multiplication Principle
Permutation
Combination
Addition Rule
Multiplication Principle
Permutation
...Permutation
Combination
...Combination
Limitation
If it is not possible to enumerate all the possible outcomes for an
experiment.
If the sample points (outcomes) are not mutually independent.
If the total number of outcomes is infinite.
If each and every outcomes is not equally likely.
D R I Row Totals
Executive (E) 5 34 9 48
Worker (W) 63 21 27 111
Column Tables 68 55 36 159
Example
Suppose that of 158 people who attended a dinner party, 99 were ill
due to food poisoning.
The probability of illness for a person selected at random is
99
Pr (illines) = 158 = 0.63
or 63%
Axiomatic Approach
Subjective Approach
Conditional probability
Conditional Events: If the occurrence of one event has an effect on
the next occurrence of the other event then the two events are
conditional or dependent events.
The formula for calculating a sample conditional probability is easy to
use:
P(AnB)
P(A|B) = P(B)
Independence
Two events, A and B, are independent if the occurrence or
non-occurrence of either of one does not affect the probability of the
occurrence of the other.
Two events A and B are independent if and only if
P(AnB) = P(A) × P(B)
Example 1: Given that P (A) = 0.4, P (B) = 0.2, Are A and B
independent?
Solution:
P(AnB) = P(A) × P(B) = 0.4 × 0.2 = 0.08
Hence, A and B are independent
Independence
Example 2: P(C) = 0.5, P (D) = 0.3, P(CnD) = 0.1. Are C and D
independent?
Solution: P(CnD) = P(A) × P(D) = 0.5 × 0.3 = 0.15
P(CnD) 6= 0.1
Hence, C and D are dependent
P(onlyX ) = P(AnB 0 )
= P(A) − P(AnB) = 0.06
P(onlyy ) = P(BnA0 )
= P(B) − P(AnB) = 0.03
Unconditional Probability
When the size of the total group or grand total (n) serves as the
denominator to calculate a probability, the probability is termed as
unconditional probability.
Example: A study was conducted to investigate the effect of
prolonged exposure to bright light on retinal damage in premature
infants. Eighteen out of 21 premature infants exposed to bright light
developed retinopathy, while 21 of 39 premature infants exposed to
reduced light level developed retinopathy. What is the probability of
developing retinopathy?
Unconditional Probability
No. of infant with retinopaty
P(retinopaty ) =
Total number of infant
18 + 21
=
21 + 39
= 0.65
Are probabilities that are based on the knowledge that some other
event has occurred. In this case the subset of the total group is taken
as a denominator
We want to compare the probability of retinopathy, given that the
infant was exposed to bright light, with that the infant was exposed
to reduced light
Exposure to bright-light and exposure to reduced-light are
conditioning events (i.e. events we want to take into account when
calculating conditional probabilities)
Conditional probabilities are denoted by P (A/B) (read as Probability
of A / B) or P (Event/Conditioning event).
P(AnB)
P(A/B) = , if P(B) > 0
P(B)
Example
Questions
Answer
25
Pr (c > 100) = = 0.33.
75
...Answer
7 + 20
Pr (f ≤ 100) =
36
27
= = 0.75
36
References
References
Thank You!!!