Introduction To Biostatistics
Introduction To Biostatistics
Biostatistics
Part One
1. Basic Concepts
2. Data & Their Presentation
1. Basic Concepts
• Statistics
• Biostatistics
• Populations and samples
• Statistics and parameters
• Statistical inferences
• variables
• Random Variables
• Simple random sample
Statistics and Biostatistics
• The field of statistics: The study and use of theory and methods for the analysis of data
arising from random processes or phenomena. The study of how we make sense of
data.
• The field of statistics provides some of the most fundamental tools and techniques of
the scientific method
• forming hypotheses
• gathering data
• summarizing data
• A statistic rather than the field of “statistics” also refers to a numerical quantity
computed from sample data (e.g., the mean, the median, the maximum)
We will study what to do and how to do it, but also very important is
why the methods are appropriate and what are the concepts justifying
those methods
Populations and Samples
• A population is the collection or set of all of the values
that a variable may have.
Populatio
• A sample is a part of a population. n of
interest
• We use the data from the sample to make inference about
the population
sample
• The sample mean is not true mean but might be very
close.
• Closeness depends on sample size. 7
Sampling is defined as the process of selecting certain members or a subset of the
population to make statistical inferences from them and to estimate characteristics
of the whole population.
Probability Sampling: Probability sampling is a sampling method that selects
random members of a population by setting a few selection criteria. These selection
parameters allow every member to have the equal opportunities to be a part of
various samples.
Non-probability Sampling: Non probability sampling method is reliant on a
researcher’s ability to select members at random. This sampling method is not a
fixed or pre-defined selection process which makes it difficult for all elements of a
population to have equal opportunities to be included in a sample
Sampling Approaches-1
• Convenience Sampling: select the most accessible and available subjects in
target population. Inexpensive, less time consuming, but sample is nearly
always non-representative of target population.
• Highly vulnerable to selection bias and influences beyond the control of the
researcher
• Studies that use convenience sampling have little credibility due to reasons
above
Simple Random Sampling: One of the best probability sampling techniques that helps in saving time and resources, is the
Simple Random Sampling method. It is a trustworthy method of obtaining information where every single member of a
population is chosen randomly, merely by chance and each individual has the exact same probability of being chosen to be
a part of a sample.
For example, in an organization of 500 employees, if the HR team decides on conducting team building activities, it is
highly likely that they would prefer picking chits out of a bowl. In this case, each of the 500 employees has an equal
opportunity of being selected.
Cluster Sampling: Cluster sampling is a method where the researchers divide the entire population into sections or
clusters that represent a population. Clusters are identified and included in a sample on the basis of defining demographic
parameters such as age, location, sex etc. which makes it extremely easy for a survey creator to derive effective inference
from the feedback.
For example, if the government of the United States wishes to evaluate the number of immigrants living in the Mainland
US, they can divide it into clusters on the basis of states such as California, Texas, Florida, Massachusetts, Colorado,
Hawaii etc. This way of conducting a survey will be more effective as the results will be organized into states and provides
insightful immigration data.
• Systematic Sampling: Using systematic sampling method, members of a sample are chosen at regular intervals of a
population. It requires selection of a starting point for the sample and sample size that can be repeated at regular intervals.
This type of sampling method has a predefined interval and hence this sampling technique is the least time-consuming.
• For example, a researcher intends to collect a systematic sample of 500 people in a population of 5000. Each element of
the population will be numbered from 1-5000 and every 10th individual will be chosen to be a part of the sample (Total
• Stratified Random Sampling: Stratified Random sampling is a method where the population can be divided into smaller
groups, that don’t overlap but represent the entire population together. While sampling, these groups can be organized and
• For example, a researcher looking to analyze the characteristics of people belonging to different annual income divisions,
will create strata (groups) according to annual family income such as – Less than $20,000, $21,000 – $30,000, $31,000 to
$40,000, $41,000 to $50,000 etc. and people belonging to different income groups can be observed to draw conclusions of
which income strata have which characteristics. Marketers can analyze which income groups to target and which ones to
eliminate in order to create a roadmap that would definitely bear fruitful results.
Sampling Error
• The discrepancy between the true population parameter and the
sample statistic
• We may define data as figures. Figures result from the process of counting
or from taking a measurement.
• Example:
16
Data Sources: Records, Reports and Other Sources
Look for data to serve as the raw material for our investigation.
39
Types of Data
44
Categorical Variables
• Any variable that is not numerical (values have no numerical meaning) (e.g. gender, race,
drug, disease status)
•Nominal variables
• The data are unordered (e.g. RACE: 1=Caucasian, 2=Asian American, 3=African
American, 4=others)
• A subset of these variables are Binary or
• Dichotomous variables: have only two categories (e.g. GENDER: 1=male, 2=female)
• Ordinal variables
• The data are ordered (e.g. AGE: 1=10-19 years, 2=20-29 years, 3=30-39 years; likelihood
of participating in a vaccine trial). Income: Low, medium, high
Frequency Tables
• Categorical variables are summarized by
• Frequency counts – how many are in each category
• Relative frequency or percent (a number from 0 to 100)
• Or proportion (a number from 0 to 1) Gender of new HIV clinic
patients, 2006-2007, Mbarara, Uganda.
Numerical Variables (Quantitative)
• Continuous variables: can take any value within a given range (e.g. weight:
2974.5 g, 3012.6 g)
Manipulation of Variables
• Continuous variables can be discretized
• E.g., age can be rounded to whole numbers
• Continuous or discrete variables can be categorized
• E.g., age categories
• Categorical variables can be re-categorized
• E.g., lumping from 5 categories down to 2
48
Categorization
• Continuous variables can categorized in meaningful ways
• Choice of cut-off points
• Even intervals (5 year age intervals)
• Meaningful cut-points related to a health outcome or
decision
• Meaningful CD4 count (below 200, -350, -500,
500+)
• Equal percentage of the data falling into each category
(quartiles, centiles,..)
49
Organizing Data and Presentation
Some of common methods:
• Frequency Table
• Frequency Histogram
• Relative Frequency Histogram
• Frequency polygon
• Relative Frequency polygon
• Bar chart
• Pie chart
• Box plot
• Scatter plots.
26
Frequency Tables
27
Histograms
• Bar chart for numerical data – The number of bins and
the bin width will make a difference in the appearance
of this plot and may affect interpretation
8 Female
7 Male
6
Frequency
5
4
0
20- 30- 40- 50- 60-69
Age in years
36
Bar Charts
• General graph for categorical variables
• Graphical equivalent of a frequency table
• The x-axis does not have to be numerical
Alcohol consumption in Mulago Hospital
patients enrolling in VCT study, n=929
0.5
0.4
Proportion
0.3
0.2
0.1
0
Never >1 year ago Within the past
28
year
• What Does This Graph Tell Us?
Days drank alcohol among current drinkers
.25
.2
.15
.1
.05
0
0 10 20 30
Days
60
Box Plots
• Middle line=median
(50th percentile)
• Middle box=25th to
30
75th percentiles
(interquartile range)
• Bottom whisker:
20
Data point at or
above 25th percentile
– 1.5*IQR
10
• Top whisker: Data
point at or below 75th
percentile + 0
1.5*IQR
61
Box Plots
CD4 count among new HIV positives at Mulago
1,500
1,000
500
0
62
Box Plots By Another
Variable
• We can divide up our graphs by another variable
• What type of variable is gender?
male
30
20
10 female
0
.3
Relative freq
.2
.1
0
0 10 20 30 0 10 20 30
Days consumed alcohol of prior 30
35
Scatter Plots
CD4 cell count versus age
1500
1000
500
0
10 20 30 40 50 60
a4. how old are you? 37
Part Two
38