0% found this document useful (0 votes)
17 views67 pages

Introduction To Biostatistics

The document provides an introduction to biostatistics, covering basic concepts such as statistics, populations, samples, and various sampling methods. It emphasizes the importance of biostatistics in health sciences, differentiating it from broader biological applications, and discusses data types and their presentation. Key topics include descriptive and inferential statistics, variables, and methods for organizing and presenting data.

Uploaded by

tapanmanna11111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views67 pages

Introduction To Biostatistics

The document provides an introduction to biostatistics, covering basic concepts such as statistics, populations, samples, and various sampling methods. It emphasizes the importance of biostatistics in health sciences, differentiating it from broader biological applications, and discusses data types and their presentation. Key topics include descriptive and inferential statistics, variables, and methods for organizing and presenting data.

Uploaded by

tapanmanna11111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction to

Biostatistics
Part One
1. Basic Concepts
2. Data & Their Presentation
1. Basic Concepts

• Statistics
• Biostatistics
• Populations and samples
• Statistics and parameters
• Statistical inferences
• variables
• Random Variables
• Simple random sample
Statistics and Biostatistics

• The field of statistics: The study and use of theory and methods for the analysis of data
arising from random processes or phenomena. The study of how we make sense of
data.

• The field of statistics provides some of the most fundamental tools and techniques of
the scientific method
• forming hypotheses

• designing experiments and observational studies,

• gathering data

• summarizing data

• drawing inferences from data (e.g., testing hypotheses)


Statistic

• A statistic rather than the field of “statistics” also refers to a numerical quantity
computed from sample data (e.g., the mean, the median, the maximum)

• Roughly speaking, the field of statistics can be divided into


• Mathematical Statistics: the study and development of statistical theory and methods in
the abstract; and
• Applied Statistics: the application of statistical methods to solve real problems involving
randomly generated data, and the development of new statistical methodology motivated
by real problems
Biostatistics
• Biostatistics is the branch of applied statistics directed toward
applications in the health sciences and biology

• Biostatistics is sometimes distinguished from the field of biometry


based upon whether applications are in the health sciences (bio-
statistics) or in broader biology (biometry;
e.g., agriculture, ecology, wildlife biology)

• Other branches of (applied) statistics: psychometrics, econometrics,


chemometrics, astrostatistics, environmetrics, etc.
Why biostatistics? What's the difference?
• Because some statistical methods are more heavily used in health
applications than elsewhere e.g., survival analysis, longitudinal data
analysis

• Because examples are drawn from health sciences


• Makes subject more appealing to those interested in health

• Illustrates how to apply methodology to similar problems encountered in real


life
We will emphasize the methods of data analysis, but some basic theory
will also be necessary to enhance understanding of the methods and to
allow further coursework

We will study what to do and how to do it, but also very important is
why the methods are appropriate and what are the concepts justifying
those methods
Populations and Samples
• A population is the collection or set of all of the values
that a variable may have.
Populatio
• A sample is a part of a population. n of
interest
• We use the data from the sample to make inference about
the population
sample
• The sample mean is not true mean but might be very
close.
• Closeness depends on sample size. 7
Sampling is defined as the process of selecting certain members or a subset of the
population to make statistical inferences from them and to estimate characteristics
of the whole population.
Probability Sampling: Probability sampling is a sampling method that selects
random members of a population by setting a few selection criteria. These selection
parameters allow every member to have the equal opportunities to be a part of
various samples.
Non-probability Sampling: Non probability sampling method is reliant on a
researcher’s ability to select members at random. This sampling method is not a
fixed or pre-defined selection process which makes it difficult for all elements of a
population to have equal opportunities to be included in a sample
Sampling Approaches-1
• Convenience Sampling: select the most accessible and available subjects in
target population. Inexpensive, less time consuming, but sample is nearly
always non-representative of target population.

• Advantages of Convenience Sampling

• Simplicity of sampling and the ease of research

• Helpful for pilot studies and for hypothesis generation

• Data collection can be facilitated in short duration of time

• Cheapest to implement that alternative sampling methods

• Disadvantages of Convenience Sampling

• Highly vulnerable to selection bias and influences beyond the control of the
researcher

• High level of sampling error

• Studies that use convenience sampling have little credibility due to reasons
above
Simple Random Sampling: One of the best probability sampling techniques that helps in saving time and resources, is the
Simple Random Sampling method. It is a trustworthy method of obtaining information where every single member of a
population is chosen randomly, merely by chance and each individual has the exact same probability of being chosen to be
a part of a sample.
For example, in an organization of 500 employees, if the HR team decides on conducting team building activities, it is
highly likely that they would prefer picking chits out of a bowl. In this case, each of the 500 employees has an equal
opportunity of being selected.
Cluster Sampling: Cluster sampling is a method where the researchers divide the entire population into sections or
clusters that represent a population. Clusters are identified and included in a sample on the basis of defining demographic
parameters such as age, location, sex etc. which makes it extremely easy for a survey creator to derive effective inference
from the feedback.
For example, if the government of the United States wishes to evaluate the number of immigrants living in the Mainland
US, they can divide it into clusters on the basis of states such as California, Texas, Florida, Massachusetts, Colorado,
Hawaii etc. This way of conducting a survey will be more effective as the results will be organized into states and provides
insightful immigration data.
• Systematic Sampling: Using systematic sampling method, members of a sample are chosen at regular intervals of a

population. It requires selection of a starting point for the sample and sample size that can be repeated at regular intervals.

This type of sampling method has a predefined interval and hence this sampling technique is the least time-consuming.

• For example, a researcher intends to collect a systematic sample of 500 people in a population of 5000. Each element of

the population will be numbered from 1-5000 and every 10th individual will be chosen to be a part of the sample (Total

population/ Sample Size = 5000/500 = 10).

• Stratified Random Sampling: Stratified Random sampling is a method where the population can be divided into smaller

groups, that don’t overlap but represent the entire population together. While sampling, these groups can be organized and

then draw a sample from each group separately.

• For example, a researcher looking to analyze the characteristics of people belonging to different annual income divisions,

will create strata (groups) according to annual family income such as – Less than $20,000, $21,000 – $30,000, $31,000 to

$40,000, $41,000 to $50,000 etc. and people belonging to different income groups can be observed to draw conclusions of

which income strata have which characteristics. Marketers can analyze which income groups to target and which ones to

eliminate in order to create a roadmap that would definitely bear fruitful results.
Sampling Error
• The discrepancy between the true population parameter and the
sample statistic

• Sampling error likely exists in most studies, but can be reduced


by using larger sample sizes

• Sampling error approximates 1 / √n


• Note that larger sample sizes also require time and expense to
obtain, and that large sample sizes do not eliminate sampling error
22
Parameters vs. Statistics

• A parameter is a population characteristic

• A statistic is a sample characteristic

• Example: we estimate the sample mean to tell us about the


true population mean
• the sample mean is a ‘statistic’
• the population mean is a ‘parameter’
23
Descriptive & Inferential Statistics

Descriptive Statistics deal with the enumeration, organization and graphical


representation of data from a sample

Inferential Statistics deal with reaching conclusions from incomplete


information, that is, generalizing from the specific sample
Inferential statistics use available information in a sample to draw inferences
about the population from which the sample was selected
Variables
6
• A variable is an object, characteristic or property that can have
different values in different places, persons, or things.

• A quantitative variable can be measured in some way.


• Examples: Heart rate, heights, weight, age, size of tumor, volume of a dose.

• A qualitative (categorical) variable is characterized by its inability to


be measured but it can be sorted into categories.
• Examples: gender, race, drug name, disease status.
Types of Variables
Random Variables
•A random variable is one that cannot be predicted in advance
because it arises by chance. Observations or measurements are used
to obtain the value of a random variable.

•A discrete random variable has gaps or interruptions in the values that


it can have.

•The values may be whole numbers or have spaces between them.


•A continuous random variable does not have gaps in the values it
can assume.
13
• .
•Its properties are like the real numbers
2- Data and Their Presentation
• Numerical variables
• Data
• Categorization
• Data sources
• Bar charts
• Records • Histograms
• Surveys • Box plots
• Experiments • Bar charts by another
• Types of data variable
• Categorical variables • Histogram by another
variable
• Frequency tables • Box plots by another
• variable
• Scatter plots
14
Data
• Data are observations of random variables made on the elements of a population
or sample
• Data are the quantities (numbers) or qualities (attributes) measured or
observed that are to be collected and/or analyzed
• The word “data” is plural, “datum” is singular

• A collection of data is often called a data set ‘singular’.


Data
• The raw material of Statistics is data.

• We may define data as figures. Figures result from the process of counting
or from taking a measurement.

• Example:

• - When a hospital administrator counts the number of patients (counting).


• - When a nurse weighs a patient (measurement)
15
Sources of Data

Data are obtained from


• Records
• Surveys
• Experiments

16
Data Sources: Records, Reports and Other Sources
Look for data to serve as the raw material for our investigation.

1- Routinely kept records.


- Hospital medical records contain immense amounts of information on
patients
- Hospital accounting records contain a wealth of data on the facility’s business
activities.
2- External sources.
The data needed to answer a question may already exist in the form of
published reports, commercially available
data banks, or the research literature, i.e. someone else 17

has already asked the same question.


Data Sources: Surveys
Survey may be necessary if the data needed is about answering
certain questions.
Example:
If the administrator of a clinic wishes to obtain information regarding the
mode of transportation used by patients to visit the clinic, then a survey
may be conducted among patients to obtain this information

39
Types of Data

• Data are made up of a set of variables:


• Categorical variable
• Numerical variables

44
Categorical Variables
• Any variable that is not numerical (values have no numerical meaning) (e.g. gender, race,
drug, disease status)
•Nominal variables
• The data are unordered (e.g. RACE: 1=Caucasian, 2=Asian American, 3=African
American, 4=others)
• A subset of these variables are Binary or
• Dichotomous variables: have only two categories (e.g. GENDER: 1=male, 2=female)
• Ordinal variables
• The data are ordered (e.g. AGE: 1=10-19 years, 2=20-29 years, 3=30-39 years; likelihood
of participating in a vaccine trial). Income: Low, medium, high
Frequency Tables
• Categorical variables are summarized by
• Frequency counts – how many are in each category
• Relative frequency or percent (a number from 0 to 100)
• Or proportion (a number from 0 to 1) Gender of new HIV clinic
patients, 2006-2007, Mbarara, Uganda.
Numerical Variables (Quantitative)

• Naturally measured as numbers for which meaningful arithmetic operations make


sense (e.g. height, weight, age, salary, viral load, CD4 cell counts)

• Discrete variables: can be counted (e.g. number of children in household: 0, 1, 2,


3, etc.)

• Continuous variables: can take any value within a given range (e.g. weight:
2974.5 g, 3012.6 g)
Manipulation of Variables
• Continuous variables can be discretized
• E.g., age can be rounded to whole numbers
• Continuous or discrete variables can be categorized
• E.g., age categories
• Categorical variables can be re-categorized
• E.g., lumping from 5 categories down to 2

48
Categorization
• Continuous variables can categorized in meaningful ways
• Choice of cut-off points
• Even intervals (5 year age intervals)
• Meaningful cut-points related to a health outcome or
decision
• Meaningful CD4 count (below 200, -350, -500,
500+)
• Equal percentage of the data falling into each category
(quartiles, centiles,..)
49
Organizing Data and Presentation
Some of common methods:

• Frequency Table
• Frequency Histogram
• Relative Frequency Histogram
• Frequency polygon
• Relative Frequency polygon
• Bar chart
• Pie chart
• Box plot
• Scatter plots.

26
Frequency Tables

27
Histograms
• Bar chart for numerical data – The number of bins and
the bin width will make a difference in the appearance
of this plot and may affect interpretation

CD4 among new HIV positives at Mulago


15
10
5
0

0 500 1000 1500


CD4 cell count 29
Histograms
• This histogram has less detail but gives us the % of
persons with CD4 <350 cells/mm3

CD4 among new HIV positives at Mulago


60
40
20
0

0 500 1000 1500


CD4 cell count 30
Frequency Polygon
• Use to identify the distribution of your data

8 Female

7 Male

6
Frequency

5
4

0
20- 30- 40- 50- 60-69
Age in years

36
Bar Charts
• General graph for categorical variables
• Graphical equivalent of a frequency table
• The x-axis does not have to be numerical
Alcohol consumption in Mulago Hospital
patients enrolling in VCT study, n=929

0.5

0.4
Proportion

0.3

0.2
0.1

0
Never >1 year ago Within the past
28
year
• What Does This Graph Tell Us?
Days drank alcohol among current drinkers

.25
.2
.15
.1
.05
0

0 10 20 30
Days

60
Box Plots
• Middle line=median
(50th percentile)
• Middle box=25th to

30
75th percentiles
(interquartile range)
• Bottom whisker:

20
Data point at or
above 25th percentile
– 1.5*IQR

10
• Top whisker: Data
point at or below 75th
percentile + 0

1.5*IQR
61
Box Plots
CD4 count among new HIV positives at Mulago

1,500
1,000
500
0

62
Box Plots By Another
Variable
• We can divide up our graphs by another variable
• What type of variable is gender?

male

30
20
10 female
0

Graphs by a1. sex


63
Histograms By Another
Variable
male female

.3
Relative freq
.2
.1
0

0 10 20 30 0 10 20 30
Days consumed alcohol of prior 30
35
Scatter Plots
CD4 cell count versus age

1500
1000
500
0

10 20 30 40 50 60
a4. how old are you? 37
Part Two

Numerical Variable Summaries and Measures

38

You might also like