0% found this document useful (0 votes)
77 views247 pages

1 Introduction To Biostatistics

This document provides an overview of biostatistics including objectives, course content, definitions and key concepts. It discusses data collection techniques, characteristics of statistical data, and dependent and independent variables. The document is intended to introduce students to the field of biostatistics.

Uploaded by

creativejoburg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views247 pages

1 Introduction To Biostatistics

This document provides an overview of biostatistics including objectives, course content, definitions and key concepts. It discusses data collection techniques, characteristics of statistical data, and dependent and independent variables. The document is intended to introduce students to the field of biostatistics.

Uploaded by

creativejoburg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 247

Addis Ababa University

School of Public Health

Biostatistics
Mengistu Y. (BSC, MPH-HI, PhD fellow, Assi. Prof. PH)

2022
Learning Objectives

General Objective
♦ To provide the statistical methods and numerical descriptions that is useful to generate
information about certain situations and present them in such a way that valid interpretations
are possible
Specific Objectives
♦ design, organize, present and summarize data
♦ understand the process involved in data collection and processing
♦ distinguish between categorical and numeric data
♦ understand probabilities and their applications
♦ interpret summary statistics, graphical displays and contingency tables commonly presented in
the health literature
♦ carry out exploratory data analysis
♦ understand the process involved in estimations and hypothesis testing
♦ interpret the functions of confidence intervals and p-values
♦ give an interpretation or reach a conclusion about a population on the basis of information
contained in a sample drown from that population.
Course content
♦ Introduction to the course
♦ Data and Scales of measurement
♦ Methods of data organization and presentation
♦ Frequency distribution
♦ Measures of central tendency and dispersion
♦ Basic principles of probability
♦ Rules of probability and applications (additive,
multiplicative, Bayes')
References: (available in the Library)

1. Gordis, L. (2009). Epidemiology (4th ed.). USA. Elsevier Inc.


2. Koepsell & Weiss. Epidemiologic Methods. Oxford University Press, 2003.
3. Last (ed.) Dictionary of Epidemiology, 1995
4. Rothman, Kenneth J.; Greenl and, Sander; Lash, Timothy L. Modern Epidemiology, 3 rd Ed
Lippincott Williams & Wilkins.2008.
5. Martin Bland. An introduction to Medical Statistics
6. Colton T. Statistics in Medicine
7. Daniel W. Biostatistics a foundation for analysis in the Health Sciences
8. Kirkwood BR. Essentials of Medical Statistics
9. Knapp RG, Miller MC. Clinical epidemiology and Biostatistics. Baltimore Williams and Wilkins, 1992
10. Pagano & Gauvereau. Principles of Biostatistics
11. Schelesslman, J.J. Case control studies, Design, Conduct, Analysis, Oxford University Press, New York,
1982
12. Breslow, N.E. Statistical Methods in cancer Research, Volume I-The analysis of case-control studies
Introduction
 Statistics is the science of gaining information from data through

 collecting data
 organizing
 Summarizing data
 Presenting data
 analysing and drawing conclusion (inferences) from data.

 It is helpful to think of the process of data analysis as consisting of three


stages: management, descriptive and inferential 5
Definitions
• Statistics: is used to mean either statistical data or statistical methods.

• Statistical data:
• When it means statistical data it refers to numerical descriptions of things.
• These descriptions may take the form of counts or measurements.
NB Even though statistical data always denote figures (numerical
descriptions) it must be remembered that all 'numerical descriptions'
are not statistical data
• Statistical methods:
• It refers to a body of methods that are used for collecting, organising,
summarizing, analysis and interpreting numerical data for
understanding a phenomenon or making wise decisions.
04/13/2024 6
Definitions…
• Biostatistics is the application of different statistical methods for
biological, medical and public health data
• A population is any specific collection of objects of interest.
• A sample is any subset or sub-collection of the population
• A census is the case that the sample consists of the whole population.

04/13/2024 7
Definitions ...
• A measurement is a number or attribute computed for each member
of a population or of a sample.
• A parameter is the characteristics of the population as a whole.
• A statistic is the characteristics of the sample data.
• Descriptive statistics is a study of data: involves organizing,
displaying, and describing properties of the data
• Inferential statistics is drawing conclusions about a population of
interest based on information contained in the sample taken from the
population.

04/13/2024 8
Definitions …
• The distinction between a population together with its parameters
and a sample together with its statistics is a fundamental concept in
inferential statistics.
population sample

Statistics
Inference

parameters

04/13/2024 9
Definition …

• A Variable is a characteristic which takes different values in different


persons, places, or things. In general it is a characteristic which takes
different values.

• Variables are things that we measure, control, or manipulate in


research.
♦ Data: are measurements or observations (value) recorded for each
element. For example, data include record on weight, length, breaking
strength, age, sex, religion, marital status, income etc.

Based on the nature of the variables we can have qualitative and


Dependent vs. Independent
Independent variable:
♦ A variable that you believe might influence your outcome measure.
♦ This might be a variable that you control, like a treatment, or a variable not under
your control, like an exposure. It also might represent a demographic factor like age
or gender.
♦ An independent variable is a hypothesized cause of the dependent variable
• Any variable that you are using to make those predictions is an independent
variable.
• Example: The relationship of dietary fat consumption and the development of
ischemic stroke.
In this study, the independent variables were:
Percentage of total fat in the diet,
Dependent variable

• In a research, the variable that you believe might be influenced or modified by some

treatment or exposure.

• It may also represent the variable you are trying to predict.

• The dependent variable is called the outcome variable. This definition depends on the context

of the study.

• Example: A study examined the relationship of dietary fat consumption and the development

of ischemic stroke.

• In this study, the dependent variable was incidence of ischemic stroke.


Characteristics of statistical data
i) They must be in aggregates – are 'number of facts.' A single fact,
even though numerically stated, cannot be called statistics.
ii) They must be affected to a marked extent by a multiplicity of causes.
This means that statistics are aggregates of such facts only as grow out
of a ' variety of circumstances'. Thus the explosion of outbreak is
attributable to a number of factors, e.g. Human factors, parasite
factors and environmental factors.
iii) They must be enumerated or estimated according to a reasonable
standard of accuracy. If statistical data is incorrect the results are bound
to be misleading.

04/13/2024 13
Characteristics…
iv) They must have been collected in a systematic manner for a
predetermined purpose. Numerical data can be called statistics only if
they have been compiled in a properly planned manner and for a
purpose about which the enumerator had a definite idea.
v) They must be placed in relation to each other. That is, they must be
comparable. Numerical facts may be placed in relation to each other
either in point of time, space or condition.

04/13/2024 14
Source of data
• Routine data collection
• Routine health unit and community data
• Activity data about patients seen and programmes run, routine
services and epidemiological surveillance;
• Semi-permanent data about the population served, the facility
itself and staff that run it
• Vital registration
• Non-routine data collection
• Surveys
• Population census (headcounts proportion/facility catchment’s area)
• Quantitative or qualitative rapid assessment methods

04/13/2024 15
Techniques of data collection

Data collection is a crucial stage in the planning and implementation


of a study

If the data collection has been superficial, biased or incomplete,


data analysis becomes difficult, and the research report will be of
poor quality.

Therefore, we should concentrate all possible efforts on developing


appropriate tools, and should test them several times.

16
Techniques of collecting data con’td
Observation: is a technique that involves systematically
selecting, watching and recording behavior and
characteristics of living things, objects or phenomena.
• Observation of human behavior is a much-used data
collection technique. It can be undertaken in different
ways;
• Participant observation: The observer takes part in the
situation he or she observes.
• Non-participant observation: The observer watches the
situation, openly or concealed, but does not participate
17
Data collection techniques con’d
• Observations can give additional, more accurate information
on behavior of people than interviews or questionnaires

• Observations can also be made on objects;

• For example, the presence or absence of a latrine and its state


of cleanliness may be observed.
• Here observation would be the major research technique

18
Data collection techniques con’d
• Interview (face-to-face): is a data-collection technique that
involves oral questioning of respondents, either individually or as
a group.

• Answers to the questions posed during an interview can be


recorded by writing them down (either during the interview itself
or immediately after the interview) or by tape-recording the
responses, or by a combination of both.

19
Data collection techniques con’d

• Administer written questionnaire: is a data collection tool in


which written questions are presented that are to be answered by
the respondents in written form
• A written questionnaire can be administered in different ways, such
as by:
Sending questionnaires by mail with clear instructions on how to answer
the questions and asking for mailed responses;
Gathering all or part of the respondents in one place at one time, giving
oral or written instructions, and letting the respondents fill out the
questionnaires;
Hand-delivering questionnaires to respondents and collecting them later
20
Types of questions
• Depending on how questions are asked and recorded
we can distinguish two major possibilities
1. Open-ended questions: (allowing for completely open
as well as partially categorized answers)
It permit free responses which should be recorded in
the respondents' own words.

21
Types of questions
Such questions are useful for obtaining in-depth information
on:

• facts with which the researcher is not very familiar,

• opinions, attitudes and suggestions of informants, or

• sensitive issues.

22
Types of questions
• Example;

1. 'What is your opinion on the services provided in the ANC?' (Explain


why.)

2. 'What do you think are the reasons some adolescents in this area start
using drugs?

3. 'What would you do if you noticed that your daughter (school girl) had
a relationship with someone?'

23
Types of questions
• Advantage of open-ended questions
• Allow you to probe more deeply into issues of interest being
raised.
• Information provided in the respondents' own words might be
useful
• Risks of completely open-ended questions
• A big risk is incomplete recording of all relevant issues covered in
the discussion.
• Analysis is time-consuming and requires experience; otherwise
important data may be lost.
24
Types of questions
2. Closed
questions: have a list of possible options or answers
from which the respondents must choose

Closed questions are most commonly used for background


variables such as age, marital status or education, although in
the case of age and education you may also take the exact
values and categorise them during data analysis

25
Types of questions

1. 'Women who have induced abortion should be severely punished.‘

26
Types of questions
2. Did you eat any of the following foods yesterday?' (Circle yes if at least one
item in each set of items is eaten.)

27
Types of questions
• Advantages of closed questions
• It saves time
• Comparing responses of different groups, or of the same group over time,
becomes easier.
• Risks of closed questions:
• In case of illiterate respondents, bias will be introduce

28
Steps in designing questionnaire
1. Content: Take your objectives and variables as a starting point

2. Formulating questions: Formulate one or more questions that will


provide the information needed for each variable.
 Check whether each question measures one thing at
a time.
 Avoid leading questions.
 Ask sensitive questions in a socially acceptable way:

29
Steps in designing questionnaire
3. Sequencing the questions: Design your interview
schedule or questionnaire to be 'informant friendly‘

4. Formatting the questionnaire:


When you finalize your questionnaire, be sure that
 A separate, introductory page is attached to each
questionnaire

30
Steps in designing questionnaire
explaining the purpose of the study
 requesting the informant's consent to be interviewed
assuring confidentiality of the data obtained.

• Each questionnaire has a heading and space to insert the number,


date and location of the interview

• You may add the name of the interviewer, to facilitate quality


control.
31
Steps…
5. Translation

6. Pre-test:

32
For qualitative study
 Focus group discussions: It allows a group of 8 - 12 informants to freely discuss a
certain subject with the guidance of a facilitator or reporter
 In-depth interview
 Key informant interview

33
Rationale of studying statistics
•Why do we need to use statistics
•– The reason is: Presence of variability

• Statistics pervades a way of organizing information on a wider and


more formal basis
• More and more things are now measured quantitatively in medicine
and public health
• There is a great deal of intrinsic (inherent) variation in most
biological processes
• Public health and medicine are becoming increasingly quantitative.
As technology progresses, the physician encounters more and more
quantitative rather than descriptive information.
04/13/2024 34
Rationale….
• The planning, conduct, and interpretation of much of medical
research are becoming increasingly reliant on statistical technology. Is
this new drug or procedure better than the one commonly in use?
How much better? What, if any, are the risks of side effects associated
with its use?
• Statistics pervades the medical literature.

04/13/2024 35
Limitations of statistics
1. It deals on aggregates of facts and no importance is attached to
individual items–suited only if their group characteristics are desired to
be studied.
2. Statistical data are only approximately and not mathematically
correct.

04/13/2024 36
Data and types of data
• Qualitative (or categorical) data consist of values that can be
separated into different categories that are distinguished by some
nonnumeric characteristic.
• Cannot be measured in quantitative form but can only be identified by name
or categories
• Quantitative data consist of values representing counts or
measurements. Expressed numerically and they can be of two types
(discrete or continuous).

04/13/2024 37
Types of Quantitative Data
• Continuous data can take on any value in a given interval. Continuous
data values results from some continuous scale that covers a range of
values without gaps, interruptions, or jumps.
• Discrete data can take on only particular distinct values and not other
values in between. The values in discrete data is either a finite
number or a countable number.

04/13/2024 38
Scale of measurement
• Nominal
• Ordinal
• Interval
• Ratio
• Nominal and ordinal are qualitative (categorical) levels of
measurement.
• Interval and ratio are quantitative levels of measurement.

04/13/2024 39
Types of Variables
• Variable types can be distinguished based on their scale, Typically,
different statistical methods are appropriate for variables of different
scales
scale Characteristic questions Examples
Nominal Is A different than B? Marital status, Eye color, Gender,
Religious affiliation, Race
Ordinal Is A bigger than B? Stage of disease
Severity of pain
Level of satisfaction
Interval By how many units do A and Temperature
B differ?
Ratio How many times bigger than Distance, Length
B is A? Time until death
Weight

04/13/2024 40
Operations that make sense for variables of
different scales
Scale Operation that make sense
Counting Ranking Addition/ Multiplication/
subtraction Division
Nominal  .
Ordinal  .  .
Interval  .  .  .
Ratio  .  .  .  .

04/13/2024 41
TYPES OF QUALITATIVE
MEASUREMENTS
• Nominal level of measurement—classifies data into names, labels or
categories in which no order or ranking can be imposed.
Example: Sex ( M, F)
Exam result (P, F)
Blood Group (A,B, O or AB)
Color of Eyes (blue, green,
brown, black)

04/13/2024 42
• Ordinal level of measurement—classifies data into categories that can be
ordered or ranked, but precise differences between the ranks do not exist.
Generally it does not make sense to do calculations with data at the
ordinal level.
Example:
Response to treatment
(poor, fair, good)
Severity of disease
(mild, moderate, severe)
Income status (low, middle,
high)
04/13/2024 43
TYPES OF QUANTITATIVE
MEASUREMENTS
• Interval level of measurement—ranks data, precise differences
between units of measure exist, but there is no meaningful zero. If a
zero exists, it is an arbitrary point. Example—IQ scores, it makes sense
to talk about someone having an IQ 20 points higher than another
person, but an IQ of zero has no meaning.

• Ratio level of measurement—has all the characteristics of the interval


level, but a true zero exists. Also, true ratios exist when the same
variable is measured on two different members of the population.
Example—weight of an individual. It makes sense to say that a 150 lb
adult weighs twice as much as a 75 lb. child.
04/13/2024 44
summarizes the possible data types and levels of measurement.

Figure 1 Data types and levels of measurement.

04/13/2024 45
Copyright © 2009 Pearson Education, Inc.
Data organization and presentation
• Statistics is used to organize and interpret research
observations and findings.
• Before interpretation & communication of the
findings, the raw data must be organized and
presented in a clear and understandable way.
Techniques used to organize and summarize a set of
data in a concise way.
• Organization of data
• Summarization of data
• Presentation of data

04/13/2024 46
Cont...
• Numbers that have not been summarized and
organized are called raw data

Descriptive statistic includes tables, graphical


/chart displays and calculation of summary
measures such as mean, proportions, averages
etc…

• The methods of describing variables differ


depending on the type of data (Numerical or
Categorical).
04/13/2024 47
Organizing data
Categorical data Continuous or discrete data
• Table of frequency • Frequency distribution
distributions • Summary measures
• Frequency
Graphs
• Relative frequency • Histograms
• Cumulative frequencies • Frequency polygons
• Cumulative frequency
polygons
• Graphs
• Bar charts Leaf and steam
• Pie charts Box and whisker Plots
Scatter plot
04/13/2024 48
Frequency distributions
• A frequency distribution is a presentation of the number
of times (or the frequency) that each value (or group of
values) occurs in the study population.

• Ordered array: A simple arrangement of individual


observations in order of magnitude.
• A simple and effective way of summarizing categorical
data is to construct a frequency distribution table.

• This is done by counting the number of observations


falling into each of the categories, or levels of the
variables.

• Consider for example, the variable birth weight with


04/13/2024
levels ‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’. 49
Relative Frequency
• Sometimes it is useful to compute the proportion, or
percentages of observations in each category.

• The distribution of proportions is called the relative


frequency distribution of the variable.

• Given a total number of observations, the relative


frequency distribution is easily derived from the
frequency distribution.

04/13/2024 50
Cumulative frequency
• Two other distributions are useful describing
particularly ordinal data.
• It tells nothing in nominal data.
E.g. You will never say 70% are below blue color.
• The cumulative frequency is the number of
observations in the category plus observations in all
categories smaller than it.
• Cumulative relative frequency is the proportion of
observations in the category plus observations in all
categories smaller than it, and is obtained by dividing
the cumulative frequency by the total number of
observations.

04/13/2024 51
Table 2. Distribution of birth weight of newborns
between 1976-1996 at TAH.

BWT Freq. Rel. Freq(%) Cum. Freq Cum.rel.freq.(%)


Very low 43 0.4 43 0.4
Low 793 8.0 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100_____
Total 9974 100

04/13/2024 52
Frequency distribution for numerical data
• Ordered array, further useful summarization may be
achieved by grouping the data.
• To group a set of observations we select a set of
continuous, non overlapping intervals such that each
value in the set of observations can be placed in one,
and only one, of the intervals.
• These intervals are usually referred to as class
intervals.

04/13/2024 53
• One of the first considerations when data are to
be grouped is how many intervals to include
• The question is how best can we organize such
data. Imagine when we have huge data set
which may not be manageable by eye.

04/13/2024 54
Table 3. Frequencies of serum cholesterol levels for 1067 US males of
ages 25-34, (1976-1980).
-------------------------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
------------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159 150 14.1 163 15.3
160-199 442 41.4 605 56.7
200-239 299 28.0 904 84.7
240-279 115 10.8 1019 95.5
280-319 34 3.2 1053 98.7
320-359 9 0.8 1062 99.5
360-399 5 0.5 1067 100
------------------------------------------------------------------------------------------------------------------
Total 1067 100

04/13/2024 55
For both discrete and continuous data the values are
grouped into non-overlapping intervals, usually of
equal width.

04/13/2024 56
Example of raw data of age….

04/13/2024 57
Example of categorized data of age

04/13/2024 58
How to calculate class interval?
To determine the number of class intervals and the
corresponding width, we use:

 Sturge’s rule:
K=1+3.322(logn)
W=L-S
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
04/13/2024 59
Example
• Construct a grouped frequency distribution of the
following data on the amount of time (in hours) that
80 college students devoted to leisure activities during
a typical school week:

04/13/2024 60
Example:

04/13/2024 61
The amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week

• Using the above formula,


K = 1 + 3.322  log (80)
= 7.32  7 classes
• Maximum value = 38 and Minimum value = 10
• w= Range/k = (38 – 10)/7= 28/7 = 4
• Using width of 5(common rule of thumb), we can construct grouped
frequency distribution for the above data as:

04/13/2024 62
04/13/2024 63
Mid-point and True-limits
Mid-point (class mark): The value of the interval
which lies midway between the lower and the upper
limits of a class.
True limits(class boundaries): Are those limits that
make an interval of a continuous variable continuous
in both directions

Used for smoothening of the class intervals

Subtract 0.5 from the lower and add it to the upper


limit
04/13/2024 64
Contd…
• Note. In the construction of cumulative frequency distribution, if we
start the cumulation from the lowest size of the variable to the
highest size, the resulting frequency distribution is called `Less than
cumulative frequency distribution' and if the cumulation is from the
highest to the lowest value the resulting frequency distribution is
called `more than cumulative frequency distribution.' The most
common cumulative frequency is the less than cumulative frequency

04/13/2024 65
Example
Time True limit Mid-point Frequency
(Hours)

10-14 9.5 – 14.5 12 8


15-19 14.5 – 19.5 17 28
20-24 19.5 – 24.5 22 27
25-29 24.5 – 29.5 27 12
30-34 29.5 – 34.5 32 4
35-39 34.5 - 39.5 37 1
Total 80

04/13/2024 66
• Class interval: The length of the class, it is given by the difference
between class boundaries for 1st class, the interval is 5.
• Note: As sample increases, and interval reduced the sample
distribution resembles the population distribution

04/13/2024 67
• Class intervals should be continuous, non
overlapping, mutually exclusive and exhaustive

• Too few intervals results loss of information

• Too many intervals results that the objective of


summarization will not be met.

• Class intervals generally should be of the same


width (some times impossible)

• Open ended class intervals should be avoided


68
Exercise
• Construct a grouped
frequency
distribution and
complete the
following table for
the Age of patients
(years) in a diabetic
clinic in Addis
Ababa, 2010

04/13/2024 69
Age of patients (years) in a diabetic clinic in Addis
Ababa, 2010
Cumulative freq Relative Cum freq

Fraction (%)
<Method >Method <Method >Method

Frequency,
Age group

Boundary

Relative
(Years)

Fr. (fi)
Point
Class

Class

Class
limit

Tally
Mid

Total

04/13/2024 70
METHOD OF DATA PRESENTATION

04/13/2024 71
Data table
Guidelines for constructing tables
• Keep them simple
• Limit the number of variables
• All tables should be self-explanatory
• Include clear title telling what, where and
when
• Clearly label the rows and columns
04/13/2024 72
Cntd…
• State clearly the unit of measurement used
• Explain codes and abbreviations in the foot-
note
• Show totals
• If data is not original, indicate the source in
foot-note
04/13/2024 73
Graphical presentation of data

• Variety of graph styles can be used to present data.


• The most commonly used types of graph are pie charts, bar
diagrams, histograms, frequency polygon and scatter diagrams.
• The purpose of using a graph is to tell others about a set of data
quickly, allowing them to grasp the important characteristics of
the data.
• In other words, graphs are visual aids to rapid understanding.

04/13/2024 74
Importance of graphs
• Diagrams have greater attraction than mere figures.
• They give delight to the eye, add a spark of interest and as such
catch the attention
• They help in deriving the required information in less time and
without any mental strain.
• They have great memorizing value than mere figures.
• They facilitate comparison

04/13/2024 75
Bar charts

• Bar chart: Display the frequency distribution for nominal or


ordinal data.
• In a bar chart the various categories into which the observation
fall are represented along horizontal axis and
• A vertical bar is drawn above each category such that the height
of the bar represents either the frequency or the relative
frequency of observation within the class.
• The vertical axis should always start from 0 but the horizontal
can start from any where.
• The bars should be of equal width and should be separated from
one another so as not to imply continuity

04/13/2024 76
Figure 1. Bar charts showing frequency distribution of
the variable ‘BWT’.

100
6000

5000
80

4000
60
3000

Rel. Freq.
Freq.

40
2000

1000 20

0
0
Very low Low Normal Big Very low Low Normal Big

BWT BWT

04/13/2024 77
Bar charts for comparison
• Multiple bar chart: In order to compare the
distribution of a variable for two or more groups, bars
are often drawn along side each other for groups being
compared in a single bar chart.
• Sub division bar chart: If there are different
quantities forming the sub-divisions of the totals,
simple bars may be sub-divided in the ratio of the
various sub-divisions to exhibit the relationship of the
parts to the whole.

04/13/2024 78
Fig 2. Bar chart indicating categories of birth weight of
9975 newborns grouped by antenatal follow-up of the
mothers
6000

100 88.9 89
5000
90
80
70
4000
60
50 Yes

Percent
Freq.
3000
40 No
30
20 9 7.9
2000
10 2.1 3.1
Antenatal Care 0
1000 Low Normal Big
No
NNo BWT
0 Yes
Low Normal Big

04/13/2024 79
BWT
Example: Plasmodium species distribution for confirmed malaria cases,
Zeway, 2003

04/13/2024 80
Pie chart
Pie Chart: Displays the frequency distribution for
nominal or ordinal data.
• In a pie chart the various categories into which the
observation fall are represented along sectors of a
circle

• Each sector represents either the frequency or the


relative frequency of observation within the class
the angles of which are proportional to frequency or
the relative frequency.
04/13/2024 81
Figure 3. Pie charts showing frequency distribution of
the variable ‘BWT’
Fig 3(b) Pie chart indicating relative frequency of Fig 3(a) Pie chart indicating frequency of categories
categories of birth weight of birth weight

0.4 43 793
2.7 8 268

Very low Very low


Low Low

Normal Normal
Big
Big

8870
88.9

04/13/2024 82
Histogram
• Histogram is frequency distributions with continuous
class interval that has been turned into graph.

• Given a set of numerical data, we can obtain impression


of the shape of its distribution by constructing a
histogram.

• A histogram is constructed by choosing a set of non-


overlapping intervals (class intervals) and counting the
number of observations that fall in each class.
.

04/13/2024 83
Histograms cont…
• The number of observations in each class
is called the frequency. Hence histograms
are also called frequency distributions

• It is necessary that the class intervals be


non-overlapping so that each observation
falls in one and only one interval.

04/13/2024 84
Histograms cont…
• Except for the two boundaries, class intervals are
usually chosen to be of equal width. If this is not the
case, the histogram could give a misleading
impression of the shape of the data

• In drawing the histogram , smoothening of class


interval is one of important point. We subtract 0.5
from the lower and add it up to the upper boundary of
the given interval.

04/13/2024 85
Example
Distribution of the age of women at the time of
marriage
Age group No. of women
15-19 11
20-24 36
25-29 28
30-34 13
35-39 7
40-44 3
45-49 2
04/13/2024 86
Age of women at the time of marriage

40

35

30

25
No of women

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group

04/13/2024 87
Fig 5. A histogram displaying frequency distribution of birth weight of newborns at
Tikur Anbessa Hospital
2000

1800

1600

1400

1200

1000

800

600
Frequency

400
Std. Dev = 502.34
200 Mean = 3126
0 N = 9975.00

Birth weight

04/13/2024 88
Frequency polygons
• Instead of drawing bars for each class interval,
sometimes a single point is drawn at the mid point of
each class interval and consecutive points joined by
straight line.

• Graphs drawn in this way are called frequency


polygons .

• Frequency polygons are superior to histograms for


comparing two or more sets of data.

04/13/2024 89
Fig.6. Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males
and females
50

40

%
30

20

SEX
10
Males

Females

0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Birth Weight

04/13/2024 90
Box and Whisker Plot
It is another way to display information when
the objective is to illustrate certain locations
(skewness) in the distribution

Can be used to display a set of discrete or


continuous observations using a single vertical
axis – only certain summaries of the data are
shown

04/13/2024 91
Box plot cont...
A box is drawn with the top of the box at the third
quartile (75%) and the bottom at the first quartile
(25%).
The location of the mid-point (50%) of the
distribution is indicated with a horizontal line in the
box.
Finally, straight lines, or whiskers, are drawn from
the centre of the top of the box to the largest
observation and from the centre of the bottom of the
box to the smallest observation.

04/13/2024 92
Box cont....
The box plot is then completed
 Draw a vertical bar from the upper quartile to
the largest non-outlining value in the sample
 Draw a vertical bar from the lower quartile to
the smallest non-outlying value in the sample
 Any values that are outside the IQR but are not
outliers are marked by the whiskers on the plot
(IQR = P75 – P25)

04/13/2024 93
Box plots are useful for comparing two or
more groups of observations

04/13/2024 94
Drawing Box-and -whiskers plot

Raw data
35, 29, 44, 72, 34, 64, 41, 50, 54, 104, 39, 58
Order the data
29 34 35 39 41 44 50 54 58 64 72 104
Median = (44 + 50)/2 = 47 = Q2
Q1 = 37
Q3 = 61,Min = 29 , Max = 104

04/13/2024 95
Box plot Example
Min = 29 Q1 = 37 Q2 = 47 Q3 = 61 Max = 104

.. . .
0 10 20 30 40 50 60 70 80 90 100 110
04/13/2024 96
Scatter plot
Most studies in medicine involve measuring more than one
characteristic, and graphs displaying the relationship between
two characteristics are common in literature.

When both the variables are qualitative then we can use a


multiple bar graph.

When one of the characteristics is qualitative and the other is


quantitative, the data can be displayed in box and whisker
plots

04/13/2024 97
Scatter plot ….
For two quantitative variables we use bivariate plots (also called
scatter plots or scatter diagrams).
It is used to see whether a relationship existed between the two
measures.
A scatter diagram is constructed by drawing
X-and Y-axes
Each point represented by a point or dot() represents a pair of
values measured for a single study subject =POSTIVE RELATION

04/13/2024 98
Scatter plot
• Scatter plot helps us to understand the association between two
variables using:
1. The trend
2. The shape and
3. The strength
Measure of association
• Identifying very strong and very weak association is easy by observing
the graph, but how we can classify everything in between?

04/13/2024 99
Scatter plot
• Linear correlation coefficient (R) measure the strength of association
between 2 variables.
• R values always range from -1 to 1
• R approaches to 1 shows a strong linear positive association
• R approaches to -1 shows a strong linear negative association
• R approaches to 0 shows a weak or no linear association
• Note: values in between is somewhat subjective

04/13/2024 100
Scatter Plots and Types of Correlation

60
x = hours of training
Accidents
50 y = number of accidents

Accidents
40

30

20

10

0 2 4 6 8 10 12 14 16 18 20
Hours of Training

Negative Correlation as x increases, y decreases

101
Scatter Plots and Types of Correlation

x = SAT score
GPA4.00
3.75
y = GPA
3.50
3.25
3.00
2.75
2.50
2.25
2.00
1.75
1.50

300 350 400 450 500 550 600 650 700 750 800
Math SAT

Positive Correlation as x increases y increases


102
Scatter Plots and Types of Correlation

x = height y = IQ
IQ
160

150

140

130
IQ

120

110

100

90

80

60 64 68 72 76 80
Height

No linear correlation
103
Scatter Diagram…
1. Direction of Relationship

Y Positive

Y Negative
X

04/13/2024 104
2. Form of Relationship

Y
Linear
X

Y
Curvilinear
X

04/13/2024 105
3. Degree of Relationship

Y Strong

Y
Weak
X

04/13/2024 106
Line graph
Useful for assessing the trend of particular situation
overtime. e.g. monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along
the horizontal axis
Values of the quantity being studied is marked on the
vertical axis.
Values for each category are connected by continuous
line.
Sometimes two or more graphs are drawn on the same
graph taking the same scale so that the plotted graphs
are comparable.

04/13/2024 107
No. of microscopically confirmed malaria cases by species
and month at Zeway malaria control unit, 2003

2100

No. of confirmed malaria cases 1800 Positive


1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months
04/13/2024 108
Line graph cont..
 Line graph can be also used to depict the
relationship between two continuous
variables like that of scatter diagram.
The following graph shows level of
zidovudine (AZT) in the blood of
HIV/AIDS patients at several times after
administration of the drug, for with normal
fat absorption and with fat mal absorption.
04/13/2024 109
Line graph cont…..
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
Blood zidovudine

6
concentration

5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Tim e since adm inistration (Min.)

Fat malabsorption Normal fat absorption

04/13/2024 110
Choosing graphs
Type of Data/or Appropriate Graphs
Purpose
Metric/Numerical -Histogram (one continuous var)
-Frequency Polygon (one/more cont. var)
-Cumulative Freq Polygon (ogive curve)
-Box and whisker (one cont. and one cat.
Var)
-Stem and Leave (one cont. var)
-Scatter (two cont. var)

Categorical -Bar (one/more cat. var) (Simple/Multiple)


-Pie (one cat. var)

Trend -Line (one cont. and one cat. Var/two


cont)
04/13/2024 111
SUMMARIZING DATA

04/13/2024
Summary Measures
Describing Data Numerically

Central Tendency Variation Shape

Arithmetic Mean Range Skewness

Median Interquartile Range

Mode Variance

Geometric Mean Standard Deviation

Quartiles Coefficient of Variation

04/13/2024
MEASURES OF CENTRAL TENDENCY

• The tendency of statistical data to get concentrated at


certain values is called the “Central Tendency or
average”
• Mean
• Median
• Mode

04/13/2024
The Arithmetic Mean or simple Mean
• The mean is the average of the numbers. It
is add up all the numbers, then divide by
how many numbers there are
• It is written in statistical terms as:

04/13/2024
• Example 1: What is the Mean of these numbers? 6, 11, 7
• Add the numbers: 6 + 11 + 7 = 24
• Divide by how many numbers (there are 3 numbers): 24 / 3 = 8
• The Mean is 8
Why Does This Work?
• It is because 6, 11 and 7 added together is the same as 3 lots of 8:
• It is like you are "flattening out" the numbers.

04/13/2024
Example 2
Birth weights(gm) of all live
born infant born at a private
What is the arithmetic mean
hospital in a city, during a 1-
for the sample birth weights?
week period.

04/13/2024
Weighted Mean
• When averaging quantities, it is often necessary to
account for the fact that not all of them are equally
important in the phenomenon being described.

• In order to give quantities being averaged there


proper degree of importance, it is necessary to
assign them relative importance called weights,
and then calculate a weighted mean.
04/13/2024
• The weighted mean of a set
of numbers X1, X2, … and Xn,
whose relative importance is
expressed numerically by a
corresponding set of
numbers w1, w2, … and wn, is
given by

04/13/2024
• Example: In a given drug shop fourdifferentdrugs were sold for unit
price of 60, 85, 95 and 50 birr and the total numbers of drugs sold
were 10, 10, 5 and 20 respectively. What is the average price of the
four drugs in this drug shop?
• Solution: for this example we have to use weightedmeanusing
number of drugs sold as the respective weights for each drug's price.
Therefore, the average price will be: 65 birr
• If we don't consider the weights, the average price will be 72.5 birr
Weighted mean65

04/13/2024
Weighted Mean
• We can also calculate a weighted mean using some weighting
factor:
e.g. What is the average income of all
n people in cities A, B, and C :

w x i i
City
A
Avg. Income
$23,000
Population
100,000
x i 1
n B $20,000 50,000

w
i 1
i
C $25,000 150,000

Here, population is the weighting factor and the average


income is the variable of interest

04/13/2024
Geometric Mean
• The Geometric Mean is a special type of average where we multiply
the numbers together and then take a square root (for two numbers),
cube root (for three numbers) etc.
Example: What is the Geometric Mean of 2 and 18?
• First we multiply them: 2 × 18 = 36
• Then (as there are two numbers) take the square root: √36 = 6

• Geometric Mean of 2 and 18 = √(2 × 18) = 6


• It is like the area is the same!

04/13/2024
Example: What is the Geometric Mean of 10, 51.2 and 8?
• First we multiply them: 10 × 51.2 × 8 = 4096
• Then (as there are three numbers) take the cube root: 3√4096 = 16
• For n numbers: multiply them all together and then take the nth root
(written n√ )

• Geometric Mean = 3√(10 × 51.2 × 8) = 16


• It is like the volume is the same:

04/13/2024
Estimating the Mean from Grouped Data
Someone timed 21 people in the race, to the
nearest second:
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4

•The groups (51-55, 56-60, etc), also called class


intervals, are of width 5

•The midpoints are in the middle of each class: 53, 58,


63 and 68
04/13/2024
Cntd…

We can estimate the Mean by using the midpoints


So, how does this work?
Think about the 7 runners in the group 56 - 60: all we know is that they ran
somewhere between 56 and 60 seconds:
•Maybe all seven of them did 56 seconds,
•Maybe all seven of them did 60 seconds,
•But it is more likely that there is a spread of numbers: some at 56, some at 57,
etc
So we take an average and assume that all seven of them took 58 seconds.
04/13/2024
Cntd…
• Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8
people took 63 sec and 3 took 68 sec". In other words
we imagine the data looks like this:
• 53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63,
68, 68, 68, 68
• Then we add them all up and divide by 21. The quick way to do
it is to multiply each midpoint by each frequency
• And then our estimate of the mean time to complete the race
k

is: 1288 m f i i

= 61.333… x = i=1
k
• Estimated Mean = 21 f i
i=1

04/13/2024
Correct mean
• If a wrong figure has been used when calculating the mean the correct
mean can be obtained with out repeating the whole process using:

• Example: An average weight of 10 patients was calculated to be


65.Latter it was discovered that one weight was misread as 40 instead
of 80 k.g. Calculate the correct average weight.
• solution
The effect of transforming original series on the mean.

• If a constant k is added/ subtracted to/from every


observation then the new mean will be the old mean± k
respectively.

• If every observations are multiplied by a constant k then the


new mean will be k*old mean
Characteristics of mean
• The value of the arithmetic mean is determined by every
item in the series.
• It is greatly affected by extreme values.
Advantages
• It is based on all values given in the distribution.
• It is most easily understood.
• It is most amenable to algebraic treatment.

04/13/2024
Disadvantages
• It may be greatly affected by extreme items and its
usefulness as a “Summary of the whole” may be
considerably reduced.
• When the distribution has open-ended classes, its
computation would be based assumption, and therefore may
not be valid.

04/13/2024
Median
• Suppose there are n observations in a sample. If
these observations are ordered from smallest to
largest, then the median is defined as follows:
• The sample median is

04/13/2024
Example 2
2.2. Consider the following
2.1. Compute the sample data, which consists of white
median for the birth blood counts taken on
weight data in example 1. admission of all patients
entering a small hospital on a
given day. Compute the
median white-blood count
(103).
7, 35,5,9,8,3,10,12,8

04/13/2024
Estimating the Median from Grouped Data
• Let's look at our data again:

The median is the middle value, which in our case is


the 11th one, which is in the 61 - 65 group:
We can say "the median group is 61 - 65"

04/13/2024
Cntd…
• We call it "61 - 65", but it really includes values from 60.5 up to (but
not including) 65.5.
• Why? the values are in whole seconds, so a real time of 60.5 is
measured as 61. Likewise 65.5 is measured as 65.
• At 60.5 we already have 9 runners, and by the next boundary at 65.5
we have 17 runners. By drawing a straight line in between we can pick
out where the median frequency of n/2 runners is:

04/13/2024
Cntd..
(n/2) − B
• Estimated Median = where: L+ ×w
G
• L is the lower class boundary of the group containing the median
• n is the total number of values
• B is the cumulative frequency of the groups before the median group
• G is the frequency of the median group
• w is the group width
• For our example: (21/2) − 9
• L = 60.5 Estimated
= 60.5 + ×5
Median 8
• n = 21
• B=2+7=9
• G=8
• w=5 = 61.4375
04/13/2024
i) Characteristics of Median
• It is an average of position/location .
• It is affected by the number of items than by extreme values.

ii) Advantages
• It is easily calculated and is not much disturbed by extreme
values
• It is more typical of the series
• The median may be located even when the data are incomplete,
e.g, when the class intervals are irregular and the final classes
have open ends.
04/13/2024
iii) Disadvantages
• it is determined mainly by the middle points in a
sample and is less sensitive to the actual numerical
values of the remaining data points.
• It is not so generally familiar as the arithmetic mean

04/13/2024
Mode
• It is the value of the observation that occurs with the greatest
frequency.
• A particular disadvantage is that, with a small number of
observations, there may be no mode.
• In addition, sometimes, there may be more than one mode such
as when dealing with a bimodal (two-peak) distribution.
• Find the modal values for the following data
a) 22, 66, 69, 70, 73. (No modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg)

04/13/2024
Estimating the Mode from Grouped Data
• We can easily find the modal group (the group with the highest
frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
fm − fm-1
Estimated Mode = L+ ×w
(fm − fm-1) + (fm − fm+1)

04/13/2024
Cntd…
• where:
• L is the lower class boundary of the modal group
• fm-1 is the frequency of the group before the modal group
• fm is the frequency of the modal group
• fm+1 is the frequency of the group after the modal group
• w is the group width
8−7
• In this example: Estimated
= 60.5 + ×5
Mode (8 − 7) + (8 − 4)
• L = 60.5
• fm-1 = 7 = 60.5 + (1/5) × 5
• fm = 8 = 61.5
• fm+1 = 4
• w=5
04/13/2024
Mode
Characteristics
• It is an average of position
• It is not affected by extreme values
• It is the most typical value of the distribution
Advantages
• Since it is the most typical value it is the most descriptive
average
• Since the mode is usually an “actual value”, it indicates the
precise value of an important part of the series.
04/13/2024
Disadvantages:-
• Unless the number of items is fairly large and the distribution
reveals a distinct central tendency, the mode has no
significance
• It is not capable of mathematical treatment
• In a small number of items the mode may not exist.

04/13/2024
Skewness:
• If extremely low or extremely high observations are present in a
distribution, then the mean tends to shift towards those scores. Based
on the type of skewness, distributions can be:
• Negatively skewed distribution: occurs when majority of scores are at
the right end of the curve and a few small scores are scattered at the
left end.
• Positively skewed distribution: Occurs when the majority of scores are
at the left end of the curve and a few extreme large scores are scattered
at the right end.
• Symmetrical distribution: It is neither positively nor negatively skewed.
A curve is symmetrical if one half of the curve is the mirror image of the
other half.
04/13/2024
Skewness…
• Data can be "skewed", meaning it tends to have a long tail on one
side or the other:

• Negative Skew?
• Why is it called negative skew? Because the long "tail" is on the
negative side of the peak.
• The mean is also on the left of the peak.
04/13/2024
Skewness…
The Normal Distribution has No Skew
A Normal Distribution is not skewed.
It is perfectly symmetrical.
And the Mean is exactly at the peak.

04/13/2024
Skewness…
Positive Skew
And positive skew is when the long tail is on the
positive side of the peak, and some people say it
is "skewed to the right".
The mean is on the right of the peak value.

04/13/2024
Skewness…

04/13/2024
Measures of Dispersion
• Which of the
distributions of scores
has the larger 125

100

dispersion? 75

50

25

The upper distribution 0


1 2 3 4 5 6 7 8 9 10

has more dispersion


because the scores are 125

more spread out


100

75

50

25

0
1 2 3 4 5 6 7 8 9 10

04/13/2024
Measures of Dispersion

• How “spread out” the numbers are about the centre?


• Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50

• The two data sets given above have a mean of 50, but obviously set 1 is more
“spread out” than set 2 how do we express this numerically?
• Some of the commonly used measures of dispersion (variation) are: Range,
inter quartile range, quartiles, percentiles, variance, standard deviation and
coefficient of variation.
04/13/2024
Range and Interquartile Rage
• Range
• Simplest and the crudest measure of variation
• Difference between the largest and the smallest observations: Range =
Xlargest – Xsmallest
• Ignores the way in which data are distributed
• It wastes information for it takes no account of the entire data.
• Sensitive to outliers
• Interquartile Range
• Eliminate some high- and low-valued observations and calculate the range
from the remaining values
• Interquartile range = 3rd quartile – 1st quartile
04/13/2024
= Q3 – Q1
Quartiles and Percentiles

• The quartiles divide the distribution into four equal parts.


• Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles

• Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles

04/13/2024
Quartiles
• The 25th percentile is often When we wish to find the
referred to as the first quartiles for a set of data, the
quartile and denoted Q1. following formulas are used
• The 50th percentile (the
median) is referred to as
the second or middle
quartile and written Q2’
and
• the 75th percentile is
referred to as the third
quartile, Q3.

04/13/2024
Using the Five-Number Summary to Explore
the Shape
• Box-and-Whisker Plot: A Graphical display of data using 5-number
summary:

Minimum, Q1, Median, Q3, Maximum

• The Box and central line are centered between the endpoints if data
are symmetric around the median

Min Q1 Median Q3 Max


Distribution Shape and
Box-and-Whisker Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Standard Deviation and Variance
• show the scatter of the individual measurements around the mean of
all the measurements in a given distribution.
• The variance represents squared units and, therefore, is not an
appropriate measure of dispersion when we wish to express this
concept in terms of the original units.
• To obtain a measure of dispersion in original units, we merely take the
square root of the variance. The result is called the standard
deviation.
• Variance the average of the squared difference from the mean
• Standard deviation is the square root of variance
04/13/2024
Variance and Standard Deviation
Population Sample

 i
x   2

 ix  x 2

 s
N n 1

SD  variance
04/13/2024
To calculate standard deviation
1. C alculate the m ean
x
2. C alculate the residual for each x x  x

3. S quare the resid uals


( x  x )2

4. C alculate the sum of the squares



 x x 2

5. D ivide the sum in S tep 4 b y ( n -1)



 x x 2
n1
6. Take the square roo t of qu antity
in S tep 5

 x x 2
n1

04/13/2024
Example- Find Standard Deviation of Ungroup
Data

Family No. 1 2 3 4 5 6 7 8 9 10

Size (xi) 3 3 4 4 5 5 6 6 7 7

04/13/2024
Here, x
 x i

50
5
n 10

Family No. 1 2 3 4 5 6 7 8 9 10 Total

xi 3 3 4 4 5 5 6 6 7 7 50
xi  x -2 -2 -1 -1 0 0 1 1 2 2 0

x i  x  2
4 4 1 1 0 0 1 1 4 4 20


 ix  x 2
20
s2 
n 1

9
 2.2, s  2.2  1.48
04/13/2024
Example
• The length of a newborn baby are: 600mm, 470mm, 170mm, 430mm
and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean:
• Answer:
• Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394
5 5
• so the mean (average) height is 394 mm.
• Let's plot this on the chart:

04/13/2024
Cntd…

04/13/2024
To calculate the Variance, take each difference,
square it, and then average the result:

Standard Deviation

σ = √21,704
= 147.32...
= 147 (to the nearest
mm)
04/13/2024
Cntd…

04/13/2024
• And the good thing about the Standard Deviation is that it is useful.
Now we can show which lengths are within one Standard Deviation
(147mm) of the Mean:
• So, using the Standard Deviation we have a "standard" way of
knowing what is normal, and what is extra long or extra short.

04/13/2024
Why square the differences?
• If we just add up the differences from the mean ... the negatives
cancel the positives:

4+4-4-4 =0
• 4

So that won't work. How about we use absolute values?


7+1+|-6|+|-2| = 4 but if we use square root
4
√(72 + 12 + 62 + 22) = √(904) = 4.74...
4
04/13/2024
Coefficient of Variation

• Measures relative variation


• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare two or more sets of data
measured in different units
S
CV     100%

X
04/13/2024
Basic principles of probability, rules and
its applications

04/13/2024 167
Probability
• Probability is the language of chance.
• The deliberate use of chance is the central idea of statistical designs
for producing data.
• Probability is so important for data – leaders of the distribution as
maps for a journey
• Probabilities are used in everyday communication
• Probability theory was developed out of attempting to solve
problems related to games of chance such as tossing a coin, rolling a
die etc.
i.e. trying to quantify personal beliefs regarding degrees of
uncertainty.
04/13/2024
Question from Simple Probabilities
1. What is the probability that a card drawn at random
from a deck of cards will be an ace ?
2. A book contains 32 pages numbered 1, 2, ..., 32. If a
student randomly opens the book, what is the
probability that the page number contains digit 1?
3. A mother in the delivery room to give birth and the
health worker informed her as she will deliver at
9:30 pm. She is eager to give birth of a daughter.
What is the probability that she will get what she
wants?

04/13/2024
Chance

• When a meteorologist states that the chance of rain is


50%, the meteorologist is saying that it is equally likely to
rain or not to rain. If the chance of rain rises to 80%, it is
more likely to rain. If the chance drops to 20%, then it
may rain, but it probably will not rain.

• These examples suggest the chance of an occurrence of


some event of a random variable.

04/13/2024
Basic terms
• Experiment: Is any activity from which result can
be obtained.
• Example: 1. flipping a coin
2. rolling a die
3. drawing 30 individual from the pop
• Sample space: set of possible outcome from the
experiment
Example: 1. coin toss {H, T}
2. Rolling a die {1, 2, 3, 4, 5, 6}
• Event: a collection of outcomes
04/13/2024
• The Sample Space is all
possible outcomes.
• A Sample Point is just
one possible outcome.
• And an Event can be
one or more of the
possible outcomes.

04/13/2024
Properties of probability
1. Possible outcome of probability range 0-1
2. Generally the probability of two events happening
is given by
 P(AuB)=P(A)+P(B)-P(AnB)
3. If two events are mutually exclusive then
 P(AuB)=P(A)+P(B)
4. If two events are independent then
 P(AnB)=P(A).P(B)

04/13/2024
Unions and Intersections
Unions of Two Events
•“If A and B are events, then the union of A and B, denoted by
AUB, represents the event composed of all basic outcomes in A
or B.”
• Intersections of Two Events
“If A and B are events, then the intersection of A and B,
denoted by AnB, represents the event composed of all basic
outcomes in A and B.”

A B

04/13/2024
Addition rules
• Rule 1: If 2 events, B & C, are mutually exclusive (i.e., no overlap) then
the probability that one or both occur is P(B or C) = P(B ∪ C) = P(B) +
P(C)
• Rule 2: For any given pair of events, if the sum of their probabilities is
equal to one, then those two events are mutually exclusive.
• Rule 3: For any 2 events, A & B, not mutually exclusive, the probability
that one or both occur is P(A or B) = P(A∪B) = P(A)+P(B)-P(A n B)

04/13/2024
• Example 1: One die is rolled. Sample space = S = (1, 2, 3, 4, 5, 6)
Let A = the event an odd number turns up, A = (1, 3, 5)
Let B = the event a 1, 2 or 3 turns up; B = (1, 2, 3)
Let C = the event a 2 turns up, C= (2)
I) Find Pr (A); Pr (B) and Pr (C)
• Pr (A) = Pr (1) + Pr (3) + Pr (5) = 1/6+1/6+ 1/6 = 3/6 = 1/2
• Pr (B) = Pr (1) + pr (2) + Pr (3) = 1/6+1/6+1/6 = 3/6 = ½
• Pr (C) = Pr (2) = 1/6
II) Are A and B; A and C; B and C mutually exclusive?
• A and B are not mutually exclusive. Because they have the
elements 1 and 3 in common
• Similarly, B and C are not mutually exclusive. They have the
element 2 in common
• A and C are mutually exclusive. They don’t have any element in
common
04/13/2024
The Addition . . .
If two events A and B are not mutually exclusive, then, P (A
U B) = P (A) + P (B) – P (A and B)
Example
1. There are 80 nurses and 40 physicians in a hospital. Of
these, 70 nurses and 15 physicians are females. If a staff
person is selected at random, find the probability that the
subject is a nurse or male.

Male Female Total


Nurse 10 70 80
Physician 25 15 40
Total 35 85 120

P(N u M) = P(N) + P(M) – P(N n M)


= 80/120 + 35/ 120 – 10/ 120 = 105/ 120
Summary of the Additive Rule

04/13/2024
Conditional probabilities and the multiplicative law
• Let’s assume two questions on a test, the
first question is a true/false and the second is
a multiple question type with five possible
answers (a, b, c, d, e)
• True or False: Heart is an organ which pumps blood in our
body.
• MCQ: Which of the following human organ is used for
breathing?
a. Brain b. Liver c. Lung d. Kidney e. Heart
• If the answers are random guesses the 10
possible outcomes are equally likely so

04/13/2024
• A tree diagram is a picture of the possible outcomes
of a procedure

04/13/2024

04/13/2024
Multiplicative Rule
• When two events are said to be independent of each
other, what this means is that the probability that
one event occurs in no way affects the probability of
the other event occurring.

• For any two events A and B with non-zero probability


are Independent events, each of the following must
be true:

• P (A/B)= P(A) , and P(B/A)= P(B) ; and so, P(A and B)=
P(A) P(B)

04/13/2024
• Eg. 1) A classic example is n tosses of a coin and the
chances that on each toss it lands heads. These are
independent events. The chance of heads on any one
toss is independent of the number of previous heads. No
matter how many heads have already been observed,
the chance of heads on the next toss is ½.

• Eg 2) a similar situation prevails with the sex of offspring.


The chance of a male is approximately ½. Regardless of
the sexes of previous offspring, the chance the next child
is a male is still ½.

04/13/2024
• Sometimes the chance a particular event happens depends on
the outcome of some other event. This applies obviously with
many events that are spread out in time

• Eg. The chance a patient with some disease survives the next
year depends on his having survived to the present time. Such
probabilities are called conditional.

• The notation is Pr (B/A), which is read as “the probability event


B occurs given that event A has already occurred.”

• Let A and B be two events of a sample space S. The conditional


probability of an event A, given B, denoted by Pr (A/B) = P (A n
B) / P (B), P (B)  0.
04/13/2024
• Similarly, P (B/A) = P(A n B) / P(A) , P(A)  0. This can
be taken as an alternative form of the multiplicative
law.
• Where for non-independent events A and B
• P (A and B) = P (A/B) P(B) or P(A and B)= P(B/A)P(A)

• Eg. Suppose in country X the chance that an infant


lives to age 25 is .95, whereas the chance that he lives
to age 65 is .65. For the latter, it is understood that to
survive to age 65 means to survive both from birth to
age 25 and from age 25 to 65. What is the chance that
a person 25 years of age survives to age 65?
04/13/2024
Notation Event Probability

A Survive birth to age 25 .95

A and B Survive both birth to age 25 and age .65


25 to 65
B/A Survive age 25 to 65 given survival to ?
age 25

Then, Pr (B/A) = Pr (A n B) / Pr (A) = .65/.95 = .684.


That is, a person aged 25 has a 68.4 percent chance of
living to age 65.
04/13/2024
Example
1)Consider selecting a child at random from a
kindergarten; let A = event a child is infected with
ascariasis, G = event a child has giardiasis. Suppose
P(A) = .30, P(G) = .25, P(A n G) = .13.
a) What’s the probability that a child randomly selected
from the KG has giardiasis, given that we know s/he
has ascariasis?
b) What is the probability that a child randomly selected
from the KG will test negative for these intestinal
parasites?
2. Of 200 senior students at a certain college, 98 are women,
34 are majoring in Biology, and 20 Biology majors are
women. If one student is chosen at random from the senior
class, what is the probability that the choice will be either a
04/13/2024 Biology major or a woman).
Exercise: Calculating probability of an
event
Table 1: shows the frequency of cocaine use by gender
among adult cocaine users
_______________________________________________________________________________________________

Life time frequency Male Female Total


of cocaine use
_______________________________________________________________________________________________

1-19 times 32 7 39
20-99 times 18 20 38
more than 100 times 25 9 34
--------------------------------------------------------------------------------------------
Total 75 36 111
---------------------------------------------------------------------------------------------

04/13/2024
Questions
1.What is the probability of a person randomly
picked is a male?
2. What is the probability of a person randomly
picked uses cocaine more than 100 times?
3.Given that the selected person is male, what is the
probability of a person randomly picked uses
cocaine more than 100 times?
4.Given that the person has used cocaine less than
100 times, what is the probability of being female?
5.What is the probability of a person randomly
picked is a male and uses cocaine more than 100
times?
04/13/2024
Summary for the Multiplicative Rule

04/13/2024
Probability as a Numerical Measure of the Likelihood of
Occurrence

Increasing Likelihood of Occurrence

0 .5 1
Probability:

The occurrence of the event is


just as likely as it is unlikely.

04/13/2024
Permutations
The number of possible permutations is the number of
different orders in which particular events occur. The
number of possible permutations are
n!
Np 
r ( n  r )!
where r is the number of events in the series, n is the
number of possible events, and n! denotes the factorial
of
n = the product of all the positive integers from 1 to n.

04/13/2024
Combinations
When the order in which the events occurred is of no
interest, we are dealing with combinations. The number
of possible combinations is
n  n!
Nc   
r  r!(n  r)!
where r is the number of events in the series, n is the
number of possible events, and n! denotes the factorial
of n = the product of all the positive integers from 1 to
n. 
04/13/2024
Bayes' Theorem

•Bayes' Theorem shows the relationship between a


conditional probability and its inverse.

i.e. it allows us to make an inference from


the probability of a hypothesis given the evidence to
the probability of that evidence given the hypothesis
and vice versa
Bayes' Theorem

•P(A|B) = P(B|A) P(A)


P(B)

•P(A) – the PRIOR PROBABILITY – represents your


knowledge about A before you have gathered data.
•e.g. if 0.01 of a population has schizophrenia then the
probability that a person drawn at random would have
schizophrenia is 0.01
Bayes' Theorem

•P(A|B) = P(B|A) P(A)


P(B)

•P(B|A) – the CONDITIONAL PROBABILITY – the


probability of B, given A.
•e.g. you are trying to roll a total of 8 on two dice. What
is the probability that you achieve this, given that the
first die rolled a 6?
Bayes' Theorem

•P(A|B) = P(B|A) P(A)


P(B)

•So the theorem says:


•The probability of A given B is equal to the probability
of B given A, times the prior probability of A, divided by
the prior probability of B.
Probability distribution
• Every random variable has a corresponding probability distribution.
• A probability distribution applies the theory of probability to describe the
behavior of the random variable.

• The term probability distribution or just distribution refers to the way data are
distributed, in order to draw conclusions about a set of data.

• A probability distribution of a random variable can be displayed by a table or a


graph or a mathematical formula.

• With categorical variables, we obtain the frequency distribution of each


variable.

• With numeric variables, the aim is to determine whether or not normality may
be assumed.
04/13/2024
I. Probability distribution of a categorical variables
• The probability distribution of a categorical variable tells us with what
probability the variable will take on the different possible values.
• That is it specifies all possible outcomes of the categorical variable along with
the probability that each will occur.
E.g. Consider the value on the face showing up from tossing a die. The probability
distribution of this variable is
Value on Face 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
• Notice that the total probability is 1.

04/13/2024
Bernoulli Distribution
• A random experiment with only one experiment with probability p
and q; where p+q=1, is called Bernoulli trials
• The outcome of an experiment can either be success (i.e., 1) and
failure (i.e., 0).
• Pr(X=1) = p, Pr(X=0) = 1-p, or

• E[X] = p, Var(X) = p(1-p)


• Bernoulli trial is a random experiment with only two possible
outcomes

04/13/2024
Binomial distribution
• In general the binomial distribution involves three assumptions
• There are fixed n number of trials each of which results in one of two mutually exclusive
outcomes.
• the outcomes of n trials are independent.
• the probability of “success” is constant for each trial
• Pr (X=success) = Pr (X=1) = p
• Pr (X=failure) = Pr (X=0) = 1-p

n  k n  k
P(k)   p 1 p
k 

04/13/2024
The binomial distribution
A process that has only two possible outcomes is called
a binomial process. In statistics, the two outcomes are
frequently denoted as success and failure. Binomial
distribution is a sum of independent and evenly
distributed Bernoulli trials. The binomial distribution
gives the probability of exactly k successes in n trials

n  k n  k
P(k)   p 1 p
k 
04/13/2024
Binomial distribution….
• In addition to the probabilities of individual outcomes, we can also compute the
numerical summary measures associated with a probability distribution.
• The mean and variance values for a binomial distribution or the average
number of successes in repeated samples of n is equal to

  np
V  npq
• Example 1: From the sample of 1000 US population, there are 290 smokers, if
we want to get the mean and standard deviation of the proportion of smokers,
we can use the formula of the following;
• Mean=nxp=1000x0.29=290
______________

S.d = √1000(0.29X0.71) = 14.4

04/13/2024
Binomial distribution….
Example 2: Suppose that in a certain population 52% of all recorded births are
males. If we select randomly 10 birth records What is the probability that
exactly
• 5 will be males? Given n=10, x=5,
• Pr (X= x) = n! p x (1- p) n- x

x ! (n -x )!
So Pr (X=5) = 10! X 0.52 x (1- 0.52)
5 10-5
=0.24
5!(10-5)!
• 3 or more will be females?
• Pr(X≥3) = 1- Pr (X<3) = 1-[Pr(X=0)+Pr(X=1)+Pr(X=2)]
=1-[0.001+0.013+0.055]= 1-0.069=0.931

04/13/2024
Random variable and Probability
distributions

• A random variable is a variable that has a single numerical value, determined


by chance, for each outcome of a procedure.

• A discrete random variable has either a finite number of values or a


countable number of values. Eg. The number of eggs that a hen lays in a
day(possible values are 0, or 1, or 2

• A continuous random variable has infinitely many values, and those values
can be associated with measurements on a continuous scale in such a way
that there are no gaps or interruptions.
Eg. Voltage of electricity

04/13/2024
Every probability distribution must satisfy each of the following two requirements

•Since the values of a probability


distribution are probabilities, they
must be numbers in the interval from
0 to 1.
• Since a random variable has to take on one of its
values, the sum of all the values of a probability
distribution must be equal to 1.

04/13/2024
Random Variable
• A Random Variable is a set of possible values from a
random experiment
• Example: Tossing a coin: we could get Heads or Tails.
• Let's give them the values Heads=0 and Tails=1 and
we have a Random Variable "X":
random possible random
variable values events
0 H
X =
1 T

04/13/2024
• So:
• We have an experiment (like tossing a coin)
• We give values to each event
• The set of values is a Random Variable

04/13/2024
• Eg. Toss a coin 3 times. Let x be the number of heads obtained. Find the
probability distribution of x . f (x) = Pr (X = xi) , i = 0, 1, 2, 3.
• Pr (x = 0) = 1/8 …………………………….. TTT
• Pr (x = 1) = 3/8 ……………………………. HTT THT TTH
• Pr (x = 2) = 3/8 ……………………………..HHT THH HTH
• Pr (x = 3) = 1/8 ……………………………. HHH
• Probability distribution of X.
• The required conditions are also satisfied. i) f(x)  0 ii)  f (xi) = 1

X = xi 0 1 2 3

Pr(X=xi) 1/8 3/8 3/8 1/8

04/13/2024
The birth of a son or a daughter
are mutually exclusive events
because the two events will not
happen at the same time.

The birth of a daughter and the


birth of carrier of the sickle-cell
anemia allele are not mutually
exclusive because the two events
can happen at the same time (they
are independent events).

04/13/2024
Example : Sex Ratio in a Family of 3
• Assume that the probability of a boy = child child child
1/2 and the probability of a girl = 1/2. #1 #2 #3

i. How many possibilities are there for a B B B


family to have the sex distribution? B B G

ii. What is the probability of occurrence B G B


of each event? B G G
G B B
iii. What is the chance of 2 boys AND 1
girl? G B G
G G B
G G G
04/13/2024
• Solution:
i. 8 possibilities
ii. The probability of each event is 1/8
( 1/2 x 1/2 x 1/2).
iii. The chances of 2 boys AND 1 girl are
3. This occurs: BBG, BGB, and GBB.
• Thus, the chance is 1/8 + 1/8 + 1/8 =
3/8.

04/13/2024
The expected value of a discrete random variable
The expected value, denoted by E(x) or , represents the “average” value of the random variable. It is
obtained by multiplying each possible value by its respective probability and summing over all the values
that have positive probability.

Definition: The expected value of a discrete random variable is defined as


n
E(X) =  =  x i P(X  x i )
i 1

04/13/2024
Where the xi’s are the values the random variable assumes with positive probability

Example: Consider the random variable representing the number of episodes of diarrhea in the first 2
years of life. Suppose this random variable has a probability mass function as below
R 0 1 2 3 4 5 6
P(X .129 .264 .271 .185 .095 .039 .017
= r)
What is the expected number of episodes of diarrhoea in the first 2 years of life?
E(X) = 0(.129) +1(.264) +2(.271) +3(.185) +4(.095) +5(.039) +6(.017) = 2.038

Thus, on the average a child would be expected to have 2 episodes of diarrhoea in the first 2 years of life
04/13/2024
The variance of a discrete random variable
The variance represents the spread of all values that have positive probability relative to the expected
value. In particular, the variance is obtained by multiplying the squared distance of each possible value
from the expected value by its respective probability and summing overall the values that have positive
probability.

Definition: The variance of a discrete random variable denoted by X is defined by

2 k 2 k 2 2
V(X) = σ   ( x i  μ ) P(X  x i )   x i P(X  x i )  μ
i 1 i1
Where the Xi’s are the values for which the random variable takes on positive probability. The SD of a
random variable X, denoted by SD(X) or  is defined by square root of its variance.

04/13/2024
Example: Compute the variance and SD for the random variable representing number of episodes
of diarrhea in the first 2 years of life.

E(X) =  = 2.04
n
 x i P(X  x i ) = 02(.129) + 12(.264) + 22(.271) + 32(.185) + 42(.095) + 52(.039) + 62(0.017) = 6.12
i 1

Thus, V(X) = 6.12 – (2.04)2 = 1.967 and the SD of X is σ  1.967  1.402

04/13/2024
Binomial distribution, generally

Note the general pattern emerging  if you have only two


possible outcomes (call them 1/0 or yes/no or success/failure) in n
independent trials, then the probability of exactly X “successes”=
n = number of trials

n X n X
  p (1  p )
X 1-p = probability of
failure

X=# p = probability of
successes out success
of n trials

04/13/2024
Exercise
1. Each child born to a particular set of
parents has a probability of 0.25 of having
blood type O. If these parents have 5
children.
What is the probability that
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d.2 do not have blood type O.

04/13/2024
Exercise….
2. Suppose past experiences in a certain malarious area
indicated that the probability of a person with a high
fever will be positive for malaria is 0.7. Consider 3
randomly selected patients (with high fever) in that same
area.
a) What is the probability that no patient will be positive
for malaria?
b) What is the probability that exactly one patient will be
positive for malaria?
c) What is the probability that exactly two of the patients
will be positive for malaria?
d) What is the probability that all patients will be positive
for malaria?
04/13/2024
The Poisson distribution
When the probability of “success” is very small, e.g., the
probability of a mutation, then pk and (1 – p)n – k become too
small to calculate exactly by the binomial distribution. In
such cases, the Poisson distribution becomes useful. Let l
be the expected number of successes in a process
consisting of n trials, i.e., l = np. The probability of
observing k successes is
k
 e 
P(k) 
k!
The mean and variance of a Poisson distributed variable are
given by m = l and V = l, respectively.
04/13/2024
Plots of Poisson Distribution

04/13/2024
The Poisson distribution…
• Example 3. Suppose x is a random variable representing
the number of individuals involved in a road accident
each year (In US 2.4 are involved per 10,000 population
each year)
• I.e. λ = 2.4 per 10000
• Pr (X=0) = e-2.4 2.40 = 0.091
0!
• Pr (X=1) = e-2.4 2.41 = 0.218
1!
• Pr (X=2) = e-2.4 2.42 = 0.262
2!
04/13/2024
II. Probability distribution of Numeric variables

1. Probability distribution of a discrete variable


•Let X be a discrete random variable, such as
number of new AIDS cases reported during
one year period, number of children in a
family

•To construct the probability distribution for


X we list each of the values x the variable
assumes and its associated probability
(relative frequency).

04/13/2024
Characteristics of a distribution
• Features commonly used to describe a distribution are
location, dispersion, modality and skew ness.
• Location tells us something about the average
value of the variable.
• Dispersion tells us something about how
spread out, the values of the variable are.
• Modality refers to the number of peaks in the
distribution.
• Skew ness refers to whether or not the
distribution is symmetric
• A distribution is said to be symmetric if it is
symmetrically distribute about its mode.
04/13/2024
2.Probability distribution of continuous variables
• Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
• E.g. Suppose, X represents the continuous
variable ‘Height’; rarely is an individual
exactly equal to 170cm tall
• X can assume an infinite number of
intermediate values 170.1, 170.2, 170.3 etc.
• Because a continuous random variable X can take on an
uncountably infinite number of values, the probability
associated with any particular one value is almost equal
to zero.

04/13/2024
Continuous Random Variables
• A smooth curve describes the probability distribution of a
continuous random variable.

•The depth or density of the probability, which varies with x,


may be described by a mathematical formula f (x ), called
the probability distribution or probability density function for
the random variable x.

04/13/2024
Properties of Continuous Probability Distributions
• The area under the curve is equal to 1.
• P(a  x  b) = area under the curve between a and b.

•There is no probability attached to any


single value of x. That is, P(x = a) = 0.
04/13/2024
Continuous Probability Distributions

• There are many different types of continuous


random variables
• We try to pick a model that
• Fits the data well
• Allows us to make the best possible
inferences using the data.
• One important continuous random variable is the
normal random variable.
04/13/2024
The Normal(Gaussian) Distribution
• The normal distribution is used extensively in the analyses of
continuous variables and has an especially important role in
statistics.
• It has been found to be a good approximation for many
distributions that arise in practice.
• The normal distribution is a uni-modal and symmetric.
• The normal distribution is completely described by two
parameters, referred as the mean μ (read as ‘mu’) and standard
deviation σ (read ‘sigma’).
• The mean μ can be any number (negative, positive or zero).
• The standard deviation σ must be a positive number.
• The mean μ defines the location of the distribution and the SD
(standard deviation) σ defines the dispersion of the distribution
about the mean.
04/13/2024
The Normal Distribution

• The formula that generates the


normal probability distribution is:

2
1  x 
1  
2  

f ( x)  e for   x 
 2
e  2.7183   3.1416
 and  are the population mean and standard deviation.

• The shape and location of the normal curve changes


as the mean and standard deviation change.

04/13/2024
How the Normal curve shifts when parameters
change

-a μ a X

-1
- 0 11 X-μ

-1 0 1 X-μ
𝜎
Same location (μ) but different 𝜎 (S.D)

𝜎=1

𝜎-2

𝜎=3

μ
Biostatistics course by Girma Taye
(PhD), AAU
Same 𝜎 but different location (mean)

μ=0 μ=1 μ=2

Biostatistics course by Girma Taye


(PhD), AAU
The standard normal distribution
• Since a normal distribution could be an infinite number of possible values for its
mean and SD, it is impossible to tabulate the area associated for each and every
normal curve.

• Instead only a single curve for which μ = 0 and σ = 1 is tabulated.

• The curve is called the standard normal distribution (SND).

04/13/2024
The Standard Normal Distribution

• To find P(a < x < b), we need to find the area under the
appropriate normal curve.
• To simplify the tabulation of these areas, we
standardize each value of x by expressing it as a z-
score, the number of standard deviations s it lies from
the mean m.

x
z

04/13/2024
The Standard
Normal (z)
Distribution

• Mean = 0; Standard deviation = 1


• When x = m, z = 0
• Symmetric about z = 0
• Values of z to the left of center are negative
• Values of z to the right of center are positive
• Total area under the curve is 1.

04/13/2024
Using normal table
The four digit probability in a particular row and column
of Table 1 gives the area under the z curve to the left
that particular value of z.

04/13/2024
Area for z = 1.36
Example
Use Table 1 to calculate these probabilities:

P(zz 1.36)
P( 1.36) =
= .9131
.9131

P(zz >1.36)
P( >1.36)
=
= 11 -- .9131
.9131 =
= .0869
.0869

P(-1.20  zz  1.36)
P(-1.20 1.36)
=
= .9131
.9131 -- .1151
.1151
=
= .7980
.7980
04/13/2024
Example

The weights of packages of ground beef are


normally distributed with mean 1 pound and
standard deviation .10. What is the probability
that a randomly selected package weighs between
0.80 and 0.85 pounds?

P (.80  x  .85) 
P (2  z  1.5) 
.0668  .0228  .0440

04/13/2024
Example
What is the weight of a package
such that only 1% of all packages
exceed this weight?

P ( x  ?)  .01
? 1
P( z  )  .01
.1
? 1
From Table 1,  2.33
.1
?  2.33(.1)  1  1.233
04/13/2024
Approximating the Binomial
Make sure to include the entire rectangle for the
values of x in the interval of interest. This is called
the continuity correction.
Standardize the values of x using

x  np
z
npq

Make sure that np and nq are both greater than


5 to avoid inaccurate approximations!
04/13/2024
Exercise
A data collected on systolic blood pressure in normal
healthy individuals is normally distributed with μ= 120
and σ= 10 mm Hg.
1)What proportion of normal healthy individuals have a
systolic blood pressure above 130 mm Hg
2)What proportion of normal healthy individuals have a
systolic blood pressure between 100 and 140 mm Hg?
3)What level of systolic blood pressure cuts off the lower
95% of normal healthy individuals?

04/13/2024
μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ

Fig.3. Percentage of area under a normal distribution with mean μ and


standard deviation σ

For any normal distribution,


 about 68% (most) of the observations is contained within one SD of
the mean.
about 95% (majority) of the probability is contained within two SDs
and 99% (almost all) within three SDs of the mean.

04/13/2024
Exercises

• Find the probability of the following under the SND


• Above 1.96?
• Below –1.96?
• Between –1.28 and 1.28?
• Between –1.65 and 1.08? 0.8502
• What level cuts the upper 25%?
• What level cuts the middle 99%?

04/13/2024
Table 1: Normal distribution

Area between 0 and z

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857

04/13/2024 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
Table 2: Student’s t-distribution

t table with right tail probabilities

df\p 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005

1 0.324920 1.000000 3.077684 6.313752 12.70620 31.82052 63.65674 636.6192

2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991

3 0.276671 0.764892 1.637744 2.353363 3.18245 4.54070 5.84091 12.9240

4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103

5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688

6 0.264835 0.717558 1.439756 1.943180 2.44691 3.14267 3.70743 5.9588

7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079

8 0.261921 0.706387 1.396815 1.859548 2.30600 2.89646 3.35539 5.0413

9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809

10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869

04/13/2024
Thank you!

04/13/2024 247

You might also like