0% found this document useful (0 votes)
139 views430 pages

Biostastics

This document provides an overview of biostatistics and key concepts. It defines biostatistics as applying statistical methods to biology and health sciences. It discusses types of variables like categorical and quantitative, and scales of measurement like nominal and ordinal. Population and sample are also summarized, noting that statistics are used to make inferences about populations based on representative samples.

Uploaded by

Daniel kebede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views430 pages

Biostastics

This document provides an overview of biostatistics and key concepts. It defines biostatistics as applying statistical methods to biology and health sciences. It discusses types of variables like categorical and quantitative, and scales of measurement like nominal and ordinal. Population and sample are also summarized, noting that statistics are used to make inferences about populations based on representative samples.

Uploaded by

Daniel kebede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 430

Biostatistics Course

By: Akalewold Alemayehu


(Assistant Professor of Epidemiology)

School of Public Health, Hawassa University

1
 What is Biostatistics?
 Statistics: A field of study concerned with methods and procedures
for:
 Collection, organization, analysis, summarization and
interpretation of numerical data, &
 to make scientific inferences about a body of data when only
a small part of the data is observed.

 It helps us use numbers to communicate ideas.

2
 The tools of statistics are employed in many fields such as business,
education, psychology, agriculture, economics, …….

 when we employ the application of statistical methods to the fields


of biology and Health Sciences we use the term Biostatistics.

3
 Provides methods of organizing information

 Assessment of health status

 Resource allocation (planning)

 Magnitude of association between exposure and outcome


 Strong vs weak association between exposure and outcome

4
 Assessment of risk factors
 Cause & effect relationship

 Evaluation of new vaccine or drug


 How effective is the vaccine (drug)?
 Is the effect due to chance or some bias?

 Drawing of inferences
 Information from sample to population

 Essential for understanding, appraisal and critique


of scientific literature
5
Branches of Statistics

1. Descriptive statistics: is concerned with the organization,


presentation, and summarization of data.
 Helps to identify the general features and trends in a set of
data and extracting useful information
 Also very important in communicate the final results of a study

6
 Ways of Data Presentation include:
 Tables
 Graphs (Bar chart, Pie chart, Histogram, Scatter plot, etc.)

Ways of Data Summarization includes:-


Numerical summary measures
 Measures of central tendency (location)
 Measures of variability (dispersion)

7
2. Inferential statistics: deals with techniques of making
conclusions about the population based on the information
obtained from a sample drawn from that population.

 The inferences are drawn from particular properties of sample


to particular properties of population.
 Inferential statistics builds upon descriptive statistics.
Includes: Making inferences, Estimation, hypothesis testing,
determining relationships, making predictions, etc.

8
Data, population, Sample, parameter, Statistic, Variables
Data: are numbers which can be obtained from taking measurements
or can be obtained by counting or observation.
Numerical descriptions of things
 The raw material for statistics.

9
Population and sample
Population: refers to any well defined groups of subjects/objects who
share common characteristics.
 A group of people, institutions or items that have something
in common for which we wish to draw conclusions at a
particular time.
E.g., All TB patients in Ethiopia, all hospitals in Hawassa
 Population is generally large & difficult to study all of them.

10
Population and sample…
Sample:
 A small group or subset of a population about which
information is actually obtained.
 Samples are used to describe & make inferences about
the populations from which they arise
 Statistical methods are based on these samples
 Samples should be selected using a suitable
method so that it can be representative (random sample)

11
12
Parameter and statistic
Parameter:
 A numerical descriptive measure derived from the
data of a population.

 They exist but the specific value is unknown

Statistic: A descriptive measure computed from


the data of a sample.

13
Parameter and statistic….

• Since the population is usually large and is


not actually observed, the parameters are
considered unknown constants.
• Statistical inferential methods are used to
make inferences/statements concerning the
unknown parameters, based on sample data.

14
Variable
• Variable: A characteristic which takes different values in
different persons, places, or things.

• Any aspect of an individual or object that is measured (e.g. BP)


or recorded (e.g. age, sex) and takes any value.

• There may be one variable in a study or many.

• E.g. A study of treatment outcome of TB


 sex, weight (kg), smear result (Positive, negative or uncertain), culture

result (negative, positive), cured after 6 months (yes/no).

15
Variables can be broadly classified into:

– Categorical (or Qualitative) and

– Numerical variables(or Quantitative).

16
1. Categorical variable: A variable which can not be measured in
quantitative form but can only be sorted by name or categories

• Not able to be measured as we measure height or weight

• The notion of magnitude is absent or implicit.

17
Categorical variable is divided into two:

1. Nominal:

• The simplest type of variable, in which the values fall into


un-ordered categories or classes

• Uses names, labels or symbols to assign each


measurement.

– Examples: Blood type, sex, race, marital status

18
2. Ordinal:

• Assigns each measurement to one of a limited number of


categories that are ranked in terms of order.

• Although non-numerical, can be considered to have a


natural ordering

– Examples: Patient status, cancer stages

19
2. Quantitative variable: A variable that can be measured or
counted and expressed numerically.

• Height, weight, # of children, etc.

• Has the notion of magnitude.

20
Quantitative variable is divided into two:

1. Discrete: It can only have a limited number of discrete values


(whole numbers).

– E.g. the number of episodes of diarrhoea a child has had in a


year. You can’t have 12.5 episodes of diarrhoea

• Characterized by gaps or interruptions in the values.

• Both the order and magnitude of the values matter.

• The values are not just labels, but are actual measurable quantities.

21
2. Continuous variable:

It can have an infinite number of possible values in any given


interval.

• Both the magnitude and the order of the values matter.

• Does not possess the gaps or interruptions

• E.g. Weight, Height, etc.

22
SUMMARY

Variable

Types Quantitative
Qualitative
of (Numerical)
(Categorical)
variables

Nominal Ordinal Discrete Continuous


e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales
23
Scales of measurement

• All measurements are not the same.

• Measuring weight = eg. 40kg

• Measuring the status of a patient on scale = “improved”,


“stable”, “not improved”.

• There are four types of scales of measurement.

24
1. Nominal scale:

• The simplest type of scale of measurment, in which the


values fall into un-ordered categories or classes

• Uses names, labels or symbols to assign each


measurement.

– Examples: Blood type, sex, race, marital status

25
Example of nominal
Scale:
Race/Ethnicity: • The numbers have NO
1. Black meaning
2. White • They are labels only
3. Latino
4. Other

26
• If nominal data take only two possible values, they are
called dichotomous or binary.

• E.g. sex is dichotomous (male or female).

• Yes/no questions

– E.g., Is the patient cured from TB at 6 months of Rx?

27
2. Ordinal scale:

• Assigns each measurement to one of a limited number of


categories that are ranked in terms of order.

• Although non-numerical, can be considered to have a


natural ordering

– Examples: Patient status, cancer stages

28
Example of ordinal scale:
• The numbers have
• Pain level:
1. None LIMITED meaning
2. Mild 4>3>2>1 is all we know
3. Moderate apart from their utility as
4. Severe labels

29
3. Interval scale:
- Measured on a continuum and differences between any two numbers
on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o F cooler than day D with 65o
but is 15o cooler.
- It has no true zero point. “0” is arbitrarily chosen and doesn‟t reflect
the absence the attribute.

30
4. . Ratio scale:

- It is the highest scale of measurement.

- Measurement begins at a true zero point and the scale has


equal space.

- Examples: Height, weight, BP, etc.

31
 A measurement on a higher scale can be transformed into one on
a lower scale, but not vice versa.

E.g., weight of a child= 3000gm, (ratio scale)

weight of a child= under weight, normal, over weight (Ordinal scale)

Weight of a child= normal, not normal (nominal scale)

32
33
Interval
Ordinal
Nominal

Ratio
Degree of precision in measuring
Dependent vs. Independent Variable

Dependent: The variable (s) we measure as


the outcome of interest , or response
Independent: The variable that explains the
dependent variable (s), or explanatory/
predictor variable.
Eg. Parasitic infections Anemia

34
Class Exercise
I. Classify the below variables as quantitative and qualitative and
write in bracket as nominal, ordinal, discrete or continuous

A. Number of female students in your class


B. Marital status: 1=married, 2=single, 3=widowed, 4=divorced
C. Prognosis: 1=very good, 2=good, 3=fair, 4=bad, 5=very bad
D. First temperature following admission (F⁰)
E. Received Po medications: 1=yes, 2= no
F. Service delivered: 1=medication, 2= surgery
G. Weight of infant at birth (gm)
H. Type of disease: 1= chronic 2= acute

35
Source/ Type of data

1. Primary data:
– Collected by the investigator or under his/her close
supervision for the purpose of specific inquiry or
study.
– Original in character and are mostly generated by
individual or research institutes.
– the investigator is aware of any limitations the data
may contain since he/she knows under what
conditions the data are collected
– Considered to be more reliable and relatively
accurate.

36
Source/Type of data…
2. Secondary data:

– Use of data already collected for other purposes.

– Less reliable and less valid. Therefore, these data should be


used carefully.

– Example: -Routinely collected and kept data at health


facilities.

37
Secondary data…

Advantage:
• Data collection is inexpensive.
• Less time consuming

Disadvantages:
• It is sometimes difficult to gain access to the records or
reports required,
• The data may not always be complete and precise enough, or
too disorganized.

38
Reading assignment
1. Data Collection Techniques
2. Types of questions
 Close ended questions
 Open ended questions
3. Questionnaire forms
 Structured
 Semi-structured
 Unstructured

4. Questionnaire designing
39
Descriptive Statistics

• Involves techniques used to organize and summarize and present a


set of data.

• Numbers that have not been summarized and organized are called
raw data.

• Before interpretation & communication of the findings, the raw


data must be organized, summarized and presented in a clear and
understandable way.

40
Methods of Data Organization and
Presentation

41
A. Describing categorical variables
• Table of frequency distributions

– Frequency

– Relative frequency

– Cumulative frequencies

– Relative cumulative frequency

• Charts

– Bar charts

– Pie charts
42
Frequency distributions
• Simple and effective way of summarizing categorical data
• The actual summarization and organization of data starts from
frequency distribution
• Done by counting the number of observations falling into each of
the categories or levels of the variables.

E.g. Birth weight with levels „Very low ‟, „Low‟, „Normal‟ and „big‟.

• The frequency distribution for newborns is obtained simply by


counting the number of newborns in each birth weight category.

43
Relative Frequency
• It is the proportion or percentages of observations in each category of a
variable.

• The distribution of proportions is called the relative frequency


distribution of the variable

• Given a total number of observations, the relative frequency


distribution is easily derived from the frequency distribution.

• Conversion in the opposite direction is also possible, but the conversion


is often inaccurate because of rounding

44
Cumulative frequency
• It is the number of observations in the category of a variable plus
observations in all categories smaller than it.

Cumulative relative frequency


• It is the proportion of observations in the category plus
observations in all categories smaller than it.

• It is obtained by dividing the cumulative frequency by the total


number of observations.

45
Table 1. Distribution of birth weight of newborns between Sept-
Oct, 2020 at „X‟ Hospital.

BWT Freq. Cum. Freq Rel.Freq. Cum.rel.freq.

Very low 25 25 0.1 0.1


Low 50 75 0.2 0.3
Normal 150 225 0.6 0.9
Big 25 250 0.1 1
Total 250 1

46
B) Describing Quantitative variable:

• Table of frequency distributions

– Frequency

– Relative frequency

– Cumulative frequencies

- Select a set of continuous, non-overlapping intervals such that


each value can be placed in one and only one of the intervals.

- The first consideration is how many intervals to include

47
To determine the number of class intervals and the corresponding
width, we may use:

Sturge‟s rule:
K  1  3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value

48
Example:
Leisure time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10
19 27 29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27
15 21 25 16

K = 1 + 3.322 (log40) = 6.32 ≈ 6

Maximum value = 38, Minimum value = 10

Width = (38-10)/6 = 4.66 ≈ 5


• Ordered array: is a simple arrangement of individual
observations in the order of magnitude.

49
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00

50
• Class Limit: The range for each class
– Upper class limit
– Lower class limit

• Mid-point ( class mark): The value of the interval which lies


midway between the lower and the upper limits of a class.
• Class boundary (True limits): Are those limits that make an
interval of a continuous variable continuous in both directions

– Upper class boundary

– Lower class boundary

• Subtract 0.5 from the lower and add it to the upper class limit

51
Time
(Hours) True limit(class boundary) Mid-point Frequency

10-14 9.5 – 14.5 12 5


15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40

52
Types of tables

53
Types of table cont.…..

54
55
Guidelines for constructing tables
• Keep them simple

• Limit the number of variables to be included.

• All tables should be self-explanatory

• Include clear title telling what, when and where

• Clearly label the rows and columns

• State clearly the unit of measurement used

• Explain codes and abbreviations in the foot-note

• Show totals

• If data is not original, indicate the source in foot-note.


56
Diagrammatic (Pictorial) representations of Statistical data

Importance of diagrammatic representation

1. Diagrams have greater attraction than mere figures.

2. They give quick overall impression of the data.

3. They have great memorizing value than mere figures.

4. They facilitate comparison

5. Used to understand patterns and trends

57
Specific types of graphs include:
• Bar graph
Nominal, ordinal,
• Pie chart Discrete data

• Stem and Leaf Plot


• Histogram
• Frequency polygon
Quantitative
• Cum. Freq. polygon (Ogive Curve) continuous data
• Line graph
• Box plot
• Scatter plot

58
1. Bar charts (Graphs)
• Categories are listed on the horizontal axis (X-axis)

• Frequencies or relative frequencies are represented on the Y-axis


(ordinate)

• The height of each bar is proportional to the frequency or relative


frequency of observations in that category

• There are different types of bar graphs, the most important ones
are:

59
A. Simple bar chart: It is a one-dimensional in which the bar
represents the whole of the magnitude. (only one variable)

100
Number of children

80

60

40

20

0
Not immunized Partially immunized Fully immunized
Immunization status

Fig. 1. Immunization status of Children in Adami Tulu Woreda, Feb.


2020.
60
B. Multiple bar chart: the component figures are shown as
separate bars bordering each other. It depicts distributional
pattern of more than one variable

350
300
Number of women

250
200
150
100
50
0
Married Single Divorced Widowed
Marital status

Immunized Not immunized

Fig. 2 TT Immunization status by marital status of women 15-49 years, Asendabo


town, 2020
61
C. Sub-divided bar chart: Bars are sub-divided into component parts
of the figure. These sorts of graphs are constructed when each total is
built up from two or more component figures.

100
Number of women

80
60
40
20
0
Married Single Divorced Widow ed
Marital status

Immunized Not immunized

Fig. 3 TT Immunization status by marital status of women 15-49 years, Asendabo town,
1996
62
Subdivided bar chart cont.…..

63
Method of constructing bar chart

• All the bars must have equal width

• The bars are not joined together

• The different bars should be separated by equal distances

• All the bars should rest on the same line called the base

• Label both axes clearly


64
2. Pie chart

• Shows the relative frequency for each category by dividing a


circle into sectors

• The angles are proportional to the relative frequency.

• Used for a single categorical variable

• Use percentage distributions

65
Steps to construct a pie-chart
• Construct a frequency table

• Change the frequency into percentage (P)

• Change the percentages into degrees, where: degree =


Percentage X 360o

• Draw a circle and divide it accordingly

66
Example: Distribution of deaths for females, in England and
Wales, 1989.

Cause of death No. of death


Circulatory system 100 000
Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000

Total 236 000

67
Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

68
3. Histogram
• Histograms are frequency distributions with continuous class
interval that have been turned into graphs.

• Given a set of numerical data, we can obtain impression of the


shape of its distribution by constructing a histogram.

• Constructed by choosing a set of non-overlapping class intervals


& counting the number of observations that fall in each class.

69
• It is necessary that the class intervals be non-overlapping so that
each observation falls in one and only one interval.

• Bars are drawn over the intervals

• The area of each bar is proportional to the frequency of


observations in the interval

70
Example: Distribution of the age of women at the time of marriage

Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49


group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group 71
4. Frequency polygon

• Instead of drawing bars for each class interval, sometimes a single


point is drawn at the mid point of each class interval and
consecutive points joined by straight line.

• Graphs drawn in this way are called frequency polygons

• Frequency polygons are superior to histograms for comparing two


or more sets of data.

72
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
12 17 22 27 32 37 42 47
Age

73
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
12 17 22 27 32 37 42 47
Age

74
Frequency polygon of birth weight of 9975 newborns for males and
females
50

40

%
30

20

SEX
10
Males

Females

0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Birth Weight

75
5. Ogive Curve (Cumulative Frequency Polygon)
• Used to know the number of items whose values are more or less than a
certain amount.

• E.g. to know the no. of patients whose weight is <50 or >60 Kg.

• To get this information it is necessary to change the form of the


frequency distribution from a „simple‟ to a „cumulative‟ distribution.

• Ogive curve turns a cumulative frequency distribution in to graphs.

• Are much more common than frequency polygons


76
Example: time spend on leisure activities

90
80
Cumulative frequency

70
60
50
40
30
20
10
0
4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5
Upper class boundary

Fig 4: Cumulative frequency curve for amount of time college students devoted to
leisure activities
77
6. Line graph

• Useful for assessing the trend of particular situation overtime.

• Helps for monitoring the trend of epidemics.

• The time, in weeks, months or years, is marked along the


horizontal axis, and

• Values of the quantity being studied is marked on the vertical


axis.

• Values for each category are connected by continuous line.

• Sometimes two or more graphs are drawn on the same graph


taking the same scale so that the plotted graphs are comparable.
78
Example: Malaria Parasite Prevalence Rates in Ethiopia, 1967 –
1979 Eth. C.
5.5
5.0
4.5
4.0
3.5
Rate (%)

3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year

Fig 5: Malaria Parasite Prevalence Rates in Ethiopia, 1967 – 1979 Eth. C.

79
80
Stem-and-Leaf Plot
 A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data.
 Similar to histogram and serves the same purpose and
reveals the presence or absence of symmetry
 Are most effective with relatively small data sets
 Are not suitable for reports and other communications,
but
 Help researchers to understand the nature of their data

81
Example

• 43, 28, 34, 61, 77, 82, 22, 47, 49, 51, 29, 36,
66, 72, 41
2 2 8 9
3 4 6
4 1 3 7 9
5 1
6 1 6
7 2 7
8 2

82
Steps to construct Stem-and-Leaf Plots

1. Separate each data point into a stem and leaf


components
• Stem = consists of one or more of the initial digits
of the measurement
• Leaf = consists of the rightmost digit
The stem of the number 483, for example, is 48 and the
leaf is 3.
2. Write the smallest stem in the data set in the
upper left-hand corner of the plot

83
Steps to construct Stem-and-Leaf Plots

3. Write the second stem (first stem +1) below the first
stem
4. Continue with the remaining stems until you reach
the largest stem in the data set
5. Draw a vertical bar to the right of the column of
stems
6. For each number in the data set, find the appropriate
stem and write the leaf to the right of the vertical
bar

84
Scatter Plots
 The most useful graphical tool for displaying the
relationship between two quantitative variables is a two
way scatterplot.
 Scatter plots present data on the x- and y-axes and are used
to investigate an association between two variables.
 A point represents each individual or object, and an
association between two variables can be studied by
analyzing patterns across multiple points.
 A regression line is added to a graph to determine whether
the association between two variables can be explained or
not.

85
Scatter plot (Two way) Here is one that
displays annual salary vs year of education.

86
Box-and-Whisker Plots
 It is a useful visual device for communicating the
information contained in a data set.
 The construction of a box-and-whisker plot makes use of
the quartiles
Examination of a box-and-whisker plot for a set of
data reveals information regarding the amount of
spread, location of concentration, and symmetry
of the data.

87
Box plots

88
Any question?

89
Numerical summary measures

 A single number which quantify the characteristics of a


distribution of values.

Measures of central tendency (location)

Measures of dispersion (variability)

90
A. Measures of Central location
• The objective of calculating MCT is to determine a single value
which may be used to represent the whole data set.

• Measures used to summarize the point at which the data tend to


cluster in a single number. Such statistics are called measures of
location or measures of central tendency.

• We describe them as mean, median and mode.

Mean

• The sum of the observations divided by the number of


observations.
91
Example
19 21 20 20 34 22 24 27
27 27
• Then, Mean = (19 + 21 + … +27) = 24.1
10
• General formula
a) Ungrouped data

If x 1 , x 2 , ..., x n are n observed values, then


n

x
i=1
i
x= .
n
92
b) Grouped data
• We assume that all values falling into a particular class interval
are located at the mid-point of the interval. It is calculated as
follow: k

m f
i=1
i i
x= k

f i=1
i

• where,

k = the number of class intervals

mi = the mid-point of the ith class interval

fi = the frequency of the ith class interval


93
Example. Compute the mean age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi


[10-19] 14.5 4 58.0
[20-29] 24.5 66 1617.0
[30-39] 34.5 47 1621.5
[40-49] 44.5 36 1602.0
[50-59] 54.5 12 654.0
[60-69] 64.5 4 258.0
Total __ 169 5810.5

94
95
Properties of the arithmetic mean
• For given set of data there is one and only one arithmetic mean
(uniqueness).

• It is easily calculate and understand (simple).

• Poor measure of central location if the underlying distribution is


not normal (or not Gaussian).

• Influenced by each and every value in the data set hence affected
by the extreme values.

• In grouped data if any class interval is open, arithmetic mean can


not be calculated.

96
Median
• With the observations arranged in increasing or decreasing order,
the median is defined as the middle observation.

a) ungrouped data

If observations are odd, the median is defined as the [(n+1)/2]th

observation.

• If observations are even the median is the average of the two


middle (n/2)th and [(n/2)+1]th values i.e
Example : 19 2 0 20 21 22 24 27 27 27 34
• Then, the median = (22 + 24)/2 = 23

97
The median is a better measure of central tendency (than the mean)
when the distribution is skewed

98
b) Grouped data

 we assume that the values within a class-interval are evenly


distributed through the interval.

– The first step is to locate the class interval in which it is


located.

– Find n/2 and see a class interval with a minimum cumulative


frequency which contains n/2.

99
Median for Grouped data…..
To find a unique median value, use the following formal.

n 
  Fc 
~x = L   2 W
m
 fm 
• where,  
 
• Lm = lower true class boundary of the interval containing the median

• Fc = cumulative frequency of the interval just above the median class interval

• fm = frequency of the interval containing the median

• W= class interval width

• n = total number of observations

100
Example. Compute the median age of 169 subjects from the
grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


[10-19] 14.5 4 4
[20-29] 24.5 66 70
[30-39] 34.5 47 117
[40-49] 44.5 36 153
[50-59] 54.5 12 165
[60-69] 64.5 4 169
Total 169

101
• n/2 = 84.5 = in the 3rd class interval

• Lower limit = 29.5, Upper limit = 39.5

• Frequency of the class = 47

• Fc = 70

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

102
Properties of median

• There is only one median for a given set of data (uniqueness)

• The median is easy to calculate

• Median is a positional average and hence it is not sensitive to very


large or very small values.

• The median is a better measure of central tendency (than the


mean) when the distribution is skewed (not normal)

• Can be calculated even in the case of open end intervals

103
Quartiles
• If the data are divided into four equal parts, we speak of
quartiles.

• The median divides the data into two equal parts

a) The first quartile (Q1): 25% of all the ranked observations are
less than Q1. [25th percentile]

b) b) The second quartile (Q2): 50% of all the ranked observations


are less than Q2. [50th percentile] The second quartile is the
median.

c) The third quartile (Q3): 75% of all the ranked observations are
less than Q3. [75th percentile] 104
Percentiles

 Simply divide the data into 100 pieces.


 Commonly used percentiles:
→ 10, 20, ….. 90% (deciles)
→ 20, 40, ….. 80% (quintiles)
→ 25, 50, 75% (quartiles)
→ 33.3, 66.7% (tertiles)

105
– P0: The minimum

– P25: 25% of the sample values are less than or equal to this value.
P25 means 1st Quartile or 25th percentile and given by:-
0.25(n+1)th observation

– P50: 50% of the sample are less than or equal to this value. 2nd
Quartile or 50th percentile and given by:-

0.5(n+1)th observation

– P75: 75% of the sample values are less than or equal to this value.
3rd Quartile or 75th percentile and given by:-

0.75(n+1)th observation
– P100: The maximum
106
Example: Birth weight in grams

2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248,
3260, 3265, 3314, 3323, 3484, 3541, 3609, 3649, 4146

 find the 10th and 90th percentile of the data set.

 10th percentile = 0.1(20+1) = 2.1th value

 the average of the 2nd and 3rd values = (2581+2759)/2 = 2670 g

 90th percentile = 0.9(20+1) = 18.9th value

 the average of the18th and 19th values = (3609+3649)/2 = 3629 g

107
Mode
• It is a value that occur most often.

• Most distributions have one peak and are described as uni-modal.


• E.g. 19 21 20 20 34 22 24 27 27 27
• The mode is 27, because the value 27 occurs three times (the most
frequent).

• Some distributions have more than one mode

 Unimodal: A distribution with one mode.

 Bimodal: A distribution with two modes.

 Trimodal: A distribution with three modes.


108
Mode….

• The mode of grouped data usually refers to the modal class with
the highest frequency.

• If a single value for the mode of grouped data must be specified,


it is taken as the mid point of the modal class interval. 109
Properties of mode

 It is not affected by extreme values

 Often its value is not unique (more than one mode is possible)

 The main drawback of mode is that often it does not exist,


therefore it is not a good summary of the majority of the data.

110
111
Descriptive statistics
Measures of dispersion

112
Measures of Dispersion……

Consider the following two sets of data:


A: 177, 193, 195, 209, 226 Mean = 200
B: 192, 197, 200, 202, 209 Mean = 200
• Two or more sets may have the same mean and/or
median but they may be quite different.
• MCT are not good to describe about the variability or
spread of the values.

113
Measures of Dispersion

• Measures that quantify the variation or dispersion of a set of data


from its central location.

• Dispersion refers to the variety exhibited by the values of the


data.

• The amount may be small when the values are close together.

• If all the values are the same, no dispersion


114
1. Range (R)
• The difference between the largest and smallest observations in a
data set.

• Range = Maximum value – Minimum value

• Example –

– Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37

115
Properties of range

 It is the simplest crude measure and can be easily understood

 It takes into account only two values which causes it to be a poor


measure of dispersion

 Very sensitive to extreme observations

116
2. Inter-quartile range (IQR)
• Indicates the spread of the middle 50% of the observations, and
used with median

IQR = Q3 - Q1

Example: Suppose the first and third quartile for weights of girls
12 months of age are 8.8 Kg and 10.2 Kg, respectively.

IQR = 10.2 Kg – 8.8 Kg

i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.

117
Example 2
• Given the following data set (age of patients):-

18, 59, 24, 42, 21, 23, 24, 32

• Find the inter-quartile range

• Solution: 18 21 23 24 24 32 42 59

• 1st quartile = {(n+1)/4}th = (2.25)th = (21 + 23)/2 = 22

• 3rd quartile = {3/4 (n+1)}th = (6.75)th = (32 + 42)/2 = 37

• Hence, IQR = 37 - 22 = 15

118
Properties of IQR:

• It encloses the central 50% of the observations

• It is not based on all observations but only on two specific values

• It is important in selecting cut-off points in the formulation of


clinical standards.

• Since it excludes the lowest and highest 25% values, it is not


affected by extreme values

• Less sensitive to the size of the sample

119
120
121
122
123
124
n

 i
(x  x) 2

S2  i=1

n -1

125
n

 i
(x  x) 2

S2  i=1

n -1

126
n

 i
(x  x) 2

S2  i=1

n -1

127
n

 i
(x  x) 2

S2  i=1

n -1

128
n

 i
(x  x) 2

S2  i=1

n -1

129
n

 i
(x  x) 2

S2  i=1

n -1

130
n

 i
(x  x) 2

S2  i=1

n -1

131
n

 i
(x  x) 2

S2  i=1

n -1

132
n

 i
(x  x) 2

S2  i=1

n -1

133
n

 i
(x  x) 2

S2  i=1

n -1

134
Example. Compute the variance and SD of the age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96

Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

135
Properties of SD
• Has the advantage of being expressed in the same units of
measurement as the mean

• The best measure of dispersion and is used widely because of the


properties of the theoretical normal curve.

• However, if the units of measurements of variables of two data sets


is not the same, then there variability can‟t be compared by
comparing the values of SD.
136
Coefficient of variation (CV)
 When two data sets have different units of measurements the CV
should be used as a measure of dispersion.

 It is the best measure to compare the variability of two series of


sets of observations.

 Data with less coefficient of variation is considered more


consistent.

137
CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x

SD Mean CV (%)

SBP 15mm 130mm 11.5


Cholesterol 40mg/dl 200md/dl 20.0

“Cholesterol is more variable than systolic blood pressure”

138
Skewed distributions

 Skewness: If extremely low or extremely high observations are


present in a distribution, then the mean tends to shift towards
those scores.

 Based on the type of Skewness, distributions can be:

A. Positively skewed distribution: Occurs when the majority of


scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.

139
B. Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores are
scattered at the left end.

C. Symmetrical distribution: It is neither positively nor


negatively skewed.

A curve is symmetrical if one half of the curve is the mirror image


of the other half.

140
Mean, Median & Mode

141
Which measures to use?
• When the distribution is symmetric, summarize the data using means and
standard deviations.

• When the data are skewed, it is preferable to use the median and IQR as
summary statistics.

• Median and IQR are not easily influenced by extreme values in a skewed
distribution unlike means and standard deviations.

• Remark:
• The mean and median of symmetric distribution coincide.

• When skewed to the right, its mean is larger than its median.

• When skewed to the left, its mean is smaller than its median.(see fig. a-c)
142
Any question?

143
Probability and
Probability Distributions

144
Brain storming
For a certain major operation procedure the
probability of death is 1 in 20 individuals (0.05).
if 19 consecutive individuals undertake the
procedure and all of them survived, and if the 20th
individual is you, what will you decide?
Do you undertake the operation or not? Why?

145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
Types of Events

161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
The Variance of a Discrete Random Variable

• It quantifies the dispersion of the possible


outcomes of the random variable (X) around
the expected value.

197
198
199
1. Binomial Distribution
 It is one of the most widely encountered discrete
probability distribution.
 Considers dichotomous/ binary random variables
 Is based on a process known as Bernoulli trial, James
Bernoulli (1654 – 1705).
– When a single trial of an experiment in only one o two
mutually exclusive outcomes (Dead or alive, sick or well,
male or female, +ve or -ve, Yes/No etc…)
 Binomial distribution is used to make inferences
about population proportions.

200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
Finding normal curve areas
1. The table gives area between a value of Z0 and +∞

2. Find the Z value in decimal tenths place in the column


at the left margin and locate its row. Find the
hundredths place in the appropriate column.

3. Read the value of the area (P) from the body of the
table where the row and column intersect. Values of P
are in the form of decimal points and four places.

244
245
Exercises

246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
Types and Techniques of Sampling

264
Sampling is a procedure by which some members of the given
population are selected as representative of the entire population for
observation /study purpose.

A census: is enumeration of the entire population which is very


expensive, takes a long time and difficult to handle.

Instead we select a sample of individuals hoping that the sample


is representative of the population.

265
Sampling would be easy if all populations are
similar/homogenous.

Sampling is very critical in populations where


variations among individuals as well as their
environment is high.

266
Definitions of terms
•Reference population/Target population.-is the population of interest
to which the findings of the study are going to be generalized.
•Source population.-The population from which the study subjects are
obtained.
•Study/Sample population:-The population included in the sample.
•Sampling unit:-The unit of selection in the sampling process.
For example, in a sample of districts, the sampling unit is a district; in a
sample of persons, the sampling is a person, etc.
•Study unit:-The unit on which information is collected.
For example, in the study of prevalence of disease the study unit is
individuals/persons. In the study of family size the study unit is a
household.
267
• Sampling frame- is the list of sampling units in the source
population from which the sample will be selected.

• Sampling fraction (Sampling interval)- the ratio of the number


of units in the source population to the number of units in the
sample (N/n)

Example:
Researchers are interested to see whether there is association
between sexual debut and severe menstrual pain among
reproductive age group female Hawassa University students.

1. What is the target population?


2. Source population?
3. Study/sample population?
4. study unit?
268
Advantage of sampling
 Feasibility: it may be the only feasible method of collecting data

 Reduced cost: sampling reduces demands on resource such as


finance, personal and material

 Greater accuracy: sampling may lead to better accuracy of


collecting data unlike Census which is Cumbersome & therefore
inaccurately done.
 better trained personnel, more careful supervision and processing

 Greater speed :data can be collected and summarized more


quickly
269
Sampling Methods
Two broad divisions:

I. Probability sampling methods

II. Non-probability sampling methods

270
Types of sampling
I. Probability sampling
 probability sampling method is any method of sampling that utilizes
some form of random selection.
 more complex,
 more time‐consuming and
 usually more costly than non‐probability sampling.

 Every individual of the target population has a known and non zero
chance to be included in the sample.
 Generalization is possible (from sample to population)

 A sampling frame exists or can be compiled.


271
Most common probability
sampling techniques
A. Simple random sampling
B. Systematic random sampling
C. Stratified random sampling
D. Cluster sampling
E. Multi-stage sampling
F. Sampling with probability proportional to size

272
A) Simple random sampling (SRS)

 This is the most basic scheme of random sampling.

 the required number of individuals are selected at random from


the sampling frame, a list or a database of all individuals in the
source population.

 Each unit in the sampling frame has an equal chance of being


selected

 randomness of a sample is ensured by:-


 Lottery method

 Table of random number

 Computer programs
273
Procedure:

 Make a numbered list of all the units in the population from


which you want to draw a sample.

 Each unit on the list should be numbered in sequence from 1 to N


(where N is the size of the population)

 Decide on the size of the sample

 Select the required number of study units, using a “lottery”


method or a table of random numbers or computer programs.

274
57172 42088 70098 11333 26902 29959 43909 49607
33883 87680 28923 15659 09839 45817 89405 70743
77950 67344 10609 87119 15859 74577 42791 75889
11607 11596 01796 24498 17009 67119 00614 49529
56149 55678 38169 47228 49931 94303 67448 31286
80719 65101 77729 83949 83358 75230 56624 27549
93809 19505 82000 79068 45552 86776 48980 56684
40950 86216 48161 17646 24164 35513 94057 51834
12182 59744 65695 83710 41125 14291 74773 66391
13382 48076 73151 48724 35670 38453 63154 58116
38629 94576 48859 75654 17152 66516 78796 73099
60728 32063 12431 23898 23683 10853 04038 75246
01881 99056 46747 08846 01331 88163 74462 14551
23094 29831 95387 23917 07421 97869 88092 72201
15243 21100 48125 05243 16181 39641 36970 99522
53501 58431 68149 25405 23463 49168 02048 31522
07698 24181 01161 01527 17046 31460 91507 16050
22921 25930 79579 43488 13211 71120 91715 49881
68127 00501 37484 99278 28751 80855 02035 10910
55309 10713 36439 65660 72554 77021 46279 22705
92034 90892 69853 06175 61221 76825 18239 47687
50612 84077 41387 54107 09190 74305 68196 75634
81415 98504 32168 17822 49946 37545 47201 85224
38461 44528 30953 08633 08049 68698 08759 45611
07556 24587 88753 71626 64864 54986 38964 83534
60557 50031 75829 05622 30237 77795 41870 26300

275
SRS has certain limitations:

 Requires a sampling frame.


 Difficult if the reference population is dispersed.
 Minority subgroups of interest may not be selected.

276
B) Systematic Random Sampling

• Individuals are chosen at regular intervals called the Sampling


Interval (for example, every kth) from the sampling frame.

• Sampling Interval/fraction (K)= N/n

• The first unit to be selected is taken at random from among the


first k units.

• For example, a systematic sample is to be selected from 100


students of a school. The sample size is decided to be 20.

• The sampling interval is: 100/20 = 5.

277
• The number of the first student to be included in the sample is
chosen randomly, for example by blindly picking one out of five
pieces of paper, numbered 1 to 5.

• If number 4 is picked, every fifth student will be included in the


sample, starting with student number 4, until 20 students are
selected.

• The numbers selected would be 4, 9, 14, 19, 24 …

278
279
 Important if the source population is arranged in some order:
– Order of registration of patients
– Numerical number of house numbers
– Student‟s registration lists
Merits

• Systematic sampling is usually less time consuming and easier to


perform than simple random sampling.

• It provides a good approximation to SRS.

• Systematic sampling can be conducted without a sampling frame

• E.g. In patients attending a health center, where it is not possible to


predict in advance who will be attending

280
Demerits

• If there is any sort of cyclic pattern in the ordering of the


subjects which coincides with the sampling interval, the sample
will not be representative of the population.

Examples

- List of married couples arranged with men's names alternatively


with the women's names (every 2nd, 4th, etc.) will result in a
sample of all men or women).

281
C) Stratified Sampling

• It is appropriate when the distribution of the characteristic to be


studied is strongly affected by certain variable and the
population is known to have heterogeneity with regard to the
variable.

• The population is first divided into groups (strata) according to a


characteristic of interest (eg., sex, geographic area, prevalence of
disease, etc.).

• A separate sample is then taken independently from each


stratum.
282
Two types of sample size allocation

283
Example: Equal allocation:

– Allocate equal sample size to each stratum


Village A B C D Total
HHs 100 150 120 130 500
S. size ? ? ? ? 60

284
Example: Proportionate Allocation

Village A B C D Total
HHs 100 150 120 130 500
S. size ? ? ? ? 60

285
Merit

• The representativeness of the sample is improved.

 Adequate representation of minority subgroups of interest


can be ensured by stratification.

Demerit

• Sampling frame has to be prepared separately for each stratum.

286
D) Cluster sampling
Sometimes it is too expensive to carry out SRS
Population may be large and scattered.
Complete list of the study population unavailable
Population consists of many natural groups
(clusters)
Travel costs can become expensive if interviewers
have to survey people from one end to the
other.
The clusters should be homogeneous, unlike
stratified sampling where the strata are
heterogeneous
287
D) Cluster sampling…..

• In this sampling scheme, selection of the required sample is done on


groups of study units (clusters) instead of each study unit individually.

• The sampling unit is a cluster, and the sampling frame is a list of these
clusters.

Procedure

• The reference population (homogeneous) is divided into clusters.

• These clusters are often geographic units (eg districts, villages, etc.)

• A sample of such clusters is selected

• All the units in the selected clusters are studied

288
Example: Cluster sampling
Cluster 1 Cluster 2

Cluster 3

Cluster 5

Cluster 4

289
Merit
• A list of all the individual study units in the reference
population is not required.
• It is sufficient to have a list of clusters.
Demerit
• It is based on the assumption that the characteristic to be studied
is uniformly distributed throughout the reference population,
which may not always be the case.
 Hence, sampling error is usually higher than a simple
random sample of the same size.

290
E) Multi-stage sampling

• Similar to cluster sampling but sampling is done at stages


• This method is appropriate when the reference population is large and
widely scattered.

• Selection is done in stages until the final sampling unit (e.g. households or
persons) are arrived at.

• The primary sampling unit (PSU) is the sampling unit (usually large size)
in the first sampling stage.

• The secondary sampling unit (SSU) is the sampling unit in the second
sampling stage, etc.

• Example - The PSUs could be kebeles and the SSUs could be households.
291
292
Merit
– No need to have a list of all units in the population.
– Saves a great amount of time and effort

Demerit
– Error will be multiplied
– Provide less precise estimation

293
F. Sampling with probability
proportional to size (PPS)

294
295
Steps in PPS
• List all Kebeles/clusters with their population
size/HHs size
• Calculate the cumulative frequency of the
population/ HHs
• Calculate the sampling interval (say K) by dividing
the total population/HHs by the Kebeles/clusters
size to be selected
• Randomly choose a number between 1 &K, say j
• Kebeles/clusters with cumulative frequency
containing the jth, (j+k)th …will be included in the
sample 296
297
298
299
II. Non-probability sampling
 No random selection (unrepresentative of the given population)

 Used when a sampling frame does not exist

 Not appropriate if the aim is to measure variables and generalize findings

obtained from a sample to the population.

 It is useful when descriptive comments about the sample itself are desired

 They are cheaper, easier and quick.

 There are also other circumstances, such as researches, when it is

not feasible or impractical to conduct probability sampling.


300
The most common types of non-
probability sampling
1. Convenience or haphazard sampling
2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snow ball sampling

301
302
303
304
2. Volunteer sampling
• Occurs when people volunteer to be involved
in the study.
• In experiments or pharmaceutical trials (drug
testing), for example, it would be difficult and
unethical to enlist random participants from
the general public.
• In these instances, the sample is taken from a
group of volunteers.

305
306
307
308
309
310
Errors in sampling
 When we take a sample, our results will not exactly equal the
correct results for the whole population. The sample value
deviates from the population value.
• Two types of errors

– Sampling error (random error)

– Non-sampling error (bias)

1) Sampling error: The deviation of sample statistic from


the population parameter.

 It arises from the sampling process itself


 Sampling error can be minimized by increasing the size of the
sample.
311
2. Non sampling error (Bias):
 Systematic error in the design or conduct of a sampling
procedure
 Results in distortion of the sample and study results.
 More serious type of error
• Multi-factorial causes
 Selection bias
 Non response bias
 Observational error
 Respondent error
 Lack of preciseness of definition /measurement errors
 Errors in editing and tabulation of data

312
Non Sampling Error …

 Can not be eliminated by increasing the sample size.

E.g., Taking male students of HU only to determine the proportion of


smokers result in an overestimate, since females are less likely to smoke.
Increasing the number of male students would not remove the bias.

 Actions to minimize non sampling error:


 careful design of the sampling procedure
 Minimizing non responses
 Use of standard checklists for observation
 Use of precise case definitions and calibrated measuring instruments
 Careful data handling

313
314
Sampling Distribution

315
Sampling Distribution
A sampling distribution is a distribution of all possible values of a
statistic computed from samples of the same size randomly selected from
the same population.

When sampling a discrete, finite population, a sampling distribution can


be constructed.

However, this construction is difficult with a large population and


impossible with an infinite population (use a reasonable samples of a given
size)

Serves to answer probability questions about sample statistics.

316
Sampling Distribution….

 We consider sample statistics as random variable.

Example:

 Age of individuals is a random variable

 Similarly, mean age is a random variable

 Take a sample (n) from population (N) and calculate the statistics, e.g.,
Mean.

 Take another sample (same size) and calculate mean.

 Repeat & repeat & repeat & repeat & . . . . . . . .

317
Sampling Distribution….
 Do you expect all the sample means the same?

 They will vary (random variation)

 Put all these sample statistics together to get a distribution of sample


statistics (frequency distribution)

Main types of sampling distributions

A. Distribution of the sample mean

B. Distribution of the difference b/n two means

C. Distribution of a sample proportion

D. Distribution of the difference b/n two proportions.

318
A. Sampling distribution of mean/Distribution of
sample mean
• Suppose we have a population of size N=4, constituting the ages
of four outpatients.

x, Age (years): 18, 20, 22, 24

μ
x i
N
18  20  22  24
  21
4

σ
 i
(x  μ) 2

 2.236
N
319
320
Sample means Freq P( )
18 1 0.0625
19 2 0.1250
20 3 0.1875
21 4 0.2500
22 3 0.1875
23 2 0.1250
24 1 0.0625

321
Sampling distribution of all sample means

16 Sample Means Sample Means


Distribution
1st 2nd Observation
Obs 18 20 22 24 P(x)
.3
18 18 19 20 21
.2
20 19 20 21 22
.1
22 20 21 22 23
0 18 19 20 21 22 23 24
_
24 21 22 23 24 x
322
323
Summary measures of this sampling distribution:
Add the 16 sample means & divide by 16.
Also calculate the SD of the sample means.

μx 
 x i

18  19  21    24
 21
N 16

σx 
 i x
(x  μ ) 2

N
(18 - 21) 2  (19 - 21) 2    (24 - 21) 2
  1.58
16

324
Compare the population distribution with its sampling
distribution

325
 We note that the mean of the sampling distribution of the mean has the same
value as the mean of the population.

 However, the variance is different from the original population variance; but it is
equal to the population variance divided by the sample size to obtain sampling
distribution.

 The square root of the sampling distribution variance is called standard error of
the mean or, simply, standard error.

 Or, the standard deviation of any sample statistics is called its standard error.

326
 Standard error is determined by both the sample size and the degree of
variability among the individual observations.

 Standard deviation quantifies the amount of variability among


individuals in a population, while

 Standard error quantifies the variability among means of repeated


samples drawn from that population.

 The standard error is always smaller than the standard deviation


(except when n=1)

327
328
329
330
Properties of sampling distribution of mean

1. Sampling from normally distributed population

331
Properties of sampling distribution of mean

2. Sampling from non-normally distributed population

332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
C. Sampling Distribution of proportion/
Distribution of Sample Proportion

352
353
Properties of Sampling Distribution of
sample proportion
• Construction of sampling distribution of
sample proportion is done in manner similar
to that of sampling distribution of sample
mean.

• Applying the central limit theorem the shape


of the sampling distribution is approximately
normal provided that n is large enough.

354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
Inferential Statistics
Statistical Estimation

371
372
373
374
375
376
377
• Two methods of estimation:
– Point estimation
– Interval estimation

• Point estimation involves the calculation


of a single value to estimate the population
parameter

• Interval estimation specifies a range of


values assumed to include population
parameter
378
379
380
381
382
383
384
385
Interval Estimate
• Two questions to put bounds in our point
estimate to reflect our level of confidence.
 How wide does the bracket have to be?
 What is our tolerance of error?
• Scientists usually accept a 5% chance
that the range will not include the true
population value.
 This range or interval is called 95%
confidence interval.

386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
Degrees of Freedom (df)
Idea: Number of observations that are free to vary
after sample mean has been calculated

• Example: Suppose the mean of 3 numbers is 8.0

Let X1 = 7 If the mean of these three


Let X2 = 8 values is 8.0,
then X3 must be 9
What is X3?
(i.e., X3 is not free to vary)
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2
(2 values can be any numbers, but the third is not free to vary
for a given mean) 401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430

You might also like