Southwestern University

School of Medicine

Summarizing data and Sampling
(Variables, Frequency Distribution, Sampling in
Public Health)

Anacleto Clent L. Banaay, Jr., MD, MPM

Lecturer, Preventive and Community Medicine I
General objectives

1.To define and determine the different types of variables

2.To identify and interpret the measures of Central location
(Mode, Median, Mean)
3.To be able to construct frequency distribution
4.To identify the different methods of sampling

Southwestern University PHINMA

• In order to answer research questions, it is doubtful that researcher should
be able to collect data from all cases. Thus, there is a need to select a
sample. The entire set of cases from which researcher sample is drawn in
called the population. Since, researchers neither have time nor the
resources to analysis the entire population so they apply sampling
technique to reduce the number of cases.
• In this module, we will be able to identify which in the Population are
exposed, vulnerable to an exposure. We will also learn how to characterize
the representatives of the population by assigning them according to their
• When you want to measure something in the natural world you usually
have to take several measurements. This is because things are variable, so
you need several results to get an idea of the situation. Once you have
these measurements you need to summarize them in some way because
sets of raw numbers are not easily interpreted by most people.

Southwestern University PHINMA

Review of Terms
Descriptive statistics
• The use of statistical tools to summarize and describe a set of data values
• Human beings usually find it difficult to create meaning from long lists of numbers or words
• Summarizing the numbers or counting the occurrences of words and expressing that
summary with single values makes much more sense to us
• In descriptive statistics, no attempt is made to compare any data sets or groups

Inferential statistics
• The investigation of specified elements which allow us to make inferences about a larger
population (i.e., beyond the sample size)
• Here we compare groups of subjects or individuals
• It is normally not possible to include each subject or individual in a population in a study,
therefore we use statistics and infer that the results we get, apply to the larger population.

Southwestern University PHINMA

•A group of individuals that share at least one characteristic in common
•On a macro level, this might refer to all of humanity
•At the level of a clinical research, this might refer to every individual with a certain disease, or
risk factor, which might still be an enormous number of individuals
•It is quite possible to have quite small population, i.e. in the case of very rare condition
•The findings of a study infer its results to a larger population; we make use of the findings to
manage the population to which those study findings infer.

•A sample is a selection of members within the population (I'll discuss different ways of
selecting a sample a bit later in this course)
•Research is conducted using that sample set of members and any results can be inferred to the
population from which the sample was taken
•This use of statistical analysis makes clinical research possible as it is usually near impossible to
include the complete population

Southwestern University PHINMA

• A statistical value that is calculated from all the values in a whole population, is termed a
• If we knew the age of every individual on earth and calculated the mean or average age,
that age would be a parameter

• There are many ways to define a variable, but for use in this course I will refer to a variable
as a group name for any data values that are collected for a study
• Examples would include age, presence of risk factor, admission temperature, infective
organism, systolic blood pressure
• This invariably becomes the column names in a data spreadsheet, with each row
representing the findings for an individual in a study

Southwestern University PHINMA

Organizing Data

A variable can be any

characteristic that differs
from person to person,
such as height, sex,
smallpox vaccination
status, or physical activity
Types of Variables
• Categorical (including nominal
and ordinal data), refers to
categories or things, not
mathematical values
• Numerical (further defined as
being either interval or ratio data)
refers to data which is about
measurement and counting

Southwestern University PHINMA

• A nominal-scale variable
values are categories without any numerical ranking, such as county of residence.
In epidemiology, nominal variables with only two categories are very common: alive or
dead, ill or well, vaccinated or unvaccinated, or did or did not eat the potato salad. A
nominal variable with two mutually exclusive categories is sometimes called a
dichotomous variable.
• An ordinal-scale variable has values that can be ranked but are not necessarily evenly
spaced, such as stage of cancer.
• An interval-scale variable is measured on a scale of equally spaced units, but without
a true zero point, such as date of birth.
• A ratio-scale variable is an interval variable with a true zero point, such as height in
centimeters or duration of illness.

Southwestern University PHINMA

• Nominal- and ordinal-scale variables are
considered qualitative or categorical variables,
whereas interval- and ratio-scale variables are
considered quantitative or continuous
• Sometimes the same variable can be measured
using both a nominal scale and a ratio scale. For
example, the tuberculin skin tests of a group of
persons potentially exposed to a co-worker with
tuberculosis can be measured as “positive” or
“negative” (nominal scale) or in millimeters of
induration (ratio scale).

Southwestern University PHINMA

Summarizing variables
Frequency Distributions
• A frequency distribution displays the values a variable can take and the number of persons
or records with each value

Southwestern University PHINMA

Properties of Frequency Distributions
• The data in a frequency distribution can be
• Graph reveals three features:
1. Where the distribution has its peak (central
2. How widely dispersed it is on both sides of
the peak (spread), and
3. Whether it is more or less symmetrically
distributed on the two sides of the peak

Southwestern University PHINMA

1. Central location
• This type of symmetric distribution is the classic bell-shaped curve — also known as a normal
• The clustering at a particular value is known as the central location or central tendency of a
frequency distribution.
• The central location of a distribution is one of its most important properties. Sometimes it is cited
as a single value that summarizes the entire distribution

• Three measures of central location are commonly

used in epidemiology:
• arithmetic mean, median, and mode.
• Two other measures that are used less often are the
• midrange and geometric mean

Southwestern University PHINMA

2. Spread (Variation or dispersion)

• second property of frequency

• Spread refers to the distribution out from a
• central value. Two measures of spread commonly used in
• epidemiology are range and standard deviation
• the spread of a frequency distribution is independent of its central location.
• frequency distributions may have the same central location but different amounts of

Southwestern University PHINMA

3. Shape
• A third property of a frequency distribution.
• Frequency distributions of some
characteristics of human populations tend
to be symmetrical.
• Asymmetrical or more commonly referred
to as skewed.
• Skewness refers to the tail, not the hump.
So a distribution that is skewed to the left
has a long left tail.

Southwestern University PHINMA

• A distribution that has a central location to
the left and a tail off to the right is said to
be positively skewed or skewed to the
• A distribution that has a central location to
the right and a tail to the left is said to be
negatively skewed or skewed to the left.
• One distribution deserves special mention
— the Normal or Gaussian distribution.
This is the classic symmetrical bell-shaped
curve. It is defined by a mathematical
equation and is very important in statistics
Southwestern University PHINMA
• the normal distribution (or bell-shaped curve) is perfectly symmetrical, the mean,
median, and mode all have the same value.
• Mean= Median= mode
• Standard deviation is the measure of spread most commonly used with the mean
• Mean is the most commonly used measure of central location, and is the measure upon
which the majority of statistical tests and analytic techniques are based.
• The advantage of the median is that it is not affected by a few extremely high or low
• When a set of data is skewed, the median is more representative of the data than is the
Measures of Central Location
• Provides a single value that summarizes an entire distribution of data.
• It is the value that best represents an entire distribution of data.
• Measures of central location include the mode, median, arithmetic mean,
midrange, and geometric mean.
• Selecting the best measure to use for a given distribution depends largely on
two factors:
• The shape or skewness of the distribution,
• The intended use of the measure.

Southwestern University PHINMA

the number of doses of diphtheriapertussis-
• Is the value that occurs most often in a tetanus (DPT) vaccine each of seventeen 2-year-old
set of data. children in a particular village received
• It can be determined simply by tallying
the number of times each value occurs. 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4
• When more than 2 modes: Bi-modal
Method for identifying the mode
Two children received no doses; two children received 1
• Step 1. Arrange the observations into a frequency
distribution, indicating the values of the variable and
dose; three received 2 doses; six received 3 doses; and four
the frequency with which each value occurs. received all 4 doses.
(Alternatively, for a data set with only a few values,
arrange the actual values in ascending order, as was Therefore, the mode is 3 doses, because more children
done with the DPT vaccine doses above.)
received 3 doses than any other number of doses
• Step 2. Identify the value that occurs most often.

Southwestern University PHINMA

Properties and uses of the mode
• The mode is the easiest measure of central location to understand and explain. It is also the
easiest to identify, and requires no calculations.
• The mode is the preferred measure of central location for addressing which value is the
most popular or the most common. That is the “Typical” Number.
• Distribution has more than one mode if two or more values tie as the most frequent values.
It has no mode if no value appears more than once.
• The mode is used almost exclusively as a “descriptive” measure.
• It is almost never used in statistical manipulations or analyses.
• The mode is not typically affected by one or two extreme values (outliers).

Southwestern University PHINMA

Method for identifying the median
• The median is the middle value of a Step 1. Arrange the observations into increasing or decreasing
set of data that has been put into order.
rank order. Step 2. Find the middle position of the distribution by using the
following formula:
• the statistical median is the value Middle position = (n + 1) / 2
that divides the data into two a. If the number of observations (n) is odd, the middle
halves, with one half of the position falls on a single observation.
b. If the number of observations is even, the middle
observations being smaller than position falls between two observations.
the median value and the other Step 3. Identify the value at the middle position.
half being larger. a. If the number of observations (n) is odd and the
middle position falls on a single observation, the
• The median is also the 50th median equals the value of that observation.
percentile of the distribution. b. If the number of observations is even and the
middle position falls between two observations, the
median equals the average of the two values

Southwestern University PHINMA

Properties and uses of the median
• The median is a good descriptive measure, particularly for data that are skewed, because it
is the central point of the distribution.
• The median, like the mode, is not generally affected by one or two extreme values outliers).
• The median is relatively easy to identify. It is equal to either a single observed value (if odd
number of observations) or the average of two observed values (if even number of
• The median has less-than-ideal statistical properties. Therefore, it is not often used in
statistical manipulations and analyses

Southwestern University PHINMA

Arithmetic Mean

• The arithmetic mean is a more technical name for what is more commonly called the mean
or average.
• The arithmetic mean is the value that is closest to all the other values in a distribution.
• Method for calculating the mean
Step 1. Add all of the observed values in the distribution.
Step 2. Divide the sum by the number of observations

Southwestern University PHINMA

Properties and uses of the arithmetic mean
• The mean has excellent statistical properties and is commonly used in additional statistical
manipulations and analyses.
• One such property is called the centering property of the mean.
• When the mean is subtracted from each observation in the data set, the sum of these
differences is zero (i.e., the negative sum is equal to the positive sum

• Because of this centering property, the mean is sometimes called the center of gravity of a
frequency distribution.
• The arithmetic mean is the best descriptive measure for data that are normally distributed.
• the mean is not the measure of choice for data that are severely skewed or have extreme
values in one direction or another.

Southwestern University PHINMA

The midrange (midpoint of an interval)
• The midrange is the half-way point or the midpoint of a set of observations. The midrange
is usually calculated as an intermediate step in determining other measures.
• Method for identifying the midrange
Step 1. Identify the smallest (minimum) observation and the largest (maximum)
Step 2. Add the minimum plus the maximum, then divide by two.
• Exception: Age differs from most other variables because age does not follow the usual
rules for rounding to the nearest integer.
• Someone who is 17 years and 360 days old cannot claim to be 18 year old for at least 5 more days. Thus, to
identify the midrange for age (in years) data, you must add the smallest (minimum) observation plus the
largest (maximum) observation plus 1, then divide by two.
Midrange (most types of data) = (minimum + maximum) / 2
Midrange (age data) = (minimum + maximum + 1) / 2
Southwestern University PHINMA
Properties and uses of the midrange
• The midrange is not commonly reported as a measure of central location.
• The midrange is more commonly used as an intermediate step in other calculations, or for plotting graphs of data
collected in intervals.

Southwestern University PHINMA

Standard deviation
• Is the measure of spread used most commonly with the arithmetic mean
• Subtracting the mean from each observation and then summing the differences adds to
0. This concept of subtracting the mean from each observation is the basis for the
standard deviation.
• However, the difference between the mean and each observation is squared to
eliminate negative numbers. Then the average is calculated and the square root is taken
to get back to the original units.
Method for calculating the standard deviation
• Step 1. Calculate the arithmetic mean.
• Step 2. Subtract the mean from each observation. Square the
• difference.
• Step 3. Sum the squared differences.
• Step 4. Divide the sum of the squared differences by n – 1.
• Step 5. Take the square root of the value obtained in Step 4. The result is the standard deviation.

Southwestern University PHINMA

Properties and uses of the standard deviation
• The numeric value of the
standard deviation does not have
an easy, non-statistical
interpretation, but similar to
other measures of spread, the
standard deviation conveys how
widely or tightly the observations
are distributed from the center.
• Standard deviation is usually
calculated only when the data are
more-or-less “normally
distributed,” i.e., the data fall into
a typical bell-shaped curve.

Southwestern University PHINMA

Area Under Normal Curve within 1, 2 and 3 Standard Deviations

• The mean is at the center, and data are equally

distributed on either side of this mean.
• The points that show ±1, 2, and 3 standard
deviations are marked on the x axis.
• For normally distributed data, approximately
two-thirds (68.3%, to be exact) of the data fall
within one standard deviation of either side of
the mean;
• 95.5% of the data fall within two standard
deviations of the mean;
• 99.7% of the data fall within three standard
• Exactly 95.0% of the data fall within 1.96
standard deviations of the mean.
Southwestern University PHINMA
Standard error of the mean
• The standard error of the mean refers to
variability we might expect in the arithmetic
means of repeated samples taken from the same
• The standard error assumes that the data you
have is actually a sample from a larger population.
• According to the assumption, your sample is just
one of an infinite number of possible samples that
Properties and uses of the standard error of the
could be taken from the source population. Thus, mean
the mean for your sample is just one of an infinite
number of other sample means. The primary practical use of the standard error of
the mean is in calculating confidence intervals
• The standard error quantifies the variation in around the arithmetic mean
those sample means.

Southwestern University PHINMA

Confidence limits (confidence interval)
• Inference is a scientific generalization that the sample being studied
represents the entire population. Usually, the inference includes some
consideration about the precision of the measurement.
• A common way to indicate a measurement’s precision is by providing a
confidence interval.
• A narrow confidence interval indicates high precision; a wide confidence
interval indicates low precision.
• The confidence interval for a mean is based on the mean itself and some
multiple of the standard error of the mean

Southwestern University PHINMA

Method for calculating a 95% confidence interval for
a mean
• Step 1. Calculate the mean and its standard error.
• Step 2. Multiply the standard error by 1.96.
• Step 3. Lower limit of the 95% confidence interval =
mean minus 1.96 x standard error.
Upper limit of the 95% confidence interval =
mean plus 1.96 x standard error

Rounding to one decimal, the 95% confidence interval is 200.1 to 211.9. In other words, this study’s best estimate of the true
population mean is 206, but is consistent with values ranging from as low as 200.1 and as high as 211.9. Thus, the confidence
interval indicates how precise the estimate is. (This confidence interval is narrow, indicating that the sample mean of 206 is fairly
precise.) It also indicates how confident the researchers should be in drawing inferences from the sample to the entire population

Southwestern University PHINMA

Properties and uses of confidence intervals
• The mean is not the only measure for which a confidence interval can or should be calculated. Confidence intervals are
also commonly calculated for proportions, rates, risk ratios, odds ratios, and other epidemiologic measures when the
purpose is to draw inferences from a sample survey or study to the larger population.
• Most epidemiologic studies are not performed under the ideal conditions required by the theory behind a confidence
• Confidence intervals for means, proportions, risk ratios, odds ratios, and other measures all are calculated using different
• the interpretation of a confidence interval is the same: the narrower the interval, the more precise the estimate; and the
range of values in the interval is the range of population values most consistent with the data from the study.

Southwestern University PHINMA

Why is sampling necessary in Research?
• Let's use this knowledge in a quick example: Say we want to test the
effectiveness of levothyroxine as treatment for hypothyrodism in South
African subjects between the ages of 18-24.
• It will be physically impossible to find every individual in the country with
hypothyroidism and collect their data.
• However, we can collect a representative sample of the population. For
example, we can calculate the average (sample statistic) of the thyroid
stimulating hormone level before and after treatment.
• We can use this average to infer results about the population parameter.
• In this example thyroid stimulating hormone level would be a variable and
the actual numerical value for each patient would represent individual data
• A sample is a subset of individuals from a population.
• Sampling means selecting the group that you will
actually collect data.
• Sampling allows to Tet the hypothesis about the
characteristics of a population.
• he population is the entire group that you want to
draw conclusions about.
• The sample is the specific group of individuals that you
will collect data from.
• The population can be defined in terms of geographical
location, age, income, and many other characteristics.

Southwestern University PHINMA

Two types of sampling methods
• Probability sampling involves random
selection, allowing you to make
statistical inferences about the whole
• Non-probability sampling involves
non-random selection based on
convenience or other criteria,
allowing you to easily collect initial

Southwestern University PHINMA

Probability sampling methods

Southwestern University PHINMA

Non-probability sampling methods

Southwestern University PHINMA

Southwestern University PHINMA
Sampling Bias
• Sampling bias occurs when some members
of a population are systematically more
likely to be selected in a sample than

Southwestern University PHINMA

Sample size
• In order to generalize from a random sample and avoid sampling errors or biases, a random sample
needs to be of adequate size.
• the absolute size of the sample selected relative to the complexity of the population.
• While the larger the sample the lesser the likelihood that findings will be biased does hold, diminishing
returns can quickly set in when samples get over a specific size which need to be balanced against the
researcher’s resources.
• To put it bluntly, larger sample sizes reduce sampling error but at a decreasing rate. Several statistical
formulas are available for determining sample size.
• There are numerous approaches, incorporating a number of different formulas, for calculating the
sample size for categorical data.
• n= p (100-p)z2/E2
• n is the required sample size
• P is the percentage occurrence of a state or condition
• E is the percentage maximum error required
• Z is the value corresponding to level of confidence required

Southwestern University PHINMA

• Measures of central location are single values that summarize the observed values of a distribution.
• The mode provides the most common value, the median provides the central value, the arithmetic mean provides the
average value, the midrange provides the midpoint value
• The mode and median are useful as descriptive measures. However, they are not often used for further statistical
• In contrast, the mean is not only a good descriptive measure, but it also has good statistical properties. The mean is used
most often in additional statistical manipulations
• The midrange, which is based on the minimum and maximum values, is more sensitive to outliers than any other
• A sample is a subset of individuals from a larger population
•  Sampling means selecting the group that you will actually collect data from in your research.
• In statistics, sampling allows you to test a hypothesis about the characteristics of a population.
• Samples are used to make inferences about populations. Samples are easier to collect data from because they are
practical, cost-effective, convenient and manageable.
• Probability sampling means that every member of the target population has a known chance of being included in the
• In non-probability sampling, the sample is selected based on non-random criteria, and not every member of the
population has a chance of being included.
Southwestern University PHINMA

