Lecure-2 Descriptive Biostatistics
Lecure-2 Descriptive Biostatistics
Lecure-2 Descriptive Biostatistics
由 NordriDesign 提供
www.nordridesign.com
Course Outline
• Introduction to Biostatistics
• Descriptive Biostatistics
• Probability
• Discrete Probability Distributions
• Continuous Probability Distributions
Lecture Outline
• Descriptive Measures
• Measures of Central Tendency
The Mean
The Median
The Mode
Data Distribution (symmetric and skewed distribution)
• Measures of Dispersion
The Range
The Variance
The Standard Deviation
The Coefficient of Variance
The Percentiles
The Interquartile Range
Outliers
Kurtosis
• Grouped Data: The Frequency Distribution
• Graphic Methods
Descriptive Biostatistics
• The best way to work with data is to summarize and organize them.
• The ability to summarize the data by means of a single number called a descriptive
measure.
• Descriptive measures may be computed from the data of a sample or the data of a
population.
Several types of descriptive measures can be computed from a set of data. However,
the Two important types are;
1. Mean
2. Median
3. Mode
The Mean
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean
• Since geometric and harmonic means are not covered in this lecture, the arithmetic
mean simply referred as the mean.
The Arithmetic Mean
•
The Arithmetic Mean
Example: What is the arithmetic mean for the sample of birth-weights in the table.
The Arithmetic Mean
Limitations:
• In this instance, it may not be representative of the location of the great majority of
sample points.
The Arithmetic Mean
• If the first infant in the Table happened to be a premature infant weighing 500 g
rather than 3265 g, then the arithmetic mean of the sample would fall to 3028.7 g.
• In this instance, 7 of the birth-weights would be lower than the arithmetic mean, and
13 would be higher than the arithmetic mean.
• It is possible in extreme cases for all but one of the sample points to be on one side
of the arithmetic mean.
• In these types of samples, the arithmetic mean is a poor measure of central location
because it does not reflect the center of the sample.
• Nevertheless, the arithmetic mean is by far the most widely used measure of central
location.
The Arithmetic Mean
• Suppose the five physicians who practice in an area are surveyed to determine their
charges for a certain procedure.
• Assume that they report these charges: $75, $75, $80, $80, and $280.
• The mean charge for the five physicians is found to be $118, a value that is not very
representative of the set of data as a whole.
• The single atypical value had the effect of inflating the mean.
Properties of the Mean
1. Uniqueness
For a given set of data there is one and only one mean.
2. Simplicity
The mean is influenced by each value. Therefore, extreme values can distort the mean.
The Median
• An alternative measure of location is the median or, more precisely, the sample
median.
• The median of a finite set of values is that value which divides the set into two equal
parts.
• Samples with an odd sample size have a unique central point, when all values have
been arranged in order of magnitude.
• Example: For samples of size 7, the fourth largest point is the central point in the
sense that 3 points are smaller than it and 3 points are larger.
• Samples with an even sample size have no unique central point, and the middle two
values must be averaged, when all values have been arranged in the order of their
magnitudes.
• Example: For samples of size 8 the fourth and fifth largest points would be averaged
to obtain the median, because neither is the central point.
The Median
Suppose there are n observations in a sample. If these observations are ordered from
smallest to largest, then the median is defined as follows:
Example-1: Compute the sample median for the sample in the given table.
Example-2: The data set in the table consists of white-blood counts taken on
admission of all patients entering a small hospital in Allentown, Pennsylvania, on a
given day. Compute the median white-blood count.
Because n is odd,
The sample median is given by the fifth largest point, which equals 8 or 8000 on the
original scale.
The Median
Strength:
• The main strength of the sample median is that it is insensitive to very large or very
small values.
• In particular, if the second patient in Table of Example-2 had a white count of 65,000
rather than 35,000, the sample median would remain unchanged, because the fifth
largest value is still 8000.
• Conversely, the arithmetic mean would increase dramatically from 10,778 in the
original sample to 14,111 in the new sample.
Weakness:
• The main weakness of the sample median is that it is determined mainly by the
middle points in a sample and is less sensitive to the actual numeric values of the
remaining data points.
The Median
• Arraying the 10 ages in order of magnitude from smallest to largest gives 38, 43, 50,
57, 57, 59, 61, 64, 65, 66.
• Since we have an even number of ages, there is no middle value. The two middle
values, however, are 57 and 59.
• The sample median is (57 + 59)/2 = 58.
Data Distributions
• Data distributions may be classified on the basis of whether they are symmetric or
asymmetric.
• If a distribution is symmetric, the left half of its graph will be a mirror image of its right
half.
• When the left half and right half of the graph of a distribution are not mirror images of each
other, the distribution is asymmetric.
In symmetric distribution, the relative position of the points on each side of the sample
median is the same.
• If a distribution is not symmetric because its graph extends further to the right than to
the left, that is, if it has a long tail to the right, then the distribution is skewed to the
right or is positively skewed.
• In positively skewed distribution, the points above the median tend to be farther
from the median in absolute value than points below the median.
Example: The number of years of oral contraceptive (OC) use among a group of women
ages 20 to 29 years.
Negatively Skewed Distribution
• If a distribution is not symmetric because its graph extends further to the left than to
the right, that is, if it has a long tail to the left, then the distribution is skewed to the
left or is negatively skewed.
• In negatively skewed distribution, the points below the median tend to be farther
from the median in absolute value than points above the median.
Example: Relative humidities observed in a humid climate at the same time of day over
a number of days. In this case, most humidities are at or close to 100%, with a few very
low humidities on dry days.
Skewness
• In many samples, the relationship between the arithmetic mean and the sample
median can be used to assess the symmetry of a distribution.
• For symmetric distributions, the arithmetic mean is approximately the same as the
median.
• For positively skewed distributions, the arithmetic mean tends to be larger than the
median.
• For negatively skewed distributions, the arithmetic mean tends to be smaller than the
median.
Properties of the Median
1.Uniqueness
As is true with the mean, there is only one median for a given set of data.
2.Simplicity
The median is easy to calculate.
• The mode is the most frequently occurring value among all the observations in a
sample.
• The mode is another widely used measure of location.
• If all the values are different then there is no mode.
• Some distributions have more than one mode.
• In fact, one useful method of classifying distributions is by the number of modes
present.
• A distribution with one mode is called unimodal; two modes, bimodal; three modes,
trimodal; and so forth.
The Mode
The mode is 8×1000 = 8000 because it occurs more frequently than any other white
blood count.
The Mode
Example-2: Find the modal age of the subjects whose ages are given in the table.
Table: Ordered Array of Ages of 189 Subjects Who Participated in a Study on Smoking Cessation
• A count of the ages in the table reveals that the age 53 occurs most frequently (17
times).
• The mode for this population of ages is 53.
The Mode
There is no mode of the distribution in the table, because all the values occur exactly
once.
The Mode
• A distribution will be skewed to the right, or positively skewed, if its mean is greater
than its mode.
• A distribution will be skewed to the left, or negatively skewed, if its mean is less than
its mode.
Histograms Illustrating Skewness
Consider the three distributions shown in the figure . Given that the histograms represent
frequency counts, the data can be easily re-created and entered into a statistical
package.
Example: observation of the “No Skew” distribution would yield the following data:
The descriptive statistics for these three distributions are given in the following table.
Statistical Analysis Software Packages
• SPSS
• MINITAB
• SAS
• NCSS
Measures of Spread or Dispersion
• Other terms used synonymously with dispersion include variation, spread, and scatter.
• The dispersion of a set of observations refers to the variety that they exhibit.
• If the values are not all the same, then the dispersion is present in the data.
• The amount of dispersion may be small when the values, though different, are close
together.
Measures of Spread or Dispersion
• Figure shows the frequency polygons for two populations that have equal means but
different amounts of variability.
• Population B, which is more variable than population A, is more spread out.
• If the values are widely scattered, the dispersion is greater.
Figure: Two frequency distributions with equal means but different amounts of dispersion.
Measures of Spread or Dispersion
• The figure represents two samples of cholesterol measurements, each on the same
person, but using different measurement techniques.
• The arithmetic means for both samples are same, i.e., 200 mg/dL.
• Visually, however, the two samples appear radically different.
• This difference lies in the greater variability, or spread, of the Autoanalyzer method
relative to the Microenzymatic method.
The Range
“The range is the difference between the largest and smallest observations/values in a
sample”.
• If we denote the range by R, the largest value by xL, and the smallest value by xS,
then we can compute the range as follows:
R = xL – xS
The Range
Example-1: Find the range in the sample of birthweights given in the table.
Solution:
R=?
xL = 4146
xS = 2069
R = xL – xS
R = 4146 − 2069
R = 2077 g
The Range
Solution:
Example-3: Compute the range of the ages of the sample subjects in the table.
Solution:
• The youngest subject in the sample is 38 years old and the oldest is 66 years old.
• The range to be R = 66 - 38 = 28 years.
The Range
xL are the smallest and largest values in the data set, respectively.
• One disadvantage of the range is very sensitive to extreme observations.
• Another disadvantage of the range is that it depends on the sample size (n). that is,
the larger n is, the larger the range tends to be.
The Variance (s2)
•
Degrees of Freedom
• In computing the variance, the reason for dividing by n - 1 rather than n, is the
theoretical consideration referred to as degrees of freedom.
• The reason of choosing n –1 is that the sum of the deviations of the individual
observations of a sample about the sample mean is always zero.
Solution:
The Variance of a Finite Population (σ2)
•
The Standard Deviation (s)
•
The Standard Deviation (s)
• The standard deviation measures the dispersion or spread about the mean.
• The bigger value of s shows that the more variability present in the data.
• The units of standard deviation are the same as the units of the data.
The Standard Deviation of a Finite Population (σ)
•
The Standard Deviation (s)
Example-1: Compute the variance and standard deviation for the Autoanalyzer and
Microenzymatic method data in the figure.
Solution:
Autoanalyzer Method
Microenzymatic Method
Thus, the Autoanalyzer method has a standard deviation roughly three times as large as that
of the Microenzymatic method.
The Standard Deviation (s)
Example-2: Compute the variance and standard deviation of the birthweight data in
the table in both grams and ounces.
• Thus, if the sample points change in scale by a factor of c, the variance changes by a
factor of c2 and the standard deviation changes by a factor of c.
• This relationship is the main reason why the standard deviation is more often used
than the variance as a measure of spread.
• The standard deviation and the arithmetic mean are in the same units, whereas the
variance and the arithmetic mean are not.
The Coefficient of Variation (CV)
• The standard deviation is useful as a measure of variation within a given set of data.
• When one desires to compare the dispersion in two sets of data, however, comparing the
two standard deviations may lead to fallacious results.
• It may be that the two variables involved are measured in different units.
• For example: we may wish to know, for a certain population, whether serum cholesterol
levels, measured in milligrams per 100 ml, are more variable than body weight, measured
in pounds.
• Even the same unit of measurement is used, the two means may be quite different.
• For example: If we compare the standard deviation of weights of first-grade children with
the standard deviation of weights of high school freshmen, we may find that the latter
standard deviation is numerically larger than the former, because the weights themselves
are larger, not because the dispersion is greater.
• Also, It is useful to relate the arithmetic mean and the standard deviation to each other.
• For example: a standard deviation of 10 means something different conceptually if the
arithmetic mean is 10 than if it is 1000.
The Coefficient of Variation (CV)
•
The Coefficient of Variation (CV)
Example-1: Suppose two samples of human males yield the results shown in the
table. Find that which is more variable, the weights of the 25-year-olds or the weights
of the 11-year-olds.
• A comparison of the standard deviations show that the two samples possess equal
variability.
• However, It is clear from this example that variation is much higher in the sample of 11-
year-olds than in the sample of 25-year-olds.
The Coefficient of Variation (CV)
Example-2: Compute the coefficient of variation for the data in the table when the
birthweights are expressed in either grams or ounces.
• Percentiles are values that divide a set of observations into 100 equal parts, so there are
total 99 percentiles.
• Percentiles have the advantage over the range of being less sensitive to outliers and
of not being greatly affected by the sample size (n).
“Given a set of n observations, x1, x2, x3, … , xn, the pth percentile P is the value of X
such that p percent or less of the observations are less than P and (100 - p) percent or
less of the observations are greater than P”.
Percentiles
• 70 Percentile means that 70% values lie below the value at P70 while 30% of the values
lie above the value at P70.
• The median divides the lower 50% values and the higher 50% values in a data set.
Percentiles
• The 25th percentile is often referred to as the first quartile or lower quartile and
denoted as Q1. It contains one-quarter of the data.
• The 50th percentile (the median) is referred to as the second or middle quartile and
written as Q2. It marks the point with half of the data.
• The 75th percentile is referred to as the third quartile or upper quartile and denoted as
Q3. It contains three-quarters of the data.
• The quartiles for a set of data are calculated using the following formulas;
Percentiles
Example-1: Compute the 10th and 90th percentiles for the birthweight data in the table.
Solution: The first step is to arrange data from the smallest value to the largest value.
• The np/100 = 20 × 0.1 = 2, and np/100 = 20 × 0.9 = 18 are integers.
• Therefore , the 10th and 90th percentiles are defined by;
• 10th percentile: average of the second and third largest values = (2581 + 2759)/2 = 2670 g
• 90th percentile: average of the 18th and 19th largest values = (3609 + 3649)/2 = 3629 g
• We would estimate that 80% of birthweights will fall between 2670 g and 3629 g, which
gives an overall impression of the spread of the distribution.
Percentiles
Example-2: Compute the 20th percentile for the white-blood-count data in the table.
Solution:
The first step is to arrange data from the smallest value to the largest value.
• The np/100 = 9 × 0.2 = 1.8 is not an integer.
• Therefore, the 20th percentile is defined by the (1 + 1)th largest value.
• Hence, the 20th percentile is the second largest value = 5000.
Interquartile Range (IQR)
• The range provides a crude measure of the variability present in a set of data.
• A disadvantage of the range is the fact that it is computed from only two values, the
largest and the smallest.
• A similar measure that reflects the variability among the middle 50 percent of the
observations in a data set is the interquartile range.
• The interquartile range (IQR) is the difference between the third and first quartiles.
Interquartile Range
• A large IQR indicates a large amount of variability among the middle 50 percent of the
relevant observations.
• A small IQR indicates a small amount of variability among the relevant observations.
• It is more informative to compare the interquartile range with the range for the entire data
set.
• A comparison may be made by forming the ratio of the IQR to the range (R) and
multiplying by 100. i.e., 100(IQR/R) tells us what percent the IQR is of the overall range.
Outliers or Outlying Values
1) x > upper quartile (Q3) + 1.5 × (upper quartile (Q3) − lower quartile (Q1)) or
2) x < lower quartile (Q1) − 1.5 × (upper quartile (Q3) − lower quartile (Q1))
Outliers are unusually large and unusually small values of x in a data set.
Outliers or Outlying Values
•
Outliers or Outlying Values
•
Kurtosis
• A distribution may possess a smaller proportion of observations in its tails, so that its
graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic.
Kurtosis
• Most computer algorithms reduce the measure by 3, as is done in Equation, so that the
kurtosis measure of a mesokurtic distribution will be equal to 0.
Graphs of distributions representing the three types of kurtosis are shown in the figure.
The descriptive statistics of these three distributions are shown in the following table.
The Ordered Array
• An ordered array enables one to determine quickly the value of the smallest
measurement, the value of the largest measurement, and other facts about the
arrayed data that might be needed in a hurry.
The Ordered Array
• Table below presents the data in the form of an ordered array.
• Using this ordered array table, we are able to determine quickly the age of the youngest
subject (30) and the age of the oldest subject (82).
• We also readily note that about one-third of the subjects are 50 years of age or younger.
• Grouping the data provides a better overall picture of the unknown population.
• These intervals are usually referred to as class intervals (also known as bins).
• Important characteristics of a large data set can be easily assessed by first grouping
the data into different class intervals and then determining the number of
observations that fall in each of the class intervals.
Frequency Distribution
• Class intervals are units of equal width that include all the data from the lowest value
to the highest value.
• The number entries in each class interval are added to give the frequency.
• Data from a frequency distribution can be used to make graphs including the
histogram and the frequency polygon.
Frequency Distribution (Example-1)
TABLE: Ordered Array of Ages of Subjects
• In this section, certain commonly used graphic methods for displaying data are
presenting.
• The purpose of using graphic displays is to give a quick overall impression of data,
which is sometimes difficult to obtain with numeric measures.
i. Histogram
ii. Frequency Polygon
iii. Stem-and-Leaf Displays
iv. Box-and-Whisker Plots
Histogram
• In a histogram, the values of the variable under consideration are represented by the
horizontal axis, while the vertical axis has as its scale the frequency of occurrence.
• Above each class interval on the horizontal axis a rectangular bar, or cell, is erected.
• In a histogram, the bars must touch to indicate that there are no data in the data set
that are missing from the histogram.
Histogram (Example-1)
TABLE: Ordered Array of Ages of Subjects
Figure: Frequency polygon for the ages of Figure: Histogram and Frequency polygon for
189 subjects. the ages of 189 subjects.
Stem-and-Leaf Displays
• A quick way to obtain an informative visual representation of the data set is to construct
a stem-and-leaf display.
• An advantage of the stem-and-leaf display over the histogram is the fact that it preserves
the information contained in the individual measurements. Such information is lost when
measurements are assigned to the class intervals of a histogram.
• Another advantage of stem-and-leaf displays is the fact that they can be constructed
during the tallying process, so the intermediate step of preparing an ordered array is
eliminated.
Stem-and-Leaf Displays
• In some stem-and-leaf plots the leaf can consist of more than one digit.
• In this case, the leaf would consist of the rightmost two digits.
• The stem the leftmost two digits.
• The pairs of digits to the right of the vertical bar would be underlined to
distinguish between two different leaves.
• The stem-and-leaf display for the data in the table is shown in the figure.
Stem-and-Leaf Displays (Example-3)
• The point 5|8 represents 58, 11|8 represents 118, and so forth.
• Notice how this plot gives an overall feel for the distribution
without losing the individual values.
• Also, the cumulative frequency count from either the lowest or the
highest value is given in the first column.
• For the 11 stem, the absolute count is given in parentheses (17)
instead of the cumulative total because the highest or lowest
value would exceed 50% (50).
Box-and-Whisker Plots
• The construction of a Box-and-Whisker plot (or boxplot) makes use of the quartiles
of a data set and may be accomplished by following these five steps:
2. Draw a box in the space above the horizontal axis in such a way that the left end of
the box aligns with the first quartile Q1 and the right end of the box aligns with the
third quartile Q3.
3. Divide the box into two parts by a vertical line that aligns with the median.
4. Draw a horizontal line called a whisker from the left end of the box to a point that
aligns with the smallest measurement in the data set.
5. Draw another horizontal line, or whisker, from the right end of the box to a point that
aligns with the largest measurement in the data set.
Evans et al. examined the effect of velocity on ground reaction forces (GRF) in dogs with
lameness (a condition in which the animal fails to travel in a regular and sound manner
on all four feet) from a torn cranial cruciate ligament (disease). The dogs were walked
and trotted (run) over a force platform, and the GRF was recorded during a certain phase
of their performance. The Table given below contains 20 measurements of force where
each value shown is the mean of five force measurements per dog when trotting.
Box-and-Whisker Plots (Example-1)
• The smallest and largest measurements are 14.6 and 44, respectively.
• Examination of figure reveals that 50 percent of the measurements are between about
27 and 33, the approximate values of the first and third quartiles, respectively.
• The vertical bar inside the box shows that the median is about 31.
Median
Smallest Largest
Q1 Q2 Q3
Value Value
Range
IQR
Box-and-Whisker Plots (Example-2)
Nicotine content was measured in a random sample of 40 cigarettes. The data are
displayed in Table.
Box-and-Whisker Plots (Example-2)