Lecure-2 Descriptive Biostatistics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 102

Probability & Biostatistics

Dr. Oyebayo Ridwan


Olaniran, PhD
Olaniran.or@
unilorin.edu.ng

由 NordriDesign 提供
www.nordridesign.com
Course Outline

• Introduction to Biostatistics
• Descriptive Biostatistics
• Probability
• Discrete Probability Distributions
• Continuous Probability Distributions
Lecture Outline

• Descriptive Measures
• Measures of Central Tendency
 The Mean
 The Median
 The Mode
 Data Distribution (symmetric and skewed distribution)
• Measures of Dispersion
 The Range
 The Variance
 The Standard Deviation
 The Coefficient of Variance
 The Percentiles
 The Interquartile Range
 Outliers
 Kurtosis
• Grouped Data: The Frequency Distribution
• Graphic Methods
Descriptive Biostatistics

• The best way to work with data is to summarize and organize them.

• Measurements that have not been organized, summarized, or otherwise manipulated


are called raw data.
Descriptive Measures

• The ability to summarize the data by means of a single number called a descriptive
measure.

• Descriptive measures may be computed from the data of a sample or the data of a
population.

• A descriptive measure computed from the data of a sample is called a statistic.

• A descriptive measure computed from the data of a population is called a parameter.


Types of Descriptive Measures

Several types of descriptive measures can be computed from a set of data. However,
the Two important types are;

1. Measures of Central Tendency


2. Measures of Dispersion
Measures of Central Tendency

• The Measures of central tendency or Measure of location is the type of measure


useful for summarizing data defines the center, or middle, of the sample.

• Measures of central tendency convey information regarding the average value of a


set of values.

• The three most commonly used measures of central tendency are;

1. Mean
2. Median
3. Mode
The Mean

The three types of mean are;

1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean

• The most familiar measure of central tendency is the arithmetic mean.

• Since geometric and harmonic means are not covered in this lecture, the arithmetic
mean simply referred as the mean.
The Arithmetic Mean


The Arithmetic Mean

The Sample Mean

The Finite Population Mean


The Arithmetic Mean

Example: What is the arithmetic mean for the sample of birth-weights in the table.
The Arithmetic Mean

Limitations:

• The arithmetic mean is, in general, a very natural measure of location.

• One of its main limitations, however, is that it is oversensitive to extreme values.

• In this instance, it may not be representative of the location of the great majority of
sample points.
The Arithmetic Mean

Example-1: The Arithmetic Mean Limitation

• If the first infant in the Table happened to be a premature infant weighing 500 g
rather than 3265 g, then the arithmetic mean of the sample would fall to 3028.7 g.

• In this instance, 7 of the birth-weights would be lower than the arithmetic mean, and
13 would be higher than the arithmetic mean.

• It is possible in extreme cases for all but one of the sample points to be on one side
of the arithmetic mean.

• In these types of samples, the arithmetic mean is a poor measure of central location
because it does not reflect the center of the sample.

• Nevertheless, the arithmetic mean is by far the most widely used measure of central
location.
The Arithmetic Mean

Example-2: The Arithmetic Mean Limitation

• Suppose the five physicians who practice in an area are surveyed to determine their
charges for a certain procedure.
• Assume that they report these charges: $75, $75, $80, $80, and $280.
• The mean charge for the five physicians is found to be $118, a value that is not very
representative of the set of data as a whole.
• The single atypical value had the effect of inflating the mean.
Properties of the Mean

1. Uniqueness

For a given set of data there is one and only one mean.

2. Simplicity

The mean is easy to calculate.

3. Affected by Extreme Values

The mean is influenced by each value. Therefore, extreme values can distort the mean.
The Median

• An alternative measure of location is the median or, more precisely, the sample
median.

• The median of a finite set of values is that value which divides the set into two equal
parts.

• The median is defined differently when n is even and odd.

• Samples with an odd sample size have a unique central point, when all values have
been arranged in order of magnitude.
• Example: For samples of size 7, the fourth largest point is the central point in the
sense that 3 points are smaller than it and 3 points are larger.

• Samples with an even sample size have no unique central point, and the middle two
values must be averaged, when all values have been arranged in the order of their
magnitudes.
• Example: For samples of size 8 the fourth and fifth largest points would be averaged
to obtain the median, because neither is the central point.
The Median

Suppose there are n observations in a sample. If these observations are ordered from
smallest to largest, then the median is defined as follows:

The sample median is;


The Median

Example-1: Compute the sample median for the sample in the given table.

First, arrange the sample in ascending order:


2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314,
3323, 3484, 3541, 3609, 3649, 4146
Because n is even,
Sample median = average of the 10th and 11th largest observations
Sample median = (3245 + 3248)/2 = 3246.5 g
The Median

Example-2: The data set in the table consists of white-blood counts taken on
admission of all patients entering a small hospital in Allentown, Pennsylvania, on a
given day. Compute the median white-blood count.

First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35.

Because n is odd,

The sample median is given by the fifth largest point, which equals 8 or 8000 on the
original scale.
The Median

Strength:

• The main strength of the sample median is that it is insensitive to very large or very
small values.

• In particular, if the second patient in Table of Example-2 had a white count of 65,000
rather than 35,000, the sample median would remain unchanged, because the fifth
largest value is still 8000.

• Conversely, the arithmetic mean would increase dramatically from 10,778 in the
original sample to 14,111 in the new sample.

Weakness:

• The main weakness of the sample median is that it is determined mainly by the
middle points in a sample and is less sensitive to the actual numeric values of the
remaining data points.
The Median

Example-3: A simple random sample of 10 subjects from the population of subjects


are shown in the table. find the median age of the subjects.

Table: Sample of 10 Ages Drawn from the Ages of a population

• Arraying the 10 ages in order of magnitude from smallest to largest gives 38, 43, 50,
57, 57, 59, 61, 64, 65, 66.
• Since we have an even number of ages, there is no middle value. The two middle
values, however, are 57 and 59.
• The sample median is (57 + 59)/2 = 58.
Data Distributions

• Data distributions may be classified on the basis of whether they are symmetric or
asymmetric.

• If a distribution is symmetric, the left half of its graph will be a mirror image of its right
half.

• When the left half and right half of the graph of a distribution are not mirror images of each
other, the distribution is asymmetric.

• If the graph of a distribution is asymmetric, then the distribution is said to be skewed.


Symmetric Distribution

In symmetric distribution, the relative position of the points on each side of the sample
median is the same.

Example: A distribution that is expected to be roughly symmetric is the distribution of


systolic blood-pressure measurements taken on all 30- to 39-year-old factory workers in
a given workplace.
Positively Skewed Distribution

• If a distribution is not symmetric because its graph extends further to the right than to
the left, that is, if it has a long tail to the right, then the distribution is skewed to the
right or is positively skewed.

• In positively skewed distribution, the points above the median tend to be farther
from the median in absolute value than points below the median.

Example: The number of years of oral contraceptive (OC) use among a group of women
ages 20 to 29 years.
Negatively Skewed Distribution
• If a distribution is not symmetric because its graph extends further to the left than to
the right, that is, if it has a long tail to the left, then the distribution is skewed to the
left or is negatively skewed.

• In negatively skewed distribution, the points below the median tend to be farther
from the median in absolute value than points above the median.

Example: Relative humidities observed in a humid climate at the same time of day over
a number of days. In this case, most humidities are at or close to 100%, with a few very
low humidities on dry days.
Skewness

Skewness can be expressed as follows:

Where s is the standard deviation of a sample.

• A value of skewness > 0 indicates positive skewness.


• A value of skewness < 0 indicates negative skewness.
Relationship between the Arithmetic Mean and the Median

• In many samples, the relationship between the arithmetic mean and the sample
median can be used to assess the symmetry of a distribution.

• For symmetric distributions, the arithmetic mean is approximately the same as the
median.

• For positively skewed distributions, the arithmetic mean tends to be larger than the
median.

• For negatively skewed distributions, the arithmetic mean tends to be smaller than the
median.
Properties of the Median

Properties of the median include the following:

1.Uniqueness
As is true with the mean, there is only one median for a given set of data.

2.Simplicity
The median is easy to calculate.

3.Less Affected by Extreme Value


It is not as drastically affected by extreme values as is the mean.
The Mode

• The mode is the most frequently occurring value among all the observations in a
sample.
• The mode is another widely used measure of location.
• If all the values are different then there is no mode.
• Some distributions have more than one mode.
• In fact, one useful method of classifying distributions is by the number of modes
present.
• A distribution with one mode is called unimodal; two modes, bimodal; three modes,
trimodal; and so forth.
The Mode

Example-1: Compute the mode of the distribution in the table.

The mode is 8×1000 = 8000 because it occurs more frequently than any other white
blood count.
The Mode

Example-2: Find the modal age of the subjects whose ages are given in the table.

Table: Ordered Array of Ages of 189 Subjects Who Participated in a Study on Smoking Cessation

• A count of the ages in the table reveals that the age 53 occurs most frequently (17
times).
• The mode for this population of ages is 53.
The Mode

Example-3: Compute the mode of the distribution in the table.

There is no mode of the distribution in the table, because all the values occur exactly
once.
The Mode

• A distribution will be skewed to the right, or positively skewed, if its mean is greater
than its mode.

• A distribution will be skewed to the left, or negatively skewed, if its mean is less than
its mode.
Histograms Illustrating Skewness

Consider the three distributions shown in the figure . Given that the histograms represent
frequency counts, the data can be easily re-created and entered into a statistical
package.

Example: observation of the “No Skew” distribution would yield the following data:

5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11.

Values can be obtained from the skewed distributions in a similar fashion.


Histograms Illustrating Skewness

The descriptive statistics for these three distributions are given in the following table.
Statistical Analysis Software Packages

The famous statistical analysis software packages are;

• SPSS

• MINITAB

• SAS

• NCSS
Measures of Spread or Dispersion

• Other terms used synonymously with dispersion include variation, spread, and scatter.

• The dispersion of a set of observations refers to the variety that they exhibit.

• A measure of dispersion conveys information regarding the amount of variability

present in a set of data.

• If all the values are the same, then there is no dispersion.

• If the values are not all the same, then the dispersion is present in the data.

• The amount of dispersion may be small when the values, though different, are close

together.
Measures of Spread or Dispersion

• Figure shows the frequency polygons for two populations that have equal means but
different amounts of variability.
• Population B, which is more variable than population A, is more spread out.
• If the values are widely scattered, the dispersion is greater.

Figure: Two frequency distributions with equal means but different amounts of dispersion.
Measures of Spread or Dispersion

• The figure represents two samples of cholesterol measurements, each on the same
person, but using different measurement techniques.
• The arithmetic means for both samples are same, i.e., 200 mg/dL.
• Visually, however, the two samples appear radically different.
• This difference lies in the greater variability, or spread, of the Autoanalyzer method
relative to the Microenzymatic method.
The Range

• Several different measures can be used to describe the variability of a sample.


• Perhaps the simplest measure is the range.

“The range is the difference between the largest and smallest observations/values in a
sample”.

• If we denote the range by R, the largest value by xL, and the smallest value by xS,
then we can compute the range as follows:

R = xL – xS
The Range

Example-1: Find the range in the sample of birthweights given in the table.

Solution:
R=?
xL = 4146
xS = 2069

R = xL – xS
R = 4146 − 2069
R = 2077 g
The Range

Example-2: Compute the ranges for the Autoanalyzer- and Microenzymatic-method


data in the figure and compare the variability of the two methods.

Solution:

• The range for the Autoanalyzer method = 226 − 177 = 49 mg/dL.


• The range for the Microenzymatic method = 209 - 192 = 17 mg/dL.
• The Autoanalyzer method clearly seems more variable.
The Range

Example-3: Compute the range of the ages of the sample subjects in the table.

Table: Sample of 10 Ages Drawn from the Ages population

Solution:
• The youngest subject in the sample is 38 years old and the oldest is 66 years old.
• The range to be R = 66 - 38 = 28 years.
The Range

• The usefulness of the range is limited.


• The fact that it takes into account only two values causes it to be a poor measure of
dispersion.
• The main advantage in using the range is the simplicity of its computation.
• Since the range, expressed as a single measure, imparts minimal information about a
data set and therefore, is of limited use.
• It is often preferable to express the range as a number pair, [xS, xL], in which xS and

xL are the smallest and largest values in the data set, respectively.
• One disadvantage of the range is very sensitive to extreme observations.
• Another disadvantage of the range is that it depends on the sample size (n). that is,
the larger n is, the larger the range tends to be.
The Variance (s2)


Degrees of Freedom

• In computing the variance, the reason for dividing by n - 1 rather than n, is the
theoretical consideration referred to as degrees of freedom.

• n –1 is called the degrees of freedom of the variance or SD.

• The reason of choosing n –1 is that the sum of the deviations of the individual
observations of a sample about the sample mean is always zero.

• Hence, if n –1 values are known, the nth one is determined automatically.


The Variance (s2)

Example: A simple random sample of 10 subjects from the population of subjects


represented in the table. Compute the variance of the ages of the subjects from the
sample.
Table: Sample of 10 Ages Drawn from the Ages of a population

Solution:
The Variance of a Finite Population (σ2)


The Standard Deviation (s)


The Standard Deviation (s)

• The standard deviation often abbreviated as SD or sd.

• The standard deviation measures the dispersion or spread about the mean.

• The bigger value of s shows that the more variability present in the data.

• The standard deviation can equal to zero if there is no spread.

• The units of standard deviation are the same as the units of the data.
The Standard Deviation of a Finite Population (σ)


The Standard Deviation (s)
Example-1: Compute the variance and standard deviation for the Autoanalyzer and
Microenzymatic method data in the figure.

Solution:
Autoanalyzer Method

Microenzymatic Method

Thus, the Autoanalyzer method has a standard deviation roughly three times as large as that
of the Microenzymatic method.
The Standard Deviation (s)
Example-2: Compute the variance and standard deviation of the birthweight data in
the table in both grams and ounces.

The Variance and Standard Deviation in Grams


The Standard Deviation (s)
Example-2: continue.

The Variance and Standard Deviation in Ounces

• Thus, if the sample points change in scale by a factor of c, the variance changes by a
factor of c2 and the standard deviation changes by a factor of c.

• This relationship is the main reason why the standard deviation is more often used
than the variance as a measure of spread.

• The standard deviation and the arithmetic mean are in the same units, whereas the
variance and the arithmetic mean are not.
The Coefficient of Variation (CV)

• The standard deviation is useful as a measure of variation within a given set of data.

• When one desires to compare the dispersion in two sets of data, however, comparing the
two standard deviations may lead to fallacious results.

• It may be that the two variables involved are measured in different units.
• For example: we may wish to know, for a certain population, whether serum cholesterol
levels, measured in milligrams per 100 ml, are more variable than body weight, measured
in pounds.

• Even the same unit of measurement is used, the two means may be quite different.
• For example: If we compare the standard deviation of weights of first-grade children with
the standard deviation of weights of high school freshmen, we may find that the latter
standard deviation is numerically larger than the former, because the weights themselves
are larger, not because the dispersion is greater.

• Also, It is useful to relate the arithmetic mean and the standard deviation to each other.
• For example: a standard deviation of 10 means something different conceptually if the
arithmetic mean is 10 than if it is 1000.
The Coefficient of Variation (CV)


The Coefficient of Variation (CV)
Example-1: Suppose two samples of human males yield the results shown in the
table. Find that which is more variable, the weights of the 25-year-olds or the weights
of the 11-year-olds.

CV for the 25-year-olds

CV for the 11-year-olds

• A comparison of the standard deviations show that the two samples possess equal
variability.
• However, It is clear from this example that variation is much higher in the sample of 11-
year-olds than in the sample of 25-year-olds.
The Coefficient of Variation (CV)

Example-2: Compute the coefficient of variation for the data in the table when the
birthweights are expressed in either grams or ounces.

When the Data is Expressed in Grams

When the Data is Expressed in Ounces


Percentiles

• Percentiles are also sometimes called Quantiles.

• Percentiles are values that divide a set of observations into 100 equal parts, so there are
total 99 percentiles.

• Percentiles are used for location of data on the horizontal axis.

• Percentiles have the advantage over the range of being less sensitive to outliers and
of not being greatly affected by the sample size (n).

• A Percentile is define as follows,

“Given a set of n observations, x1, x2, x3, … , xn, the pth percentile P is the value of X
such that p percent or less of the observations are less than P and (100 - p) percent or
less of the observations are greater than P”.
Percentiles

• Subscripts on P serve to distinguish one percentile from another.


• For example: The 10th percentile is designated P10, the 70th is designated P70and so on.

• 70 Percentile means that 70% values lie below the value at P70 while 30% of the values
lie above the value at P70.

• The 50th percentile is the median and is designated P50.

• The median divides the lower 50% values and the higher 50% values in a data set.
Percentiles

The pth percentile can be computed as;


Percentiles

Frequently used percentiles are;

• Quartiles divide the data set into four equal parts.


• Example: 25th, 50th, and 75th percentiles.

• Quintiles divide the data set into five equal parts.


• Example: 20th, 40th, 60th, and 80th percentiles.

• Deciles divide the data set into 10 equal parts.


• Example:10th, 20th, . . . ,90th percentiles.
Quartiles
• Quartiles can divide the data set into four equal parts.

• The 25th percentile is often referred to as the first quartile or lower quartile and
denoted as Q1. It contains one-quarter of the data.

• The 50th percentile (the median) is referred to as the second or middle quartile and
written as Q2. It marks the point with half of the data.

• The 75th percentile is referred to as the third quartile or upper quartile and denoted as
Q3. It contains three-quarters of the data.

• The quartiles for a set of data are calculated using the following formulas;
Percentiles
Example-1: Compute the 10th and 90th percentiles for the birthweight data in the table.

Solution: The first step is to arrange data from the smallest value to the largest value.
• The np/100 = 20 × 0.1 = 2, and np/100 = 20 × 0.9 = 18 are integers.
• Therefore , the 10th and 90th percentiles are defined by;
• 10th percentile: average of the second and third largest values = (2581 + 2759)/2 = 2670 g
• 90th percentile: average of the 18th and 19th largest values = (3609 + 3649)/2 = 3629 g
• We would estimate that 80% of birthweights will fall between 2670 g and 3629 g, which
gives an overall impression of the spread of the distribution.
Percentiles

Example-2: Compute the 20th percentile for the white-blood-count data in the table.

Solution:
The first step is to arrange data from the smallest value to the largest value.
• The np/100 = 9 × 0.2 = 1.8 is not an integer.
• Therefore, the 20th percentile is defined by the (1 + 1)th largest value.
• Hence, the 20th percentile is the second largest value = 5000.
Interquartile Range (IQR)

• The range provides a crude measure of the variability present in a set of data.

• A disadvantage of the range is the fact that it is computed from only two values, the
largest and the smallest.

• A similar measure that reflects the variability among the middle 50 percent of the
observations in a data set is the interquartile range.

• The interquartile range (IQR) is the difference between the third and first quartiles.
Interquartile Range

• A large IQR indicates a large amount of variability among the middle 50 percent of the
relevant observations.

• A small IQR indicates a small amount of variability among the relevant observations.

• It is more informative to compare the interquartile range with the range for the entire data
set.

• A comparison may be made by forming the ratio of the IQR to the range (R) and
multiplying by 100. i.e., 100(IQR/R) tells us what percent the IQR is of the overall range.
Outliers or Outlying Values

An outlier or outlying value is a value x such that either

1) x > upper quartile (Q3) + 1.5 × (upper quartile (Q3) − lower quartile (Q1)) or

2) x < lower quartile (Q1) − 1.5 × (upper quartile (Q3) − lower quartile (Q1))

Outliers are unusually large and unusually small values of x in a data set.
Outliers or Outlying Values

Outliers or Outlying Values


Kurtosis

• Just as we may describe a distribution in terms of skewness, we may describe a


distribution in terms of kurtosis.

• Kurtosis is a measure of the degree to which a distribution is “peaked” or flat in


comparison to a normal distribution whose graph is characterized by a bell-
shaped appearance.

• A normal, or bell-shaped distribution, is said to be mesokurtic.

• A distribution may possess an excessive proportion of observations in its tails, so that


its graph exhibits a flattened appearance. Such a distribution is said to be platykurtic.

• A distribution may possess a smaller proportion of observations in its tails, so that its
graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic.
Kurtosis

Kurtosis can be expressed as;

• A perfectly mesokurtic distribution has a kurtosis measure of 3 based on the equation.

• Most computer algorithms reduce the measure by 3, as is done in Equation, so that the
kurtosis measure of a mesokurtic distribution will be equal to 0.

• A leptokurtic distribution will have a kurtosis measure > 0.

• A platykurtic distribution will have a kurtosis measure < 0.


Kurtosis

Graphs of distributions representing the three types of kurtosis are shown in the figure.

The descriptive statistics of these three distributions are shown in the following table.
The Ordered Array

• A first step in organizing data is the preparation of an ordered array.

• An ordered array is a listing of the values of a collection (either population or


sample) in order of magnitude from the smallest value to the largest value.

• If the number of measurements to be ordered is of any appreciable size, the use of a


computer to prepare the ordered array is highly desirable.

• An ordered array enables one to determine quickly the value of the smallest
measurement, the value of the largest measurement, and other facts about the
arrayed data that might be needed in a hurry.
The Ordered Array
• Table below presents the data in the form of an ordered array.
• Using this ordered array table, we are able to determine quickly the age of the youngest
subject (30) and the age of the oldest subject (82).
• We also readily note that about one-third of the subjects are 50 years of age or younger.

TABLE: Ordered Array of Ages of Subjects


Grouped Data

• The main purpose in grouping data is summarization.

• Data contain information and that summarization is a way of making it easier to

determine the nature of this information.

• Grouping the data provides a better overall picture of the unknown population.

• Data can be grouped into a set of non-overlapping, contiguous intervals.

• These intervals are usually referred to as class intervals (also known as bins).

• Class intervals are used to sort the data.

• Class intervals are usually depends on the range of the data.


Grouped Data

Frequency Distribution

• Frequency of a particular observation is the number of times the observation occurs


in a data.

• A frequency distribution is an ordered display of each value in a data set together


with its frequency, that is, the number of times that value occurs in the data set.

• Important characteristics of a large data set can be easily assessed by first grouping
the data into different class intervals and then determining the number of
observations that fall in each of the class intervals.
Frequency Distribution

• Class intervals are units of equal width that include all the data from the lowest value
to the highest value.

• Each data element is placed into its correct class interval.

• The number entries in each class interval are added to give the frequency.

• In a tabular form, we call this as frequency distribution.

• Data from a frequency distribution can be used to make graphs including the
histogram and the frequency polygon.
Frequency Distribution (Example-1)
TABLE: Ordered Array of Ages of Subjects

TABLE: Frequency Distribution of


Ages of 189 Subjects Shown in the
above tables.
Frequency Distribution (Example-2)

Table: Grouped frequency distribution


of birthweight (oz) from 100
consecutive deliveries.
Graphic Methods

• In this section, certain commonly used graphic methods for displaying data are
presenting.

• The purpose of using graphic displays is to give a quick overall impression of data,
which is sometimes difficult to obtain with numeric measures.

• The commonly used graphic methods are;

i. Histogram
ii. Frequency Polygon
iii. Stem-and-Leaf Displays
iv. Box-and-Whisker Plots
Histogram

• The Histogram is a special type of bar graph.

• In a histogram, the values of the variable under consideration are represented by the
horizontal axis, while the vertical axis has as its scale the frequency of occurrence.

• Above each class interval on the horizontal axis a rectangular bar, or cell, is erected.

• The height of the bar indicates the frequency.

• In a histogram, the bars must touch to indicate that there are no data in the data set
that are missing from the histogram.
Histogram (Example-1)
TABLE: Ordered Array of Ages of Subjects

TABLE: Frequency Distribution of Figure: Histogram of


Ages of 189 Subjects Shown in the above tables. ages of 189 subjects
Frequency Polygon

• The frequency polygon is a special kind of line graph.


• To draw a frequency polygon we first place a dot above the midpoint of each class
interval represented on the horizontal axis of a graph.
• The height of a given dot above the horizontal axis corresponds to the frequency of
the relevant class interval.
• Connecting the dots by straight lines produces the frequency polygon.
• Both ends of the frequency polygon are attached on the x-axis.
• To accomplish this, we must utilize a class interval before the first one, and a class
interval after the last one, each containing no data points.
• This allows for the total area to be enclosed.
• The total area under the frequency polygon is equal to the area under the histogram.
Frequency Polygon

Figure: Frequency polygon for the ages of Figure: Histogram and Frequency polygon for
189 subjects. the ages of 189 subjects.
Stem-and-Leaf Displays

• The stem-and-leaf display is useful for representing quantitative data sets.

• A quick way to obtain an informative visual representation of the data set is to construct
a stem-and-leaf display.

• A stem-and-leaf display bears a strong resemblance to a histogram.

• A properly constructed stem-and-leaf display, like a histogram, provides information


regarding the range of the data set, shows the location of the highest concentration of
measurements, and reveals the presence or absence of symmetry.

• An advantage of the stem-and-leaf display over the histogram is the fact that it preserves
the information contained in the individual measurements. Such information is lost when
measurements are assigned to the class intervals of a histogram.

• Another advantage of stem-and-leaf displays is the fact that they can be constructed
during the tallying process, so the intermediate step of preparing an ordered array is
eliminated.
Stem-and-Leaf Displays

• To construct a stem-and-leaf display we partition each measurement into two parts.


• The first part is called the stem, and the second part is called the leaf.
• The stem consists of one or more of the initial digits of the measurement.
• The leaf is composed of one or more of the remaining digits.
• Thus the stem of the number 483 is 48, and the leaf is 3.
• All partitioned numbers are shown together in a single display; the stems form an
ordered column with the smallest stem at the top and the largest at the bottom.
• The rows of the display contain the leaves, ordered and listed to the right of their
respective stems.
• When leaves consist of more than one digit, all digits after the first may be deleted.
• Decimals when present in the original data are omitted in the stem-and-leaf display.
• The stems are separated from their leaves by a vertical line.
• A stem-and-leaf display is also an ordered array of the data.
Stem-and-Leaf Displays (Example-1)
Since the measurements are all two-digit numbers, we will have one-digit stems and one-
digit leaves.
For Example: The measurement 30 has a stem of 3 and a leaf of 0.
TABLE: Ordered Array of Ages of 189 Subjects
Stem-and-Leaf Displays (Example-2)

• In some stem-and-leaf plots the leaf can consist of more than one digit.
• In this case, the leaf would consist of the rightmost two digits.
• The stem the leftmost two digits.
• The pairs of digits to the right of the vertical bar would be underlined to
distinguish between two different leaves.
• The stem-and-leaf display for the data in the table is shown in the figure.
Stem-and-Leaf Displays (Example-3)

• The point 5|8 represents 58, 11|8 represents 118, and so forth.
• Notice how this plot gives an overall feel for the distribution
without losing the individual values.
• Also, the cumulative frequency count from either the lowest or the
highest value is given in the first column.
• For the 11 stem, the absolute count is given in parentheses (17)
instead of the cumulative total because the highest or lowest
value would exceed 50% (50).
Box-and-Whisker Plots
• The construction of a Box-and-Whisker plot (or boxplot) makes use of the quartiles
of a data set and may be accomplished by following these five steps:

1. Represent the variable of interest on the horizontal axis.

2. Draw a box in the space above the horizontal axis in such a way that the left end of
the box aligns with the first quartile Q1 and the right end of the box aligns with the
third quartile Q3.

3. Divide the box into two parts by a vertical line that aligns with the median.

4. Draw a horizontal line called a whisker from the left end of the box to a point that
aligns with the smallest measurement in the data set.

5. Draw another horizontal line, or whisker, from the right end of the box to a point that
aligns with the largest measurement in the data set.

• Examination of a box-and-whisker plot for a set of data reveals information regarding


the amount of spread, location of concentration, and symmetry of the data.
Box-and-Whisker Plots (Example-1)

Evans et al. examined the effect of velocity on ground reaction forces (GRF) in dogs with
lameness (a condition in which the animal fails to travel in a regular and sound manner
on all four feet) from a torn cranial cruciate ligament (disease). The dogs were walked
and trotted (run) over a force platform, and the GRF was recorded during a certain phase
of their performance. The Table given below contains 20 measurements of force where
each value shown is the mean of five force measurements per dog when trotting.
Box-and-Whisker Plots (Example-1)

• The smallest and largest measurements are 14.6 and 44, respectively.

• The first quartile is the Q1 = (20 + 1)/4 = 5.25 th measurement.


• The 25th measurement is equal to 27.2 + (0.25)(27.4 – 27.2) = 27.25.

• The second quartile or median is the Q2 = (20 + 1)/2 = 10.5 th measurement.


• The 10.5th measurement is equal to 30.7 + (0.5)(31.5 – 30.7) = 31.1.

• The third quartile is the Q3 = 3(20 + 1)/4 = 15.75 th measurement.


• The 15.75th measurement is equal to 33.3 + (0.75)(33.6 – 33.3) = 33.525.

• The interquartile range is IQR = 33.525 – 27.25 = 6.275.


• The range is R = 44 – 14.6 = 29.4.
• The IQR 100(6.275/29.4) = 21 percent of the range.
Box-and-Whisker Plots (Example-1)

• The resulting box-and-whisker plot is shown in the figure.

• Examination of figure reveals that 50 percent of the measurements are between about
27 and 33, the approximate values of the first and third quartiles, respectively.

• The vertical bar inside the box shows that the median is about 31.

Median
Smallest Largest
Q1 Q2 Q3
Value Value

Range
IQR
Box-and-Whisker Plots (Example-2)

Nicotine content was measured in a random sample of 40 cigarettes. The data are
displayed in Table.
Box-and-Whisker Plots (Example-2)

Stem-and-leaf plot for the nicotine data.


Key Formullas
Key Formullas
Key Formullas
Key Symbols
Thank You !!!

You might also like