Lecture No.2
Lecture No.2
Lecture No.2
Lecture No. 2
Resourse Person: Dr. Absar Ul Haq Department: Mechanical Engineering
Statistical modeling is the process of using mathematical and statistical techniques to analyze and understand
a dataset. This can involve creating models to represent the data, testing hypotheses, and making predictions
about future events.
Scientific inspection is the process of carefully examining data and results to identify patterns and trends.
This can involve visualizing the data using graphs and charts, and using statistical tests to determine the
significance of any patterns or trends.
Graphical diagnostics is the use of graphical methods to identify and diagnose problems in statistical models.
This can involve creating diagnostic plots to identify outliers, checking for patterns in the residuals, and using
other graphical methods to identify potential issues with the model. Together, these three approaches allow
analysts to gain a deeper understanding of the data, identify potential issues, and make better-informed
decisions.
In this Lecture we will be concerned with stem-and-leaf displays, box plots, graphs for simple sets of discrete
data, grouped frequency distributions, and histograms and cumulative distribution diagrams.
Stem-and-Leaf Displays are a way to organize and display data in a graphical format. It is a type of histogram
that separates the digits of a number into two parts: the stem and the leaf. The stem represents the leftmost
digits of a number, and the leaf represents the rightmost digit(s).
For ex ample, if you have a set of data that includes the following numbers: 12, 15, 18, 20, 21, 24, 25, 26,
27, 28
You could create a stem-and-leaf display as follows:
Stem Leaf
1 2, 5, 8
2 0, 1, 4, 5, 6, 7, 8
In this example, the stem represents the tens place of the number, and the leaf represents the ones place.
The numbers 12, 15, 18, 20, 21, 24, 25, 26, and 27 all have a stem of ”1” and a leaf of ”2”, ”5”, and ”8”.
The number 28 has a stem of ”2” and a leaf of ”8”.
Another example, if you have a set of data that includes the following numbers: 123, 124, 125, 126, 127, 128,
129, 130
You could create a stem-and-leaf display as follows:
2-1
Lecture No. 2 2-2
Stem Leaf
12 3, 4, 5, 6, 7, 8, 9
13 0
In this example, the stem represents the hundreds place of the number, and the leaf represents the tens and
ones place. The numbers 123, 124, 125, 126, 127, 128, 129, and 130 all have a stem of ”12” and a leaf of
”3”, ”4”, ”5”, ”6”, ”7”, ”8”, ”9”, and ”0” respectively.
These simple displays are particularly suitable for exploratory analysis of fairly small sets of data. The basic
ideas will be developed with an example. Data have been obtained on the lives of batteries of a particular
type in an industrial application. Table 2.1 shows the lives of 36 batteries recorded to the nearest tenth of
a year. For these data we choose ”stems” which are the main magnitudes. In this case the digit before the
decimal point is a reasonable choice: 1, 2, 3, 4, 5, 6. Now we go through the data and put each ”leaf,” in this
case the digit after the decimal point, on its corresponding stem. The decimal point is not usually shown.
The result can be seen in Table 2.2. The number of stems on each leaf can be counted and shown under the
heading of Frequency. From the list of leaves on each stem we have an immediate visual indication of the
relative numbers. We can see whether or not the distribution is approximately symmetrical, and we may
get a preliminary indication of whether any particular theoretical distribution may fit the data. We will see
some theoretical distributions later in this book, and we will find that some of the distributions we encounter
in this chapter can be represented well by theoretical distributions.
We may want to sort the leaves on each stem in order of magnitude to give more detail and facilitate finding
parameters which depend on the order. The result of sorting by magnitude is shown in Table 2.3. Another
possibility is to double the number of stems (or multiply them further), especially if the number of data is
large in relation to the initial number of stems. Stem ”a” might have leaves from 0 to 4, and stem ”b” might
have leaves from 5 to 9. The result without sorting is shown in Table 2.4 Of course, we might both double
the number of stems and sort the leaves on each stem. In other cases it might be more appropriate to show
two significant figures on each leaf, with appropriate separation between leaves. There are many possible
variations.
Lecture No. 2 2-3
A box plot, or box-and-whisker plot, is a graphical device for displaying certain characteristics of a frequency
distribution. A narrow box extends from the lower quartile to the upper quartile. Thus the length of the
box represents the interquartile range, a measure of variability. The median is marked by a line extending
across the box. The smallest value in the distribution and the largest value are marked, and each is joined
to the box by a straight line, the whisker. Thus, the whiskers represent the full range of the data.
Figure 2.5 is a box plot for the data of Table 2.1 on the life of batteries under industrial conditions. The
labels, ”smallest”, ”largest”, ”median”, and ”quartiles”, are usually omitted. Box plots are particularly
suitable for comparing sets of data, such as before and after modifications were made in the production
process. Figure 2.6 shows a comparison of the box plot of Figure 2.5 with a box plot for similar data under
modified production conditions, both for the same sample size. Although the median has not changed very
much, we can see that the sample range and the interquartile range for modified conditions are considerably
smaller.
Example 2.1 To start a program to improve the quality of production in a factory, all the products coming
off a production line, under what we have reason to believe are normal operating conditions, are examined
and classified as ”good” products or ”defective” products. The number of defective products in each successive
group of six is counted. The results for 60 groups, so for 360 products, are shown in Table 2.7. Find the mean,
median, mode, first quartile, third quartile, eighth decile, ninth decile, proportion defective in the sample,
first estimate of probability that an item will be defective, sample variance, sample standard deviation, and
coefficient of variation.
These data can be shown graphically in a very simple form because they involve discrete data, as opposed to
continuous data, and only a few different values. The variate is discrete in the sense that only certain values
are possible: in this case the number of defective items in a group of six must be an integer rather than a
fraction. The number of defective items in each group of this example is only 0, 1, or 2. The frequencies
of these numbers are shown above. The corresponding frequency graph is shown in Figure 2.9. The isolated
spikes correspond to the discrete character of the variate. If the number of different values is very large, it
Lecture No. 2 2-5
may be desirable to use the grouped frequency approach, as discussed below for continuous data.
If the variate is continuous, any value at all in an appropriate range is possible. Between any two possible
values, there are an infinite number of other possible values, although measuring devices are not able to
distinguish some of them from one another. Measurements will be recorded to only a certain number
of significant figures. Even to this number of figures, there will usually be a large number of possible
values. If the number of possible values of the variate is large, too many occur on a table or graph for easy
comprehension. We can make the data easier to comprehend by dividing the variate into intervals or classes
and counting the frequency of occurrence for each class. This is called the grouped frequency approach.
Thus, frequency grouping is used to make the distribution more easily understood. The width of each class
(the difference between its lower boundary and its upper boundary) should be constant from one class to
another (there are exceptions to this statement, but we will omit them from this book). The number of
classes should be from seven to twenty, depending chiefly on the size of the population or sample being
represented. If the number of classes is too large, the result is too detailed and it is hard to see an underlying
pattern. If the number of classes is too small, there is appreciable loss of information, and the pattern may
be obscured. An empirical relation which gives an approximate value of the appropriate number of classes
is Sturges’Rule:
number of class intervals ≈ 1 + 3.3 log10 n (2.1)
Lecture No. 2 2-6
where n is the total number of observations in the sample. The procedure is to start with the range, the
difference between the largest and the smallest items in the set of observations. Then the constant class width
is given approximately by dividing the range by the approximate number of class intervals from equation 2.1.
Round off the class width to a convenient number (remember that there is nothing sacred or exact about
Sturges’Rule!). The class boundaries must be clear with no gaps and no overlaps. For problems in this book
choose the class boundaries halfway between possible magnitudes. This gives a definite and fair boundary.
For example, if the observations are recorded to one decimal place, the boundaries should end in five in the
second decimal place. If 2.4 and 2.5 are possible observations, a class boundary might be chosen as 2.45.
The smallest class boundary should be chosen at a convenient value a little smaller than the smallest item
in the set of observations.
Each class midpoint is halfway between the corresponding class boundaries.
Then the number of items in each class should be tallied and shown as class frequency in a table called a
grouped frequency table. The relative frequency is the class frequency divided by the total of all the class
frequencies, which should agree with the total number of items in the set of observations. The cumulative
frequency is the total of all class frequencies smaller than a class boundary. The class boundary rather
than class midpoint must be used for finding cumulative frequency because we can see from the table how
many items are smaller than a class boundary, but we cannot know how many items are smaller than a
class midpoint unless we go back to the original data. The relative cumulative frequency is the fraction
(or percentage) of the total number of items smaller than the corresponding upper class boundary. Let us
consider an example
Example 2.2 The thickness of a particular metal part of an optical instrument was measured on 121 suc-
cessive items as they came off a production line under what was believed to be normal conditions. The results
are shown in Table 2.10
Thickness is a continuous variable, since any number at all in the appropriate range is a possible value. The
data in Table 2.10 are given to two decimal places, but it would be possible to measure to greater or lesser
precision. The number of possible results is infinite. The mass of numbers in Table 2.10 is very difficult to
comprehend. Let us apply the methods of this section to this set of data.
Applying average formula to the numbers in Table 2.10 gives a mean of 407.59
121 = 3.3685 or 3.369 mm. (We
will see later that the mean of a large group of numbers is considerably more precise than the individual
numbers, so quoting the mean to more significant figures is justified.) Since the data constitute a sample of
all the thicknesses of parts coming off the production line under the same conditions, this is a sample mean,
Lecture No. 2 2-7
The median of the 121 numbers in Table 2.10 is the 61st number in order of magnitude. This is 3.37mm. The
fifth percentile is between the 6th and 7th items in order of magnitude, so (3.26 + 3.27)/2 = 3.265mm. The
ninth decile is between the 108th and 109th numbers in increasing order of magnitude, so (3.44 + 3.45)/2 =
3.445mm.
Now let us apply the grouped frequency approach to the numbers in Table 2.10. The largest item in the
table is 3.57, and the smallest is 3.21, so the range is 0.36. The number of class intervals according to
Sturges’Rule should be approximately 1+(3.3)(log10 121) = 7.87. Then the class width should be approximately
0.36/7.87 = 0.0457. Let us choose a convenient class width of 0.05. The thicknesses are stated to two decimal
places, so the class boundaries should end in five in the third decimal. Let us choose the smallest class
boundary, then, as 3.195. The resulting grouped frequency table is shown in Table 2.11.
In this table the class frequency is obtained by counting the tally marks for each class. This becomes easier
if we divide the tally marks into groups of five as shown in Table 2.11. The relative frequency is simply the
class frequency divided by the total number of items in the table, i.e. the total frequency, which is 121 in this
case. The cumulative frequency is obtained by adding together all the class frequencies for classes with values
smaller than the current upper class boundary. Thus, in the third line of Table 2.11, the cumulative frequency
of 40 is the sum of the class frequencies 2, 14 and 24. The corresponding relative cumulative frequency would
40
be 121 = 0.331, or 33.1%. The cumulative frequency in the last line must be equal to the total frequency.
From Table 2.11 the mode is given by the class midpoint of the class with the largest class frequency, 3.370mm.
The mean, median and mode, 3.369, 3.37 and 3.370mm, are in close agreement. This indicates that the
distribution is approximately symmetrical.
Lecture No. 2 2-8
Graphical representations of grouped frequency distributions are usually more readily understood than the
corresponding tables. Some of the main characteristics of the data can be seen in histograms and cumulative
frequency diagrams. A histogram is a bar graph in which the class frequency or relative class frequency is
plotted against values of the quantity being studied, so the height of the bar indicates the class frequency or
relative class frequency. Class midpoints are plotted along the horizontal axis.
In principle, a histogram for continuous data should have the bars touching one another, and that should be
done for problems in this book. However, the bars are often shown separated, and some computer software
does not allow the bars to touch one another.
The histogram for the data of Table 2.10 is shown in Figure 2.12 for a class width of 0.05 mm as already
calculated. Relative class frequency is shown on the righthand scale. Histograms for class widths of 0.03 mm
Lecture No. 2 2-9
and 0.10 mm are shown in Figures 2.13 and 2.14 for comparison. Of these three, the class width of 0.05 mm
in Figure 2.12 seems most satisfactory (in agreement with Sturges’Rule).
Cumulative frequencies are shown in the last column of Table 2.11. A cumulative frequency diagram is a
plot of cumulative frequency vs. the upper class boundary, with successive points joined by straight lines. A
cumulative frequency diagram for the thicknesses of Table 2.10 is shown in Figure 2.2.
The cumulative frequency diagram of Figure 2.2 could be changed into a relative cumulative frequency diagram
by a change of scale for the ordinate.
2.2 Exercise
References
[TT] T.T. Soong, “Fundamentals of probability and statistics for engineers,” John Wiley & Sons
Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, 2004.
Lecture No. 2 2-10