"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020
"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020
Chapter 1
1
Descriptive Statistics
Introduction
Definitions:
1. Descriptive Statistics: Collection, organization,
summarization, and presentation of data. Descriptive
statistics useful in data screening.
2. Population: A population is any entire collection of
subject from which we may collect data. It is the entire
group we are interested in.
3. Sample: A sample is a group subjects selected from a
population.
4. Variable: A variable is any characteristic of an
individual that takes different values.
5. Data: Data is a collection of numbers or facts that is
used as a basis for making conclusions.
6. Raw Data: Raw data is a data collected in original
form and has not been organized.
7. Parameter: A parameter is a measure computed from
population data.
8. Statistic: A statistic is a measure computed from the
sample data.
9. Outlier: An outlier is an observation point that is
distant from most of the other data values.
10. Inferential Statistics: Generalizing or infer from sample to
population in form of estimation, hypothesis testing,
determining relationships between variables.
2
Types of Data
A qualitative or categorical data
Data represent characteristics such as person’s gender,
blood type.
A quantitative or numerical data
data assume numerical values such as weight, blood
pressure.
3
Quantitative (Numerical) variables
4
Descriptive Statistics
Part 1: Describing Data with Tables and Graphs
5
Figure 2: Bar chart of blood type
6
Describing Numerical Data with tables and Graphs
A discrete variable can take only few different values, then the
data can be summarized in the same way as qualitative data.
Example: Consider the following data: Quiz scores out of 4
for 30 students.
0, 1, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0, 0,
4, 4, 4, 4, 4
Different values are: 0, 1, 2, 3, 4
Chart of Score
6
Count
0
0 1 2 3 4
Score
7
Pie Chart of Score
Category
0
1
16.7% 16.7% 2
3
4
16.7%
20.0%
30.0%
Dot Plot
Dotplot of Score
0 1 2 3 4
Score
8
Frequency Distribution Grouped Data
If the discrete data have a lot of different values or the
variable is continuous, then the data must be grouped into
classes before the table of frequencies can be formed.
Classes Frequency
100-104 2
105-109 8
110-114 18
115-119 13
120-124 7
125-129 1
130-134 1
Total 50
Frequency distribution:
*class limits. Represent the smallest and largest data values in
each class.
*lower class limit: The smallest value that can belong to a
given interval.
*upper class limit: The largest value that can belong to the
interval.
*class width: is the difference between the upper class limit
and the lower class limit
*class boundaries. Separate one class in a grouped frequency
distribution from another. The boundaries have one more
decimal place than the raw data and therefore do not appear in
the data. There is no gap between the class. If the observations
are given to the nearest integer, we subtract 0.5 from the lower
9
class limit to get the lower class boundary and add 0.5, to get
the upper class boundary.
*Class Midpoint: the number in the middle of the class. It is
found by adding the upper and lower limits and dividing by
two.
Guidelines for classes
1. Usually number of classes between 5 and 20.
2. The class width should be an odd number. This will
guarantee that the class midpoints are integers instead of
decimals.
3. The classes must be mutually exclusive. This means that
no data value can fall into two different classes
4. The classes must be all inclusive. This means that all
data values must be included.
5. The classes must be continuous. There are no gaps in a
frequency distribution.
6. The classes must be equal in width. The exception here is
the first and last class. It is possible to have an “below
…” or “…and above” class, this is often used with ages.
STEPS
1. Find the largest and smallest values.
2. Compute the Range=Maximum Value-Minimum Value.
3. Select the number of classes desired. Usually between 5
and 20.
4. Find the class width by dividing the range by the number
of classes and rounding up.
5. Select a starting point for the lowest class limit (smallest
value or any convenient number.
6. To get the lower class limit of the second class add the
class width to get the lower class limit of the next class,
keep adding the class width until you get all the classes.
7. To get the upper class limit for the first class, subtract
one from the lower limit of the second class to get the
10
upper limit of the first class then add the class width to
each upper limit to get all upper limits.
8. Tally the data.
9. Find the numerical frequencies from the tallies.
Example: The following data represent the record of
high temperatures in F for each of the 50 states.
Construct a grouped frequency distribution for the data
using 7 classes?
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 110 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
The procedure for constructing a grouped frequency
distribution for the above numerical data as follows:
Step 1: Take the number of classes = 7
Step 2: largest value = 134, smallest value = 100.
range = 134 – 100 =34
Step 3: class width = range / number of classes,
class width = 34/7=4.9 ≈ 5
Step 4 Start with the lower class limit for the first class 100.
Step 5: To get the lower class limit of the second class add the
class width 5 to get the lower class limit of the next
class, keep adding the class width until there are 7
classes.
Step 6: To get the upper class limit for the first class, subtract
one from the lower limit of the second class which is
105 to get the upper limit of the first class which is
104, then add the class width 5 to each upper limit to
get all upper limits.
Step 7: Tally the data
Step 8: Find the numerical frequencies from the tallies.
[ See table 3]
11
Graphical Representation
The graphical representation of the frequency distribution of
the grouped data like the Bar graph, it displays the data by
using vertical bars of heights which represent frequencies. but
there are gabs between the rectangles on the horizontal axis.
We use the class boundaries to get no gabs between the
rectangles on the horizontal axis, this graph is called
Histogram.
Histogram graphically summarize center, spread, skewness,
outliers.
Skewness (symmetry) of data. How concentrated data are at
the low or high end of the scale.
Kurtossis (peakedness) of data. How concentrated data are
around a single value.
Example: The following table represent the frequency
distribution of temperature data using class boundaries:
12
Histogram of Tempreture
20
15
Frequency
10
0
102 107 112 117 122 127 132
Tempreture
13
Frequency Polygon: A line graph. The frequency is placed
along the vertical axis and the class midpoints are placed
along the horizontal axis, the points are connected with liners.
Example: Consider the following frequency table:
14
Cumulative Frequency Polygon :
*The cumulative frequency shows the cumulative data values
with values up to and including those in a given range.
*Line graph (rather than a bar graph), plot the cumulative
frequency at each upper real class limit with the height being
the corresponding cumulative frequency
Real Limits Freq. Upper Real Class Limits Cum. Freq. CF%
9.5-19.5 0 19.5 0 0%
19.5-29.5 23 29.5 23 32.4%
29.5-39.5 21 39.5 44 62%
39.5-49.5 21 49.5 65 91.5%
49.5-59.5 4 59.5 69 97.2%
59.5-69.5 1 69.5 70 98,6%
69.5-79.5 1 79.5.5 71 100%
79.5-89.5 0 89.5 71 100%
15
Frequency Curve
A smooth curve which corresponds to the limiting case of a
histogram computed for a frequency distribution of a
continuous distribution as the number of data points becomes
very large.
16
Steam-and-leaf plot is a data plot that uses part of a data value
as the steam and the rest of data value the leaf to form groups
or classes. This is very useful for sorting data quickly. It is the
same as histogram but saves the original data points.
Steam Leaf
1 23
2 17
3 3457
4 001
*The “steam” is the left-hand column which contains the tens
digits.
*The “leaves” are the lists in the right-hand column, showing
all the ones digits for each of the tens, twenties, thirties, and
forties.
17
Shapes of Data Distributions:
Symmetric bell-shaped
Example: Men’s Heights
14
12
10
0
1 2 3 4 5 6 7 8 9
18
12
10
0
1 2 3 4 5 6 7 8 9
14
12
10
0
1 2 3 4 5 6 7 8 9
19
Uniform – All data values are equally
represented.
10
9
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6
20
Descriptive Statistics
Part 2
Describing Data with
Numerical Measures
*Measures of Central Tendency
-Mean
-Median
-Mode
*Measures of Dispersion
-Range
-Variance
-Standard deviation
-Coefficient of variation (C.V)
*Measures of Position
-Percentiles
-Quartiles
21
Measures of Central Tendency
x1 x2 ... xn i 1 x i
x
n n
Where n = number of items in the sample.
Population data set
Where N= number of items in the population.
N
x1 x 2 ... x N x i
i 1
N N
22
Example: consider the following sample data:
46, 54, 42, 46, 32
46 54 42 46 32 220
x 44
5 5
Weighted Mean:
In this case, we attach a numerical weight (w)
to each value and calculate the mean as
follows:
x
( x w)
w
Trimmed mean: it’s the mean, but it trims any
outliers
2- Median:
Median – the value which separates the
largest 50% of data values from the lowest
50%.
To calculate the median, place data values in
number order. If n is odd, the middle value is
the median. If n is even, the mean of the two
middle values is the median.
23
In general:
Arrange the data from smallest to largest
If n is odd the median is the observation in
position (n+1)/2
If n is even the median is the average of the
observations in positions (n/2) and (n/2)+1
Example1:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,
15,16,18,20
Determine the median?
Arrange the 16 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16, 17, 18,
20,26
n=16 is even the median is the average of the
observations in positions (n/2)=(16/2)=8
(which is 14) and (n/2)+1=(16/2)+1=9 (which
is 15)
(14 15)
Median = 14.5
2
24
Example2:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,15,16,18
Determine the median?
Arrange the 15 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16,17,18, 20
If n=15 is odd, then the median is the
observation in position (n+1)/2=(15+1)/2=8,
which is 14
Median = 14
3. Mode:
The mode is the most frequent data value.
There may be no mode if no one value appears
more than any other. There may also be two
modes (bimodal), or more than two modes
(multi-modal).
Useful for qualitative variables.
Example 1: 3, 7, 8, 8, 1
Mode = 8
Example 2: 1, 1, 9, 8, 6, 5, 10, 8
Mode = 8 and 1
25
Which measures to use?
*Mean is the most commonly used measure of
central tendency.
*One drawback of the mean is that it is
heavily influenced by a few very high or
very low data values (outliers or
skewedness). In these cases it is more
common to use the median.
*The mode has the advantage that it can be
used to measure data sets even if they contain
only qualitative data. A disadvantage is that a
data set may not have a mode.
*For unimodel distributions which are
almost symmetric, the best measure of
location is the mean. Moreover, the
mean, median, and mode are almost the
same.
*For unimodel distributions which have
moderate skewness, the median is the most
appropriate measure of location
26
27
Mean for
Grouped Data
Example:
Frequency Distribution for
Temperature Data
Classes Frequency Midpoint f.x
f x
100-104 2 102 204
105-109 8 107 856
110-114 18 112 2016
115-119 13 117 1521
120-124 7 122 854
125-129 1 127 127
130-134 1 132 132
Total n=50 5710
f x
mean
n
28
Mean = 5710 / 50 = 114.2
Measures of Dispersion or Variability
How spread out data are?
Example: Quiz Scores: 3, 3, 4, 4, 4, 4, 4,
4, 5, 5, 5
Example: Quiz Scores: 1, 3, 4, 5, 6, 6, 7,
8, 8, 9, 10
Dotplot of score vs group
group
1
2
2 4 6 8 10
score
group
1
2
2 4 6 8 10
score
29
Range:
The range for a set of data is the difference
between the largest and smallest values.
Variance:
The variance is a measure of the amount that a
set of data varies about its mean.
The sum of deviations from the mean for
any given sample is always zero
n
n n n
n xi
(x
i 1
i x ) xi x
i 1 i 1
i 1
n
nx nx nx 0
(x i )2
2
i 1
i
( x x ) 2
s2 i 1
n 1
(n-1) makes estimate unbiased
30
Sample Standard Deviation = s = s 2
n n
( xi ) 2
i
( x
i 1
x ) 2
i
x 2
i 1
i 1
n
Example: Consider a sample of size n=5 with
data values as follows:
10, 20, 12, 17, 16
Compute the sample variance s2 ?
5
Note : x
i 1
i 75, x 15
X X X ( X X )2
10 10-15=-5 25
20 20-15=5 25
12 12-15=-3 9
31
17 17-15=2 4
16 16-17=1 1
Total=75 0 64
i
( X X ) 2
64
s 2 i 1
16
n 1 5 1
x
i 1
i 75, x 15, x i 1
2
i 1189
n
( xi ) 2
(75) 2
x 2
i i 1
n
1189
5 16
s2 i 1
n 1 5 1
32
COEFFICIENT OF VARIATION (CV)
The Coefficient of Variation calculates the
standard deviation as a percentage of the mean
s
C.V. 100%
x
Remark 1:Good to compare variability of
data sets with different units..
Remark 2: the coefficient of variation, which
expresses the standard deviation as a
percentage of the mean.
33
Example:
Sample1 Sample2
Age 25 years 11 years
Mean 145 pounds 80 pounds
weight
Standard 10 pounds 10 pounds
deviation
Solution:
for the 25-year-old
s 10
C.V. 100% 100% 6.9%
x 145
for the 11-year-olds
s 10
C.V. 100% 100% 12.5%
x 80
34
Theorem Involving Standard Deviation:
Chebychev’s Theorem
Applies to any data set.
Given a number k greater than 1 and a set of n
observations, the portion (%) of data values that
must be within k standard deviations of the mean is
1
at least: 1
k2
For example
1 1
k 2 1 1 0.75
k2 22
At least 75% of the data lie in the interval
( x 2s, x 2s)
1 1
k 3 1 1 0.89
k2 32
At least 89% of the data lie in the interval
( x 3s, x 3s)
Example: Given x 70, s 10
The percentage of the data in the interval
( x 2s, x 2s) (70 2(10), 70 2(10))
=(50, 90) is at least 75%
The standard deviation of a data set is an
important quantity because it limits the
number of data values that can be very far
(high or low) from average.
35
The Empirical Rule:
Applies only to bell-shaped distributions.
36
Measures of Position
Percentiles
Percentiles – divide a data set into 100 parts.
For example, the 36th percentile is the value
which separates the lowest 36% of data values
from the highest 64% of data values and is
denoted by P36
A percentile is a numerical measure that also
locates values of interest in a data set.
The p-th percentile of a data set is a value such
that at p% of observations at or below this
value, and at least(100-p) % are at or above
this value.
37
p
q( )(n 1)
100
where p is the percentile of interest and n is
the number of data values.
Example:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,
15,16,18,20
Determine the 90th percentile?
Step 1: Arrange the 16 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16, 17, 18,
20,26.
Step 2: Compute q as follows
90
q 17 15.3
100
Step 3:
38
p-th percentile=X(q) = X (L) +(q-L)(X(L+1)-X(L))
90-th percentile=
X(15.3) = X (15) +(15.3-15)(X(16)-X(15))
=20+(0.3)(26-20)
=20+1.8=21.8
39
Ex: If your doctor tells you your 3 year old is
in the 50th percentile for height and the 35th
percentile for weight, what does that mean?
Box-Whisker Plot
Is a graphical numerical summary of five
numbers:
Minimum Value
Q1 : First Quartile
Median
Q3 : Third Quartile
Maximum Value
40
Box plots are an excellent tool for
conveying location and variation
information in data sets, particularly for
detecting and illustrating location and
variation changes between different groups
of data
* It provides a quick visual summary that
easily shows center, spread, range,
skewedness and any outliers.
*Whiskers line: is the line marks
the range of the data by connecting
the smallest and largest values
(excluding outliers) to the Box
41
Another measure of variability”
Interquartile Range IQR = Q3 -Q1
*note that the box (IQR) contains
50% of values.
Box plots and Skewedness
42
Box plots and Outliers
Outliers: values above a Q3+3×IQR or below Q1-
3×IQR are outliers.
Suspected outliers are slightly more central
versions of outliers: values above a Q3+1.5×IQR or
below Q1-1.5×IQR are suspected outliers.
43
Example: Assume that someone gave you the
following data that give the strength of the left
hand (arm1) and the right hand (arm2) of a group
of persons.
Left Arm: 20, 23, 24, 30, 21, 22, 33, 44, 33, 22,
33, 43, 54, 34, 22, 11, 15, 23, 34, 22, 11, 12, 18,
19, 20, 42, 41
Right Arm: 12, 15, 23, 32, 43, 55, 54, 44, 48, 49,
50, 12, 55, 56, 57, 50, 49, 43, 48, 46, 30, 27, 33,
55, 59, 52, 51, 53, 54, 47
Side-by-side Box of left and right strength data
60
50
40
Data
30
20
10
Left.Arm Right.Arm
44
Z-SCORES
45