Probability and Statistics For Engineers
Probability and Statistics For Engineers
ENGINEERS
HARAMAYA UNIVERSITY
COLLEGE OF COMPUTING AND INFORMATICS
DEPARTMENT OF STATISTICS
MILLION WESENU(ASSIST.PROF.)
A 5
B 6
AB 4
O 9
Total 24
Ungrouped FD (Frequency Array)
A FD of numerical data (quantitative) in which each
value of a variable represents a single class (i.e. the
values of the variable are not grouped) and the number
of times each value repeats represents the frequency of
that class.
Eg:-Number of children for 21 families.
2 3 5 4 3 3 2
3 1 0 4 3 2 2
1 1 1 4 2 2 2
Grouped (Continuous) FD
A frequency of numerical data in which several values of a
variable are grouped into one class.
The number of observations belonging to the class is the
frequency of the class.
It used continuous variable values either ratio or interval.
Basic Terms In Grouped FD
Class Limits(CL): the lowest and highest values that can be
included in a class are called class limits.
The lowest values are called lower class limits(LCL) and the
highest values are called upper class limits(UCL).
Class Boundaries: are class limits when there is no gap
between the UCL of the first class and the LCL of the second
class.
The lowest values are called lower class boundaries(LCB) and the
highest values are called upper class boundaries (UCB).
Class Width: the difference between UCB and LCB of a class.
It is also the difference between the lower limits of two
consecutive classes or it is the difference between upper limits of
two consecutive classes.
CONT’D
Class Mark: is the half way between the class limits or the class boundaries.
Relative Frequency(RF) or Absolute Frequency(AF)
It is a summary table in which the original data is condensed into groups
and their frequencies, which is called AF distribution.
But if a researcher would like to know the proportion or percentage
of cases in each group, instead of simply, the number of cases, s/he
can do so by constructing a relative frequency distribution table.
The RF distribution can be formed by dividing the frequency in
each class of the frequency distribution by the total number of
observation.
Percentage frequency distribution=RF*100.
The RFs are particularly helpful when comparing two or more
frequency distributions in which the numbers of cases under
investigation are not equal.
Cumulative Frequency
The above RF/PF distributions does not tell us directly the total
number (percentage) of units that lie below or above the
specified values of the classes.
A cumulative frequency distribution displays the total
number of observations above (below) a certain value.
When the interest of the investigator focuses on the number of
items below a specified value, then this specified value is the
upper boundary of the class and is known as less than
cumulative frequency distribution (LCF).
Similarly, when the interest lies in finding the number of cases
above a specified value, then this value is taken as the lower
boundary of the specified class and is known as more than
cumulative frequency distribution(MCF).
Construction of Grouped FD
EXAMPLE
Consider Mark of 50 students out of 40
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22
12 22 29 18 22 28 25 7 17 22 28 19 23 23 22 3 19 13 31 23
28 24 9 20 33 30 23 20 8 21 24
Construct grouped frequency distribution.
Properties of Classes (Class Boundaries)
i. Complete and non-overlapping:
Complete- it should include all the data set.
Non-overlapping- no data should belong to two classes
ii. Clear and properly set: The W and K should be calculated
properly and W should be the same for all classes.
iii. Standardized: A class should follow logical and chronological
(increasing) order. The number of classes should be in between 5
and 20 i.e. 5≤K≤20. K depends on N. the larger the N the more the
K.
vi. Continuous: Even if there are no values in a class the class
must be included in the frequency distribution.
Advantages And Disadvantages Of Frequency Distributions
Advantages
It condenses a large mass of data in to a comparatively small table.
It attracts the attention of even a layman and gives him an insight into
the nature of the distribution.
It helps for further statistical analysis, like central tendency, scatter,
and symmetry … of the data.
Disadvantages
the identity of the observations is lost. We know only the
number of observations in a class and don not know what the
values are.
Because the selection of the class width and the lower class limit
of the first class are to a certain extent arbitrary, different
frequency distributions may be constructed for the same data
and hence may give contradictory impressions
Data Presentation:-Graphic Display of Data
Bar Chart:
o It is the simplest and most commonly used diagrammatic
representation of a frequency distribution.
o It is the most common presentation for nominal, categorical or
discrete data.
o It uses a serious of separated and equally spaced bars.
o The heights of the bars represent the frequency or relative
frequency of the classes.
o But the width of the bars has no meaning; however, all the bars
should be the same width to avoid distortion. And also the bars
are separated by constant distance.
The Three Types of Bar Chart
i. Simple Bar Chart: is a diagram in which categories of a variable are
marked on the X axis and the frequencies of the categories are
marked on the Y axis.
It is applicable for discrete variables, that is, for data given
according to some period, places and timings.
These periods and timings are represented on the base line (X
axis) at regular interval and the corresponding frequencies are
represented on the Y-axis.
The width of the bar represents nothing (it is meaningless), but it
should be equal for all bars.
Each bar is separated by an equal space.
It can also represent some magnitude (on the Y axis) over time,
space, groups, etc (on the X axis).
CONT’D
ii. Component Bar Chart
Component Bar Chart: is used when there is a desire to show a total or
aggregate is divided into its component parts.
The bars represent total value of a variable with each total
broken into its component parts and different colors are used
for identification.
In such type of diagrams, a bar is subdivided into parts in
proportion to the size of the subdivision.
These subdivided rectangles are shaded differently by lines, dots
and colors so that they will be very easy to compare the
components.
For making meaningful comparisons, the components of the
attributes are reduced to percentages.
In that case each attribute will have 100 as its maximum volume.
This sort of component bar chart is known as percentage bar chart.
Cont’d
iii. Multiple Bar Chart
Multiple Bar Chart: is used to display data on more than one
variable. In the multiple bars diagram two or more sets of inter-
related data are interpreted.
CONT’D
Pie-chart
Pie chart is popularly used in practice to show percentage break
down of data.
It is a circle representing a set of data by dividing the circle
into sectors proportional to the number of items in the
categories or it is a circle representing the total, cut into slices
in proportional to the size of the parts that make up the total.
It gives the proportional sizes of different data groups as slice
of a pie or a circle.
CONT’D
Histogram
Histogram is the most common graphical presentation of a frequency
distribution for numerical data.
It uses a series of adjacent bars in which the width of each bar represents
the class width and the heights represent the frequency or RF of the class.
It is used for grouped data in which the class boundaries are marked on the
X axis and the frequencies are marked along the Y axis.
Example: Construct A Histogram To The
Following Grouped Data.
Frequency Polygon and Ogive curve
Frequency Polygon is a graph that consists of line segments
connecting the intersection of the class marks and the frequencies
of a continuous frequency distribution.
It can also be constructed from histogram by joining the mid-
points of each bar.
Cumulative Frequency Curves (Ogive) As there are two
cumulative frequency distributions, there are two ogive
(pronounced as“oh-jive”) curves.
These are the less than cumulative frequency which is a line
graph joining the intersection points of the upper class
boundaries and their corresponding less than cumulative
frequencies and the more than cumulative frequency which is a
line graph joining the intersection points of the lower class
boundaries and their corresponding more than cumulative
frequencies.
Cumulative Frequency
CHAPTER -2
SUMMARIZING OF DATA
2.1. Measures of Central Tendency
A single value which can be considered as a typical or representative
of a set of observations and around which the observations can be
considered as centered is called an ‘Average’ (or average value or
center of location).
Since, such a typical values tend to lie centrally within asset of
observations when arranged according to magnitudes; averages are
called Measures of Central Tendency.
Objectives of Measures of Central Tendency (MCT)
1. To condense a mass of data in to one single value.
2. To facilitate comparison. Statistical devices like averages,
percentages and ratios used for this purpose.
Types of Measure of Central Tendency
There are many types of measures of central tendency, each possessing particular
properties and each being typical in some unique way. The most frequently
encountered ones are
I. Computed averages: Mean (Arithmetic Mean. Geometric Mean and
Harmonic Mean)
II. Positional averages: Median and Quantiles (Quartiles, Deciles,
Percentiles)
III. Mode
Desirable properties of good Measures of central tendency
A measure of central tendency is good or satisfactory if it possesses
the following characteristics.
1. It should be calculated based on all observations.
2. It should not be affected by extreme values. It should be as close
to the maximum number of observed values as possible.
The uses of Summation Notation
.
The Mean and its properties
a. Arithmetic mean(AM)
Simple Arithmetic Mean:-is the sum of all observations divided by total number
of observations.
Weighted Arithmetic Mean:
While calculating the simple arithmetic mean we had given
equal importance to all values.
But there are cases where the relative importance is not the
same for all items.
When this is case, it is necessary to assign them weights (i.e.
relative importance) and then calculate a weighted arithmetic
mean.
Let X1,X2,…,Xn be the values and W1,W2,…,Wn be the
corresponding weights then the weighted arithmetic mean
denoted by is given by
Properties Of Arithmetic Mean
The algebraic sum of the deviations of each value from the arithmetic mean
is zero. That is =0
The sum of the squares of the deviations from the mean is less
than the sum of the squares of the deviations about the other
score in the distribution. That is ≤ , A≠
If a constant C is added or subtracted from each value in
a distribution, then the new mean will be
= ⏈C respectively.
If each value of a distribution is multiplied by a constant C, the
new mean will be the original mean multiplied by C.
Arithmetic mean is affected by extreme values.
EXAMPLES
1. Find the arithmetic mean of A) 1, 2, 3, 4, 5. B) 1, 2,
3, 4, 100. Is there a great difference between the mean
of A and that of B?
2. A teacher attaches 2 to Quiz, 3 to Mid-term and 5 for
Final exam. If a student gets 90, 50 and 60 for Quiz,
Mid-term and Final-exam respectively, what is his/her
average academic performance
3. The mean weight of 50 women workers in a factory is
48 kg. The mean weight of 75 men working in the
same factory is 58 kg. Find the mean weight of all
workers in the factory.
Geometric Mean
.
CONT’D
If the variable values are measures as ratios, proportions or
percentage and some values are larger in magnitude and others are
small, then the geometric mean is a better representative of the
data than the simple average.
In a “geometric series”, the most meaning full average is the
geometric mean.
The disadvantage of GM is that it cannot be calculated if one or
more observations are zero or negative. It is also affected by
extreme values but not to the extent of AM.
That means less affected by extreme values than AM.
EXAMPLES
1. Find the geometric mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is
8% from 1990 to 1991 and by 77% from 1991 to 1992. Find the
04/27/2025
FOR GROUPED FREQUENCY DISTRIBUTION
04/27/2025
EXAMPLES
1. Given the data: 420,430,435,438,441,449,490,500,510 and 515.
find
a) all the quartiles
b) The 1st and 7th deciles
c) The 40th and 75th percentiles
2. Calcuate all quartiles, the 5th and 8th deciles, the 30th and 90th percentiles for the
students score data below.
04/27/2025
SOLUTIONS
04/27/2025
SOLUTIONS…..CONT..
04/27/2025
SOLUTION 2
04/27/2025
CONT….2
04/27/2025
CONT…3
04/27/2025
SOLUTION
04/27/2025
MODE AND ITS PROPERTIES
The mode is the most frequently occurring
value in a set of observations
or it is the value with the highest frequency.
A data set may have one mode (uni-modal), two
modes (bi-modal) and multimodal or no mode.
Good measure for qualitative variables values.
Ungrouped (individual series): Arrange the data in
ascending order and take the value appearing most
frequently (the most frequent value).
04/27/2025
CONT---
Grouped (continuous) series: In a frequency distribution,
the mode is located in the class with highest frequency
and that class is the modal class.
Properties of Mode
It is simple to calculate and easy to determine.
It is not based on all observations.
The mode can be used for both qualitative and
quantitative data types.
Mode is not affected by extreme values.
It is calculated for open ended class.
04/27/2025
From Previous Example
04/27/2025
Chapter -3
Measures Of Variation/Dispersion
In Measure Of Central Tendency, You Understand That:-
Median is A Positional Average And Has Nothing To Do
With The Variability Of The Observations In A Data Set.
Mode is The Largest Occurring Value Independent Of The
Other Values In The data Set.
This Leads us To Conclude That A MCT Is Not Enough
To Have A Clear Idea About The Data Unless All
Observations Are The Same.
Moreover Two Or More Data Sets May Have The Same
Mean Or Median But They May Be Quite Different. So
MCT Alone Do Not Provide Enough Information About
The Nature Of The Data.
04/27/2025
CONT…
Due To This Reason, Measure Of Variation Will Be
Employed To Know The Extent Of Scatterdness Of
Value Around The Measures Of Central Tendency.
Thus Measure Of Dispersion Tells Us The Extent To
Which The Values Of A Variable Vary About The
Measure Of Central Tendency.
Therefore, measure of dispersion deals with the
variability of the data set when the observations are
different either is size or units.
04/27/2025
OBJECTIVES OF MEASURES OF VARIATION
To have an idea about the reliability of the measure of
central tendency.
To compare two or more sets of data with regard to their
variability.
To provide information about the structure the data.
To pave way to the use of other statistical measures.
04/27/2025
Types of Measures of Variation
There are two types of measures of variation.
1. Absolute measures of variation: It is said to be an absolute form
when it shows the actual amount of variation of an item from a
measure of central tendency and are expressed in concrete units
in which the data have been expressed.
04/27/2025
CONT…
2. Relative measure of variation: It is the quotient obtained by
dividing the absolute measure by a quantity in respect to which
absolute deviation has been computed.
04/27/2025
Range And Relative Range
Range is the simplest and crudest measure of
dispersion. Range is defined as the difference between
the largest and the smallest values in the data.
Range hardly satisfies any property of good measure
of dispersion as it is based on two extreme values only,
ignoring the others.
It is not liable to further algebraic treatment.
Range for raw (Ungrouped) Data: R=maximum-
minimum or R=L-S
04/27/2025
Range… Cont…
Grouped Data: R=UCLlast-LCLfirst or CMlast-Cmfirst or UCBlast-LCBfirst or
R=WxK ,
04/27/2025
Quartile Deviation and Coefficient of Quartile Deviation
Quartile deviation is sometimes known as Semi-Interquartile Range
(SIR). The interquartile range is Q3 − Q1.
Thus, QD=
The corresponding relative measure of variation, coefficient of
quartile deviation is:
CQD=
04/27/2025
Mean Deviation and Coefficient of Mean Deviation
Mean deviation is a better measure than range and
quartile deviation.
Mean deviation is the arithmetic mean of the absolute
x
values of the deviation from some measures of central
tendency usually the mean and the median of a
distribution.
Hence we have mean deviation about the mean and mean
deviation about the median.
Mean deviation is always zero as stated in arithmetic
mean property, it is better to say absolute mean deviation
instead of mean deviation.
04/27/2025
VARIANCE AND COEFFICIENT OF VARIATION
The Variance and Standard Deviation are the most superior and
widely used measures of dispersions and both measure the
average dispersion of the observations around the mean.
The Variance of a data set is the sum of the squares of the
deviation of each observation taken from the mean divided
by total number of observations in the data set.
The positive square root of variance is called standard
deviation.
04/27/2025
CONT…
04/27/2025
FOR SAMPLE VARIANCE CAN BE CALCULATED AS:-
For a sample of n elements, the sample variance and
standard deviation denoted by S2 and S, respectively,
are calculated using the formulae:-
04/27/2025
Disadvantages Of Variance
The variation of the data is exaggerated because the deviation
(difference) of the each value from the mean is squared.
04/27/2025
Standard Deviation
Standard deviation is the positive square root of variance.
04/27/2025
INTERPRETATION OF THE STANDARD DEVIATION
If the data are a sample and the distribution is normal or bell-
shaped (or close to it!) or approximately normally distributed, then
the following conclusions can be reached:
04/27/2025
EMPIRICAL RELATIONSHIP QD,MD AND SD
6QD=5MD=4SD
QD=5MD/6 or QD=2SD/3
MD=6QD/5 or MD=4SD/5
SD=3QD/2=1.5QD or SD=5MD/4=1.25MD
NOTE
If there are two or more distributions of different
variables (having different units of measurement), there
variability cannot be compared by comparing the values
of the standard deviation.
04/27/2025
COEFFICIENT OF VARIATION (CV)
Coefficient Of Variation Used When:-
The groups have different units of measurement.
The size of the data between the groups is not the same.
Method Of Calculation:
It is a relative measure of standard deviation.
The coefficient of variation is the ratio of the standard
deviation to the mean and it is expressed as percent.
04/27/2025
CONT….
04/27/2025
EXAMPLE
1. Calculate the R, CR,QD, CQD, MD(mean),
MD(median) and CMD for the following
data.20,28,40,12,30,15,50.
2. Calculate the R,CR, QD,CQD, MD and CMD for the
following data.
04/27/2025
SOLUTIONS FOR EXAMPLE 1
04/27/2025
CONT…
04/27/2025
SOLUTION FOR EXAMPLE..2
04/27/2025
SOLUTION CONT…
Mean=25.64
Q3 =31.07
Median =26.1
04/27/2025
SOLUTION CONT…
04/27/2025