Engineering Data Analysis
Engineering Data Analysis
Rolling an ordinary six-sided die is a familiar example of a random experiment, an action for
which all possible outcomes can be listed, but for which the actual outcome on any given trial
of the experiment cannot be predicted with certainty. In such a situation we wish to assign to
each outcome, such as rolling a two, a number, called the probability of the outcome that
indicates how likely it is that the outcome will occur. Similarly, we would like to assign a
probability to any event, or collection of outcomes, such as rolling an even number, which
indicates how likely it is that the event will occur if the experiment is performed. This section
provides a framework for discussing probability problems, using the terms just mentioned.
Definition
A random experiment is a mechanism that produces a definite outcome that cannot be
predicted with certainty. The sample space associated with a random experiment is the set of
all possible outcomes. An event is a subset of the sample space.
An event E is said to occur on a particular trial of the experiment if the outcome observed is an
element of the set E.
Example 1
Construct a sample space for the experiment that consists of tossing a single coin.
Solution:
The outcomes could be labeled h for heads and t for tails. Then the sample space is the set
S= {h,t}.
When selecting elements of a set, the number of possible outcomes depends on the conditions
under which the selection has taken place. There are at least 4 rules to count the number of
possible outcomes:
Multiplicative rule
Suppose you have j sets of elements, n1 in the first set, n2 in the second set, ... and nj in the jth
set. Suppose you wish to form a sample of j elements by taking one element from each of the j
sets. The number of possible sets is then defined by
n1 n2 ... nj.
Permutation rule
The arrangement of elements in a distinct order is called permutation. Given a single set of n
distinctively different elements, you wish to select k elements from the n and arrange them
within k positions. The number of different permutations of the n elements taken k at a time is
denoted Pkn and is equal to
Partitions rule
Suppose a single set of n distinctively different elements exists. You wish to partition them into
k sets, with the first set containing n1 elements, the second containing n2 elements, ..., and the
kth set containing nk elements. The number of different partitions is
Combinations rule
A sample of k elements is to be chosen from a set of n elements. The number of different
samples of k samples that can be selected from n is equal to
The combination rule is a special application of the partition rule, with j=2 and n 1=k. From
n=n1+n2 it follows that n2 can be replaced by (n-n1). Usually the two groups refer to the two
different groups of selected and non-selected samples. The order in which the n1 elements are
drawn is not important, therefore there are fewer combinations than permutations (binomial
theorem).
Rules of Probability
Our first rule simply reminds us of the basic property of probability that we’ve already learned.
The probability of an event, which informs us of the likelihood of it occurring, can range
anywhere from 0 (indicating that the event will never occur) to 1 (indicating that the event is
certain).
Motivating question for rule 2: A person in the United States is chosen at random. What is the
probability of the person having blood type A?
Answer: Our intuition tells us that since the four blood types O, A, B, and AB exhaust all the
possibilities, their probabilities together must sum to 1, which is the probability of a “certain”
event (a person has one of these 4 blood types for certain).
Since the probabilities of O, B, and AB together sum to 0.44 + 0.1 + 0.04 = 0.58, the probability
of type A must be the remaining 0.42 (1 – 0.58 = 0.42):
This is a good place to compare and contrast what we’re doing here with what we learned in
the Exploratory Data Analysis (EDA) section.
Notice that in this problem we are essentially focusing on a single categorical variable:
blood type.
We summarized this variable above, as we summarized single categorical variables in the
EDA section, by listing what values the variable takes and how often it takes them.
In EDA we used percentages, and here we’re using probabilities, but the two convey the
same information.
In the EDA section, we learned that a pie chart provides an appropriate display when a
single categorical variable is involved, and similarly we can use it here (using percentages
instead of probabilities):
Even though what we’re doing here is indeed similar to what we’ve done in the EDA section,
there is a subtle but important difference between the underlying situations
In EDA, we summarized data that were obtained from a sample of individuals for whom
values of the variable of interest were recorded.
Here, when we present the probability of each blood type, we have in mind the
entire population of people in the United States, for which we are presuming to know the
overall frequency of values taken by the variable of interest.
An important point to understand here is that “event A does not occur” is a separate event that
consists of all the possible outcomes that are not in A and is called “the complement event of
A.”
Notation: we will write “not A” to denote the event that A does not occur. Here is a visual
representation of how event A and its complement event “not A” together represent all
possible outcomes.
Comment:
Such a visual display is called a “Venn diagram.” A Venn diagram is a simple way to visualize
events and the relationships between them using rectangles and circles.
Rule 3 deals with the relationship between the probability of an event and the probability of its
complement event.
Given that event A and event “not A” together make up all possible outcomes, and since rule 2
tells us that the sum of the probabilities of all possible outcomes is 1, the following rule should
be quite intuitive:
Clearly, we could also find P(not O) directly by adding the probabilities of B, AB, and A.
Comment:
Note that the Complement Rule, P(not A) = 1 – P(A) can be re-formulated as P(A) = 1 – P(not
A).
o P(not A) = 1 – P(A)
o Can be re-formulated as P(A) = 1 – P(not A).
o This seemingly trivial algebraic manipulation has an important application, and actually
captures the strength of the complement rule.
o In some cases, when finding P(A) directly is very complicated, it might be much easier to
find P(not A) and then just subtract it from 1 to get the desired P(A).
o We will come back to this comment soon and provide additional examples.
Comments:
The complement rule can be useful whenever it is easier to calculate the probability of the
complement of the event rather than the event itself.
Notice, we again used the phrase “at least one.”
Now we have seen that the complement of “at least one …” is “none … ” or “no ….” (as we
mentioned previously in terms of the events being “opposites”).
In the above activity we see that
o P(NONE of these two side effects) = 1 – P(at least one of these two side effects )
This is a common application of the complement rule which you can often recognize by the
phrase “at least one” in the problem.
Having said that, it should be noted that there are some cases where it is simply impossible for
the two events to both occur at the same time.
EXAMPLE:
Consider the following two events:
A — a randomly chosen person has blood type A
B — a randomly chosen person is a woman.
In this case, it is possible for events A and B to occur together.
Events A and B are NOT DISJOINT.
The Venn diagrams suggest that another way to think about disjoint versus not disjoint events
is that disjoint events do not overlap. They do not share any of the possible outcomes, and
therefore cannot happen together.
On the other hand, events that are not disjoint are overlapping in the sense that they share
some of the possible outcomes and therefore can occur at the same time.
We now begin with a simple rule for finding P(A or B) for disjoint events.
Comment:
When dealing with probabilities, the word “or” will always be associated with the operation
of addition; hence the name of this rule, “The Addition Rule.”
What is the probability that a randomly chosen person is a potential donor for a person with
blood type A?
From the information given, we know that being a potential donor for a person with blood type
A means having blood type A or O.
We therefore need to find P(A or O). Since the events A and O are disjoint, we can use the
addition rule for disjoint events to get:
P(A or O) = P(A) + P(O) = 0.42 + 0.44 = 0.86.
It is easy to see why adding the probability actually makes sense.
If 42% of the population has blood type A and 44% of the population has blood type O,
then 42% + 44% = 86% of the population has either blood type A or O, and thus are
potential donors to a person with blood type A.
This reasoning about why the addition rule makes sense can be visualized using the pie chart
below:
Comment:
The Addition Rule for Disjoint Events can naturally be extended to more than two disjoint
events. Let’s take three, for example. If A, B and C are three disjoint events
then P(A or B or C) = P(A) + P(B) + P(C). The rule is the same for any number of disjoint events.
We are now finished with the first version of the Addition Rule (Rule four) which is the version
restricted to disjoint events. Before covering the second version, we must first discuss P(A and
B).
We have seen this type of table before when we discussed analyzing data in case C → C. For the
purpose of this question, we will use this data as our “population” and consider randomly
selecting one person.