Statistics and Probability For CDW
Statistics and Probability For CDW
A) Statistics
i) Introduction
a) What is statistics?
Statistics is a field of study concerned with collecting, summarising/organising,
analysing, presenting, interpreting data and making decisions based on data. There are
two basic forms: descriptive statistics and inferential statistics.
Descriptive Statistics is primarily about summarizing a given data set through
numerical summaries and graphs and can be used for exploratory analysis to
visualize the information contained in the data and suggest hypotheses etc.
Inferential Statistics is concerned with methods for making conclusions about a
population using information from a sample and assessing the reliability of, and
uncertainty in, these conclusions. This allows us to make judgements in the
presence of uncertainty and variability, which is extremely important in
underpinning evidence-based decision-making in science, government, business
etc.
b) What is Population?
A population is the collection of all individuals or items under consideration in the study.
The entire set of possible observations in which we are interested. For example, consider
the following populations together with corresponding variables of interest:
All adults in Tanzania who are eligible to vote; the variable of interest is the
political party supported.
Car batteries of a particular type manufactured by a particular company; the
variable of interest is the lifetime of the battery before failure.
All adult males working full-time at Water Institute; the variable of interest is the
person’s gross income.
1
All potential possible outcomes of a planned laboratory experiment; the variable
of interest is the value of a particular measurement.
Gathering all data is not always possible due to barriers such as time, accessibility, or
cost. Instead of that, we often gather information from a smaller subset of the population,
known as a sample.
c) What is Sample?
The sample is a subset of the population from which information is actually collected.
We then use the characteristics of the sample to estimate the characteristics of the
population. In order for this procedure to give a good estimate, the sample must be
representative of the population. Otherwise, if an unrepresentative or ‘biased’ sample is
used the conclusions will be systematically incorrect.
Sampling Techniques
They are two techniques of sampling: Probability and Nonprobability Sampling
Techniques.
Probability sampling is a technique in which every unit in the population has a chance
(greater than zero) of being selected in the sample, and this probability can be
accurately determined. The combination of these traits makes it possible to produce
unbiased estimates of population totals, by weighting sampled units according to their
probability of selection.
2
Nonprobability sampling is any sampling method where some elements of the
population have no chance of selection (these are sometimes referred to as 'out of
coverage'/'under covered'), or where the probability of selection can't be accurately
determined. It involves the selection of elements based on assumptions regarding the
population of interest, which forms the criteria for selection. Hence, because the
selection of elements is nonrandom, nonprobability sampling does not allow the
estimation of sampling errors. These conditions give rise to exclusion bias, placing
limits on how much information a sample can provide about the population.
Information about the relationship between the sample and the population is limited,
making it difficult to extrapolate from the sample to the population.
d) Variables
Variables are properties or characteristics of some event, object, or person that can take
on different values or amounts;
Variables may be:
Independent or Dependent: The experimenter manipulates the independent
variable and its effects on the dependent variable are measured.
Discrete or Continuous:
Discrete variables can take only certain values. For example, a household could
have three children or six children, but not 4.53 children.
Continuous variables can take any value within the range of the scale. For
example, “time to respond to a question” are continuous variables since the
scale is continuous and not made up of discrete steps, say, the response time
could be 1.64 seconds.
Qualitative or Quantitative:
Qualitative variables are those that express a qualitative attribute such as hair
colour, eye colour, religion, favourite movie, gender, and so on.
The values of a qualitative variable do not imply a numerical ordering.
Quantitative variables are those variables that are measured in terms of
numbers. Some examples of quantitative variables are height, weight, and shoe
size.
For example, the following plot counts page views over a period of six months. You can see
from this visualization that there was a small peak in June and July before returning to the
previous baseline.
Line Chart
The line chart is a simple, two-dimensional chart with an X and Y-axis, each point
representing a single value. A line to depict a trend, usually over time, joins the data points.
The horizontal axis depicts a continuous progression often that of time, while the vertical
axis reports values for a metric of interest across that progression.
In the experimental sciences, data collected from experiments are often visualized by a
graph. For example, if one collects data on the speed of an object at certain points in time,
one can visualize the data and represent them in line chart as follows.
4
Multiple lines can also be plotted in a single-line chart to compare the trend between series.
A common use case for this is to observe the breakdown of the data across different
subgroups. The ability to plot multiple lines also provides the line chart a special use case
where it might not usually be selected. Normally, we would use a histogram to depict the
frequency distribution of a single numeric variable. However, since it’s tricky to plot two
histograms on the same set of axes, the line chart serves as a good mode of comparison as a
substitute. Line charts used to depict frequency distributions are often called frequency
polygons
Pie Chart
5
The “pie chart” is also known as a “circle chart”, dividing the circular statistical graphic into
sectors or sections to illustrate the numerical problems. Each sector denotes a proportionate
part of the whole. To find out the composition of something, Pie chart works best at that
time. In most cases, pie charts replace other graphs like bar graphs, line plots, histograms,
etc.
Formula
The pie chart is an important type of data representation. It contains different segments and
sectors in which each segment and sector of a pie chart forms a specific portion of the total
(percentage). The sum of all the data is equal to 360°. The total value of the pie is always
100%.
To work out with the percentage for a pie chart, follow the steps given below:
Categorize the data
Calculate the total
Divide the categories
Finally, calculate the degrees
You may also convert into percentages for easy interpretation
6
Therefore, the pie chart formula is given as:
Given Data
×360 °
Total value of Data
Example
The percentages of various crops cultivated in a village of particular distinct are given in the
following table.
Solution:
Given Data
The central angle= ×360 °
Total value of Data
7
Steps to construct:
Step 3: Choose the largest central angle. Construct a sector of a central angle, whose one
radius coincides with the radius drawn in step 2, and the other radius is in the clockwise
direction to the vertical radius.
Step 4: Construct other sectors representing other values in the clockwise direction in
descending order of magnitudes of their central angles.
Step 5: Shade the sectors obtained by different colours and label them as shown in the figure
below.
8
transactions. Dividing this total by an attribute like user type, age bracket, or location might
provide insights as to where the business is most successful.
Class Size
In statistics, class size refers to the difference between a class’s upper and lower boundaries
in a frequency distribution.
The class ¿ Actual upper class boundaries – Actual lower class boundaries=Difference of class boundaries
= 20 – 10
= 10.
Class Interval
Class Interval: It is defined as the size of each class of numerical data in a large frequency
distribution following a specific width. For example, if the raw data has too many variations
in numbers, we make groups of intervals to organize the data such as 0-10, 10-20, 20-30, etc.
These are known as class intervals.
Upper boundary/Limit: It is the highest value of the class interval. There could be no item
greater than the upper boundary in that particular class. For example, the upper boundary of
30-40 is 40. It is known as the upper-class boundary.
Lower boundary/Limit: It is the lowest value of the class interval. No item could be less
than the lower boundary/limit in that class. For example, the lower boundary/limit of 30-40 is
30. It is known as the lower class boundary.
Class Boundaries
In statistics, class boundaries are endpoints used to separate the data into classes or groups.
The boundary with the lower value is called the lower boundary while the one with a higher
value is called the upper boundary. Class boundaries are typically applied to continuous
datasets.
Class Limit
9
Corresponding to a class interval, the class limits may be defined as the minimum value and
the maximum value the class interval may contain. The minimum value is known as the
lower-class limit (LCL) and the maximum value is known as the upper-class limit (UCL).
To understand class limits and class boundaries in statistics, let us consider the data recorded
in Table 1 and Table 2 below.
Table 1: Grouped Data Presented in Class Limit and Class Boundary
Table 2: Grouped Data Presented in Class Limit, Frequency and Class Boundary
10
Class Marks
The class mark in a frequency distribution is the midpoint or the middle value of a given
class. For example, the class mark of 10-20 is 15, as 15 is the mid-value that lies between 10
and 20. In statistics, the class mark is used at various places, for example, while calculating
mean, drawing line graphs, finding the average of each class in a frequency distribution, etc.
It is very easy to calculate class marks by using a formula that you will learn in the section
below.
The formula to calculate class mark in a frequency distribution is given as (upper limit +
lower limit)/2 or (Sum of class boundaries)/2. By using this class mark formula, you can
easily find the midpoint of any given class interval. Let use data presented in table 2 to
determine class mark.
Histogram
A histogram is a graph used to represent the frequency distribution of a few data points of one
variable. Histograms often classify data into various “bins” or “range groups” and count how
many data points belong to each of those bins. A histogram can be defined also as a set of
rectangles with bases along with the intervals between class boundaries. Each rectangle bar
depicts some sort of data and all the rectangles are adjacent. The heights of rectangles are
proportional to corresponding frequencies of similar as well as for different classes.
It is the graphical representation of data where data is grouped into continuous number ranges
and each range corresponds to a vertical bar.
11
The horizontal axis displays the number range.
The vertical axis (frequency) represents the amount of data that is present in each
range.
The number ranges depend upon the data that is being used.
For example
Michael owns a garden with 30 mango trees. Each tree is of a different height. The height of
the trees (in feet): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5,
77, 77.5, 78, 78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as follows in
a frequency distribution table by setting a range:
60 - 65 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1
This data can be now shown using a histogram. We need to make sure that while plotting a
histogram, there shouldn’t be any gaps between the bars.
12
Difference Between a Bar Chart and a Histogram
The fundamental difference between histograms and bar graphs from a visual aspect is that
bars in a bar graph are not adjacent to each other. A bar graph is the graphical representation
of categorical data using rectangular bars where the length of each bar is proportional to the
value they represent. A histogram is the graphical representation of data where data is
grouped into continuous number ranges and each range corresponds to a vertical bar.
13
14
B) Probability
i) Introduction
Every day, decisions are made that involve uncertainty about the
outcome. The ability to estimate and understand probability helps us
make good decisions. Examples of probability used in everyday life
include the probability that it will rain today and the probability of
winning the lottery. Many events cannot be predicted with total
certainty. We can predict only the chance of an event to occur i.e.,
how likely they are going to happen, using it. Probability can range
from 0 to 1, where 0 means the event is an impossible one and 1
indicates a certain event. Probability is an important topic for the
students which explains all the basic concepts of this topic.
a) Probability Terms
15
are possible (H, T). But when two coins are tossed then there will be
four possible outcomes, i.e. {(H, H), (H, T), (T, H), (T, T)}.
number of outcomes∈event
P( A)=
A total number of outcomes ∈the sample space
Example
Suppose a coin is flipped two times.
Solution:
Previously, we found the sample space for this
experiment: S={HH , HT , TH , TT }
(a)The outcomes in the event “exactly one head” are HT and TH.
We see that there are 2 outcomes in the event out of the 4
possible outcomes in the sample space. So
2
P(exactly one head )= =0.5
4
(b)The outcomes in the event “at least one tail” are HT, TH,
and TT. We see that there are 3 outcomes in the event out of
the 4 possible outcomes in the sample space. So
3
P(at least one tail )= =0.75
4
16
This important characteristic of probability experiments is known as
the law of large numbers which states that as the number of
repetitions of an experiment is increased, the relative frequency
obtained in the experiment tends to become closer and closer to the
theoretical probability. Even though the outcomes do not happen
according to any set pattern or order, overall, the long-term observed
relative frequency will approach the theoretical probability. (The word
empirical is often used instead of the word observed).
Sample Space The set of all the possible 1. Tossing a coin, Sample Space (S) = {H,T}
outcomes to occur in any 2. Rolling a die, Sample Space (S) =
trial {1,2,3,4,5,6}
Experiment or A series of actions where The tossing of a coin, Selecting a card from a deck
Trial the outcomes are always of cards, throwing a dice.
uncertain.
Impossible The event cannot happen In tossing a coin, impossible to get both head and
Event tail at the same time
17
A ∪ B={1, 2 , 3 , 4 , 5 ,6 ,7 ,8 }
Notice that 4 and 5 are NOT listed twice.
For example, suppose we toss one fair, six-sided die. The sample
space S={1 ,2 , 3 , 4 ,5 , 6 }. Let A=face is 2∨3 and B=face iseven (2 , 4 ,6).
18
(the number of outcomes that are 2∨3∧even∈B)
number of sample space
P( A∨B)=
(the number of outcomes that are even∈ B)
number of sample space
1
6 1
P ( A|B )= =
3 3
6
Similarly:
(the number of outcomes that are 2∨3∧even∈B)
P ( A|B )=
the number of event that ∈ B
1
P ( A|B )=
3
Odds
The odds of an event present the probability as a ratio of success to
failure. This is common in various gambling formats. Mathematically,
the odds of an event can be defined as:
P( A)
1−P( A)
Where P(A) is the probability of success and of course, 1−P ¿) is the
probability of failure.
b) Probability Axioms
19
Example
In a presidential election, there are four candidates. Call them A, B, C,
and D. Based on our polling analysis, we estimate that A has
a 20 percent chance of winning the election, while B has a 40 percent
chance of winning. What is the probability that A or B win the election?
Solution
Notice that the events that {A wins}, {B wins}, {C wins}, and {D wins} are
disjoint since more than one of them cannot occur at the same time. For
example, if A wins, then B cannot win. From the third axiom of
probability, the probability of the union of two disjoint events is the
summation of individual probabilities. Therefore,
c) Probability Theorems
If A and B are any two events that are mutually inclusive (adjoin
event), then
P( A ∪ B)=P( A)+ P( B)−P( A ∩ B)
Example 1
Solution
¿ 0.71
c) P( A ∩ B)=P( A) P(B).
Example 2
Solutions:
21
Given P( A)=0.2 and P(B)=0.8 and events A and B are
independent of each other.
P( A ∩ B)=P( A) P(B)=0.2× 0.8=0.16 .
Example 3
Solution
¿ P ( A ∩B ' ) + P ( A ' ∩ B )
¿ {P( A )−P( A ∩ B)}+{P(B)−P( A ∩ B)}
¿ {0.12−0.07 }+{0.25−0.07 }
¿ 0.23 .
22
Example 4
Solution
¿ 1 – 5 ⁄ 16
23
¿ 11 ⁄ 16
Also, the event of not getting a blue ball is the same as getting a
red or green ball.
P ( B ' ) =P ( R ) + P (G )
3 8
¿ +
16 16
¿ 11 ⁄ 16.
d) Probability Distributions
I. Random Variables
24
random variable is always denoted by capital letter like X, Y, M
etc. The lowercase letters like x, y, z, m etc. represent the value
of the random variable.
Here, we see that the value of getting a head for the coin toss
20 times is anything from zero to twenty. If we denote the
number of a head by X, then X ={ 0 , 1, 2 , … , 20 }. The probability of
getting a head is always ½.
¿ X ∨¿ is a random variable.
25
As the name suggests, this variable is not connected or
continuous. A variable which can only assume a countable
number of real values i.e., the value of the discrete random
sample is discrete in nature. The value of the random variable
depends on chance. In other words, a real-valued function
defined on a discrete sample space is a discrete random
variable.
Value of X 0 1 2
1 2 1
P( X=x)=p (x)
4 4 4
1 3 4
F (X )=P( X ≤ x)=∑ i p(x i) =1
4 4 4
27
c) Probability Distribution of a Continuous Random
Variable
Three fair coins are tossed. Let X = the number of heads, Y = the
number of head runs. (A ‘head run’ is a consecutive occurrence of
at least two heads.) Find the probability function of X and Y.
Solution:
28
TTT 0 0
1
P(no head)= p (0)= 8
3
P(one head )= p(1)= 8
3
P(two heads)= p(2)= 8
1
P(three heads)= p (3)= 8
Value of X, x 0 1 2 3
p(x) 1 3 3 1
8 8 8 8
5
P(Y =0)= p(0)= ,∧¿
8
3
P(Y =1)=p (1)= .
8
Value of Y, y 0 1
p(y) 5 3
8 8
E( X )=μ=Σ x i pi i=1 ,2 , … , n
E ( X )=x 1 p1 + x 2 p2 + x 3 p3 +…+ x n pn .
30
(ii) Variance of Random Variables
Suppose you calculated the mean or the average marks in the five
tests of mathematics. You can easily see the difference of marks in
each of the tests from this average marks. This difference in marks
shows the variability of the possible values of the random variable.
The random variable being the marks scored in the test.
2
σ x =Var ( X )
2
¿ ∑ ( xi−μ ) p ( xi )
2
¿ E ( X −μ ) ∨,
2 2
Var ( X)=E( X )−[E ( X )] .
E ( X 2) =∑ x i2 p ( x i ) ,∧¿
2 2 2
[ E(X )] =[∑ xi p(x i)] =μ .
If the value of the variance is small, then the values of the random
variable are close to the mean.
31
For any pair-wise independent random variables, X1, X2, … , Xn and
for any constants a1, a2, … , an;
2 2 2
V (a1 X 1+ a2 X 2 +…+a n X n)=a1 V (X 1 )+ a2 V ( X 2)+ …+an V ¿
32