Chapter 2 Principles of Statistics
Chapter 2 Principles of Statistics
University of Technology
Faculty of Geology & Petroleum Engineering
Department of Drilling - Production Engineering
Course
Geostatistics in Petroleum Engineering
Trần Nguyễn Thiện Tâm
[email protected]
Chapter 2
Principles of Statistics
9/10/2024 Geostatistics 2
References
Mohan Kelkar, Godofredo Perez. Applied Geostatistics for Reservoir Characterization.
Society of Petroleum Engineers, Texas, 2002.
9/10/2024 Geostatistics 3
Contents
❑ Introduction
❑ Descriptive statistics
❑ Inferential statistics
9/10/2024 Geostatistics 4
Introduction
Descriptive Statistics
• Organization, presentation, and summarization of data.
• Better understanding of the type of information currently available that allows us to
use it more productively.
Inferential statistics
• Deriving conclusions about a population on the basis of sample data
• Estimating values at unsampled locations
9/10/2024 Geostatistics 5
Descriptive Statistics
▪ Frequency Distribution
▪ Univariate Statistics
▪ Bivariate Statistics
9/10/2024 Geostatistics 6
Frequency Distribution
The simplest ways to analyze sample data
Summarizes the data in a more compact form than original sample observations
The range of the data is divided into intervals called class intervals
The number of measurements falling within a particular class, i, is called a class
frequency, fi
9/10/2024 Geostatistics 7
Frequency Distribution
Relative frequency, 𝑓𝑅𝑖
𝑓𝑖
𝑓𝑅𝑖 =
𝑛
n = total number of samples
𝑁
𝑓𝑅𝑖 = 1
𝑖=1
N = total number of classes
Cumulative relative class frequency
𝑗
𝐹𝑗 = 𝑓𝑅𝑖
𝑖=1
9/10/2024 Geostatistics 8
Example 2.1
The following porosity samples are measured in a wellbore: 0.141, 0.124, 0.152, 0.156,
0.113, 0.167, 0.194, 0.142, 0.133, 0.149, 0.106, 0.137, 0.147, 0.159, 0.174, 0.129, 0.153,
0.173, 0.189, 0.16, 0.193, 0.156, 0.149, 0.135, 0.145, 0.171, 0.101, 0.151, 0.176, 0.191,
0.121, 0.148, 0.153, 0.171, 0.183, 0.108, 0.123, 0.169, 0.185, 0.153, 0.117, 0.127, 0.145,
0.141, 0.165, 0.14, 0.143, 0.178, 0.179, 0.157. Analyze these porosity, ϕ, values using a
frequency-distribution analysis.
9/10/2024 Geostatistics 9
Example 2.1 - Solution
For these 50 values, we divide the data into five classes
9/10/2024 Geostatistics 10
Univariate Statistics
• Mean
• Median
• Mode
• Percentiles
• Variance
• Coefficient of Variation
• Range
9/10/2024 Geostatistics 11
Mean
Arithmetic mean: represents the central tendency
𝑛
of the sample
1
𝑥ҧ = 𝑥𝑖
𝑛
𝑖=1
where
n = total number of samples
xi = the value of sample i
9/10/2024 Geostatistics 12
Median
Another measure of central tendency, which is the sample point that divides the sample
into equal halves
If all the samples are arranged in an ascending order so that x1 < x2 … < xn
When n is odd, the sample median, 𝑥, is calculated by
𝑥 = 𝑥(𝑛+1)/2
When n is even, the sample median, 𝑥,
is calculated by
𝑥𝑛/2 + 𝑥𝑛Τ2 + 1
𝑥 =
2
9/10/2024 Geostatistics 13
Mode
Another measure of a central tendency, is an observation that occurs most
frequently in the sample.
The value of the mode obviously depends on the precision of the data, especially for
naturally occurring variables.
If the data are very precise, each value is unique and none is repeated.
9/10/2024 Geostatistics 14
Mode
Mean, median, and mode coincide with each other if the distribution is symmetric.
If the distribution is skewed, these three tendencies exhibit different values.
If the distribution is skewed positively (to the right), mode < median < mean.
If the distribution is skewed negatively (to the left), mode > median > mean.
9/10/2024 Geostatistics 15
Percentiles
Percentile values represent sample values that are greater than a certain percentage
of the sample values.
The median is an example of the 50th percentile value because 50% of the values are
smaller than the median.
If the values are arranged in ascending order, xp represents a value where p percent
values are smaller than xp. For example, x10 represents a sample value that is greater
than 10% of the total sample points.
9/10/2024 Geostatistics 16
Percentiles
Certain types of percentiles are commonly used in describing sample data. For example,
the first quartile represents x25, where 25% of the sample values are less than x25. x75
represents a value where 25% of the sample values are greater than x75.
9/10/2024 Geostatistics 17
Variance
The sample variance represents the spread of the data. It is a quantitative measure of
how widely the data are distributed.
σ 𝑛 2
𝑥
𝑖=1 𝑖 − 𝑥ҧ
𝑠2 =
𝑛−1
where
s2 = sample variance
𝑥ҧ = sample mean
n = total number of samples
9/10/2024 Geostatistics 18
Variance
Variance can also be calculated as
𝑛 2 2
σ 𝑥
𝑖=1 𝑖 − 𝑛 𝑥ҧ
𝑠2 =
𝑛−1
The square root of variance, s, is called the standard deviation. It has the same units as
the variable being sampled.
9/10/2024 Geostatistics 19
Coefficient of Variation
The coefficient of variation, Cv, is defined as
𝑠
𝐶𝑣 =
𝑥ҧ
where
s = standard deviation and
𝑥ҧ = sample mean
Because s and 𝑥ҧ have the same units, Cv is a dimensionless quantity; therefore, it
provides a measure of the relative spread of a sample.
9/10/2024 Geostatistics 20
Range
Range is another quantitative measure of the spread. A simple definition of range, R, is
𝑅 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
where
xmax = the maximum value and
xmin = the minimum value
9/10/2024 Geostatistics 21
Range
Other definitions of range have also been used. For example, interquartile range
represents the difference between two successive quartile values. We can define the
first quartile range as
R1 = x25 - xmin
where
x25 = the 25th percentile value and
xmin = the minimum value.
Similar definitions can be used for other quartile ranges.
9/10/2024 Geostatistics 22
Example 2.2
The following data for pay-zone thickness (in feet) are collected from all available wells
in a reservoir: 6, 10, 20, 12, 20, 10, 15, 32, 27, 10, 18, 29, 8, 17, 23, 36, 19, 13, 33, 10, 26.
Calculate mean, median, mode, quartile values, variance, coefficient of variation, and
range.
9/10/2024 Geostatistics 23
Bivariate Statistics
Covariance is a measure of the relationship
𝑛
between
𝑛
two𝑛variables
1 1 1
𝑐 𝑥, 𝑦 = 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
where
xi and yi = samples of the variables x and y, respectively, and
n = total number of sample pairs
Note that covariance reduces to variance if x = y.
9/10/2024 Geostatistics 24
Bivariate Statistics
If x and y are positively related (i.e., as x increases, y increases), the covariance has a
positive value.
If x and y are negatively related (i.e., as x increases, y decreases), covariance has a
negative value.
If x and y are not related, the covariance has a value close to zero.
9/10/2024 Geostatistics 25
Bivariate Statistics
Correlation coefficient
𝑐(𝑥, 𝑦)
𝑟 𝑥, 𝑦 =
𝑠𝑥 𝑠𝑦
where
r(x,y) = correlation coefficient,
c(x,y) = covariance between x and y,
sx = standard deviation of the x variable, and
sy = standard deviation of the y variable.
9/10/2024 Geostatistics 26
Bivariate Statistics
The value of the correlation coefficient always falls between the limits of +1 and -1.
If x and y are positively related, the correlation coefficient falls between 0 and +1. The
stronger the relationship, the closer the value will be to +1.
If x and y are negatively related, the correlation coefficient falls between 0 and -1. The
stronger the relationship, the closer the value will be to -1.
If x and y are not related, the correlation coefficient is zero.
9/10/2024 Geostatistics 27
Bivariate Statistics
In some instances, the square of the correlation coefficient, r2(x, y), is used instead of the
correlation coefficient to describe the relationship between the two variables.
One advantage of using this value (sometimes called the r2 statistic) is that it always falls
between zero and one, whether x and y are positively or negatively related.
This is the term most commonly used in describing the “goodness of fit” in a linear-
regression analysis between two variables.
9/10/2024 Geostatistics 28
Bivariate Statistics
In addition to correlation coefficient, rank correlation coefficient is another measure
that indicates the relationship between two variables.
𝑐(𝑅𝑥 , 𝑅𝑦 )
𝑟 𝑅𝑥 , 𝑅𝑦 =
𝑆𝑅𝑥 , 𝑆𝑅𝑦
where
r(Rx, Ry) = rank correlation coefficient,
c(Rx, Ry) = covariance between the rank values of the two variables, and
𝑠𝑅𝑥 and 𝑠𝑅𝑦 = standard deviations for the rank values for the two variables.
When each variable has the same number of data values, 𝑠𝑅𝑥 = 𝑠𝑅𝑦 .
9/10/2024 Geostatistics 29
Example 2.3
Table 2.5 provides core permeability vs. core porosity data from a well. Calculate the
covariance, the correlation coefficient and the rank correlation coefficient between log k
and ϕ data.
9/10/2024 Geostatistics 30
Example 2.3 - Solution
9/10/2024 Geostatistics 31
Example 2.3 - Solution
To calculate correlation coefficient, we must calculate standard deviation for both log k
and ϕ. Use Eq. 2.7 with n instead of (n - 1) as the denominator.
σ 𝑛 2 2
𝑖=1 𝑥𝑖 − 𝑛 𝑥ҧ
𝑠2 =
𝑛
9/10/2024 Geostatistics 32
Example 2.3 - Solution
To estimate the rank correlation coefficient, all data values are first sorted in ascending
order. Then, each value is assigned a rank, depending on where it falls. The smallest
value receives the lowest rank, and the largest value has a rank of n, where n = total
number of samples.
9/10/2024 Geostatistics 33
Linear Regression
A linear relationship is useful in predicting a value of one variable when the value of
the other variable is known.
The simplest type of this relationship is
y = mx + b
where
y = the variable to be estimated;
x = the known variable,
m = the slope of the straight line, and
b = an intercept on the y axis.
9/10/2024 Geostatistics 34
Linear Regression
To estimate the values of m and b, we first use the available sample pair of x and y, and
obtain the “best” fit between the two variables.
We can show that the best fit can be obtained by defining the values of m and b as
𝑐(𝑥, 𝑦)
𝑚=
𝑠𝑥2
𝑏 = 𝑦ത − 𝑚𝑥ҧ
where
c(x,y) = covariance between x andy,
𝑠𝑥2 = variance of x, and
𝑦ത and 𝑥ҧ = arithmetic means of the y and x variables, respectively.
9/10/2024 Geostatistics 35
Example 2.4
Table 2.5 provides core permeability vs. core porosity data from a well. Find the best-fit
line between log k and ϕ values.
9/10/2024 Geostatistics 36
Bivariate Relationships for Spatial Data
Covariance can be used as a statistical tool to quantify the relationship between two
variables.
An important distinction when establishing a bivariate relationship for spatial data is
that the same variable is examined but at different spatial locations. It also is possible
to develop a relationship between two different variables at different locations.
9/10/2024 Geostatistics 37
Example 2.5
Table 2.9 shows porosity data collected from a vertical well at uniform intervals of 1 ft.
Establish a relationship between porosity values at different locations as functions of
distance between those values.
9/10/2024 Geostatistics 38
Example 2.5 - Solution
Recall that the covariance relationship 𝑛states that 𝑛 𝑛
1 1 1
𝑐 𝑥, 𝑦 = 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
We use the same relationship, except that the x and y variables are the same variable at
different locations. For example, if we denote variables x(u) as a value of x at Location u
and a variable x(u + L) as a value𝑛of x at Location u + L, 𝑛we can write
𝑛
Eq. as
1 1 1
𝑐 𝑥(𝑢), 𝑥(𝑢 + 𝐿) = 𝑥 𝑢𝑖 𝑥(𝑢𝑖 + 𝐿) − 𝑥 𝑢𝑖 𝑥(𝑢𝑖 + 𝐿)
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
where
L = distance between the two variables, also called the lag distance
n = the number of pairs located a distance L apart
9/10/2024 Geostatistics 39
Example 2.5 - Solution
Fig. shows that, for a lag distance of 1 ft, we can gather six pairs; for a lag distance of 2 ft,
we can gather five pairs; and so forth.
9/10/2024 Geostatistics 40
Example 2.5 - Solution
We can calculate the covariance for a lag distance of 1 ft with n = 6 ϕ(u) ϕ(u + 1)
𝑐 ϕ(𝑢), ϕ(𝑢 + 1) 8.25 9.00
1 9.00 6.25
= (8.25 × 9.00 + 9.00 × 6.25 + 6.25 × 5.00 + 5.00 × 5.30 + 5.30
6 6.25 5.00
× 4.75 + 4.75 × 5.00)
1 1 5.00 5.30
− 8.25 + 9.00 + 6.25 + 5.00 + 5.30 + 4.75 (9.00 + 6.25 + 5.00 5.30 4.75
6 6
+ 5.30 + 4.75 + 5.00) = 1.73 4.75 5.00
For covariance, we can simply write the left side of the equation as
c(1) because it reflects the covariance for a lag distance of 1 ft
9/10/2024 Geostatistics 41
Example 2.5 - Solution
ϕ(u) ϕ(u + 2)
8.25 6.25
9.00 5.00
6.25 5.30
5.00 4.75
5.30 5.00
Covariance for a lag distance of 2 ft is calculated in the same way. There are five pairs at
that lag distance.
𝑐 2
1
= 8.25 × 6.25 + 9.00 × 5.00 + 6.25 × 5.30 + 5.00 × 4.75 + 5.30 × 5.00
5
1 1
− 8.25 + 9.00 + 6.25 + 5.00 + 5.30 6.25 + 5.00 + 5.30 + 4.75 + 5.00 = 0.43
5 5
9/10/2024 Geostatistics 42
Example 2.5 - Solution
ϕ(u) ϕ(u + 3)
8.25 5.00
9.00 5.30
6.25 4.75
5.00 5.00
There are four pairs for a lag distance of 3 ft, and the same equation is used to calculate
c(3) = 0.195.
9/10/2024 Geostatistics 43
Example 2.5 - Solution
The values of correlation coefficient at various lag distances can be calculated similarly.
𝑐(𝑥, 𝑦)
𝑟 𝑥, 𝑦 =
𝑠𝑥 𝑠𝑦
For spatial data sets,
𝑐[𝑥 𝑢 , 𝑥 𝑢 + 𝐿 ]
𝑟 𝑥(𝑢), 𝑥(𝑢 + 𝐿) =
𝑠𝑥(𝑢) 𝑠𝑥(𝑢+𝐿)
As in the case of covariance, the correlation coefficient can be written simply as a
function of the lag distance.
𝑐(𝐿)
𝑟(𝐿) =
𝑠𝑥(𝑢) 𝑠𝑥(𝑢+𝐿)
9/10/2024 Geostatistics 44
Example 2.5 - Solution
For a lag distance of 1 ft, we can calculate sx(u) by calculating the variance of all data
points used as a first data point in a given pair.
The mean is
1
𝑥(𝑢) = 8.25 + 9.00 + 6.25 + 5.00 + 5.30 + 4.75 = 6.425
6
The variance is
6 2 2
2
σ 𝑖 = 1 𝑥(𝑢𝑖) − 6𝑥(𝑢)
𝑠𝑥(𝑢) =
6
8.25 + 9.00 + 6.252 + 5.002 + 5.302 + 4.752 − 6 × 6.4252
2 2
= = 2.6823
6
Therefore, sx(u) = 1.638
9/10/2024 Geostatistics 45
Example 2.5 - Solution
Similarly, for the second data point in each pair.
The mean is
1
𝑥(𝑢 + 1) = 9.00 + 6.25 + 5.00 + 5.30 + 4.75 + 5.00 = 5.883
6
The variance is
6 2 2
2
σ 𝑥
𝑖 = 1 (𝑢𝑖+1) − 6𝑥(𝑢 + 1)
𝑠𝑥(𝑢 + 1) =
62
9.00 + 6.25 + 5.00 + 5.302 + 4.752 + 5.002 − 6 × 5.8832
2 2
= = 2.1722
6
Therefore, sx(u + 1) = 1.474
9/10/2024 Geostatistics 46
Example 2.5 - Solution
𝑐 1 1.73
𝑟 1 = = = 0.7165
𝑠𝑥 𝑢 𝑠𝑥 𝑢+1 1.638 × 1.474
Similarly, for lag distances of 2 and 3 ft, respectively,
𝑐 2 0.43
𝑟 2 = = = 0.5135
𝑠𝑥 𝑢 𝑠𝑥 𝑢+2 1.595 × 0.525
and
𝑐 3 0.195
𝑟 3 = = = 0.631
𝑠𝑥 𝑢 𝑠𝑥 𝑢+3 1.586 × 0.195
9/10/2024 Geostatistics 47
Example 2.5 - Solution
A special case exists when the covariance and correlation-coefficient values are
estimated at a lag distance of zero. At L = 0, the equation for covariance reduces to the
corresponding equation for𝑛 variance. 𝑛 𝑛
1 1 1 2
𝑐 0 = 𝑥 𝑢𝑖 𝑥(𝑢𝑖 ) − 𝑥 𝑢𝑖 𝑥(𝑢𝑖 ) = 𝑠𝑥(𝑢)
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
In our example, n = 7 for L = 0, which gives c 0 = 𝑠𝑥2 𝑢 = 2.548
We can easily show that r(0) = 1 because a perfect relationship exists between x(u) and
x(u) - they are identical. Also mathematically,
𝑐 0 𝑠𝑥2 𝑢
𝑟 0 = = =1
𝑠𝑥 𝑢 𝑠𝑥 𝑢 𝑠𝑥 𝑢 𝑠𝑥 𝑢
9/10/2024 Geostatistics 48
Inferential statistics
Inferential statistics is a logical extension of descriptive statistics.
Descriptive statistics most often deals with analyzing sample data sets. However, from
the characteristics of the sample, conclusions (inferences) can be drawn about the
population from which the sample was taken.
9/10/2024 Geostatistics 49
Inferential statistics
▪ Random Experiment
▪ Sample Space and Events
▪ Probability
▪ Random Variables
▪ Mathematical Expectation
▪ Important Distribution Functions
9/10/2024 Geostatistics 50
Random Experiment
An experiment whose outcome cannot be predicted with certainty in advance.
Obviously, a random experiment has to result in more than one possible outcome.
Example: tossing of a coin
• More than one outcome: heads or tails
• We cannot predict the outcome with certainty
All characteristics of a random experiment are satisfied
9/10/2024 Geostatistics 51
Random Experiment
The concept of random experiment is extremely important. However, drilling a well as a
random experiment is different from tossing a coin as a random experiment. Tossing a
coin results in either of two outcomes; therefore, it is a random experiment. Drilling a
well results in only one outcome; however, our lack of knowledge does not allow us to
predict that outcome with certainty. Therefore, we treat it as a random experiment
with multiple possibilities of outcomes.
9/10/2024 Geostatistics 52
Sample Space and Events
A sample space, S, is a set of all possible outcomes. For a rolling-a-die experiment, we
can denote the sample space as
S = (1, 2, 3, 4, 5, 6)
because the experiment has six possible outcomes.
An event is defined as a set consisting of some of the possible outcomes. If Event A
consists of all the even-numbered outcomes of the rolling-a-die experiment, then Event
A is
A = (2, 4, 6)
9/10/2024 Geostatistics 53
Sample Space and Events
If Event B consists of all the outcomes less than five for the rolling-a-die experiment,
then Event B is
B = (1, 2, 3, 4)
Using only two events, A and B, of a sample space, we can define the union of these two
events, A ∪ B, as consisting of all the outcomes present in either A or B. Therefore,
A ∪ B = (1, 2, 3, 4, 6)
Similarly, using Events A and B, we can define the intersection of these two events, A ∩
B, as consisting of all outcomes that are present in both A and B. Therefore,
A ∩ B = (2, 4)
9/10/2024 Geostatistics 54
Sample Space and Events
If the intersection of two events results in a null set (containing no outcome), these two
events are mutually exclusive.
For example, if Event C contains all the odd-numbered outcomes from a rolling-a-die
experiment, then
C = (1, 3, 5)
With the definition of intersection of events,
A∩C=∅
where ∅ = a null set.
Because A ∩ C contains no outcomes, A and C are considered mutually exclusive.
9/10/2024 Geostatistics 55
Sample Space and Events
The best way to illustrate some of these definitions is with Venn diagrams. Fig. 2.25
shows examples of Venn diagrams for union and intersection of events within a sample
space. The shaded region indicates the resulting event.
9/10/2024 Geostatistics 56
Probability
Probability normally is associated to a particular event of a random experiment. A
geologist’s statement that “there is 30% probability of finding oil at a location where a
new well is to be drilled” can have two meanings. Both meanings are correct and are a
result of the way we define the random experiment.
1. The first interpretation is that the geologist believes that, in reservoirs with a similar
depositional environment, 30% of the wells will produce oil. That is, if several wells are
drilled in very similar depositional environments, 30% will produce oil and, by
inference, 70% will be dry holes.
2. The second interpretation is that the 30% probability is a measure of the geologist’s
subjective belief that the well will produce oil.
These two interpretations can be directly related to the description of random
experiments.
9/10/2024 Geostatistics 57
Probability
The first interpretation is closely related to the random experiment of rolling a die.
That is, if we repeat the experiment a large number of times under controlled
conditions, a pattern emerges about the outcomes. For example, for a true die, if we roll
the die a large number of times, we observe that each of the six outcomes is equally
likely. Under this interpretation, probability can be defined as
𝑛𝐴
𝑝 𝐴 = lim
𝑛𝑐 →∞ 𝑛𝐶
where
p(A) = probability of Event A
nA = number of times outcome has occurred, and
nC = number of times the random experiment is conducted under controlled conditions,
nC has to be large to ensure that a correct pattern is captured.
9/10/2024 Geostatistics 58
Probability
For the rolling-a-die experiment, the probability of each of the six outcomes is 1/6 or
0.1667. Going back to the geologist’s statement, if we drill a large number of wells in a
similar, depositional environment, we observe that p(P), where P = producer is 30%, or
30% of the wells should produce oil.
9/10/2024 Geostatistics 59
Probability
Interpretation 2 is closely related to the random experiment of drilling a well. The
geologist is simply using a subjective belief about the chance of success. Uncertainty
exists is because of a lack of complete knowledge about the reservoir. However, the
geologist is using his/her partial knowledge to assign a value to the probability of
success.
9/10/2024 Geostatistics 60
Probability
Both interpretations are correct, and the mathematics of probability does not change
with the interpretation applied. Deterministic events can be treated as random events if
we lack sufficient knowledge about those events; however, with partial knowledge about
the events, probabilities can be assigned to the likely outcomes (events) of that
experiment.
9/10/2024 Geostatistics 61
Probability
▪ Laws of Probability
▪ Conditional Probability
9/10/2024 Geostatistics 62
Laws of Probability
For Event A of a random experiment with Sample Space S,
0≤𝑝 𝐴 ≤1
That is, the probability value can never be less than zero or greater than one. Also,
𝑝 𝑆 =1
That is, the probability that the outcome will be part of the sample space is equal to one.
Secondly, 𝑛 𝑛
𝑒 𝑒
𝑝 𝐴 𝑖 = 𝑝 ራ 𝐴𝑖
𝑖=1 𝑖=1
where
Ai = sequence of mutually exclusive events and
ne = number of mutually exclusive events
9/10/2024 Geostatistics 63
Laws of Probability
For ne = 2
𝑝 𝐴1 + 𝑝 𝐴2 = 𝑝 𝐴1 ∪ 𝐴2
and for ne = 3
𝑝 𝐴1 + 𝑝 𝐴2 + 𝑝 𝐴3 = 𝑝 𝐴1 ∪ 𝐴2 ∪ 𝐴3
That is, the probability of the union of mutually exclusive events is equal to the addition
of the probabilities of the individual events.
9/10/2024 Geostatistics 64
Laws of Probability
For two events that are not mutually exclusive,
𝑝 𝐴∪𝐵 =𝑝 𝐴 +𝑝 𝐵 −𝑝 𝐴∩𝐵
9/10/2024 Geostatistics 65
Example 2.6
The following three events are defined for a rolling-a-die experiment.
1. Event A. All even-numbered outcomes.
2. Event B. All outcomes greater than 3.
3. Event C. All odd-numbered outcomes less than 4
Calculate p(A), p(B), p(C), p(A∪B), p(A∪C) and p(A∩B).
9/10/2024 Geostatistics 66
Example 2.6 - Solution
From the description of events, the events are A = (2, 4, 6), B = (4, 5, 6), and C = (1, 3).
Knowing that the probability of individual outcomes for rolling a die is 1/6, we can
calculate the probability of individual events
1 1 1 1
𝑝 𝐴 = + + =
6 6 6 2
1 1 1 1
𝑝 𝐵 = + + =
6 6 6 2
1 1 1
𝑝 𝐶 = + =
6 6 3
9/10/2024 Geostatistics 67
Example 2.6 - Solution
Because 𝑝 𝐴 ∩ 𝐶 = ∅
1 1 5
𝑝 𝐴∪𝐶 =𝑝 𝐴 +𝑝 𝐶 = + =
2 3 6
which can be confirmed because
A∪C = (1, 2, 3, 4, 6)
Therefore,
1 1 1 1 1 5
𝑝 𝐴∪𝐶 = + + + + =
6 6 6 6 6 6
9/10/2024 Geostatistics 68
Example 2.6 - Solution
and A ∩ B = (4, 6)
which results in
1
𝑝 𝐴∩𝐵 =
3
1 1 1 2
𝑝 𝐴∪𝐵 =𝑝 𝐴 +𝑝 𝐵 −𝑝 𝐴∩𝐵 = + − =
2 2 3 3
which is confirmed because
A ∪ B = (2, 4, 5, 6)
Therefore,
1 1 1 1 2
𝑝 𝐴∪𝐵 = + + + =
6 6 6 6 3
9/10/2024 Geostatistics 69
Conditional Probability
As the name indicates, conditional probability is the probability of an event that is
conditional on some information. This allows calculation of the probability of a given
event when partial information regarding the result of the random experiment is
available.
9/10/2024 Geostatistics 70
Conditional Probability
The most common notation used to describe conditional probability p(A|B). This
indicates the conditional probability of Event A occurring given that Event B has
occurred. A general equation for calculating conditional probability is
𝑝 𝐴∩𝐵
𝑝 𝐴|𝐵 =
𝑝(𝐵)
9/10/2024 Geostatistics 71
Example 2.7
The probability of finding oil in an exploration well is estimated to be 0.2. One of the
uncertainties in finding oil in this well is the presence of source rock. The probability of
the presence of source rock is 0.7. After these preliminary calculations were made, a
well drilled in a nearby area confirmed the presence of source rock in the region. What
is the probability of finding oil in the first exploration well given that the presence of
source rock is confirmed?
9/10/2024 Geostatistics 72
Example 2.7 - Solution
Let A be the event that the oil is found in the exploration well, and B be the event that
the source rock is present.
𝑝 𝐴∩𝐵
𝑝 𝐴|𝐵 =
𝑝(𝐵)
We know that p(B) = 0.7 and p(A ∩ B) = 0.2 (because A is a subset of B)
0.2
𝑝 𝐴|𝐵 = = 0.286
0.7
The probability of finding oil improves to 0.286 because source rock is present.
9/10/2024 Geostatistics 73
Conditional Probability
We can rewrite
𝑝 𝐴 ∩ 𝐵 = 𝑝 𝐴|𝐵 𝑝(𝐵)
Eq. allows us to derive a few more conclusions.
First, for mutually exclusive events, (A ∩ B) is a null set; therefore, p(A ∩ B) = 0.
Because the left side of Eq. is zero, p(A|B) = 0, which is consistent with the idea that if B
has occurred, A cannot occur. Therefore, the probability of A occurring given that B
has occurred is zero. The same definition can be used to define independent events,
which are events that are independent of each other. The occurrence of an independent
event is not affected by whether any of the events from which it is independent has
occurred.
9/10/2024 Geostatistics 74
Conditional Probability
For example, in rolling a pair of dice, the outcome of one die does not affect the
outcome of the other die. The outcomes of the two dice are completely independent of
each other. In other words,
p(A|B) = p(A) và p(B) = p(B|A)
where the probability that Event A will occur is not affected by the fact that B has
occurred. The same thing can be said about Event B. The probability of Event B
occurring is not affected by the fact that Event A has occurred.
9/10/2024 Geostatistics 75
Conditional Probability
Substituting gives
𝑝 𝐴∩𝐵 =𝑝 𝐴 𝑝 𝐵
for independent events. For more than two events, Eq. can be extended as
𝑝 𝐴1 ∩ 𝐴2 ∩ ⋯ ∩ 𝐴𝑛 = 𝑝 𝐴1 𝑝 𝐴2 … p(𝐴𝑛 )
where A1, A2, ... , An = independent events.
9/10/2024 Geostatistics 76
Example 2.8
The probability of success is estimated to be 0.2 for an exploration well in Basin 1. For
another exploration well in Basin 2, the probability of success is estimated to be 0.3. If
both wells are drilled, what is the probability that both will be successful?
9/10/2024 Geostatistics 77
Example 2.8 - Solution
Because these wells are drilled in different basins, they can be considered as
independent events; the outcome of one well does not affect the outcome of the other. If
Event A is a successful well in Basin 1 and Event B is a successful well in Basin 2,
p(A) = 0.2 and p(B) = 0.3.
With Eq. 2.41,
p(A ∩ B) = p(A)p(B) = 0.2 x 0.3 = 0.06.
The probability that both events will occur is 0.06 or 6%.
9/10/2024 Geostatistics 78
Conditional Probability
Another useful extension of Eq. 2.38 can be written. Recall that Eq. 2.38 states that
𝑝 𝐴∩𝐵
𝑝 𝐴|𝐵 =
𝑝(𝐵)
With a sample space consisting of A, mutually
𝑛
exclusive events so that
𝑒
𝑝(𝐴𝑖 ) = 1
𝑖=1
we can easily write
𝑝 𝐵 = 𝑝 𝐴1 ∩ 𝐵 + 𝑝 𝐴2 ∩ 𝐵 + ⋯ + 𝑝 𝐴𝑛𝑒 ∩ 𝐵
9/10/2024 Geostatistics 79
Conditional Probability
In Fig. 2.26, which illustrates this, the sample space is divided
into four mutually exclusive events, A1 through A4, and Event
B is located in the sample space. As the figure shows. Event B
can be written as
𝐵 = 𝐴1 ∩ 𝐵 + 𝐴2 ∩ 𝐵 + 𝐴3 ∩ 𝐵 + 𝐴4 ∩ 𝐵
This can be generalized for ne mutually exclusive events.
9/10/2024 Geostatistics 80
Conditional Probability
𝑝 𝐴∩𝐵
𝑝 𝐴|𝐵 =
𝑝(𝐵)
𝑝 𝐴∩𝐵
𝑝 𝐵|𝐴 =
𝑝(𝐴)
𝑝 𝐴 ∩ 𝐵 = 𝑝 𝐵|𝐴 𝑝 𝐴
𝑝 𝐵 = 𝑝 𝐴1 ∩ 𝐵 + 𝑝 𝐴2 ∩ 𝐵 + ⋯ + 𝑝 𝐴𝑛𝑒 ∩ 𝐵
9/10/2024 Geostatistics 82
Example 2.9 - Solution
Let Ei where i = 1,2,3, be an event that oil is in Region i. Let F be an event that the search
of Region 1 is unsuccessful. With Bayes’ theorem,
𝑝 𝐸1 ∩ 𝐹
𝑝 𝐸1 |𝐹 = 3
σ𝑖=1 𝑝 𝐹 𝐸𝑖 𝑝(𝐸𝑖 )
and
𝑝 𝐸1 ∩ 𝐹 = 𝑝 𝐹 𝐸1 𝑝(𝐸1 )
The denominator of Eq. can be written as
3
𝑝 𝐹 𝐸𝑖 𝑝(𝐸𝑖 ) = 𝑝 𝐹 𝐸1 𝑝 𝐸1 + 𝑝 𝐹 𝐸2 𝑝 𝐸2 + 𝑝 𝐹 𝐸3 𝑝(𝐸3 )
𝑖=1
9/10/2024 Geostatistics 83
Example 2.9 - Solution
Each region has an equal likelihood of success; therefore,
p(E1) = p(E2) = p(E3) = 0.333
Also, P(F|E1) = 0.4, P(F|E2) = 1, and p(F|E3) = 1 because the exploration well in Region 1
results in failure if oil is in either Region 2 or 3. Substituting all these values gives
𝑝 𝐹 𝐸1 𝑝(𝐸1 )
𝑝 𝐸1 |𝐹 =
𝑝 𝐹 𝐸1 𝑝 𝐸1 + 𝑝 𝐹 𝐸2 𝑝 𝐸2 + 𝑝 𝐹 𝐸3 𝑝(𝐸3 )
0.4 × 0.33
= = 0.167
0.4 × 0.33 + 1 × 0.33 + 1 × 0.33
Similarly,
𝑝 𝐹 𝐸2 𝑝(𝐸2 )
𝑝 𝐸2 |𝐹 =
𝑝 𝐹 𝐸1 𝑝 𝐸1 + 𝑝 𝐹 𝐸2 𝑝 𝐸2 + 𝑝 𝐹 𝐸3 𝑝(𝐸3 )
1 × 0.33
= = 0.417
0.4 × 0.33 + 1 × 0.33 + 1 × 0.33
The probability of finding oil in Regions 2 and 3 improved to 0.417 because the search in
Region 1 was unsuccessful.
9/10/2024 Geostatistics 84
Random Variables
A random variable is a variable whose values are generated by a random experiment
on the basis of some probabilistic function.
For example, the rolling-a-die experiment produces any one of the six possible outcomes
randomly. If the random variable is letter X for this random experiment, for a true die,
1
𝑝 𝑋=1 = =𝑝 𝑋=2 =𝑝 𝑋=3 =𝑝 𝑋=4 =𝑝 𝑋=5 =𝑝 𝑋=6
6
That is, the probability that a random variable can take any one of the six values is 1/6.
9/10/2024 Geostatistics 85
Random Variables
It is important to maintain the distinction between a random variable and an actual
outcome of a random variable. To make this distinction, we use an uppercase letter to
denote a random variable (e.g., X) and a lowercase letter to denote the outcome (or
realization) of a random variable (e.g., x).
Conceptually, the difference between the random variable and its realizations can be
explained with the same example of the rolling-a-die experiment. The random variable
can take any of the six outcomes 1, 2, 5, 6, 3, 4, 3, 3, 4, 6, 1, 4, 2, …, these are the
realizations of the random variable, which we can denote as x1 = 1, x2 = 2, x3 = 5, and x4 =
6, where the subscripts denote the number of a particular realization.
9/10/2024 Geostatistics 86
Random Variables
Random variables are defined as two types: discrete and continuous. Discrete random
variables can take a finite number of values. An example is the rolling-a-die
experiment, where a random variable can take only six possible values. Continuous
random variables can take a very large number of values (for example, a collection of
porosity data from a reservoir), and a large number of outcomes are possible.
9/10/2024 Geostatistics 87
Random Variables
Probability Function. The probability function describes the probability that a random
variable will take a certain value. The probability function is closely related to the
relative-frequency-distribution function, which describes the chance that a value will
fall within a certain class. The probability function describes an essentially similar
behavior.
9/10/2024 Geostatistics 88
Random Variables
For discrete random variables, the probability mass function, P(a), of a random
variable X is
P(a) = p(X = a)
For a discrete random variable, X can take
𝑛
finite number of values xi,i = 1,n. Therefore,
𝑃 𝑥𝑖 = 1
𝑖=1
9/10/2024 Geostatistics 89
Example 2.10
Define the probability mass function for the rolling-a-die experiment. Show that
σ𝑛𝑖=1 𝑃 𝑥𝑖 = 1 is satisfied for this function.
9/10/2024 Geostatistics 90
Example 2.10 - Solution
Knowing the random experiment, we can write, for example,
P(1) = p(X = 1) = 1/6
Similarly, we can write
P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
Applying Eq. gives
6
1 1 1 1 1 1
𝑃 𝑥𝑖 = + + + + + = 1
6 6 6 6 6 6
𝑖=1
9/10/2024 Geostatistics 91
Random Variables
For a continuous random variable, the
probability density function, f(x), describes
the behavior of a random variable. Fig. 2.27
shows a representative probability density
function. One requirement of the probability
density function is that the area under the
curve be equal to∞one. Mathematically,
න 𝑓 𝑥 𝑑𝑥 = 1
−∞
9/10/2024 Geostatistics 92
Random Variables
The probability that the value of a random
variable will fall within a certain
interval can be calculated
𝑏
𝑝(𝑎 < 𝑋 ≤ 𝑏) = න 𝑓 𝑥 𝑑𝑥
𝑎
Schematically, the probability that a value
will fall within a certain interval is
represented by the area under the curve
within that interval (Fig. 2.28).
9/10/2024 Geostatistics 93
Example 2.11
Pay-zone thickness in a reservoir is described by the following probability density
function.
0 for 𝑥 ≤ 20 ft
1
𝑓 𝑥 = for 20 < 𝑥 ≤ 70 ft
50
0 for 𝑥 > 70 ft
∞
Show that this density function satisfies −∞ 𝑓 𝑥 𝑑𝑥 = 1 . Further, calculate the
probability that the thickness at a particular location will fall between 30 and 50 ft thick.
9/10/2024 Geostatistics 94
Example 2.11 - Solution
Applying Eq. to the probability density function gives
∞ 20 70 ∞
1 1 70
න 𝑓 𝑥 𝑑𝑥 = න 0𝑑𝑥 + න 𝑑𝑥 + න 0𝑑𝑥 = 0 + 𝑥| + 0 = 1
50 50 20
−∞ −∞ 20 50 70
1 1 50 20
𝑝 30 ≤ 𝑋 ≤ 50 = න 𝑑𝑥 = 𝑥| = = 0.4
50 50 30 50
30
Therefore, there is a 40% probability that the pay zone will fall within the 30- to 50-ft
interval.
9/10/2024 Geostatistics 95
Cumulative-Distribution Function
The cumulative-distribution function, F(x), is defined as
F(x) = p(X ≤ x)
It is the probability that a random variable X will be less than a particular value x.
Knowing the definition of the cumulative-distribution function, we can calculate the
probability that a random variable will fall within a certain interval.
p(X ≤ b) = p(X ≤ a) + p(a < X ≤ b)
Therefore,
p(a < X ≤ b) = p(X ≤ b) - p(X ≤ a) = F(b) - F(a)
9/10/2024 Geostatistics 96
Cumulative-Distribution Function
For a discrete random variable, cumulative-distribution function can be calculated as
𝐹 𝑎 = 𝑃(𝑥𝑖 )
𝑥𝑖 ≤𝑎
where P(xi) = the probability mass function of a random variable.
9/10/2024 Geostatistics 97
Cumulative-Distribution Function
For a continuous random variable, the cumulative-distribution function can be
calculated as 𝑎
𝐹 𝑎 = න 𝑓 𝑥 𝑑𝑥
−∞
where f(x) = the probability density function of a random variable.
9/10/2024 Geostatistics 98
Example 2.12
Calculate the cumulative-distribution function for the rolling-a-die experiment. What is
the probability that the outcome will fall between two and five, p(2 < X ≤ 5)?
9/10/2024 Geostatistics 99
Example 2.12 - Solution
1
1
𝐹 1 = 𝑃(𝑥𝑖 ) =
6
𝑖=1
2
1 1 1
𝐹 2 = 𝑃(𝑥𝑖 ) = + =
6 6 3
𝑖=1
Fig. 2.29 shows the plot of the cumulative-
distribution function. It starts with a value
of zero and reaches a value of 1 at a value
of the variable equal to six.
p(2 < X ≤ 5) = F(5) - F(2) = 5/6 – 1/3 = 1/2
𝐸 𝑢(𝑋) = න 𝑢 𝑥 𝑓 𝑥 𝑑𝑥
−∞
2 − 𝜇 0.22 − 0.2
𝑧2 = = =1
𝜎 0.02
Once the values are standardized, we can look up the value of F(z1)
and F(z2) in Table 2.11: F(z1) = 0.15866 and F(z2) = 0.84134.
p[0.18 ≤ ≤ 0.22] = F(z2) - F(z1) = 0.84134 - 0.15866 = 0.6827
That is, a 68% probability exists that a porosity value will fall
between 0.18 and 0.22.
2
𝜎2 = 𝜇2 𝑒 𝛽 −1
where
μ = the mean of Variable X and
σ2 = the variance of Variable X.
𝛽2 1.792
𝛼 = ln𝜇 − = ln20 − = 2.1
2 2
Standardizing gives
ln𝑥 − 𝛼 ln 200 − 2.1
𝑧= = = 2.39
𝛽 1.792
From Table 2.11, F(2.39) = 0.992. That is, the probability that
permeability will be less than 200 md is 99.2%. Or, in other words,
the probability that the permeability will be greater than 200 md is
(1 - 0.992 = 0.008) 0.8%.