Machin Well Log
Machin Well Log
Article
An Approach for the Classification of Rock Types Using
Machine Learning of Core and Log Data
Yihan Xing 1 , Huiting Yang 2 and Wei Yu 3, *
1 School of Statistics, Capital University of Economics and Business, Beijing 100070, China;
[email protected]
2 School of Geosciences &Technology, Southwest Petroleum University, Chengdu 610500, China;
[email protected]
3 SimTech LLC, Houston, TX 77494, USA
* Correspondence: [email protected]
Abstract: Classifying rocks based on core data is the most common method used by geologists.
However, due to factors such as drilling costs, it is impossible to obtain core samples from all wells,
which poses challenges for the accurate identification of rocks. In this study, the authors demonstrated
the application of an explainable machine-learning workflow using core and log data to identify
rock types. The rock type is determined utilizing the flow zone index (FZI) method using core data
first, and then based on the collection, collation, and cleaning of well log data, four supervised
learning techniques were used to correlate well log data with rock types, and learning and prediction
models were constructed. The optimal machine learning algorithm for the classification of rocks
is selected based on a 10-fold cross-test and a comparison of AUC (area under curve) values. The
accuracy rate of the results indicates that the proposed method can greatly improve the accuracy of
the classification of rocks. SHapley Additive exPlanations (SHAP) was used to rank the importance of
the various well logs used as input variables for the prediction of rock types and provides both local
and global sensitivities, enabling the interpretation of prediction models and solving the “black box”
problem with associated machine learning algorithms. The results of this study demonstrated that the
proposed method can reliably predict rock types based on well log data and can solve hard problems
in geological research. Furthermore, the method can provide consistent well log interpretation arising
Citation: Xing, Y.; Yang, H.; Yu, W.
from the lack of core data while providing a powerful tool for well trajectory optimization. Finally,
An Approach for the Classification of the system can aid with the selection of intervals to be completed and/or perforated.
Rock Types Using Machine Learning
of Core and Log Data. Sustainability Keywords: rock type; flow zone index; supervised learning; SHAP value; AUC value
2023, 15, 8868. https://doi.org/
10.3390/su15118868
learning algorithm to classify volcanic rocks. Valentín et al. [4] identified rock types using
a deep residual network based on acoustic image logs and micro-resistivity image logs.
Unsupervised learning techniques use training samples of unknown categories (unlabeled
training samples) to solve various problems in pattern recognition. Commonly used unsu-
pervised learning algorithms include principal component analysis (PCA) and clustering
algorithms. Ding Ning [5] carried out lithology identification by means of cluster analysis
based on density attributes. Ju Wu et al. [6] identified coarse-grained sandstone, fine-
grained sandstone, and mudstone using a Bayes stepwise discriminant analysis method
with an accuracy of 82%. Duan Youxiang et al. [7] improved the accuracy of sandstone iden-
tification and classification to a level higher than that of methods based on single-machine
learning. Ma Longfei et al. [8] built a model based on a gradient-boosted decision tree
(GBDT) that can improve the accuracy of lithology identification. Most of these methods
use mathematical models for lithology identification based on manually determined rock
types and involve great uncertainties because experts may adopt different criteria for the
classification of rocks. Moreover, these methods mainly focus on sandstone reservoirs; they
only use a certain type of algorithm for lithology identification and do not consider the
optimization of models adequately. Therefore, it is difficult to interpret the final models of
these methods with geological knowledge. Tang et al. [9] used machine learning to find the
optimum profile in shale formations. Zhao et al. [10] used machine learning methods to
study the dynamic characteristics of fractures in different shale fabric facies, which showed
that machine learning can solve more complex problems, such as shale rock fabric and
fracture characteristics. In this paper, a method combining FZI and machine learning is
proposed for the first time to realize the classification of rock types in the study area. The
rock type is determined through the FZI method using core data, then by comparing the
accuracy levels of four machine learning algorithms and selecting the optimal algorithm to
identify rock types in uncored wells. This method can be used to identify rocks in various
hydrocarbon reservoirs and improve the efficiency and accuracy of well log interpretation
and other geological interpretations. It provides a new idea for lithology identification and
is of great significance for intelligent reservoir evaluation.
2. Geological Settings
The study area is located in the northeastern part of the Amu Darya basin in Turk-
menistan, near the juncture with Uzbekistan. The formation of interest is composed of the
Callovian–Oxfordian carbonate deposits, with an estimated thickness of 350 m, consisting
of the following units from top to bottom: XVac, XVp, XVm, XVhp, XVa1, Z, XVa2, and
XVI [11] (Figure 1).
The area under study in the Callovian period is a carbonate gentle slope sedimen-
tary system composed of an inner ramp, a mid-ramp, an outer ramp, and basin facies
belts. In the early Oxfordian period, under regional transgression, the outer zone of the
mid ramp and outer ramp in the Callovian period were gradually submerged, and the
inner ramp—mid-ramp gradually developed into an edged shelf-type carbonate platform.
The water body in the outer zone is highly energetic, and high-energy shoals or reef–shoal
complexes were developed. The top of the reservoir starts at a depth of about 2300 m.
The main production zones are XVac, XVp, and XVm. The main rock types are various
limestones, where the average matrix porosity is 11.1% and the geometric mean of perme-
ability is 53 mD. The reservoir space can be summarized into three types: pore, vug, and
fracture. The reservoir quality varies significantly vertically and laterally due to different
depositional settings and diagenesis.
023, 15, x FOR PEER REVIEW 3 of 16
Sustainability 2023, 15, 8868 3 of 15
Figure 1. Location of the study area and the column for target intervals of Callovian–Oxfordian.
The area under study in the Callovian period is a carbonate gentle slope sedimentary
system composed of an inner ramp, a mid-ramp, an outer ramp, and basin facies belts. In
the early Oxfordian period, under regional transgression, the outer zone of the mid ramp
and outer ramp in the Callovian period were gradually submerged, and the inner ramp—
mid-ramp gradually developed into an edged shelf-type carbonate platform. The water
body in the outer zone is highly energetic, and high-energy shoals or reef–shoal complexes
were developed. The top of the reservoir starts at a depth of about 2300 m. The main pro-
duction zones are XVac, XVp, and XVm. The main rock types are various limestones,
where the average matrix porosity is 11.1% and the geometric mean of permeability is 53
mD. The reservoir space can be summarized into three types: pore, vug, and fracture. The
reservoir quality varies significantly vertically and laterally due to different depositional
settings and diagenesis.
Figure 1. Location of the study area and the column for target intervals of Callovian–Oxfordian.
Figure 1. Location of the study area and the column for target intervals of Callovian–Oxfordian.
3. Data
3. Data and
and Methodology
Methodology
The area under study
The in the Callovian
The schematic
schematic of the
of period used
the workflow
workflow is a carbonate
used inthis
in gentle
thiswork
work slope
isisshown
shown insedimentary
in Figure2.2.
Figure
system composed of an inner ramp, a mid-ramp, an outer ramp, and basin facies belts. In
the early Oxfordian period, under regional transgression, the outer zone of the mid ramp
and outer ramp in the Callovian period were gradually submerged, and the inner ramp—
mid-ramp gradually developed into an edged shelf-type carbonate platform. The water
body in the outer zone is highly energetic, and high-energy shoals or reef–shoal complexes
were developed. The top of the reservoir starts at a depth of about 2300 m. The main pro-
duction zones are XVac, XVp, and XVm. The main rock types are various limestones,
where the average matrix porosity is 11.1% and the geometric mean of permeability is 53
mD. The reservoir space can be summarized into three types: pore, vug, and fracture. The
reservoir quality varies significantly vertically and laterally due to different depositional
settings and diagenesis.
Figure 2. Schematic
Figure 2. Schematic of
of the
the workflow
workflow presented
presentedin
inthis
thiswork.
work.
3. Data and Methodology
3.1. Data
3.1.ofData
The schematic the workflow used in this work is shown in Figure 2.
In this study, the 270 m coring data of 3 wells in the Callovian–Oxfordian formation
were In this mainly
used, study, the 270 m coring
including data of
the routine 3 wells
core in the
analysis Callovian–Oxfordian
data formation
of 956 samples, core photos,
weresections,
thin used, mainly including
and scanning the routine
electron core analysis
microscope data of 3data of In
wells. 956addition,
samples,petrophysical
core photos,
well-log data, including gamma-ray (GR), sonic (DT), resistivity (RT and RXO), and density
(RHOB) logs, were available for rock-type classification, especially in the intervals with
poor core data or without core data.
3.2. Methods
3.2.1. Rock Types
Rock typing has a wide variety of applications, such as the prediction of high mud-loss
intervals, potential production zones, and locating perforations. There are many methods
to classify rock types; in this study, we use Winland r35 [12], Pittman equations [13], and
the FZI [14] method. A detailed method of rock classification can be found in the related
literature. It can be seen from Figure 3 that the Callovian–Oxfordian formation in the study
area can be divided into 7 rock types (DRT 1–DRT 7). The corresponding rock types are
Figure 2. Schematic of the workflow presented in this work.
3.1. Data
In this study, the 270 m coring data of 3 wells in the Callovian–Oxfordian formation
3.2. Methods
3.2. Methods
3.2.1. Rock Types
3.2.1. Rock Types
Rock typing has a wide variety of applications, such as the prediction of high mud-
Rock typingpotential
loss intervals, has a wide variety ofzones,
production applications, such asperforations.
and locating the prediction of high
There are mud-
many
loss intervals, potential production zones, and locating perforations. There
methods to classify rock types; in this study, we use Winland r35 [12], Pittman equations are many
Sustainability 2023, 15, 8868 4 of 15
methods
[13], andtotheclassify
FZI [14]rock types;Aindetailed
method. this study, we use
method Winland
of rock r35 [12], Pittman
classification equations
can be found in the
[13], andliterature.
related the FZI [14] It method. A detailed
can be seen method
from Figure of rock
3 that classification can be found
the Callovian–Oxfordian in thein
formation
related literature.
the study area canIt can be seen into
be divided from7 Figure 3 that(DRT
rock types the Callovian–Oxfordian formationrock
1–DRT 7). The corresponding in
wackstone
the study with
area microporosity,
can be divided mud-dominated
into 7 rock types packstone,
(DRT 1–DRT grainstone
7). The
types are wackstone with microporosity, mud-dominated packstone, grainstone with with some
correspondingseparate-
rock
vug
types
some pore
are space, grainstone,
wackstone
separate-vug grain-dominated
withspace,
pore microporosity, packstone, wackstone
grainstone,mud-dominated
grain-dominated withgrainstone
packstone,
packstone, microfractures,
wackstonewith with
and
some mudstone with
separate-vug microfractures,
pore space, respectively.
grainstone, The microscopic
grain-dominated photos
packstone,
microfractures, and mudstone with microfractures, respectively. The microscopic of differentphotos
wackstone rock
with
types are shown
microfractures,
of different rock in
and Figure
are 4.
mudstone
types Statistics of the4.porosity
withinmicrofractures,
shown Figure and permeability
respectively.
Statistics of the of permeability
different
The microscopic
porosity and rockof
photos
types are
ofdifferent shown
differentrock in
rocktypes Table
typesare 1.
areshown
shownininTable
Figure
1. 4. Statistics of the porosity and permeability of
different rock types are shown in Table 1.
Figure3.3.Porosity
Figure Porosityand
andpermeability
permeabilitycross-plots
cross-plotsofofdifferent
differentrock
rocktypes
typesidentified
identifiedbybyFZI.
FZI.
Figure 3. Porosity and permeability cross-plots of different rock types identified by FZI.
DT GR RHOB RT RXO
(us/ft) (gAPI) (g/cm3 ) (ohm·m) (ohm·m)
Number of values 1093.00 1093.00 1093.00 1093.00 1093.00
Number of missing 2.00 2.00 2.00 2.00 2.00
Min value 48.86 5.29 1.57 4.51 3.53
Max value 81.70 42.94 2.67 72,207.00 618.60
Mode 55.08 8.76 2.41 26.86 10.24
Arithmetic mean 61.83 16.25 2.38 372.28 56.98
Geometric mean 61.41 14.74 2.38 52.46 22.87
Median 60.87 15.01 2.39 33.46 15.92
Average deviation 6.17 5.87 0.07 566.01 65.06
Standard deviation 7.35 7.04 0.10 3307.82 103.48
Variance 54.05 49.58 0.01 10,941,600.00 10,707.70
Skewness 0.39 0.59 −1.82 17.44 2.91
Kurtosis −0.73 −0.30 8.88 325.68 8.34
Q1 [10%] 52.79 7.64 2.27 15.81 7.31
Q2 [25%] 55.55 10.58 2.33 22.55 9.68
Q3 [50%] 60.87 15.01 2.39 33.46 15.92
Q4 [75%] 67.03 21.50 2.44 79.79 37.78
Q5 [90%] 72.31 26.29 2.48 488.06 176.45
It can be seen from Table 3 that the GR value of different types of rocks is low and
changes little, and the RHOB value also does not change much. The DT value of DRT 3
and DRT 4 is larger (greater than 60 gAPI) than that of other rock types, reflecting the
characteristics of high porosity, while DRT 6 and DRT 7 have high resistivity (RT and RXO)
values, which reflect the compact characteristics of these two rocks.
It can be seen from the star-plot of average logging values of different rock types
(Figure 5) that it is difficult to use one or several logging values to classify rock types, which
further illustrates the necessity of building other models (such as machine learning) to
predict rock types.
DRT6 16.80 2.29 54.50 650.00 208.90
DRT7 17.20 2.31 57.60 786.00 311.50
It can be seen from the star-plot of average logging values of different rock types
(Figure 5) that it is difficult to use one or several logging values to classify rock types,
Sustainability 2023, 15, 8868 6 of 15
which further illustrates the necessity of building other models (such as machine learning)
to predict rock types.
Figure 5.
Figure 5. Star-plots
Star-plots of
of log
log mean
mean values
values for
for different
different rock
rock types.
types.
cleaning and
(2) Data cleaning and feature
feature selection
selection
Data cleaning
Data cleaningisisthe theprocess
processofofdetecting
detectingand andremoving
removingnoisy
noisy data
data (erroneous,
(erroneous, incon-
inconsis-
sistent,
tent, andand duplicate
duplicate data)data)
from from datasets.
datasets. Erroneous
Erroneous datadata mainly
mainly results
results fromfromerrorserrors
in wellin
welldata
log log (especially
data (especially
density density data)is and
data) and is typically
typically causedcaused by borehole
by borehole enlargementenlargement
during
during
the the drilling
drilling process.process. In this erroneous
In this study, study, erroneous
data is data is mainly
mainly identified
identified through through sta-
statistical
tistical analysis
analysis methods methods (e.g., box-plot
(e.g., box-plot method). method).
Duplicate Duplicate data mainly
data mainly originates originates from
from differ-
different
ent rock types
rock types or porosity
or porosity and permeability
and permeability valuesvalues
at the at the depth.
same same depth. In addition,
In addition, some
columns in the in
some columns initial dataset
the initial are empty,
dataset and the
are empty, authors
and analyzed
the authors the “missingness”
analyzed the “missingness”in the
data
in theset,data
which
set,represents the percentage
which represents of the totalofnumber
the percentage of entries
the total number forofany variable
entries for that
any
is missing.
variable The
that is missing
missing.values can either
The missing be predicted
values can eitherusing the otherusing
be predicted variablesthe or removed.
other varia-
The
blesmissingness
or removed.of well-logging
The missingness variables used in this
of well-logging study used
variables is shown in Figure
in this study is6,shown
in which in
the X-axis represents the well-logging variable and the Y-axis represents
Figure 6, in which the X-axis represents the well-logging variable and the Y-axis repre- the missingness
Sustainability 2023, 15, x FOR PEER REVIEW
expressed as a percentage.
sents the missingness Since, the
expressed as adegree of missingness
percentage. Since, theisdegree
very low (<0.4%) in this
of missingness is7 data
of 16
very
set,
lowthe rowsin
(<0.4%) with
thismissing
data set,values werewith
the rows removed.
missing values were removed.
Figure 6.
Figure 6. Missingness
Missingness in
in different
different variables
variables used
used in
in this
this study.
study.
Outliers were
Outliers were removed
removed mainly through the histogram method, the box-plot method,
and Rosner’s
and Rosner’s test [15].
[15]. Histograms
Histograms are useful
useful to
to provide
provide information
information on on the
the distribution
distribution
of
of values
values for
for each
each feature;
feature; they
they can
can be
be used
used toto determine
determine the the distribution,
distribution, center,
center, and
and
skewness
skewnessof ofaadataset
datasetand
anddetect
detectoutliers
outlierstherein.
therein.From
Fromthethe
frequency
frequencyhistograms
histogramsof various
of vari-
ous parameters (Figure 7), it can be seen that RT and RXO data follow a skewed distribu-
tion, and ROHB data basically follow a normal distribution. A few outliers are shown as
black circles in the figure.
Figure 6. Missingness in different variables used in this study.
Outliers were removed mainly through the histogram method, the box-plot method,
Sustainability 2023, 15, 8868 and Rosner’s test [15]. Histograms are useful to provide information on the distribution 7 of 15
of values for each feature; they can be used to determine the distribution, center, and
skewness of a dataset and detect outliers therein. From the frequency histograms of vari-
ous parameters
parameters (Figure
(Figure 7), it 7),
canitbe
can be seen
seen thatand
that RT RT RXO
and RXO
data data follow
follow a skewed
a skewed distribu-
distribution,
tion, and ROHB data basically follow a normal distribution. A few outliers are
and ROHB data basically follow a normal distribution. A few outliers are shown as black shown as
black circles in the
circles in the figure. figure.
Figure7.7.Histograms
Figure Histogramsofoffeatures
features(log
(logparameters).
parameters).
Box plots are widely used to describe the distribution of values along an axis based on
Box plots are widely used to describe the distribution of values along an axis based
the five-number summary: minimum, first quartile, median, third quartile, and maximum
on the five-number summary: minimum, first quartile, median, third quartile, and maxi-
(Figure 8). This visual method allows the reviewer to better understand the distribution
mum (Figure 8). This visual method allows the reviewer to better understand the distri-
and locate the outliers. The median marks the midpoint of the data and is shown by the line
bution and locate the outliers. The median marks the midpoint of the data and is shown
that divides the narrow box into two. The median of the data is usually skewed towards
by the line that divides the narrow box into two. The median of the data is usually skewed
the top or bottom of the narrow box, which means that the data are usually denser on
towards
the narrow theside.
top or
Two bottom
of theofmore
the narrow
extreme box, which means
examples are RTthat
andthe
RXO.dataInare
theusually
samplesdenser
that
on the narrow side. Two of the more extreme examples are RT and
the authors took, half of the samples had values between 30 and 50 ohm·m, which is a RXO. In the samples
that the authors
relatively took, half
dense range. of the
The box samples
plot had avalues
represents between
left-skewed 30 and 50 ohm·m,
distribution. The valueswhich
thatis
a relatively dense range. The box plot represents a left-skewed distribution.
are greater than the upper limit or lesser than the lower limit will be the outliers that should The values
that
be are greater
further lookedthan
intothe
as upper limit carry
they might or lesser than
extra the lower limit
information. Mostwill be thedo
features outliers
not havethat
should be
outliers, and further looked
only the RHOB intovalues
as theyofmight
some carry
sample extra information.
points Most
are less than 2.0features do not
g/m3 . These
Sustainability 2023, 15, x FOR PEER REVIEW
have outliers, and only the RHOB values of some sample points are less than 2.08 g/m
of 16
3.
values are outliers resulting from the distortion of density data caused by borehole collapse
These values
during are outliers
the drilling process.resulting from the distortion of density data caused by borehole
collapse during the drilling process.
Figure8.8.Box
Figure Boxplot
plotof
offeatures
features(log
(logparameters).
parameters).
Consideringthe
Considering thefact
factthat
thatthis
thisstudy
studyinvolves
involvesaalarge
largenumber
numberof ofsamples,
samples,thetheauthors
authors
used the Rosner test function to detect the outliers [16]. The function performsthe
used the Rosner test function to detect the outliers [16]. The function performs theRosner
Rosner
generalized extreme
generalized extreme studentized
studentized deviate
deviate test
test to
to identify
identify potential
potential outliers
outliers in
in aa data
data set,
set,
assumingthe
assuming thedata
datawithout
withoutanyanyoutliers
outlierscomes
comesfrom
fromaanormal
normal(Gaussian)
(Gaussian)distribution.
distribution.
(3) Correlation
By understanding the correlation between different parameters, appropriate features
can be selected to build models. Ideally, features that provide a clear relationship to the
output while avoiding too many similar features that would present duplicate infor-
Figure 8. Box plot of features (log parameters).
Correlationof
Figure9.9.Correlation
Figure offeatures
features(log
(logparameters).
parameters).
(4)Normalization
(4) Normalization
To meet the needs of some machine learning algorithms (such as KNN), the data needs
to be normalized to eliminate bias. There are several techniques to scale or normalize the
data. The standard scaler expressed by Equation (2) was used for this study. For any given
set of data, xi
x − mean( x )
xscaled_i = i (2)
StdDev( x )
Figure
Figure 10.
10. The
The plot
plot of
of feature
feature importance.
importance.
The random forest method can obtain the optimal result and avoid overfitting overfitting by
adjusting
adjusting the maximum
maximum treetree depth,
depth, the
the percentage
percentage of
of features
features used
used in
in each tree, and the
minimum sample size in a leaf node. Figure Figure 11a shows
shows the optimal
optimal number of parameters
for splitting at any node, which should be 11.
(2) Gradient Boosting Machine (GBM)
Both GBM and the random forest method belong to the broad class of tree-based
classification techniques.
classification techniques.AAseries
seriesofof weak
weak learners
learners is initially
is initially generated,
generated, each
each of which
of which fits
fits the negative gradient of the loss function of the previously superimposed model, so
that the cumulative loss of the model after the addition of the weak learner decreases
in the direction of the negative gradient. Then, all learners are linearly combined using
different weights to enable the learners with excellent performance to be reused. The major
advantage of the GBM algorithm is that it does not require standardization or normalization
of features when different types of data are used; it is not sensitive to missing data; and it
features high nonlinearity and good interpretability for the model.
Optimizable hyperparameters in the GBM algorithm include the number of trees, the
minimum number of data points in the leaf nodes, the interaction depth specified for the
maximum depth of each tree, and the number of variables (or predictors) for splitting at
each node [21]. The larger the number of trees, the larger the tree depth, and the higher the
accuracy. The smaller the number of observations at leaf nodes, the higher the accuracy.
When there are more than 800 trees and the maximum tree depth is 15, the complexity of
the model will increase greatly, but the improvement in accuracy is negligible. Therefore,
simpler models are preferred to avoid overfitting. The optimal hyperparameters selected
for this study are as follows: the number of trees (estimators) is 172 (Figure 11b), the
maximum tree depth is 3, the minimum number of samples for a leaf node is 1, the number
of features to be split is 0.2, and the number of random states (random seeds) is 89.
(3) K-Nearest Neighbor (KNN)
Sustainability 2023, 15, 8868 10 of 15
Figure Hyperparameter
11.Hyperparameter
Figure 11. tuningtuning for different
for different supervised supervised learning
learning (a) the (a) of
Estimators theRFEstimators
is 11. of RF is 11.
(b) the Estimators of GBM is 172. (c) the K of KNN is 40. (d) the number of neurons in the
(b) the Estimators of GBM is 172. (c) the K of KNN is 40. (d) the number of neurons in the thirdthird
hidden layer is 14.
hidden layer is 14.
3.3. K-Fold Cross-Validation
Classifiers for lithology identification were constructed using KNN, GBM, random
forest, and MLP based on well log data. The log parameters selected for predicting the
rock types were GR, RT, DT, RXO, and RHOB. A total of 75% of the data was used for
training, and the other 25% was used for testing. A 10-fold cross-validation was performed
on the training data to prevent overfitting. In 10-fold cross-validation, the training data
were randomly subdivided into 10 parts; the model was trained on 9 parts and then vali-
Sustainability 2023, 15, 8868 11 of 15
Table 6. Different accuracy metrics on the test data set for different supervised learnings.
Figure 12 shows the results of a comparison between the actual rock types (Actual
Rock types) of core samples from Well A (which was not modeled during this study)
and the rock types predicted by various supervised learning techniques (different colors
Sustainability 2023, 15, 8868 12 of 15
represent different rock types). GBM_Rock represents rock types predicted by GBM using
the log data. MLP_Rock, KNN_Rock, and Rand Forest_Rock represent the results predicted
using MLP, KNN, and random forest, respectively. It is evident that the random forest
Sustainability 2023, 15, x FOR PEER REVIEW
technique does not predict as well as other supervised learning techniques. The13 of 16
visual
results in Figure 12 further corroborate the quantitative accuracy metrics shown in Table 6.
Figure 12.
Figure 12. The
Theplot
plotofofactual
actualrock
rock types
types and
and thethe types
types predicted
predicted by different
by different machine
machine learning
learning tech-
techniques.
niques.
4.2. Importance of Predictors and Model Interpretation
4.2. Importance
Predictionofmodels
Predictors
canand Model Interpretation
be interpreted by quantitatively analyzing the importance of
predictors (well-logging variables) to the models. This is helpful
Prediction models can be interpreted by quantitatively in decoding
analyzing the “black
the importance of
box” predictions and makes the model interpretable. The main parameter
predictors (well-logging variables) to the models. This is helpful in decoding the is the SHapley
“black
Additive exPlanations
box” predictions (SHAP)the
and makes values,
modelwhich are calculated
interpretable. for each
The main combination
parameter is theof predic-
SHapley
tor (log variables) and cluster (rock types). Mathematically, they represent
Additive exPlanations (SHAP) values, which are calculated for each combination of pre-the average of
the marginal
dictor contributions
(log variables) across(rock
and cluster all permutations [27]. Typically,
types). Mathematically, they arepresent
higher SHAP value
the average
for a predictor/cluster combination suggests that the chosen log variable is important
of the marginal contributions across all permutations [27]. Typically, a higher SHAP value to
identify the cluster. Because SHAP is model-agnostic, any machine-learning model
for a predictor/cluster combination suggests that the chosen log variable is important to can be
analyzed to derive
identify the cluster.input/output
Because SHAP relationships.
is model-agnostic, any machine-learning model can be
Figure 13a shows a variable-importance plot that lists the most significant variables in
analyzed to derive input/output relationships.
descending order, which provides a global interpretation of the classification and shows
Figure 13a shows a variable-importance plot that lists the most significant variables
the average impact on model-output magnitude. In Figure 13a, the X-axis represents
in descending order, which provides a global interpretation of the classification and
the average value of the SHAP absolute value, which reflects the average effect on the
shows the average impact on model-output magnitude. In Figure 13a, the X-axis repre-
magnitude of the output, and the Y-axis represents the well-logging variables used to
sents the average value of the SHAP absolute value, which reflects the average effect on
identify rock types. The plot shows that RT, RXO, and DT are the three most important
the magnitude of the output, and the Y-axis represents the well-logging variables used to
variables to define rock types in this study.
identify rock types. The plot shows that RT, RXO, and DT are the three most important
variables to define rock types in this study.
Figure 13b shows the SHAP values for Cluster 3 (Rock type 4) and different log vari-
ables; the different points represent the different observations (i.e., depths in the data set).
The color in the plot represents whether the log variable has a high or low value for that
observation. The X-axis shows the Shapley values; the larger the Shapley value, the greater
the impact on cluster prediction. For any variable, such as RHOB, the SHAP values corre-
of Rock type 4. In summary, Cluster 3 (rock type 4) is characterized by low GR values, low
RHOB values, high DT values, and medium-high RXO values, which is consistent with
the rocks in Cluster 3 being grainstones with low GR values, low RHOB values, high DT
values, and low RT values. This method is helpful in the local interpretation of classifica-
tion models. Such analysis provides a way to interpret classification results without con-
Sustainability 2023, 15, 8868 13 of 15
sidering model selection, and the application of SHAP values in petroleum engineering
provides a method for the global and local interpretation of classification models.
(a)
(b)
Figure 13. (a) VariableFigure
importance
13. (a)plot. (b) Sharp
Variable plot for
importance Rock
plot. (b)type 4. plot for Rock type 4.
Sharp
5. Conclusions Figure 13b shows the SHAP values for Cluster 3 (Rock type 4) and different log
variables; the different
This paper presents a promising andpoints represent
interpretable the different
machine observations
learning approach that (i.e., depths in the data
can identify variousset).
typesThe
ofcolor
rocksinbased
the plot represents
on well whether
log data. the logof
The purpose variable has was
this study a high or low value for
to improve geologicalthatinsights
observation.
and theThe X-axis shows
accuracy of wellthe
logShapley values;through
interpretation the larger the Shapley value, the
accu-
greater
rate identification of the impact
rock types. on clustermethod
The proposed prediction.
alsoFor any variable,
provides valuablesuch as RHOB, the SHAP values
references
for the optimizationcorresponding to different
of well trajectory and the RHOB dataselection
optimal points range from slightly
of intervals to be negative
perfo- to larger positive
values. The points with larger positive
rated. The conclusions drawn from this study are detailed below. SHAP values have a strong influence on Rock type 4,
and these points are associated with low (colored blue) values of features, suggesting that
(1) Based on core data and the FZI method, the Callovian–Oxfordian formation in the
low RHOB values are a key characteristic of Rock type 4. Similarly, it can be determined
study area can be divided into seven rock types.
through analysis that low GR values and high DT values are also typical features of Rock
(2) The results of this study show that the rock types in uncored wells can be accurately
type 4. In summary, Cluster 3 (rock type 4) is characterized by low GR values, low RHOB
classified by core data using machine learning and well log data. Accurate classifica-
values, high DT values, and medium-high RXO values, which is consistent with the rocks
tion of rocks can greatly improve the accuracy of well log interpretation and the reli-
in Cluster 3 being grainstones with low GR values, low RHOB values, high DT values, and
ability of research
low results withThis
RT values. respect to sedimentary
method is helpful inmicrofacies.
the local interpretation of classification models.
(3) Four machine Such learning algorithms were evaluated, including
analysis provides a way to interpret classification KNN, results
GBM, random
without considering model
forest, and MLP.selection, and the application of SHAP values in petroleumthe
Based on the cross-validation and evaluation results, GBM has provides a method
engineering
been selected for
for the identification
the global and local of interpretation
rock types in oftheclassification
study area. The accuracy of
models.
this algorithm for lithology identification can reach 79%.
(4) In this study, SHAP values were used to interpret “black box” (machine learning)
5. Conclusions
models, which demonstrate
This paperhigh robustness
presents and practicability
a promising and provide
and interpretable an effec-
machine learning approach that
tive means of global and local interpretation for rock classification models
can identify various types of rocks based on well log data. The purpose based on of this study was
machine learning.
to improve geological insights and the accuracy of well log interpretation through accurate
identification of rock types. The proposed method also provides valuable references for the
optimization of well trajectory and the optimal selection of intervals to be perforated. The
conclusions drawn from this study are detailed below.
(1) Based on core data and the FZI method, the Callovian–Oxfordian formation in the
study area can be divided into seven rock types.
(2) The results of this study show that the rock types in uncored wells can be accurately
classified by core data using machine learning and well log data. Accurate classifi-
cation of rocks can greatly improve the accuracy of well log interpretation and the
reliability of research results with respect to sedimentary microfacies.
Sustainability 2023, 15, 8868 14 of 15
(3) Four machine learning algorithms were evaluated, including KNN, GBM, random
forest, and MLP. Based on the cross-validation and evaluation results, the GBM has
been selected for the identification of rock types in the study area. The accuracy of
this algorithm for lithology identification can reach 79%.
(4) In this study, SHAP values were used to interpret “black box” (machine learning)
models, which demonstrate high robustness and practicability and provide an effec-
tive means of global and local interpretation for rock classification models based on
machine learning.
(5) The results of this study suggested that Rock type 4 (grainstones) are the best reser-
voir rocks in the study area. These rocks are characterized by high porosity, high
permeability, low GR values, low RHOB values, high DT values, low RT values, and
low RXO values.
References
1. Brendon Hall. Facies classification using machine learning. Lead. Edge 2016, 35, 906–909. [CrossRef]
2. Nishitsuji, Y.; Exley, R. Elastic impedance-based facies classification using support vector machine and deep learning. Geophys.
Prospect. 2019, 67, 1040–1054. [CrossRef]
3. Xiao, Y.; Wang, Z.; Zhou, Z.; Wei, Z.; Qu, K.; Wang, X.; Wang, R. Lithology classification of acidic volcanic rocks based on
parameter-optimized Ada Boost algorithm. Acta Pet. Sin. 2019, 40, 457–467.
4. Valentin, M.B.; Bom, C.R.; Coelho, J.M.; Correia, M.D.; De Albuquerque, M.P.; de Albuquerque, M.P.; Faria, E.L. A deep residual
convolutional neural network for automatic lithological facies identification in Brazilian pre-salt oilfield wellbore image logs.
J. Pet. Sci. Eng. 2019, 179, 474–503. [CrossRef]
5. Ning, D. An Improved Semi Supervised Clustering of Given Density and Its Application in Lithology Identification; China University of
Geosciences: Beijing, China, 2018.
6. Ju, W.; Han, X.H.; Zhi, L.F. A lithology identification method in Es4 reservoir of xin 176 block with bayes stepwise discriminant
method. Comput. Tech. Geophys. Geochem. Explor. 2012, 34, 576–581.
7. Duan, Y.; Wang, Y.; Sun, Q. Application of selective ensemble learning model in lithology-porosity prediction. Sci. Technol. Eng.
2020, 20, 1001–1008.
8. Ma, L.; Xiao, H.; Tao, J.; Su, Z. Intelligent lithology classification method based on GBDT algorithm. Pet. Geol. Recovery Effic. 2022,
29, 21–29.
9. Tang, J.; Fan, B.; Xiao, L.; Tian, S.; Zhang, F.; Zhang, L.; Weitz, D. A New Ensemble Machine Learning Framework for Searching
Sweet Spots in Shale Reservoirs. SPE J. 2021, 26, 482–497. [CrossRef]
10. Zhao, X.; Jin, F.; Liu, X.; Zhang, Z.; Cong, Z.; Li, Z.; Tang, J. Numerical study of fracture dynamics in different shale fabric facies
by integrating machine learning and 3-D lattice method: A case from Cangdong Sag, Bohai Bay basin, China. J. Pet. Sci. Eng.
2022, 218, 110861. [CrossRef]
11. Ulmishek, G.F. Petroleum Geology and Resources of the Amu Darya Basin, Turkmenistan, Uzbekistan, Afghanistan and Iran; USGS:
Reston, VA, USA, 2004; pp. 1–38.
12. Kolodzie, S. Analysis of pore throat size and use of the Waxman-Smits equation to determine OOIP in Spindle field, Colorado. In
Proceedings of the SPE Annual Technical Conference and Exhibition, Dallas, TX, USA, 21–24 September 1980. SPE-9382-MS.
13. Pittman, E.D. Relationship of porosity and permeability to various parameters derived from mercury injection-capillary pressure
curves for sandstone. AAPG Bull. 1992, 76, 191–198.
Sustainability 2023, 15, 8868 15 of 15
14. Amaefule, J.O.; Altunbay, M.; Tiab, D.; Kersey, D.G.; Keelan, D.K. Enhanced reservoir description using core and log data to
identify hydraulic flow units and predict permeability in un-cored intervals/wells. In Proceedings of the SPE Annual Technical
Conference and Exhibition, Houston, TX, USA, 3–6 October 1993.
15. Tang, Q. DPS Data Processing System—Experimental Design, Statistical Analysis and Data Mining, 2nd ed.; Science Press: Beijing,
China, 2010.
16. Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; John Wiley & Sons: Chichester, UK, 1995.
17. Hu, L.; Gao, W.; Zhao, K.; Zhang, P.; Wang, F. Feature Selection Considering Two Types of Feature Relevancy and Feature
Interdependency. Expert Syst. Appl. 2018, 93, 423–434. [CrossRef]
18. Breiman, L. Arcing the Edge; Technical Report 486; Statistics Department, University of California, Berkeley: Berkeley, CA,
USA, 1997.
19. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
20. Kuhn, S.; Cracknell, M.J.; Reading, A.M. Lithological mapping in the Central African copper belt using random forests and
clustering: Strategies for optimised results. Ore Geol. Rev. 2019, 112, 103015. [CrossRef]
21. Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [CrossRef]
22. Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185.
23. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer: New York, NY,
USA, 2013.
24. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998,
86, 2278–2324. [CrossRef]
25. Nielson, M.A. Neural Networks and Deep Learning; Determination Press: San Francisco, CA, USA, 2015.
26. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [CrossRef]
27. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Conference on Neural
Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.