1 s2.0 S0920410522004855 Main
1 s2.0 S0920410522004855 Main
1 s2.0 S0920410522004855 Main
A R T I C L E I N F O A B S T R A C T
Keywords: The lithofacies identification is critical for forecasting sweet spots of hydrocarbon explorations. Well logs are
Machine learning widely used in lithofacies identifications because they are petrophysical measurements of subsurface stratigraphy
Fluvial-lacustrine lithofacies which reflect lithological successions and depositional processes. The traditional lithofacies identification from
XGBoost
well logs is a manual work that is time-consuming and bias-prone. An automated and bias-free method is in
Resampling
Well log
demand. To this end, we created a lithofacies dataset of eleven wells with well log records and lithofacies de
Sichuan Basin scriptions that were interpreted manually based on facies analysis of drilling cutting descriptions and well logs.
Then we developed machine learning models that were trained using the lithofacies dataset of the fluvial-
lacustrine Upper Triassic Xujiahe and Lower Jurassic Ziliujing formations in Yuanba Area, northern Sichuan
Basin of southwestern China. By employing extreme gradient boosting and resampling algorithms, this machine
learning model is efficient and outperforms support vector machine and multiple-layer perceptron, as indicated
by its highest accuracy and F1-score of 0.90, the highest AUC of 0.94, as well as the shortest training time.
Moreover, the result suggests that resampling is necessary for lithofacies identification with the imbalanced
dataset. A combined method of oversampling and undersampling is better than a single resampling method. This
study presents a successful application of machine learning in fluvial-lacustrine lithofacies identification from
well logs and suggests the great potentiality of machine learning in subsurface hydrocarbon explorations.
DYZ, analysis, software, writing original draft & editing; MCH, Lithofacies is a combination of rocks that embody abundant infor
conceptualization, supervision, fund acquisition; AQC, conceptualiza mation from different examples under the same depositional conditions.
tion, data preparation, analysis, writing & editing; HTZ, data prepara The knowledge of lithofacies is imperative in predicting lithology dis
tion, conceptualization; ZQ, data preparation; QR, data preparation; tributions and alignment of stratigraphic units when only limited data
JCY, data preparation; HYW, data preparation; CM, conceptualization, are available (Allen, 1975; Miall, 1995), and this knowledge is critical to
writing & editing, fund acquisition. All authors read and approved the reconstructing the palaeogeography of the ancient Earth and targeting
manuscript. sweet spots of hydrocarbon explorations (e.g., Horne et al., 1978;
Catuneanu, 2006; Nielsen and Schovsbo, 2011; Laya and Tucker, 2012;
* Corresponding author. State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation, Institute of Sedimentary Geology, Chengdu University of
Technology, Chengdu, 610051, China.
** Corresponding author. State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation, Institute of Sedimentary Geology, Chengdu University of
Technology, Chengdu, 610051, China.
E-mail addresses: [email protected] (D. Zheng), [email protected] (M. Hou), [email protected] (A. Chen), [email protected] (H. Zhong),
[email protected] (Z. Qi), [email protected] (Q. Ren), [email protected] (J. You), [email protected] (H. Wang), [email protected]
(C. Ma).
https://doi.org/10.1016/j.petrol.2022.110610
Received 30 December 2021; Received in revised form 4 May 2022; Accepted 6 May 2022
Available online 10 May 2022
0920-4105/© 2022 Published by Elsevier B.V.
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Zhu et al., 2014; Chen et al., 2020; Zheng and Yang, 2020; Zheng and 2. Data
Wu, 2021).
Well logs are ubiquitous in subsurface exploration and they are 2.1. Geological background of the areas of the studied wells
normally continuous and sampled in uninterrupted sections. As well logs
directly measure the petrophysical characteristics of subsurface rocks, The selected intervals of well logs are from the Upper Triassic
they can reflect lithological, textural and structural changes, as well as Xujiahe to Lower Jurassic Ziliujing formations in Yuanba Area of
stacking patterns of lithology, which are critical to understanding lith Sichuan Basin. The Sichuan Basin, with an area of 180,000 km2, is one of
ofacies (Selley, 1976; Rider, 1990; Nazeer et al., 2016). Therefore, well the largest petroliferous basins in China. The Sichuan Basin is flanked by
logs facilitate the spatial and temporal correlations of subsurface stra the Longmen Mountains to the west, the Qinling Orogenic Belt to the
tigraphy and are widely used in oil and gas reservoir predictions (Del north, the Xuefeng Mountains to the east, and Kangdian High Land to
finer et al., 1987; Lim et al., 1997; Asquith and Krygowski, 2004; Tan the south (Fig. 1; Meng et al., 2005). Sichuan Basin experienced three
et al., 2015; Lai et al., 2018; Zheng et al., 2021). major stages of tectonic evolutions and was a foreland basin during Late
Despite the common use of well logs in lithofacies identifications, Triassic to Late Cretaceous (Liu et al., 2018).
there are two major limitations. First, lithofacies are largely interpreted Yuanba Area is in the northern part of the Sichuan Basin and is a
based on gamma-ray well logs, whereas the rest well logs are supportive medium-giant gas and oil field, in which the Xujiahe Formation and
(Allen, 1975; Rider, 1990; Cant, 1992; Asquith and Krygowski, 2004). A Ziliujing formations were deposited in a fluvial-lacustrine system and
comprehensive interpretation using multiple well logs simultaneously is are target intervals of tight sand gas with more than 1000 × 108 m3 gas
required for detailed lithofacies interpretations. However, the manual reserves (Fig. 1; Ma et al., 2010; Zheng et al., 2011; Guo et al., 2013).
works are difficult to handle with multiple well logs and sometimes may Five members (T3x1-T3x5 from bottom to top) are subdivided from the
neglect abundant useful information (Rider, 1990; Radwan, 2021). Xujiahe Formation. The T3x1, T3x3, and T3x5 are siltstone, siltstone with
Second, lithofacies identifications from well logs require huge efforts interbedded fine-grained sandstone, mudstone, and mudstone with
from experienced interpreters, thus it will increase the cost and hinder sandy or silty interbeds; the T3x2 and T3x4 are lithic arkose and feld
the efficiency. To date, deep subsurface explorations require huge vol spathic litharenites (Zhang et al., 2016; Li and He, 2014). The Ziliujing
umes of geo-dataset to reconstruct detailed paleogeographic settings Formation is subdivided into Zhenzhuchong, Dongyuemiao, Ma’anshan,
(Wang et al., 2021). A fast and efficient interpretation method is and Da’anzhai members from bottom to up (Li and He, 2014).
necessary, and machine learning is an optimal solution that can facilitate Coarse-grained rocks mainly occur in the Zhenzhuchong member;
researchers to extract useful information and gain new insights from the fine-grained rocks occur in the Dongyuemiao, Ma’anshan, and Da’anz
explosive datasets (Jordan and Mitchell, 2015). hai members (Li and He, 2014).
Research on applications of machine learning algorithms in the
lithofacies identification from well logs has been widely conducted in 2.2. Selected well log types and well log preprocessing
the past three decades. These methods include multi-dimensional ana
lyses, support vector machine, k-nearest neighbors, artificial neural In this study, eleven wells with a total thickness of over 13,800 m and
network and its transformers (e.g., Baldwin et al., 1990; Rogers et al., complete Xujiahe and Ziliujing formations from Yuanba Area were
1992; Bhatt and Helle, 2002; Dubois et al., 2007; Hall, 2016; Bestagini studied (Fig. 1). Eight types of well logs are selected in this study,
et al., 2017; Al-Mudhafar, 2017; Bize-Forest et al., 2018). Recent at including caliper well log (CAL), gamma-ray well log (GR), gamma-ray
tempts of facies recognition compared the capability of artificial neural without uranium well log (KTH), deep investigate double lateral re
networks, support vector machine, and random forest (Deng et al., sistivity log (RD), shallow investigate double lateral resistivity log (RS),
2019), and used Beier score to estimate the performance of machine compensated neutron log (CNL), density log (DEN), and acoustic log
learning models (Feng, 2021). However, these applications achieved an (AC). Data preprocessing were performed before lithofacies identifica
overall limited accuracy because they are incapable of solving general tion to avoid influences of depth offsets, disfunction of well log de
ized problems, and they yield abundant hyper-parameters tunings and tectors, and differences in value ranges of various well log types. The
computation costs (Jordan and Mitchell, 2015). Additionally, the well log was recorded for every 0.125 m, and a total of 109,894 valid
imbalanced dataset of lithofacies reduces the accuracy of the machine records of well log values were obtained after data preprocessing (Fig. 2;
learning algorithms as well (Chawla, 2009; Longadge et al., 2013). Appendix).
Although the imbalanced dataset exists in real exploration projects, the The data preprocessing procedures include:
potential influences of the imbalanced dataset and the relevant solutions
were not discussed in previous publications. (1) Depth calibration. The depth calibration of well logs is required
To offer a more accurate and efficient machine learning application to obtain accurate lithofacies interpretations because well logs
that can work on projects with imbalanced lithofacies, we collected a and cores/cuttings usually have depth offset. As mudstone/shale
dataset of eleven wells with a total thickness of 13,800 m from the Upper have higher GR values than sandstones/conglomerates, the GR
Triassic Xujiahe and Lower Jurassic Ziliujing formations in Sichuan log was used to calibrate the depth by moving the well logs to
Basin. Detailed lithofacies descriptions that were interpreted manually match the intervals of marker beds.
from facies analysis of drilling cutting descriptions and electrofacies of (2) Removal of invalid values. The raw well log data have values,
well logs. Based on this dataset, we then compared support vector ma such as − 999, − 9999, or 0. These values cannot reflect the real
chine (SVM), multiple-layer perceptron (MLP), and extreme gradient conditions of the subsurface rock formations, instead, they are
boosting (XGBoost) classifiers with over- and under-sampling algo likely caused by the disfunction of well log detectors. Therefore,
rithms. The XGBoost classifier with the combined over- and under- these invalid values were removed.
sampling algorithms obtained the best performance. The machine (3) Data standardization. To avoid the influences caused by
learning model in this study overcomes the ubiquitous problems of tremendous differences in value ranges of well logs, the raw
previous methods for lithofacies identifications, such as low prediction dataset was standardized before the training of machine learning
accuracy and incapability of solving the imbalanced dataset from real models. The procedure follows:
subsurface projects. Our results indicate that machine learning algo
1∑n
rithms can provide reliable, efficient, and bias-free lithofacies identifi μ= xi (1a)
n i=1
cations, which has great potentialities to facilitate hydrocarbon
explorations.
2
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Fig. 1. Geological maps. a) sketch map of China. Yellow highlighted area is Sichuan Basin; b) Tectonic divisions of Sichuan Basin (adapted from Ma, 2008). The red
square is the Yuan Area that is shown in Fig. 1c; c) Location map of the studied wells in Yuanba Area. Blue dots are wells with interpreted lithofacies that were used
for the implementation of machine learning models. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of
this article.)
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1 ∑n BCCS, which is represented by its cylindrical feature (Fig. 4a). To
σ= (xi − μ)2 (1b) distinguish MCS from BCCS, intervals of fluvial channel lithofacies with
n − 1 i=1
more mud contents were classified as MCS, while intervals with less mud
xi ,scaled =
xi − μ
(1c) contents were classified as BCCS. Labels of MCS account for 18% of the
σ total lithofacies.
Longitudinal/transverse bar sandstone lithofacies (LTBS): this
where n is the sample number (109,894 in this study), μ is the average, σ
type of lithofacies lies on the upper part of the braided stream channels,
is the standard deviation, xi ,scaled is the standardized well log value of the
which is typified by its cross-bedded sandstones, conglomeratic sand
ith sample.
stones, and sandy mudstones. The log curve is bell-shaped with an
overall fining upward feature (Fig. 4b). Labels of these lithofacies ac
2.3. The interpreted lithofacies count for 15% of the total lithofacies.
Point bar sandstone and mudstone lithofacies (PBSM): this type
Lithofacies of eleven wells were interpreted based on facies analysis of lithofacies lies on the upper part of the meandering stream channels
of drilling cutting/core description against electrofacies of well logs. with typical cross-bedded sandstone and mudstones. The log curves of
Xujiahe and Ziliujing formations has been deposited in the braided and these lithofacies are also bell-shaped (Fig. 4b). Compared to LTBS, the
meandering river, fan and fluvial delta, and lacustrine depositional en particle sizes of point bars are finer and values of gamma-ray are higher,
vironments (Zheng et al., 2011). In this study, we further classified the accordingly. Labels of PBSM account for 13% of the total lithofacies.
Xujiahe and Ziliujing formations as nine major lithofacies based on Alluvial plain sandstone and mudstone lithofacies (APSM): to
cutting descriptions and well log interpretations. These lithofacies were distinguish the rivers plains lithofacies of braided streams from
interpreted based on lithologies, sedimentary textures and structures, meandering streams, the alluvial plain and floodplain lithofacies are
stacking patterns of the lithology from cutting descriptions, and also further classified. The APSM consist of sandstones, sandy mudstones,
interpreted in terms of the shapes of log curves where no cuttings are mudstone, and coal beds, with irregular log curves of high gamma
available (Figs. 3 and 4; Table 1). In this study, delta plains, delta front, values (Fig. 4c), indicating high contents of muddy components with no
and prodelta were not differentiated. Instead, the interdistributary clear changes in mud contents. Labels of APSM account for 8% of the
channels were incorporated into the channel subfacies due to their total lithofacies.
similar lithological and log curve features; moreover, the rest parts of Floodplain mudstone lithofacies (FPM): the FPM consist of sandy
deltas were merged into the mouth bar subfacies because they share the mudstones, mudstones, and coal beds and are featured with irregular log
coarsening upward feature of lithology. The nine subfacies used in this curves. Compared with APSM, FPM contains more muddy components
study are: with higher values of the gamma-ray log. Labels of FPM account for 17%
Braided channel conglomerate and sandstone lithofacies of the total lithofacies.
(BCCS): consists of conglomerate, conglomeratic sandstone, sandstone, Crevasse splay sandstone and mudstone lithofacies (CSSM): the
and trace amounts of mudstone. The log feature is the cylindrical shape lithology of crevasse splay includes sandstone, siltstone, and mudstone,
(Cant, 1992, Fig. 4a), suggesting low mud content and no clear trend of with funnel-shaped log curves (Fig. 4d), suggesting a coarsening upward
grain size changes. Labels of braided streams account for 11% of the trend. Labels of CSSM account for 1% of the total lithofacies.
total lithofacies (Fig. 5). Mouth bar sandstone and mudstone lithofacies (MBSM): mouth
Meandering channel sandstone lithofacies (MCS): consists of bar is developed where the mouth of a river meets the standing body of
sandstone, mudstone, and trace amounts of conglomerate and water. The main lithology includes sandstones, siltstones, and
conglomeratic sandstone. The log feature of MCS is similar to that of
3
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
mudstones. The log curve is funnel-shaped indicating an overall coars that uses the interpreted dataset to make future predictions. In this
ening upward trend (Fig. 4d). Labels of MBSM account for 8% of the study, we selected eight wells for training, one well for validation, and
total lithofacies. two wells for test.
Shallow lake sandstone and mudstone lithofacies (SLSM): in The functions of these three datasets are different. The training
cludes lithofacies from littoral to sublittoral zones of the lacustrine en dataset is used to create the machine learning model, the validation
vironments. The main lithology includes sandstones, sandy mudstones, dataset is used to tune the hyper-parameters, and the test dataset is used
and mudstone. The log curve is irregular (Fig. 4e), indicating no clear to evaluate the model’s accuracy (Goodfellow et al., 2016). This study
grain size changes. Labels of SLSM account for 9% of the total adopted three classifiers, including support vector machine (SVM),
lithofacies. Multiple-layers Perceptrons (MLP), and Extreme gradient boosting
(XGBoost) to make lithofacies predictions. The grid search method was
3. Methodology applied to find the best hyper-parameters for each model (see Table 2 for
the tuned hyper-parameters). As the XGBoost outperforms SVM and
3.1. The implementation of machine learning models MLP classifiers (see Results), explanations of SVM and MLP classifiers
are not discussed in this study. Details can be viewed from Vapnik
The machine learning in this study is a task of supervised learning (1998) and Hinton and Osindero,.
Fig. 2. Crossplot of well logs with labels of lithofacies. See text for abbreviations of well logs and lithofacies.
4
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Fig. 3. Well log curves and interpreted lithofacies. (a) One selected well with eight well log curves, lithology, and interpreted lithofacies. Regarding the legends of
lithology, yellow is mudstone, green is sandstone, red is conglomerate. Regarding the lithofacies, 0-BCCS; 1-MCS; 2-LTBS; 3-PBSM; 4-APSM; 5-FPM; 6-CSSM; 7-
MBSM; 8-SLSM. (b) The interpreted lithofacies for the rest ten wells. (For interpretation of the references to color in this figure legend, the reader is referred to the
Web version of this article.)
The XGBoost is an ensemble machine learning algorithm with a deviations between the prediction y and the real y values. In this study,
parallel tree boosting (GBDT) that can deal with big datasets in an the multiple log loss function is selected as the loss function and is
efficient way (Chen and Guestrin, 2016). XGBoost combines the adap defined as:
tive boosting approach with the efficient optimization method; conse ( )
∑M
1 ∑N
quently, XGBoost runs more than ten times faster than other existing L= −
( )
yij ∗ Ln pij (5)
popular algorithms on a single machine and can obtain optimal results j
N i
with little effort in hyper-parameter tuning (Chen and Guestrin, 2016).
XGBoost classifier is a supervised learning model that consists of a bunch where N is the sample numbers, M is the class numbers, yij is the ith
of decision trees. Suppose dataset D contains n examples with m features sample with jth class that is either 0 or 1, pij is the classification prob
D = {(xi , yi ), xi ∈ Rm , yi ∈ R}, the output of an ensemble of K trees is, ability predicted by the classifier for the ith sample with jth class.
The second term is the regularization function that controls the
f (x) = wq (x) (2)
complexity of the model and avoids overfitting. In XGBoost, the regu
larization function is defined as,
∑
K
yi = fk (xi ) (3)
1 ∑ T
k=1 Ω (f ) = γT + λ w2j (6)
2 j=1
where x is the input well log dataset, and y is the lithofacies, wq is the
score of the associated leaf q. where T is the number of leaves, γ is the pseudo-regularization hyper
The objective function to be optimized is parameter, λ is the L2 norms, and w is the weight.
∑
n
[( ) ] ∑t
J(t) = y t−i 1 + ft (xi ) +
L yi , ̂ Ω (fi ) (4) 3.2. Data resampling, validation, and model evaluation
i=1 i=1
5
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Fig. 4. Typical well log shapes of lithofacies in a fluvial-lacustrine system, including a) cylindrical, b) bell, c) irregular, d) funnel shape, and e) symmetrical shapes.
Adapted from Cant (1992).
Table 1
Typical characteristics of the studied lithofacies.
Lithofacies Lithology Sedimentary textures and structures Textural stacking Well log shapes
pattern
Braided channel conglomerate Conglomerate, conglomeratic Low-relief erosion surface, imbricated structure, cross Fining-upward Low gamma, high
and sandstone lithofacies sandstone, sandstone bedding. porosity, cylindrical
(BCCS)
Transverse/longitudinal bar Sandstone dominant, Cross bedding, gravel sheet, imbricated structure, parallel Fining-upward Low gamma, high
sandstone lithofacies (LTBS) conglomerate bedding, sand bedding, rhombohedral gravel mat body, porosity, bell
reacting surface structure
Meandering channel sandstone Sandstone, conglomerate High-relief erosion surface, imbricated structure, cross Fining-upward Low gamma, high
lithofacies (MCS) bedding. porosity, cylindrical
Point bar sandstone and Sandstone, siltstone, mudstone Scour surface, lateral accretion structure, cross bedding, Fining-upward Low gamma, high
mudstone lithofacies (PBSM) parallel bedding porosity, bell
Floodplain mudstone Mudstone, siltstone, Horizontal bedding, mud crack, ripple mark No trend High gamma, high
lithofacies (FPM) carboniferous mudstone, coal porosity, irregular
Alluvial plain sandstone and Sandstone, mudstone, siltstone, Ripple mark, exposure marks, horizontal bedding, mud No trend High gamma, high
mudstone lithofacies (APSM) carboniferous mudstone, coal crack, porosity, irregular
Crevasse splay sandstone and Siltstone, fine-grained Ripple lamination, scour surface Coarsening-upward High gamma, high
mudstone lithofacies (CSSM) sandstone porosity, funnel
Mouth bar sandstone and Sandstone, siltstone, mudstone Climbing ripple, small-scale festoon-type cross bedding Coarsening-upward High gamma, high
mudstone lithofacies (MBSM) porosity, funnel
Shallow lake sandstone and Sandstone, siltstone, mudstone Bi-directional cross bedding, horizontal bedding, ripple Coarsening-upward High gamma, high
mudstone lithofacies (SLSM) lamination, ripple mark then fining-upward porosity, symmetrical
imbalanced raw dataset by creating a balanced dataset. The over The performance of these three data resampling methods was compared;
sampling methods create synthetic samples to increase the proportions the best method was suggested in Section 5.1.
of the rare samples; whereas, the undersampling methods reduce the To create the most accurate machine learning models and protect
samples to decrease the proportions of the abundant samples. The against overfitting, the dataset was split into training, validation, and
Synthetic Minority Oversampling Technique (SMOTE; Chawla et al., test datasets. For comparative analysis of the performances of models,
2002) and Neighborhood Cleaning Rule (NCR; Laurikkala, 2001) are the accuracy, F1-score, and area under the curve (AUC) were selected to
selected oversampling and undersampling methods. Additionally, a evaluate the model performance. The accuracy is defined as:
combined method of SMOTE and NCR was also performed in this study.
6
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
the higher the values are, the better the model is. To visualize the pre
dicted results, the normalized confusion matrix was used (Fig. 6a).
AUC measures the area under the Receiver Operating Characteristics
(ROC) curve, in which the x-axis is the false positive rate (FPR) and the
y-axis is the true positive rate (TPR; Fig. 6b). The inflection point of the
ROC curve of a perfect classifier would fall into the top-left corner of the
ROC graph with the TPR of 1 and FPR of 0. The TPR and FPR are defined
as:
TP
TPR = (9a)
TP + FN
FP
FPR = (9b)
TN + FP
Then AUC is defined as:
∫1
AUC = TPR(x)dx (9c)
0
4. Results
Table 2 The result of each model was obtained from a fine-tuned optimal
Hyper-parameters for grid search. model using the grid search method (Table 2). Additionally, each model
MLP SVM XGBoost adopted data resampling methods to increase its performance. For
comparative analysis, accuracy, F1-score, and AUC were used to eval
Hidden layer 2, 3, 4, 5, 6, 7, 8
numbers uate the model’s performance.
Neuron numbers 10, 20, 50, 100
Learning rate 0.001, 0.01, 0.03, 0.01, 0.03, 0.10,
4.1. SVM performance
0.10, 0.30 0.30
Kernel Linear, poly,
sigmoid The optimal performance of the SVM classifier was obtained from the
Degree 3, 5, 7, 9 model with a seven-degree polynomial kernel function. Overall, the
Tree depth 3, 4, 5, 8, 10, 12, 15,
SVM failed to identify lithofacies. The SVM classifier obtained the ac
17, 20
curacy, F1-score, and AUC of 0.41, 0.37, and 0.61 on the training
The optimum hyper-parameters were selected for machine learning training and dataset, and accuracy, F1-score, and AUC of 0.41, 0.37, and 0.61 on the
bolded in this table. test dataset (Table 3; Figs. 7 and 9). The limited variations between the
performances on the training and test datasets suggest that the SVM
classifier was unable to extract the complicated relationships in the
Accuracy =
TP + TN
(7) dataset of well logs and lithofacies. In the normalized confusion matrix
TP + TN + FP + FN of the test dataset, the column of the APSM showed dark blue color,
F1-score is defined as: suggesting that most lithofacies were mislabeled by APSM (Fig. 8a).
Moreover, the training process of the SVM classifier took 359.50 s
F1 =
2TP
(8) (Table 3).
2TP + FP + FN The data resampling methods improved the performance of the SVM
classifier. The SVM classifier with the SMOTE (oversampling) reproc
where TP is true positive, TN is true negative, FP is false positive, and FN
essed dataset increased its accuracy on the test dataset from 0.41 to 0.42,
is false negative. Both accuracy and F1 score are values between 0 and 1,
F1-score from 0.37 to 0.40, and AUC from 0.61 to 0.63. The NCR
7
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Table 3
Evaluation metrics and training time of SVM.
Training dataset Validation dataset Test dataset Time
Original 0.41 0.37 0.61 0.41 0.37 0.72 0.41 0.37 0.61
SMOTE 0.43 0.41 0.62 0.43 0.41 0.63 0.42 0.40 0.63
NCR 0.41 0.40 0.63 0.41 0.39 0.64 0.41 0.39 0.64
SMOTE + NCR 0.45 0.41 0.74 0.44 0.40 0.74 0.42 0.44 0.75
Acc is accuracy; F1 is F1-score; Time is the training time of machine learning models.
8
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Fig. 8. Normalized confusion matrixes of SVM model performances with a) the original dataset and b) the SMOTE and NCR resampled dataset; normalized confusion
matrixes of MLP model performances with c) the original dataset and d) the SMOTE and NCR resampled dataset; Normalized confusion matrixes of XGBoost model
performances with e) the original dataset and f) the SMOTE and NCR resampled dataset. The scores were achieved on the test dataset; see text for abbreviations.
parts of the over-sampled classes. In addition to the single resample the accuracy and F1-score were both 0.90. Therefore, resampling was
method, the combined method using SMOTE and NCR was the most significant before training the machine learning model when the orig
effective in improving models’ performances. All machine learning inal dataset was imbalanced, and the combined method of oversampling
models with the combined method achieved the best performances. The and undersampling was better than a single method.
improvements were distinctive, especially for XGBoost, as suggested by
9
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
In this study, the XGBoost outperformed the SVM and MLP models,
as indicated by its higher accuracy and fewer computation demands. The
XGBoost had the accuracy of 0.80 with the raw dataset and achieved its
highest accuracy of 0.90 with the resampled dataset, indicating that
90% of the lithofacies were successfully identified by the XGBoost
classifier. By contrast, SVM and MLP classifiers failed to provide accu
rate identifications. The SVM had an accuracy of 0.40 using both raw
and resampled datasets, suggesting that the SVM classifier was unable to
extract the relationships between well logs and lithofacies. The MLP
classifier had better performance than the SVM classifier, but the ac
curacy, F1-score, and AUC of the best-trained MLP were still lower than
the XGBoost model. Theoretically, the MLP classifier should improve its
performance by increasing hidden layers. However, a deep neural
network with excessive hidden layers may overfit the training dataset or
fail to converge due to the gradient vanish issue (Hanin, 2018). In this
study, numbers of hidden layers from two to eight were investigated, the
grid search results indicated that the best number of hidden layers was
five and extra hidden layers were incapable of improving the model’s
performance. In addition to the model’s accuracy, the SVM and MLP
require greater computation costs. Both SVM and MLP classifiers
required more than twice training time as the XGBoost classifier
(Tables 3–5). The XGBoost is a scalable tree algorithm with improved
parallel computation and gradient converging. Therefore, XGBoost was Fig. 10. Loss curves of XGBoost model with the original and resampled dataset.
the optimal algorithm in the lithofacies identification in this study. The loss used is the log loss from a Python library, scikit-learn (Pedre
Though the XGBoost classifier has an overall good performance on gosa, 2011).
Table 4
Evaluation metrics and training time of MLP.
Training dataset Validation dataset Test dataset Time
Original 0.78 0.78 0.89 0.79 0.78 0.89 0.71 0.71 0.85
SMOTE 0.87 0.87 0.94 0.87 0.87 0.94 0.80 0.80 0.91
NCR 0.83 0.83 0.86 0.83 0.83 0.87 0.79 0.78 0.83
SMOTE + NCR 0.86 0.86 0.90 0.86 0.86 0.89 0.82 0.82 0.87
10
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Table 5
Evaluation metrics and training time of XGBoost.
Training dataset Validation dataset Test dataset Time
Original 0.96 0.96 0.99 0.81 0.81 0.92 0.80 0.79 0.91
SMOTE 0.98 0.98 0.99 0.88 0.88 0.92 0.87 0.87 0.92
NCR 0.97 0.97 0.99 0.86 0.86 0.94 0.85 0.85 0.94
SMOTE + NCR 0.98 0.98 0.99 0.90 0.90 0.93 0.90 0.90 0.94
manually checked based on domain knowledge. To solve the challenges improvements in accuracies and F1-scores. The original dataset of well
posed by the tremendous heterogeneity of lithofacies and well logs, the logs and interpreted lithofacies were resampled using SMOTE, NCR, and
manual interference with the knowledge of geological background is the combined method of SMOTE and NCR. All three resampling methods
likely one of the most reliable and efficient ways to make accurate enhanced the accuracies, F1-scores, and AUC, and the improvements
lithofacies predictions in current circumstances. using the combined method were the greatest. With the resampled
Successful applications of machine learning in lithofacies identifi dataset using the combined method, the accuracy and F1-score of the
cation are significant in subsurface exploration and palaeogeographic XGBoost model increased from 0.80 to 0.90; accuracies and F1-scores of
reconstruction. Subsurface exploration provides important natural re the SVM and MLP models also increased by about 0.1.
sources for society. Palaeogeographic reconstructions are critical to This study indicates that machine learning is a reliable and efficient
understanding Earth’s evolution and can provide insights into fields, method to identify fluvial-lacustrine lithofacies in Sichuan Basin. Sug
such as paleoclimatology, plate tectonics, and geodynamics (Cao et al., gested future works include the adoption of suitable autoencoders that
2017; Wang et al., 2021). The fundamental procedure of both subsurface can detect both low-frequency and high-frequency responses of well logs
exploration and palaeogeographic reconstruction includes lithofacies to lithofacies changes and the incorporation of more types of subsurface
identification. By incorporating datasets from longer periods and larger data. It is promising to create a machine learning model that can be used
spaces, the machine learning model can be used to understand basinal to provide accurate lithofacies identification from basin-scale to global
palaeogeographic evolution, locate target intervals in petroleum sys scale and provide insights into understanding the paleogeographic
tems, refine palaeogeographic maps (e.g., Golonka, 2007; Scotese, 2001, conditions of the Earth and forecasting the sweet spots of hydrocarbon
An atlas of Phanerozoic paleogeographic maps: the seas come in and the explorations.
seas go out), or even provide insights into understanding the palae
ogeographic reconstruction of the Earth.
Declaration of competing interest
6. Conclusions
The authors declare that they have no known competing financial
To investigate the feasibility of machine learning in lithofacies interests or personal relationships that could have appeared to influence
identifications from well logs, three machine learning algorithms and the work reported in this paper.
resampling methods were adopted in this study. The machine learning
models were trained by using the dataset of well logs and interpreted Acknowledgments
lithofacies of the Upper Triassic Xujiahe and Lower Jurassic Ziliujing
formations in the northern part of Sichuan Basin, Southwestern China. We thank Deep-time Digital Earth (DDE) program for supporting our
The results of this study indicate that the XGBoost algorithm pro project. We thank Sinopec Petroleum Exploration and Production
vides accurate lithofacies identifications and outperforms the SVM and Research Institute for providing the well data. We thank Xinbing Wang
MLP methods. The XGBoost model achieved both accuracy and F1-score and Jie Ouyang of Shanghai Jiaotong University, and Youyuan Que of
of 0.80 and AUC of 0.91, indicating that 80% of the total lithofacies were Chengdu University of Technology for their help. We thank four anon
predicted correctly. By contrast, SVM and MLP classifiers failed to pro ymous reviewers for their constructive comments. This work was
vide accurate identifications. The accuracies of SVM and MLP models financially supported by National Natural Science Foundation of China
are 0.41 and 0.71; the F1-scores are 0.37 and 0.71; and AUC are 0.61 and (Grant No. 42050104, 42050102, and 41888101), Everest Scientific
0.80. Moreover, the XGBoost required only half of the training time of Research Program of Chengdu University of Technology (Grant no.
SVM and MLP classifiers. 2020ZF11402), Open Fund (PLC20211102) of State Key Laboratory of
Additionally, resampling methods are effective to improve the per Oil and Gas Reservoir Geology and Exploitation of Chengdu University
formances of machine learning models, as suggested by the of Technology.
Appendix
11
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
(continued )
Depth CAL GR KTH RS RD CNL DEN AC LF Pred_LF
12
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
(continued )
Depth CAL GR KTH RS RD CNL DEN AC LF Pred_LF
References Bize-Forest, N., Lima, L., Baines, V., Boyd, A., Abbots, F., Barnett, A., 2018. Using
Machine-Learning for Depositional Facies Prediction in a Complex Carbonate
Reservoir.
Al-Mudhafar, W.J., 2017. Integrating well log interpretations for lithofacies classification
Cant, D.J., 1992. Subsurface facies analysis. In: James, R.G.W., N.P (Eds.), Facies Models:
and permeability modeling through advanced machine learning algorithms. J. Pet.
Response to Sea Level Changes. Geological Association, pp. 27–45.
Explor. Prod. Technol. 7, 1023–1033.
Cao, W., Zahirovic, S., Flament, N., Williams, S., Golonka, J., Müller, R.D., 2017.
Allen, D.R., 1975. Identification of sediments-their depositional environments and
Improving Global Paleogeography since the Late Paleozoic Using Paleobiology,
degree of compactionfrom well logs. In: Chilingarian, George V., Karl, H.W. (Eds.),
pp. 5425–5439.
Compaction of Coarse-Grained Sediments, Developments in Sedimentology. Elsevier,
Catuneanu, O., 2006. Principles of Sequence Stratigraphy. Elsevier.
New York, pp. 349–402.
Chawla, N.V., 2009. Data mining for imbalanced datasets: an overview. In: Data Mining
Asquith, G., Krygowski, D., 2004. AAPG Memoirs Basic Well Log Analysis (AAPG Special
and Knowledge Discovery Handbook, pp. 875–886.
vol. s).
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic
Baldwin, J.L., Bateman, R.M., Wheatley, C.L., 1990. Application of a neural network to
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357.
the problem of mineral identification from well logs. In: The Log Analyst.
Chen, T., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. In: Proceedings of
Zhu, Hongtao, et al., 2014. Three-dimensional facies architecture analysis using
the ACM SIGKDD International Conference on Knowledge Discovery and Data
sequence stratigraphy and seismic sedimentology: Example from the Paleogene
Mining. Association for Computing Machinery, pp. 785–794.
Dongying Formation in the BZ3-1 block of the Bozhong Sag, Bohai Bay Basin, China.
Chen, A., Zou, H., Ogg, J.G., Yang, S., Hou, M., Jiang, X., Xu, S., Zhang, X., 2020. Source-
Mar. Petrol. Geol. 51, 20–33.
to-sink of Late carboniferous Ordos Basin: constraints on crustal accretion margins
Bhatt, A., Helle, H.B., n.d. Determination of Facies from Well Logs Using Modular Neural
converting to orogenic belts bounding the North China Block. Geosci. Front. 11,
Networks.
2031–2052.
Bestagini, Paolo, et al., 2017. A machine learning approach to facies classification using
well logs. Seg technical program expanded abstracts 2137–2142.
13
D. Zheng et al. Journal of Petroleum Science and Engineering 215 (2022) 110610
Delfiner, Pierre, Peyret, Olivier, Serra, Oberto, 1987. Automatic determination of Meng, Q.R., Wang, E., Hu, J.-M., 2005. Mesozoic sedimentary evolution of the northwest
lithology from well logs. SPE Format. Eval. 2 (03), 303–310. Sichuan basin: implication for continued clockwise rotation of the South China
Deng, T., Xu, C., Jobe, D., Xu, R., 2019. A comparative study of three supervised block. Geol. Soc. Am. Bull. 117, 396–410.
machine-learning algorithms for classifying carbonate vuggy facies in the Kansas Miall, A.D., 1995. Whither stratigraphy? Sediment. Geol. 100, 5–20.
arbuckle formation. J. Form. Eval. Reserv. Descr. 60, 838–853. Nazeer, Adeel, et al., 2016. Sedimentary facies interpretation of Gamma Ray (GR) log as
Dubois, Martin, 2007. Comparison of four approaches to a rock facies classification basic well logs in Central and Lower Indus Basin of Pakistan. Geodesy Geodyn. 7 (6),
problem. Comput. Geosci. 33 (5), 599–617. 432–443.
Feng, R., 2021. Improving uncertainty analysis in well log classification by machine Nielsen, Arne, Schovsbo, Niels, 2011. The Lower Cambrian of Scandinavia: Depositional
learning with a scaling algorithm. J. Petrol. Sci. Eng. 196. environment, sequence stratigraphy and palaeogeography. Earth Sci. Rev. 107 (3–4),
Golonka, J., 2007. Late Triassic and early Jurassic palaeogeography of the world. 207–310.
Palaeogeogr. Palaeoclimatol. Palaeoecol. 244, 297–307. Pedregosa, F, et al., 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.
Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, 2016. Deep learning. MIT press. 12, 2825–2830.
Guo, Tonglou, 2013. Evaluation of highly thermally mature shale-gas reservoirs in Radwan, A.E., 2021. Modeling the depositional environment of the sandstone reservoir
complex structural parts of the Sichuan Basin. J. Earth Sci. 24 (6), 863–873. in the Middle Miocene Sidri member, Badri field, Gulf of Suez basin, Egypt:
Hall, B., 2016. Facies Classification Using Machine Learning. Lead. Edge. integration of gamma-ray log patterns and petrographic characteristics of lithology.
Hinton, G.E., Osindero, S., n.d. A Fast Learning Algorithm for Deep Belief Nets Yee-Whye Nat. Resour. Res. 30, 431–449.
Teh. Rider, M.H., 1990. Gamma-ray log shape used as a facies indicator: critical analysis of an
Hanin, B, 2018. Which neural net architectures give rise to exploding and vanishing oversimplified methodology. Geol. Soc. Spec. Publ. 48, 27–37.
gradients? In Advances in Neural Information Processing Systems 580–589. Rogers, Samuel J., Fang, J.H., Karr, C.L., Stanley, D.A., 1992. Determination of lithology
Horne, J.C., Ferme, J.C., Caroccio, F.T., Baganz, B.P., 1978. Depositional models in coal from well logs using a neural network. Am. Assoc. Petrol. Geol. Bull. 76, 731–739.
exploration and mine planning in Appalachian regions. Am. Assoc. Petrol. Geol. Bull. Scotese, Christopher, 2021. An atlas of Phanerozoic paleogeographic maps: the seas
62, 2379–2411. come in and the seas go out. Annu. Rev. Earth Planet Sci. 49, 679–728.
Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends,perspectives, and prospects. Selley, R.C., 1976. Subsurface environmental analysis of North Sea sediments. AAPG
Science (80). Bull. (Am. Assoc. Pet. Geol. 60, 184–195.
Lai, J., Wang, G., Wang, S., Cao, J., Li, M., Pang, X., Zhou, Z., Fan, X., Dai, Q., Yang, L., Tan, M., Song, X., Yang, X., Wu, Q., 2015. Support-vector-regression machine technology
2018. Review of diagenetic facies in tight sandstones: diagenesis, diagenetic for total organic carbon content prediction from wireline logs in organic shale: a
minerals, and prediction via well logs. Earth Sci. Rev. 185, 234–258. comparative study. J. Nat. Gas Sci. Eng. 26, 792–802.
Laurikkala, J., 2001. Improving identification of difficult small classes by balancing class Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York.
distribution. In: Conference on Artificial Intelligence in Medicine in Europe. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Springer, Berlin, Heidelberg, pp. 63–66. Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
Laya, Juan, Tucker, Maurice, 2012. Facies analysis and depositional environments of Processing Systems, pp. 5998–6008.
Permian carbonates of the Venezuelan Andes: Palaeogeographic implications for Wang, C., Hazen, R.M., Cheng, Q., Stephenson, M.H., Zhou, C., Fox, P., Shen, S.,
Northern Gondwana. Palaeogeogr. Palaeoclimatol. Palaeoecol. 331, 1–26. Oberhänsli, R., Hou, Z., Ma, X., Feng, Z., Fan, J., Ma, C., Hu, X., Luo, B., Wang, J.,
Li, Y., He, D., 2014. Evolution of tectonic-depositional environment and prototype basins Schiffries, C.M., 2021. The Deep-Time Digital Earth program: data-driven discovery
of the Early Jurassic in Sichuan Basin and adjacent areas. Acta Pet. Sin. 35, 219–232. in geosciences. Natl. Sci. Rev. 8, 2021.
Lim, J.-S., Kang, J.M., Kim, J., 1997. Multivariate statistical analysis for automatic Zheng, D.Y., Wu, S.X., 2021. Principal component analysis of textural characteristics of
electrofacies determination from well log measurements. In: All Days. SPE. fluvio-lacustrine sandstones and controlling factors of sandstone textures. Geol. Mag.
Liu, S., Deng, B., Jansa, L., Li, Z., Sun, W., Wang, G., Luo, Z., Yong, Z., 2018. Multi-stage 158 (10), 1847–1861.
basin development and hydrocarbon accumulations: a review of the Sichuan Basin at Zheng, D.Y., Yang, W., 2020. Provenance of upper Permian-lowermost Triassic
eastern margin of the Tibetan Plateau. J. Earth Sci. 29, 307–325. sandstones, Wutonggou low-order cycle, Bogda Mountains, NW China: implications
Longadge, M.R., Snehlata, M., Dongre, S., Latesh Malik, D., 2013. Class Imbalance on the unroofing history of the eastern north Tianshan Suture. J. Palaeogeogr. 9.
Problem in Data Mining: Review. International Journal of Computer Science and Zhang, Li, et al., 2016. Lithologic characteristics and diagenesis of the Upper Triassic
Network. Xujiahe formation, Yuanba area, northeastern Sichuan Basin. Journal of Natural Gas
Ma, Yongsheng, et al., 2008. Petroleum geology of the Puguang sour gas field in the Science and Engineering 35, 1320–1335.
Sichuan Basin, SW China. Mar. Petrol. Geol. 25 (4–5), 357–370. Zheng, Rongcai, Dai, Zhaocheng, Luo, Qinglin, Wang, Xiaoping, Lei, Guangming,
Ma, Y.S., Cai, X.Y., Zhao, P.R., Luo, Y., Zhang, X.F., 2010. Distribution and further Jiang, Hao, Hu, Chen, 2011. Sedimentary system of the upper Triassic Xujiahe
exploration of the large-medium sized gas fields in Sichuan Basin. Acta Pet. Sin. 31, Formation in the Sichuan foreland basin. Nat. Gas. Ind. 31, 16–24.
347–354. Zheng, D., Wu, S., Hou, M., 2021. Fully connected deep network: an improved method to
predict TOC of shale reservoirs from well logs. Mar. Petrol. Geol. 105205.
14