0% found this document useful (0 votes)

76 views

A Data Analytics Tutorial Building Predictive

Uploaded by

sari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

A Data Analytics Tutorial Building Predictive

Uploaded by

sari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

J189969 DOI: 10.

2118/189969-PA Date: 19-July-18 Stage: Page: 1075 Total Pages: 15

A Data-Analytics Tutorial: Building

Predictive Models for Oil Production
in an Unconventional Shale Reservoir
Jared Schuetter and Srikanta Mishra, Battelle Memorial Institute; and
Ming Zhong1 and Randy LaFollette (ret.), Baker Hughes

Summary
Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical meth-
ods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of
advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help
identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale
Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects
of well architecture, well completion, stimulation, and production.
Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random for-
ests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process
involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set.
Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling
approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as
ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) pro-
vides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories.
The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the util-
ity of DTs for identifying factors responsible for good vs. poor wells.

Introduction
The recent surge in US oil and gas production can be attributed primarily to the success in unlocking hydrocarbon resources from
unconventional reservoirs (Ahmed and Meehan 2016). These reservoirs are unconventional in the sense that the organic-rich source
rock itself is targeted for resource extraction. The extremely low permeability of unconventional reservoirs, generally in the nanodarcy
range, also requires the use of multistage hydraulic fracturing with horizontal wells. Reservoir modeling in such systems is an extremely
complicated task, given the need to simulate fluid flow in a network of induced natural fractures coupled to geomechanical effects and
other processes such as water blocking, non-Darcy flow in nanoscale pores, and adsorption/desorption (Cipolla et al. 2010; Ding et al.
2014). Current research has therefore been focused on the development of robust and computationally efficient mechanistic modeling
frameworks and software tools for modeling reservoir performance and optimizing production in unconventional reservoirs (Yan et al.
2017). The key issue with the routine application of comprehensive physics-based simulators is high computational cost. This is
because of the need to perform a large number of simulations to support practical decisions such as optimal well spacing, production
optimization, and field development. A practical alternative is the use of surrogate (proxy) models (dependent on the outputs of full-
physics simulators) that are ideal for repetitive calculations (Kulga et al. 2017). This is also an active area of development, which does
require the availability of a full-featured geology- and physics-based model of flow to an unconventional well. In the interim, data-
driven statistical approaches to understand the behavior of unconventional reservoirs dependent on only production data have emerged
as an attractive alternative (Mishra and Lin 2017), which is the subject of this paper.
There is a long tradition of using statistical methods to provide data-driven insights into system performance in health care, business,
environmental, and energy applications (Hastie et al. 2001). The terms “data mining,” “statistical learning,” “knowledge discovery,”
and “data analytics” have all been used interchangeably in this context. Essentially, the goal of such an exercise is to extract important
patterns and trends, and understand “what the data says,” using supervised and/or unsupervised learning (Hastie et al. 2001). In super-
vised learning, the value of an outcome is predicted using a number of inputs, with the training data set used to build a predictive model
or “learner” by means of techniques such as regression analysis, tree-based methods, support-vector machine, and neural networks. On
the other hand, unsupervised learning involves describing associations/patterns among a set of input measures (i.e., understanding how
the data are organized or clustered), using techniques such as cluster analysis, multidimensional scaling, self-organizing maps, and prin-
cipal-component analysis.
In recent years, several publications have dealt with the application of data mining/analytics for the assessment of unconventional
resources (LaFollette et al. 2012; Bhattacaharya et al. 2013; Mohaghegh 2013; Gupta et al. 2014). These studies cover a broad range
of techniques such as advanced nonparametric regression, tree-based modeling, classification-tree analysis, fuzzy clustering, and time-
series analysis. A search of the OnePetro database reveals similar applications for conventional oil and gas assets. These data-driven
models provide an easy pathway to real-time design and optimization, because the equivalent mechanistic models such as physics-based
simulators would be more time-consuming to set up, execute, and interpret.
Unfortunately, the application of advanced statistical algorithms is not typically a primary focus for petroleum engineers and geo-
scientists. Commercially available (Mathworks 2017; SAS 2017) and open-source (R Development Core Team 2014; Rossum 2007)

1
now with Shell
Copyright V
C 2018 Society of Petroleum Engineers

This paper (SPE 189969) was revised for publication from paper URTEC 2167005, presented at the Unconventional Resources Technology Conference, San Antonio, Texas, USA, 20–22 July
2015. Original manuscript received for review 23 November 2016. Revised manuscript received for review 16 August 2017. Paper peer approved 31 October 2017.

August 2018 SPE Journal 1075

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1076 Total Pages: 15

statistical software make these algorithms available to the larger community for use, along with robust testing. However, there remains
the issue of choosing the right algorithm(s) for the problem (as opposed to using one for all cases), applying the algorithm(s) with the
proper choice of user-defined parameters, avoiding the problem of data overfitting and resulting bias in fitted-model predictions, and
ensuring that the data-driven model makes physical sense in terms of variable selection and parameter importance.
The objective of this paper is to provide some clarity to this issue from a methodological perspective. Using production data from an
unconventional-shale-oil reservoir as a test case, we describe how to build robust predictive models and how to develop decision rules
that help identify factors separating good wells from poor performers. Our discussion will emphasize a thought process and analytical
framework that can be easily applied by geoscientists and petroleum engineers, working together with data scientists.
The paper is organized as follows. First, we describe the problem setting and perform an exploratory analysis on the input and output
data. Next, we outline the predictive input/output-modeling process on the full data set, including model-evaluation steps and good-
ness-of-fit metrics. This is followed by a presentation of predictive-model results. Decision-tree analysis is presented next for a subset
of the data that contains the top 25% and bottom 25% of the wells ordered in terms of production performance. Next, the issue of vari-
able importance is addressed for a variety of predictive-modeling approaches. Finally, some concluding remarks are presented regard-
ing the application of statistical learning methods for production optimization in unconventional reservoirs.

Problem Description
The techniques described in this paper will be illustrated on an example data set from west Texas, USA (Zhong et al. 2015). The study
area is the Delaware Basin, which overlaps with Loving, Ward, Winkler, and Reeves counties (Fig. 1). In this region, Wolfcamp Shale

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

forms an unconventional reservoir of roughly 2,000- to 4,000-ft thickness, which is being exploited by a number of horizontal wells.

Delaware Basin

Central Basin
Platform

Fig. 1—The map shows the study wells (colored circles) as well as the features of the surrounding Delaware Basin and Central
Basin Platform in west Texas. The well colors identify the cumulative production within the first 12 producing months, ordered
from purple (low) to red (high). The color of the terrain indicates elevation, again moving through the color spectrum from purple
(low) to red (high).

The 476 horizontal shale wells in the data set are primarily selected from Phantom Field and are listed in the public data as being
Wolfcamp (451WFMP) completions. In addition to the well identifications (IDs), the data set also contains 12 predictor variables and
three response variables. All the predictors relate to operational characteristics of the wells, including when the well was drilled, its
physical dimensions, stimulation details, and operator. The response measures cumulative well production (in barrels) over the first 12
producing months. A list of all variables in the data set is shown in Table 1.
The response of primary interest was M12CO, which measures the cumulative well production over the first 12 months in barrels. A
typical first step in exploratory data analysis is to examine a pairwise scatterplot of the response against the predictors to determine
whether any of the predictors have a strong marginal effect on the response (Hastie et al. 2001). This matrix of scatterplots shows the
relationship between all possible input variables (predictors) and output variables (responses), along with the empirical histogram for
all the variables along the diagonal. This plot also reveals strong correlations between pairs of predictors, which can lead to poor esti-
mates of model uncertainty, the severity of which depends on what sort of regression model is used. The pairwise scatterplot for this
data set is shown in Fig. 2. Note that of the 476 wells in the full data set, only 319 had nonmissing values for M12CO.

1076 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1077 Total Pages: 15

Type Variable Description

– ID Well-identification number
Response M12CO Cumulative production within first 12 producing months (bbl)
Opt2 Categorized operator code
COMPYR Well-completion year
SurfX Geographic location (horizontal)
SurfY Geographic location (vertical)
AZM Azimuth angle (degrees)
TVDSS True vertical depth (ft)
Predictor
DA Drift angle (degrees)
LATLEN Total horizontal lateral length (ft)
STAGE Number of fracture stages
FLUID Total fracture-fluid amount (gal)
PROP Total proppant amount (lbm)

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

PROPCON Proppant concentration (lbm/gal)

Table 1—List of variables in the study data set.

In the pairwise scatterplot, the top row and first column show the relationship between the response (M12CO) and each of the pre-
dictors individually. Note that none of these predictors shows a strong relationship with the response. LATLEN potentially has a posi-
tive association with the response, but the correlation is fairly weak. FLUID appears to also have some correlation with LATLEN,
although the outlier obscures this fact in the scatterplot.
Although 319 wells had nonmissing values for M12CO, many of those wells still had missing values for one or more of the predic-
tors. In fact, only 171 wells had complete data for the predictors and the response. Many of the methods used in subsequent analysis
require nonmissing predictors. To avoid losing nearly one-half of the wells in the analysis to missing data, imputation was used to fill in
missing entries. Many techniques exist for doing this, including replacing missing values with the mean or median of that variable over
the data set; generating values using a parametric distribution or regression model; filling in values using eigenvectors, principal compo-
nents, or partial-least-squares components (Geladi and Kolwaski 1986); adding a “missing” indicator variable; or entering a value using
one or more “nearest neighbor” observations. In this case, a RF imputation method was used to fill in missing entries, which is a tech-
nique that falls into the “nearest-neighbor” category. The algorithm is derived from the RF predictive model (Breiman 2001), which
includes in the prediction a proximity score between each pair of observations. For each predictor, missing values are assigned by using
a weighted average of all nonmissing values across the data set, with the weights proportional to proximity scores. In other words, miss-
ing entries are given values that tend to agree most with nonmissing entries in similar observations in the data set. Note that this method
does not assume that interactions between predictors are the same across the input space, and can take local relationships into account
when entering missing values.

Predictive Input/Output Modeling

Moving beyond exploratory data analysis, one common goal in oil and gas applications is to build predictive input/output models. For
example, in the Wolfcamp data set, the aim may be to predict cumulative production over the first year for a new well with a particular
set of operational characteristics. In this scenario, one must first decide on a particular form of regression model to use, fit the parame-
ters of that model using the available data set, and evaluate how well the model is expected to fit to future data. If the model appears to
fit well, it can be used to provide accurate predictions of well performance using the information available.
Note that the overall model quality will be affected by the quality of the data used to train it. When the available predictors are not
informative (i.e., do not correlate well to the response of interest), there will be a ceiling to the predictive performance that cannot be
surpassed even with the most-sophisticated modeling techniques. This problem can be mitigated by selecting appropriate variables
using subject-matter expertise, and, when possible, generating additional features in an automated or semiautomated fashion. Examples
of the latter include principal or partial-least-squares components, or training layers in a neural network. Other factors affecting model
performance include statistical noise, sampling variability, and the choice of modeling technique.
The decision of which regression model to use is not always straightforward. Different modeling strategies are intended for use in
different situations. Some models come with assumptions that may not be satisfied by the training data set. The modeling approaches
considered in this example Wolfcamp study are described in the following subsections, and a comparison of their relative strengths and
weaknesses is given in Table 2.

Ordinary-Least-Squares (OLS) Regression. OLS regression (Draper et al. 1966), also called multiple linear regression, describes the
response as a linear combination of the predictors or functions of the predictors. Popular choices include main-effects models and quad-
ratic models. The former describes the response as a linear combination of the predictors only (i.e., a multidimensional plane). The lat-
ter includes pairwise interactions and quadratic terms, resulting in a surface that essentially comprises parabolas opening upward or
downward in each dimension. This can result in broad arching features or “saddles” within the surface. OLS regression assumes nor-
mally distributed residuals from the model fit. This is typically verified by plotting the residuals and using statistical tests to verify nor-
mality. If that inspection fails, some of the conclusions from the regression may not be correct.
There exist extensions to OLS models that can capture more-diverse nonlinear behaviors than polynomials. For example, local linear
regression incorporates a weight matrix that predicts the response at a given location as a weighted linear combination of nearby obser-
vations. The weights typically are provided by a kernel function, such as a Gaussian, polynomial, or other exponential-type function.
Generalized linear models introduce nonlinearity through a link function. Finally, rather than fitting OLS models to the original predic-
tors, they can instead be fit to basis functions (e.g., natural cubic splines) that are derived from the predictors. For a review of these
methods and others, please refer to Hastie et al. (2001).

August 2018 SPE Journal 1077

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1078 Total Pages: 15

2006 2010 2014 0.0×100 1.5×107 1,900,000 2,000,000

0 100,000 250,000
M12CO
2014

COMPYR
2010
2006

LATLEN

6,000
2,000

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1.5×107

FLUID
0×100

PROP

6×106
0×100
1,900,000 2,000,000

SurfX

SurfY

3,100,000
0 100,000 250,000 2,000 6,000 0×100 6×106 3,100,000

Fig. 2—This pairwise scatterplot shows the relationship between the response (M12CO, top-left corner) and a subset of the predic-
tors. Plots on the diagonal contain histograms of the individual predictors and the response. Plots in the off-diagonal show the
relationship between the variables in the associated row and column.

DT. DTs are useful tools for building simple, interpretive models to describe how a response relates to one or more predictor values
(Breiman et al. 1984). Examples of DT analysis for oil and gas applications include Perez et al. (2005), Yarus et al. (2006), and Popa
and Wood (2011). The common approach to constructing a DT is to use a classification-and-regression-tree model (Breiman et al.
1984), which recursively partitions the data set using splits on predictor values. At each branch in the tree, a predictor and threshold are
used to assign one set of observations to go down the left path, and the others to go down the right path. In regression trees, splits are
chosen to minimize an error metric (e.g., sum of squared residuals), and each terminal node yields a flat prediction at a constant value.
In classification trees, the goal is to predict a class label for each observation, and therefore the splits in the tree are chosen to maximally
separate the category labels of the observations between those two paths. Terminal nodes are assigned a group label that is predicted
for all observations reaching that node. DTs are typically “pruned” to earlier splits such that terminal nodes contain multiple
training observations.

RF. RF regression (Breiman 2001) is a tree-based approach that uses a technique called “bagging.” The model is an ensemble of sim-
ple regression trees, each of which contains splits on predictor values. Each split indicates whether an observation should take the left
or right branch of the tree dependent on a comparison of a specific predictor with a threshold value. The final nodes in the trees, called
leaves, contain the regression prediction. In RFs, each tree in the ensemble is trained using a bootstrap sample of the training data, and
a random subset of the predictors is considered for each split. This randomization not only allows each regression tree to focus on subtly
different aspects of the predictor/response relationship, but it also allows the ensemble to avoid overfitting to the training data, which is
characteristic of simple DTs.

1078 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1079 Total Pages: 15

Model Strengths Weaknesses

Can capture nonlinear behavior only if the correct
Easily interpretable
model form is selected ahead of time
OLS Quick to fit the parameters Can be influenced by isolated outlying observations, although
Low memory footprint, because only using robust regression with other loss functions (e.g., Huber
the coefficients need to be stored loss) can mitigate this (Huber 1964; Hastie et al. 2001)
Easily interpretable
Quick to fit the parameters Prone to overfitting
DT
Low memory footprint, because only the Lower accuracy than other methods
predictors and split thresholds need to be stored
Can capture nonlinear behavior
Handles correlated predictors
Can take a long time to fit, relative to linear regression
RF Built-in diagnostics (e.g., predictor importance,
Lack of interpretability
proximity between observations)
One tunable parameter to avoid overfitting

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Can capture nonlinear behavior Can take a long time to fit, relative to linear regression
Handles correlated predictors Lack of interpretability
GBM
Built-in diagnostics (e.g., predictor importance,
Three tunable parameters to avoid overfitting
proximity between observations)
Lack of interpretability
SVR Can capture nonlinear behavior
Can be influenced by isolated outlying observations
Can take a long time to fit, relative to linear regression
KM Can capture highly nonlinear behavior Prediction requires storage of training data set
Lack of interpretability

Table 2—Model strengths and weaknesses.

From a geometric perspective, each regression tree in the RF model defines a step function that predicts constant values over rectan-
gular regions that partition the predictor space. By aggregating these trees, RF models can approximate arbitrarily complex nonlinear
surfaces, which makes them a powerful prediction tool. Other than selecting the number of trees in the ensemble, the only tunable
parameter of the RF model is the number of predictors considered at each split.

GBM. GBMs (Friedman 2001; Elith et al. 2008) are similar to RFs in the sense that they are also ensembles of regression trees. How-
ever, these trees are constructed sequentially rather than in parallel. Each new tree is constructed in such a way to compensate for the
shortcomings of the previous tree. That is, when one tree tends to fit poorly to the training data for particular types of predictor values,
the next tree will put more emphasis on observations in that problem area and make sure it predicts them well. The final model looks
like a linear-regression model with thousands of terms, where each term is a tree.
As is the case in RF models, the tree structure of GBMs provides the built-in capability to capture nonlinear behavior in the
response. However, GBMs require the specification of three tuning parameters, which include the maximum interaction size between
predictors; a shrinkage factor (i.e., learning rate); and the minimum number of observations allowed in the terminal nodes of the trees.

SVR. SVR (Drucker et al. 1997) is a technique closely related to the use of support-vector machines (Vapnik 2000), which are widely
used in classification tasks. These models use a simple linear-regression model to describe the response. However, the model is con-
structed in such a way that a “kernel trick” can be used to transform the data into a different space where the linear model makes sense.
This means that SVR models can fit nonlinear responses as long as a proper transformation function can be specified that transforms the
data to a space where the relationship is linear.

Kriging Model (KM). The KM (Krige 1951; Cressie 1993), also called a Gaussian process, was originally developed for use in spatial
statistics. It contains two components: a trend and a spatial correlation structure. The trend term provides an underlying pattern to the
relationship between the response and the predictors (e.g., a linear-regression model or polynomial OLS regression model), and this pat-
tern will be relied on for prediction in regions where few data have been observed. Different types of Kriging assume different trend
terms. For example, ordinary Kriging assumes a constant mean across the predictor space, whereas universal Kriging assumes a polyno-
mial trend.
The correlation structure encourages response values to tend toward training responses where similar predictor levels were observed.
The influence a training observation exerts on the response is proportional to the similarity of the training predictors to the ones where
the response is being predicted. It should be noted that KMs are perfect interpolators. An assumption with this model is the correlation
structure itself, which implies that neighboring training observations will have similar responses. If two training observations are very
close in the predictor space, but have very different responses, this can cause issues in model fitting.
Note that there are other modeling approaches not considered in the Wolfcamp analysis that appear elsewhere in oil and gas litera-
ture. The most popular of these is the artificial neural network (McCulloch and Pitts 1943; Hopfield 1982; Rumelhart et al. 1986;
Gevrey et al. 2003), which mimics a network of neurons in the brain, with each neuron calculating a linear combination of its inputs,
then passing that value into an activation function (typically a sigmoid) that measures the degree of excitement that neuron expresses
dependent on its inputs. From a mathematical perspective, a neural network is a chain of connected models, each of which applies a
nonlinear activation function to a linear combination of inputs, and is therefore just a nonlinear extension of linear models (Hastie et al.
2001). Artificial neural networks are becoming more ubiquitous in statistical software (Abadi et al. 2016), and can now be constructed

August 2018 SPE Journal 1079

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1080 Total Pages: 15

as easily as other machine-learning models. Other predictive techniques include multivariate adaptive-regression splines (Friedman
1991), generalized additive models (Tibshirani 1988), k-nearest neighbors (Hastie et al. 2001), elastic nets (Zou and Hastie 2005), and
naı̈ve Bayes (Duda and Hart 1973; Langley et al. 1992).

Model Evaluation
One critical aspect of model selection is the evaluation of the goodness of fit, and the importance of this step is often overlooked.
A common approach to evaluating a model fit is to generate a scatterplot of actual response values in the training set against the
predicted response using the model. If all the points in the scatterplot lie near the 45 (1:1) line, this indicates a good model fit to the
training data. However, this does not necessarily mean that the model will work for future data collections.
For example, consider the model shown in Fig. 3, which is overfitting the training data set. That is, the model is placing too much
emphasis on reproducing the training set, and likely contains more degrees of freedom than are necessary to capture the underlying
shape of the curve producing these observations. This model is capturing not just the true underlying function, but also the noise in the
measurements, which makes it unlikely to produce good predictions going forward. However, in the model evaluation on the training
data, all the points lie along the 45 line, indicating a superb fit. Although this is a contrived example, many models can overfit when
certain conditions are met, and in a multidimensional space one cannot easily visualize the prediction surface generated by the model.

Model Fit Model Evaluation on Training Data

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

7 Observations 6
Model
6 True function
5
5

Predicted f(x)
4
4
f(x)

3 3

2
2
1

1
0

0 10 20 30 40 50 1 2 3 4 5 6
x True Function (x)

Fig. 3—This is an example of a poor model (red curve) that appears to fit well when evaluated solely against the training set.
A model like this is said to exhibit overfitting.

To avoid overfitting, it is important to move beyond using predictions on the training data as the sole measure of model quality. One
simple way to do this is to use an independent test set. This can either be a completely new data set (e.g., pilot data from a region where
the model is intended for use), or a “held out” portion of the training data set. In both cases, one can fit the model using the training por-
tion of the data set, and then evaluate the fit on the independent test observations.
A third method of model evaluation is called k-fold cross validation (Hastie et al. 2001). In this approach, the training data set is ran-
domly split into k different groups (commonly called “folds”). Next, each of the k groups is held out and the model is trained on the
remaining k–1 groups. That model is then used to make predictions on the group that was held out (Fig. 4). After cycling through all k
groups, there will be a single prediction for every observation in the data set, and the predictions were made using a model for which
that observation was not included in the training set. These cross-validated predictions can be used to evaluate the quality of the model-
fitting procedure, and can indicate whether any problems might be expected on future data collections.

Train

Model Model Model Model Model

Full data set

Predict

Fig. 4—This diagram is a conceptual representation of k-fold cross validation. The full data set is split into k groups (here, k 5 5)
and each group is systematically “held out” as a test set for a model trained on the remaining k–1 groups. This yields one cross-
validated prediction for each observation in the data set.

1080 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1081 Total Pages: 15

There are two important notes to make in this discussion of cross validation. First, one can extend the cross-validation procedure by
repeating the entire process with a different random selection of k groups. A repeated cross validation using r repeated runs of k ran-
domly selected groups will yield r different predictions on each of the observations. These can be averaged to compute goodness-of-fit
metrics, but they also give important information regarding the variability in model predictions depending on the characteristics of the
training set. Second, note that the models trained during cross validation are not the models one would use for prediction going forward;
rather, one would build a single predictive model using the full training set. The cross-validation procedure is only for evaluation pur-
poses and provides a better indicator on the robustness of the predictive model for future applications using new data.

Goodness-of-Fit Metrics
Many measures exist for evaluating the quality of a model fit. In this paper, two metrics are used: average-absolute error (AAE) and
mean-squared error (MSE). These two metrics are similar, and both attempt to capture the overall closeness of predictions to the evalua-
tion data. Let yi be the true response for the ith observation, and y^i be the predicted response for that observation. The AAE is defined
as the average magnitude of the difference between the true response and predicted response (i.e., the average size of the residuals):

1X n
AAE ¼ jyi y^i j: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð1Þ
n i¼1

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

MSE is similar to AAE, and measures the average squared difference between truth and prediction, rather than the absolute value.

1X n
MSE ¼ ðyi y^i Þ2 : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð2Þ
n i¼1

Note that AAE has units matching those of the response (in bbl, in the example data set), whereas MSE is measured in squared units
of the response. Values closer to zero are desirable because they indicate smaller deviations between the truth and predictions (i.e.,
more-accurate prediction). MSE is typically preferred over AAE because of its well-known distributional properties, including being
continuously differentiable and being a sufficient statistic for normally distributed random processes.

Predictive-Modeling Results
Model-fitting results on the Wolfcamp data set are summarized in Fig. 5. Each plot shows the true response (M12CO) on the horizontal
axis and the predicted response on the vertical axis. Points on the diagonal dotted line indicate perfect prediction. Each row of plots
shows predictions from one type of model (OLS, RF, GBM, SVR, and KM), while each column shows results for a different model-
evaluation type.
The left column shows independent validation results, where for the held-out test data (left column), a random 20% subset of the
wells was held out. The model was then fit to the remaining 80% of the data set and evaluated on the 20% hold-out set. The points in
the plots in the left column only correspond to those predictions on the hold-out segment of the data set. For the cross-validation predic-
tions (center column), a 10-fold cross validation was used as a further refinement of the fivefold cross-validation approach shown in
Fig. 4. The points in these plots show the actual vs. the cross-validated predictions of each of the wells in the data set. The right column
shows the results from training and predicting on the full data set, which is the conventional approach to evaluating goodness of fit.
Notice that the independent validation and cross-validation results (left and center columns) tell a much-different story from the pre-
dictions on the full training set (right column). Compared with OLS regression, many of the models show a dramatic reduction in error
in both the AAE and MSE metrics, if one accepts the full-training-set approach for model evaluation. However, this reduction is more
modest for the other methods of model evaluation. The extreme case is the KM, which is a perfect interpolator and hence, by design,
forces the model fit through the training observations. Prediction on the full training set yields an apparent perfect predictive ability,
which clearly will not hold up in future data sets. One might argue that this would be obvious to an engineer or geoscientist examining
these plots. However, for a case like the RF, it is not so clear that the predictions on the training data are biased; it is only when com-
pared with the independent validation and cross-validation plots that the overfitting is revealed.
Having a more-realistic understanding of model performance has its own intrinsic merit, but it can also be useful for identifying
issues with data collection and availability. In this case, the poor predictive ability of the models, especially for high-producing wells,
likely results from a data deficiency. Because of restricted availability, the predictors in this data set only capture information regarding
well completion and operation, but nothing regarding the local geology around and within the well; thus, the addition of geological pre-
dictors may improve the accuracy of the models. Generally, when validated or cross-validated model performance is poorer than
expected, it can be an indicator of similar data shortcomings that can be investigated and addressed moving forward.
In summary, basing one’s expectations solely on the full-training-set-based predictions could lead to an optimistic perception of what
the model accuracy will be on future test data. This could lead to disappointment when those levels of accuracy are not met. Instead, it
may be wiser to adopt an alternative, albeit more robust, method of model evaluation such as a k-fold cross validation. Although the
results may not be as spectacular as far as goodness of fit on current data is concerned, they should align closer with actual model per-
formance on future data collections. Not only will this allow stakeholders to calibrate their expectations of the model as it is applied on
new data, but it can also identify data gaps that will inform further development in data collection and management practices.

DT Analysis
One strategy for tackling predictive problems is to simplify the question being asked. The predictive regression models developed in
the preceding section were attempting to pinpoint the exact cumulative first-year production (M12CO) for a given set of well character-
istics. However, suppose the real aim of this exercise is to give a simple “go/no go” solution on constructing a well. In this case, accu-
rate prediction of M12CO is not necessarily required. One needs only to predict whether the well will be a “good” well (relatively large
M12CO) or a “bad” well (relatively low M12CO). One way of simplifying this problem is to change the modeling effort from a regres-
sion problem to a classification problem. That is, the response can be binned into categories, and classification-tree models can be used
to predict into which category a well falls.
For the Wolfcamp data set, the top 25% and bottom 25% of producing wells were identified, and the middle 50% of the wells were
removed. A classification tree was then built to separate the top and bottom 25% groups. The result is shown in Fig. 6. The tree begins

August 2018 SPE Journal 1081

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1082 Total Pages: 15

at the top of the figure, where the first split checks whether the proppant used is less than 1,405,000 lbm. If so, a well observation moves
down the left path; otherwise, it goes right. Subsequent splits work in the same way, until eventually the observation reaches a terminal
node that contains a prediction. The text at the terminal nodes in this tree indicate how many training observations on each type
(“Bottom 25%” or “Top 25%”) ended up in that node.

OLS Prediction on Held-Out Test Data OLS 10-Fold CV Predictions OLS Prediction on Full Training Data Set
Prediction From OLS (×1,000 bbl)

Imputed Well Data Imputed Well Data Imputed Well Data

300

300
CV Prediction From OLS

Prediction From OLS

AAE 26.63 AAE 32.39 AAE 30.14
MSE 1710.11 MSE 2279.71 MSE 1741.36
200

200

200
(×1,000 bbl)

(×1,000 bbl)
0 50 100

0 50 100

0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

RF Prediction on Held-Out Test Data RF 10-Fold CV Predictions RF Prediction on Full Training Data Set
Prediction From RF (×1,000 bbl)

Imputed Well Data Imputed Well Data Imputed Well Data

300

300
CV Prediction From RF

Prediction From RF
AAE 23.71 AAE 25.54 AAE 11.11
MSE 1106.49 MSE 1272.06 MSE 244.86
(×1,000 bbl)

(×1,000 bbl)
200

200

200
0 50 100

0 50 100

0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)

GBM Prediction on Held-Out Test Data GBM 10-Fold CV Predictions GBM Prediction on Full Training Data Set
Prediction From GBM (×1,000 bbl)

Imputed Well Data Imputed Well Data Imputed Well Data

300

300
CV Prediction From GBM

Prediction From GBM

AAE 24.49 AAE 24.88 AAE 16.71
MSE 1131.11 MSE 1227.32 MSE 514.90
200

200

200
(×1,000 bbl)

(×1,000 bbl)
0 50 100

0 50 100

0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)

SVR Prediction on Held-Out Test Data SVR 10-Fold CV Predictions SVR Prediction on Full Training Data Set
Prediction From SVR (×1,000 bbl)

Imputed Well Data Imputed Well Data Imputed Well Data

300

300
CV Prediction From SVR

Prediction From SVR

AAE 21.75 AAE 25.88 AAE 19.20

MSE 1280.07 MSE 1503.75 MSE 1074.43
200

200

200
(×1,000 bbl)

(×1,000 bbl)
0 50 100

0 50 100

0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)

KM Prediction on Held-Out Test Data KM 10-Fold CV Predictions KM Prediction on Full Training Data Set
Prediction From KM (×1,000 bbl)

Imputed Well Data Imputed Well Data Imputed Well Data

300

300
CV Prediction From KM

Prediction From KM

AAE 23.08 AAE 25.69 AAE 0.00

MSE 1043.37 MSE 1295.07 MSE 0.00
(×1,000 bbl)

(×1,000 bbl)
200

200

200
0 50 100

0 50 100

0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)

Fig. 5—Predictive-model results on the example data set. CV 5 cross validation.

1082 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1083 Total Pages: 15

PROP< 1.405×106

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

LATLEN<2756 TVDSS≥–8100

TVDSS≥–8294
Bottom 25% LATLEN≥5362
26/0 Bottom 25%
8/2
Bottom 25% Top 25%
7/3 12/58
Bottom 25% Top 25%
21/2 6/15

Fig. 6—A DT that separates the top 25% and bottom 25% of producing wells. If the expression at a split is true, an observation
goes down the left branch; otherwise, it goes down the right branch. The fractions in the terminal nodes indicate the proportion of
training observations from each group in that node (listed as “Bottom 25%” and “Top 25%”).

DTs are easy to interpret. Not only do they indicate which predictors are influential in determining the response category, they also
identify critical values at which these categories change. In this case, there are two general paths to obtaining a top 25% producing
well. For wells using lower amounts of proppant (PROP < 1.405 106 lbm), the goal is to have a longer lateral (LATLEN 2,756 ft)
and a greater vertical depth (TVDSS < –8,294 ft). For wells using larger amounts of proppant (PROP 1.405 106 lbm), the goal is
again to have a greater vertical depth (TVDSS < –8,100 ft), and to have a lateral that is not too long (LATLEN < 5,362 ft).
Fig. 7 shows a view of this DT from the perspective of the wells in the predictor space. Fundamentally, DTs partition the predictor
space into blocks of similar observations. In a scatterplot of two predictors, this appears as vertical and horizontal segmentation of the
plot. In the top-left plot, the first split at PROP ¼ 1.405 106 appears as a vertical division of the plot. Within each of those divisions,
the splits on LATLEN (2,756 ft on the left branch and 5,362 ft on the right) serve to further subdivide the plot. Unfortunately, 2D views
can only show so much when three predictors are involved in the tree. 3D visualization can be used in this case to better understand
where the top 25% of wells are compared with the bottom 25% (Fig. 8). In this case, the bottom 25% cases form a distinct cluster at
low proppant levels and short lateral lengths. The other bottom 25% wells are mixed with the top 25% wells, but tend toward higher
depths and low proppant levels.
Table 3 shows a “confusion matrix” that summarizes the separability of the two classes in the training set. The value in each cell
describes how many wells of the true category indicated in the row header were in a terminal node for which the majority category was
the one indicated in the column header. Because 62 of the 80 true top 25% wells were in “Top 25%” terminal nodes, this yields a cor-
rect identification rate of 62/80 ¼ 77.5%. A similar calculation gives a correct identification rate of 91.3% for the bottom 25% wells.
Overall, the rate is (62 þ 73)/160 ¼ 84.4%. This indicates a reasonable ability to separate the two classes. The terminal nodes in the tree
can be examined to determine where the evidence for splitting the classes is perhaps a bit weak. In this case, the “Top 25%” node with
a 6:15 ratio and the “Bottom 25%” node with the 7:3 ratio are indicating places where there is weaker evidence.

Variable Importance
In some applications, the objective may not be to build a predictive model for a given response, but instead to identify the drivers of
that response among a large set of predictors. This is typically called screening. For example, in the Wolfcamp data set, the aim may be
to identify operational characteristics of horizontal shale wells that tend to correlate with higher well performance. With fewer predic-
tors to focus on, a screening exercise could guide the development of simpler predictive models or make experimental designs feasible
with a smaller number of runs.
There are many different approaches to measuring variable importance. One easy approach that is not tied to a particular model is
called R2 loss (Mishra et al. 2009). This method works for any regression model, and the reasoning is that if an influential predictor is
removed from a model, the accuracy of that model will dramatically fall. Alternatively, if a superfluous predictor is removed from the
model, there should be little to no effect on the accuracy.

August 2018 SPE Journal 1083

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1084 Total Pages: 15

PROP vs. LATLEN PROP vs. TVDSS

–8,600 –8,400 –8,200 –8,000 –7,800

6,000
5,000
3,000 4,000
LATLEN

TVDSS
2,000
1,000

0×100 1×106 2×106 3×106 4×106 5×106 0×100 1×106 2×106 3×106 4×106 5×106
PROP PROP

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

LATLEN vs. TVDSS LATLEN vs. TVDSS
PROP < 1.405×106 PROP ≥ 1.405×106
–8,600 –8,400 –8,200 –8,000 –7,800

–8,000
–8,200
TVDSS

TVDSS
–8,400
–8,600

1,000 2,000 3,000 4,000 5,000 4,000 4,500 5,000 5,500

LATLEN LATLEN
Top 25%
Bottom 25%

Fig. 7—These scatterplots show the splits made by the top 25% vs. bottom 25% DT. Note how the tree is fairly efficient at partition-
ing the predictor space into the region that contains primarily the top 25% wells.

Model fit can be assessed using pseudo-R2, which is defined in Eq. 3. Pseudo-R2 compares the sum of squared differences between
the true responses yi and predicted responses y^i to the overall sum of squares, which is proportional to the variance of the responses.
That is, it measures how much of the variability in the response is explained by the model. Note that while in a linear-regression model,
the pseudo-R2 is bounded between zero and unity, this is not the case for a general-regression model. When a regression model fits the
data worse than a flat line at the mean response does, the pseudo-R2 will be negative.
Xn
2 SSmodel ðyi y^i Þ2
Rp ¼ 1 ¼ 1 Xi¼1 n : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð3Þ
SStotal ðyi yÞ2 i¼1

To measure variable importance, one can compute R2p using all the predictors, and then compute R2p for a reduced model that uses all
the predictors except the predictor of interest. The R2 loss is then the difference between R2p for the full model and R2p for the reduced
model. A larger loss in the pseudo-R2 indicates a predictor with higher influence on the response.
Figs. 9 and 10 show the R2-loss rankings for the Wolfcamp data set. When measuring variable importance, it can be useful to com-
pute the ranks using several different predictive models. This can give a more-robust sense of which predictors are important. In this
case, there is a great deal of disagreement among the models as to which predictors are the most influential. The depth (TVDSS) is pop-
ular among all models. Three of the four models also put weight on the amount of proppant used (PROP), the length of the lateral
(LATLEN), and the amount of fracturing fluid used (FLUID). From a physical standpoint, the importance of these variables
makes sense because they clearly affect the stimulated reservoir volume (LATLEN, PROP, FLUID), the productivity index of the well
(LATLEN), as well as the intrinsic energy in the reservoir (TVDSS)—all which contribute to the cumulative production over the first
12 months.
The same reasoning applied to the R2 loss also holds true when other measures of model importance are used. For example, one
could measure the loss in other quality-of-fit measures such as the Akaike (1973) information criterion or the Bayesian information cri-
terion (Schwarz 1978). One potential problem with this approach is that it is not well-suited to situations where predictors are highly
correlated. For example, suppose a pair of important predictors happen to be correlated. When one of the predictors is removed from
the model, the other predictor can stand in to compensate for the loss, resulting in only a small reduction in the model fit. As a result,
both correlated predictors appear to be unimportant, and may be removed from future consideration. Indeed, this is the case in the

1084 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1085 Total Pages: 15

Wolfcamp data set for some pairs of predictors. For example, FLUID and PROP have a high correlation, which can be observed in the
pairwise scatterplot in Fig. 2. This underlines the importance of performing a preliminary exploratory data analysis before jumping into
model building. It can help to identify potential issues to consider moving forward.

6,000 7,000
–8,800 –8,600 –8,400 –8,200 –8,000 –7,800

4,000 5,000
LATLEN
LATLEN
TCDSS

7,000

TVDSS
6,000
–7,800

3,000

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

5,000 –8,000
4,000 –8,200

1,000 2,000
3,000 –8,400
2,000 –8,600
1,000 –8,800
0×100 1×106 2×106 3×106 4×106 5×106 0×100 1×106 2×106 3×106 4×106 5×106
PROP PROP

Top 25%
5×106

Bottom 25%
4×106
3×106

TCDSS
PROP
2×106

–7,800
–8,000
–8,200
1×106

–8,400
–8,600
0×100

–8,800
1,000 2,000 3,000 4,000 5,000 6,000 7,000
LATLEN

Fig. 8—These plots show 3D views of the Wolfcamp data set. The three predictors on the axes are the ones that were selected for
splits in the top 25% vs. bottom 25% classification-and-regression-tree analysis. The bottom 25% of wells (yellow triangles) form
two groups. The first is a distinct cluster at low PROP and low LATLEN. The second are more mixed with the top 25% (blue circles),
but primarily occur either at higher true vertical depth subsea (TVDSS) or low PROP values.

Predicted Bottom 25% Wells Predicted Top 25% Wells Total Correct ID Rate
True bottom 25% wells 62 18 80 77.5%
True top 25% wells 7 73 80 91.3%
Total 69 91 160 84.4%

Table 3—DT terminal-node confusion matrix.

Other model-specific methods of measuring variable importance exist as well. For example, among the predictive models described
in Table 2, RFs and GBMs have custom methods for identifying influential predictors. In RFs, the prediction strength of each variable
is measured by calculating the increase of MSE when that variable is permuted while all others are left unchanged. The rationale behind
the permutation step is that if the predictor variable is not influential, rearranging its values among the training observations will not
change the prediction accuracy of the model significantly. For GBMs, the variable importance is dependent on the number of times a
predictor variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged
over all trees. Variable importance results for these two DT ensemble methods are shown in Fig. 11.
As is clear from the results shown in this section, variable importance rankings can differ widely from method to method. It can
be useful to try several different methods, as well the graphical comparative-assessment methodology discussed earlier, to gain more-
robust feedback regarding which variables are most influential in driving the response.

August 2018 SPE Journal 1085

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1086 Total Pages: 15

Rank
1
2 Predictors (Average Rank)
3 TVDSS (2.2)
4 opt2B (4.8)
SurfY (5.0)
5 SurfX (5.2)
6 FLUID (6.2)
7 PROP (6.5)
8 LATLEN (6.8)
COMPYR (7.8)
9
STAGE (8.8)
10 PROPCON (8.8)
11 DA (10.2)
12 AZM (11.5)
13 opt2C (11.5)
opt2D (12.8)
14
opt2A (13.8)
15 opt2E (14.2)
16

OLS SVR RF GBM

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Fig. 9—R 2-loss results for four of the predictive models in the Wolfcamp data set. The plot shows how the ranks of the predictors
change depending on which model is used. Note that there is a great deal of movement depending on which model is used.

SurfY
opt2E PROP
opt2A 5
opt2D opt2C

opt2C COMPYR
AZM
4 SurfX STAGE
DA
Rank Variability
AZM
PROPCON
STAGE opt2B PROPCON
COMPYR 3 LATLEN DA
LATLEN
PROP opt2D

FLUID opt2E
SurfX 2
SurfY FLUID
opt2B opt2A
TVDSS 1 TVDSS

5 10 15 0 5 10 15
Rank Average Rank

Fig. 10—These plots show a different view of the R2-loss procedure on the Wolfcamp data set. Both plots show how the predictors
stack up in terms of the average rank over the four predictive models vs. the variability in those rankings. The left plot shows hori-
zontal boxplots of the rankings of the predictors, sorted from bottom to top by rank. The right plot shows a scatterplot of average
rank vs. the standard deviation of those ranks. TVDSS is clearly an influential predictor, with high rank and low variability. FLUID
also has a reliable rank in the middle of the pack. Finally, opt2A is determined as not important, with consistently low rankings.

FLUID
opt2B LATLEN
SurfX SurfY
PROP TVDSS
SurfY SurfX
PROPCON PROPCON
LATLEN PROP
FLUID AZM
TVDSS opt2B
COMPYR STAGE
opt2A COMPYR
STAGE DA
opt2D opt2D
opt2C opt2C
AZM opt2A
opt2E opt2E
DA
0 5 10 15 20 25
0 10 20 30 40 Relative Influence
MSE Increase (%)

Fig. 11—These plots show variable importance results on the Wolfcamp data set for RF (left) and GBM (right).

1086 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1087 Total Pages: 15

Concluding Remarks
Data from wells completed in the Wolfcamp Shale Formation in the Permian Basin are used to demonstrate how statistical methods can
provide data-driven insights into production performance. Predictive models for the first 12 months of production are built using multi-
ple input parameters characterizing well location, architecture, and completions. Regression techniques used include OLS, as well as
other advanced regression methods such RFs, SVR, GBM, and KM. Models are evaluated using goodness-of-fit metrics for the training
data set itself, a hold-out (validation) data set, and k-fold cross validation. In addition, DT analysis is applied to identify factors separat-
ing the top 25% of wells from the bottom 25% of wells. Finally, a variety of variable importance techniques are used to identify the
most-influential subset of parameters.
As far as regression analysis is concerned, our main conclusion is that reliance on the training data for goodness-of-fit-based model
ranking may lead to the selection of perfect interpolators, such as Kriging, all the time, although their predictive performance as per
cross validation may not be equally robust. In other words, the problem may be more nuanced than finding a single model that best fits
the data. It may be more beneficial to build multiple input/output models during the data-fitting process, and use a statistical model aver-
aging approach based on the k-fold goodness-of-fit statistics to estimate the “weight” of each model and combine model predictions
(Mishra 2012).
DT-based methods have not been as common as regression-based approaches in oilfield exploration and production (E&P) applica-
tions. However, they offer greater interpretability and can be more amenable to the formulation of a decision problem. In particular, the
DT can provide useful insights as to what variables or combinations of variables drive high-end and low-end production performance.
In summary, we note that there is a growing trend toward the use of statistical and machine-learning techniques for oil and gas appli-

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

cations. The goal is to “mine big data” and develop data-driven insights to understand and optimize reservoir performance. The maturity
of the field appears to be much like that of geostatistics in the early 1990s, when it had not been fully adopted for mainstream applica-
tions. To that end, we believe that geoscientists need to develop a better understanding of the full repertory of available techniques and
their potential. At the same time, data scientists need to understand the E&P problem domain to propose and apply appropriate statisti-
cal learning techniques for developing robust data-driven insights for decision making.
The key contributions of this paper can be summarized as follows:
1. We have introduced a new model-independent strategy for variable importance analysis using the R2-loss concept, and for com-
bining the rankings across various machine-learning techniques to obtain a more-robust assessment of the key drivers of
system performance.
2. We have presented a detailed assessment of model goodness of fit for different machine-learning methods using training data, a sin-
gle validation data set, and k-fold cross validation. The petroleum literature generally does not go into this level of detail in evaluat-
ing the predictive accuracy for data not used in model training.
3. We have provided a clear exposition of some of the common machine-learning techniques used for regression problems in “plain
English” that should be easily understood by petroleum engineers and geoscientists without getting lost in the mathematical details.
4. We have highlighted the relative merits of various predictive methods for predictive-model building and for variable
importance analysis.
The analyses described in this paper were performed on the example data set using the statistical programming language R
(R Development Core Team 2014). Model fits were conducted using the base R libraries, as well as the packages e1071 (Dimitriadou
et al. 2001), randomForest (Liaw and Wiener 2002), gbm (Ridgeway 2007, 2010), DiceKriging (Roustant et al. 2011), and rpart
(Thernau et al. 2012).

Nomenclature
R2p ¼ pseudo-R2 (goodness-of-fit), see Eq. 3
SSmodel ¼ sum-of-squares for response variable explained by model
SStotal ¼ total sum of squares for response variable
yi ¼ true response for ith observation
y^i ¼ predicted response for ith observation

Acknowledgments
This study was supported by a Battelle Internal research and development grant. We thank our colleagues Rob Carnell and Rod Osborne
for a careful review of this manuscript. We also thank our technical editors for several useful comments and suggestions that helped
improve the readability of the paper.

References
Abadi, M., Barham, P., Chen, J. et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. Proc., 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI ’16), Savannah, Georgia, 2–4 November.
Ahmed, U. and Meehan, D. N. 2016. Unconventional Oil and Gas Resources: Exploitation and Development. Boca Raton, Florida: CRC Press.
Akaike, H. 1973. Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information
Theory, ed. B. N. Petrov and B. F. Csaki, 267–281. Budapest, Hungary: Academiai Kiado.
Bhattacaharya, S., Maucec, M., Yarus, J. et al. 2013. Causal Analysis and Data Mining of Well Stimulation Data Using Classification and Regression
Tree with Enhancements. Presented at the SPE Annual Technology Conference and Exhibition, New Orleans, 30 September–2 October. SPE-
166472-MS. https://doi.org/10.2118/166472-MS.
Breiman, L. 2001. Random Forests. Mach. Learn. 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J., Stone, C. J. et al. 1984. Classification and Regression Trees. Boca Raton, Florida: CRC Press.
Cipolla, C. L., Lolon, E. P., Erdle, J. C. et al. 2010. Reservoir Modeling in Shale-Gas Reservoirs. SPE Res Eval & Eng 13 (4): 638-653. SPE-125530-
PA. https://doi.org/10.2118/125530-PA.
Cressie, N. 1993. Statistics for Spatial Data. New York City: Wiley.
Dimitriadou, E., Hornik, K., Leisch, F. et al. 2011. e1071: Misc Functions of the Department of Statistics. TU Wien. R Package Version 1.6.
Ding, D. Y., Wu, Y.-S., Farah, N. et al. 2014. Numerical Simulation of Low Permeability Unconventional Gas Reservoirs. Presented at the SPE/EAGE
European Unconventional Resources Conference and Exhibition, Vienna, Austria, 25–27 February. SPE-167711-MS. https://doi.org/10.2118/
167711-MS.

August 2018 SPE Journal 1087

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1088 Total Pages: 15

Draper, N. R., Smith, H., and Pownell, E. 1966. Applied Regression Analysis, Vol. 3. New York City: Wiley.
Drucker, H., Burges, C. J., Kaufman, L. et al. 1997. Support Vector Regression Machines. Proc., 9th International Conference on Neural Information
Processing Systems, Denver, 3–5 December, 155–161.
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. New York City: John Wiley & Sons.
Elith, J., Leathwick, J. R., and Hastie, T. 2008. A Working Guide to Boosted Regression Trees. J. Anim. Ecol. 77 (4): 802–813. https://doi.org/10.1111/
j.1365-2656.2008.01390.x.
Friedman, J. H. 1991. Multivariate Adaptive Regression Splines. Annal. Stat. 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.
Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annal. Stat. 29 (5): 1189–1232. https://doi.org/10.1214/aos/
1013203451.
Geladi, P. and Kowalski, B. R. 1986. Partial Least-Squares Regression: A Tutorial. Anal. Chim. Ac. 185: 1–17. https://doi.org/10.1016/0003-
2670(86)80028-9.
Gevrey, M., Dimopoulos, I., and Lek, S. 2003. Review and Comparison of Methods to Study the Contribution of Variables in Artificial Neural Network
Models. Ecol. Model. 160 (3): 249–264. https://doi.org/10.1016/S0304-3800(02)00257-0.
Gupta, S., Fuehrer, F., and Jeyachandra, B. C. 2014. Production Forecasting in Unconventional Resources Using Data Mining and Time Series Analysis.
Presented at the SPE/CSUR Unconventional Resources Conference, Calgary, 30 September–2 October. SPE-171588-MS. https://doi.org/10.2118/
171588-MS.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York City:
Springer.
Hopfield, J. J. 1982. Neural Networks and Physical Systems With Emergent Collective Computational Abilities. Proc. Natl. Acad. Sci. USA 79 (8):

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

2554–2558.
Huber, P. J. 1964. Robust Estimation of a Location Parameter. Annal. Math. Stat. 35 (1): 73–101. https://doi.org/10.1214/aoms/1177703732.
Krige, D. G. 1951. A Statistical Approach to Some Mine Valuation and Allied Problems on the Witwatersrand. Master’s thesis, University of the Witwa-
tersrand, Johannesburg, South Africa.
Kulga, B., Artun, E., and Ertekin, T. 2017. Development of a Data-Driven Forecasting Tool for Hydraulically Fractured, Horizontal Wells in Tight-Gas
Sands. Comput. Geosci. 103 (June): 99–110. https://doi.org/10.1016/j.cageo.2017.03.009.
LaFollette, R. F., Holcomb, W. D., and Aragon, J. 2012. Practical Data Mining: Analysis of Barnett Shale Production Results With Emphasis on Well
Completion and Fracture Stimulation. Presented at the SPE Hydraulic Fracturing Technology Conference and Exhibition, The Woodlands, Texas,
6–8 February. SPE-152531-MS. https://doi.org/10.2118/152531-MS.
Langley, P., Iba, W., and Thompson, K. 1992. An Analysis of Bayesian Classifiers. Proc., 10th National Conference on Artificial Intelligence, San Jose,
California, 12–16 July, 223–228.
Liaw, A. and Wiener, M. 2002. Classification and Regression by randomForest, R News. 2/3: 18–22.
Mathworks. Matlab Version 2017a. Natick, Massachusetts: MathWorks.
McCulloch, W. S. and Pitts, W. 1943. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Mathemat. Biophys. 5 (4): 115–133. https://
doi.org/10.1007/BF02478259.
Mishra, S. 2012. A New Approach to Reserves Estimation in Shale Gas Reservoirs Using Multiple Decline Curve Analysis Models. Presented at the SPE
Eastern Regional Meeting, Lexington, Kentucky, 3–5 October. SPE-161092-MS. https://doi.org/10.2118/161092-MS.
Mishra, S. and Lin, L. 2017. Application of Data Analytics for Production Optimization in Unconventional Reservoirs: A Critical Review. Presented at
the SPE/AAG/SEG Unconventional Resources Technology Conference, Austin, Texas, 24–26 July. URTEC-2670157-MS.
Mishra, S., Deeds, N. E., and Ruskauff, G. J. 2009. Global Sensitivity Analysis Techniques for Probabilistic Ground Water Modeling. Ground Water 47
(5): 730–747. https://doi.org/10.1111/j.1745-6584.2009.00604.x.
Mohaghegh, S. 2013. Shale Asset Management via Advanced Data-Driven and Predictive Analytics. SPE Webinar, recorded 13 November 2013.
Perez, H. H., Datta-Gupta, A., and Mishra, S. 2005. The Role of Electrofacies, Lithofacies, and Hydraulic Flow Units in Permeability Predictions from
Well Logs: A Comparative Analysis Using Classification Trees. SPE Res Eval & Eng 8 (2): 143–155. SPE-84301-PA. https://doi.org/10.2118/
84301-PA.
Popa, A. and Wood, W. 2011. Application of Case-Based Reasoning for Well Fracturing Planning and Execution. J. Nat. Gas Sci. Eng. 3 (6): 687–696.
https://doi.org/10.1016/j.jngse.2011.07.013.
R Development Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
Ridgeway, G. 2007. Generalized Boosted Models: A Guide to the GBM Package.
Ridgeway, G. 2010. GBM: Generalized Boosted Regression Models. R Package Version 1.6–3.1.
Roustant, O., Ginsbourger, D., and Deville, Y. 2011. Package DiceKriging: Kriging Methods for Computer Experiments, R Package Version 1.3.2.
Rossum, G. V. 2007. Python Programming Language. Oral presentation given at the USENIX Annual Technical Conference, Santa Clara, California, 20 June.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, ed. D. E. Rumelhart and J. L. McClelland, Chap. 8, 318–362. Cambridge, Massachusetts: The
MIT Press.
SAS Institute. 2017. SAS/STAT User’s Guide, Version 9.4. Cary, North Carolina: SAS.
Schwarz, G. 1978. Estimating the Dimension of a Model. Annal. Stat. 6 (2): 461–464. https://doi.org/10.1214/aos/1176344136.
Therneau, T. M., Atkinson, B., and Ripley, B. 2012. rpart: Recursive Partitioning. R Package Version 3.1-51.
Tibshirani, R. 1988. Estimating Transformations for Regression Via Additivity and Variance Stabilization. J. Am. Stat. Assoc. 83 (402): 394–405. https://
doi.org/10.2307/2288855.
Vapnik, V. 2000. The Nature of Statistical Learning Theory. New York City: Springer.
Yan, B., Mi, L., Wang, Y. et al. 2017. Mechanistic Simulation Workflow in Shale Gas Reservoirs. Presented at the SPE Reservoir Simulation Confer-
ence, Montgomery, Texas, 20–22 February. SPE-182623-MS. https://doi.org/10.2118/182623-MS.
Yarus, J. M., Srivastava, R. M., and Chambers, R. L. 2006. Geologic Success but Economic Failure: Uncovering Hidden Problems Using Recursive Par-
titioning. Presented at the AAPG Annual Convention and Exhibition, Houston.
Zhong, M., Schuetter, J., Mishra, S. et al. 2015. Do Data Mining Methods Matter?: A Wolfcamp Shale Case Study. Presented at the SPE Hydraulic Frac-
turing Technology Conference and Exhibition, The Woodlands, Texas, 35 February. SPE-173334-MS. https://doi.org/10.2118/173334-MS.
Zou, H. and Hastie, T. 2005. Regularization and Variable Selection via the Elastic Net. J. R. Statist. Soc. B. 67 (2): 301–320. https://doi.org/10.1111/
j.1467-9868.2005.00503.x.

Jared Schuetter is a principal research statistician at Battelle Memorial Institute, and has been with the company for 8 years. His
research interests include exploratory data analysis, machine learning, image analysis, data visualization, and software devel-
opment. Schuetter holds a PhD degree in statistics from Ohio State University.

1088 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

J189969 DOI: 10.2118/189969-PA Date: 19-July-18 Stage: Page: 1089 Total Pages: 15

Srikanta Mishra is Institute Fellow and Chief Scientist for Energy at Battelle Memorial Institute. Previously, he worked for the geosys-
tems consulting company Intera, and as an adjunct professor of petroleum engineering at the University of Texas at Austin.
Mishra’s research interests include computational modeling and data-analytics applications for oil and gas problems. He is the
author of the book Applied Statistical Modeling and Data Analytics: A Practical Guide for the Petroleum Geosciences, published
by Elsevier, and will serve as an SPE Distinguished Lecturer during the 2018–2019 season. Mishra holds a PhD degree in petroleum
engineering from Stanford University.
Ming Zhong is a data scientist at Shell. Previously, he worked as a statistician with Baker Hughes, Capital One, and Abbott Labs.
Zhong’s current research interests focus on the application of machine learning in the oil and gas sector. He has authored or
coauthored more than 15 technical papers. Zhong holds a PhD degree in statistics from Texas A&M University. He is a member
of SPE.
Randy LaFollette is retired, and previously spent 39 years as a technical professional in the oil and gas industry, working for the
Western Company, Reservoirs Incorporated, BJ Services, and Baker Hughes. During his career, LaFollette developed new meth-
ods and concepts to display and analyze engineering data on geological maps to improve interpretation of unconventional-
reservoir-production results. He holds a bachelor’s degree in geological science from Lehigh University. LaFollette also volun-
teered extensively for SPE, the American Association of Petroleum Geologists, and the Houston Geological Society. He worked
to bring the benefits of multivariate statistical analysis to the study of unconventional reservoirs, coauthored numerous papers,
and served as an SPE Distinguished Lecturer during the 2015–2016 season.

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1089

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Mathematics: Thursday 17 October 2024
No ratings yet
Mathematics: Thursday 17 October 2024
32 pages
43R-08 Risk Analysis and Contingency Determination Using Parametric Estimating - Example Models As Applied For The Process Industries
100% (3)
43R-08 Risk Analysis and Contingency Determination Using Parametric Estimating - Example Models As Applied For The Process Industries
14 pages
Ms
No ratings yet
Ms
15 pages
Urtec 2659996 Ms
No ratings yet
Urtec 2659996 Ms
17 pages
Scopus Journal
No ratings yet
Scopus Journal
7 pages
Schulze Riegert2017
No ratings yet
Schulze Riegert2017
22 pages
Best Practices of Assisted History Matching Using Design of Experiments
No ratings yet
Best Practices of Assisted History Matching Using Design of Experiments
17 pages
Spe 167869 MS
No ratings yet
Spe 167869 MS
13 pages
Spe 196428 Ms
No ratings yet
Spe 196428 Ms
16 pages
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
No ratings yet
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
5 pages
SPE-174031-MS Practical Application of Data-Driven Modeling Approach During Waterflooding Operations in Heterogeneous Reservoirs
No ratings yet
SPE-174031-MS Practical Application of Data-Driven Modeling Approach During Waterflooding Operations in Heterogeneous Reservoirs
14 pages
TALLER INGLES PROCESOS
No ratings yet
TALLER INGLES PROCESOS
2 pages
Spe 77597 MS
No ratings yet
Spe 77597 MS
15 pages
Data-driven_modeling_to_optimize_the_injection_wel
No ratings yet
Data-driven_modeling_to_optimize_the_injection_wel
23 pages
Schulze Riegert2016
No ratings yet
Schulze Riegert2016
20 pages
Spe 98010 MS P
No ratings yet
Spe 98010 MS P
22 pages
Machine Learning Based Decline Curve Analysis For
No ratings yet
Machine Learning Based Decline Curve Analysis For
23 pages
SPE 78333 Data Mining Techniques For Optimizing Fast Track Re-Engineering of Mature Fields
No ratings yet
SPE 78333 Data Mining Techniques For Optimizing Fast Track Re-Engineering of Mature Fields
6 pages
A New Integrated Static-Dynamic Assisted History Matching and Probabilistic Forecasting With NPV Analysis For "L" Field
No ratings yet
A New Integrated Static-Dynamic Assisted History Matching and Probabilistic Forecasting With NPV Analysis For "L" Field
29 pages
SPE-190090-MS Detecting Failures and Optimizing Performance in Artificial Lift Using Machine Learning Models
No ratings yet
SPE-190090-MS Detecting Failures and Optimizing Performance in Artificial Lift Using Machine Learning Models
16 pages
SPE-190812-MS - Status of Data-Driven Methods and Their Applications in Oil and Gas Industry
No ratings yet
SPE-190812-MS - Status of Data-Driven Methods and Their Applications in Oil and Gas Industry
18 pages
Arma 11 548
No ratings yet
Arma 11 548
10 pages
Spe 111853
No ratings yet
Spe 111853
18 pages
2021-SPE-208657-Successful Application of ML To Improve Dynamic Modeling HM For Complex Gas Condensate Reservoirs in Hai Thach Field, Nam Con Son Basin, Offshore Vietnam
No ratings yet
2021-SPE-208657-Successful Application of ML To Improve Dynamic Modeling HM For Complex Gas Condensate Reservoirs in Hai Thach Field, Nam Con Son Basin, Offshore Vietnam
14 pages
Detecting Interesting and Anomolous Patterns in Multivariate Time Series
No ratings yet
Detecting Interesting and Anomolous Patterns in Multivariate Time Series
13 pages
Reference Paper - Page 90
No ratings yet
Reference Paper - Page 90
25 pages
A Systematic Approach To Uncertainty Reduction With A Probabilistic and Multi-Objective History Matching
No ratings yet
A Systematic Approach To Uncertainty Reduction With A Probabilistic and Multi-Objective History Matching
17 pages
Spe 205723 Ms
No ratings yet
Spe 205723 Ms
16 pages
taribo
No ratings yet
taribo
14 pages
1 s2.0 S1876380420600556 Main
No ratings yet
1 s2.0 S1876380420600556 Main
10 pages
02 Introduction
No ratings yet
02 Introduction
19 pages
1 s2.0 S0920410521011116 Main
No ratings yet
1 s2.0 S0920410521011116 Main
13 pages
SPE 166111 Data Driven Analytics in Powder River Basin, WY
No ratings yet
SPE 166111 Data Driven Analytics in Powder River Basin, WY
14 pages
A_Digital_Twin_for_Unconventional_Reservoirs_A_Mul
No ratings yet
A_Digital_Twin_for_Unconventional_Reservoirs_A_Mul
12 pages
LCA
No ratings yet
LCA
5 pages
ML_PDM_ESP
No ratings yet
ML_PDM_ESP
11 pages
MSC Thesis Dissertation
100% (2)
MSC Thesis Dissertation
8 pages
Spe 182660 Pa
No ratings yet
Spe 182660 Pa
29 pages
Editorial: Complexity Problems Handled by Big Data Technology
No ratings yet
Editorial: Complexity Problems Handled by Big Data Technology
8 pages
Uncertainty Quantification and Risk Analysis For Petroleum Exploration
No ratings yet
Uncertainty Quantification and Risk Analysis For Petroleum Exploration
9 pages
Review of Application of Artificial Intelligence Techniques in Petroleum
No ratings yet
Review of Application of Artificial Intelligence Techniques in Petroleum
16 pages
SPE 154400 Coupled Static/Dynamic Modeling For Improved Uncertainty Handling
No ratings yet
SPE 154400 Coupled Static/Dynamic Modeling For Improved Uncertainty Handling
13 pages
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
No ratings yet
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
13 pages
Reference Dataset For Rate of Penetration Benchmar
No ratings yet
Reference Dataset For Rate of Penetration Benchmar
12 pages
rieger2010
No ratings yet
rieger2010
8 pages
Spe 209127 Ms
No ratings yet
Spe 209127 Ms
17 pages
Genetic Algorithms in Oil Industry An Overview
No ratings yet
Genetic Algorithms in Oil Industry An Overview
8 pages
Data Science Analytics in Maritime Security Geo-Imagery Article 2 Puput
No ratings yet
Data Science Analytics in Maritime Security Geo-Imagery Article 2 Puput
6 pages
Energies: Energy Commodity Price Forecasting With Deep Multiple Kernel Learning
No ratings yet
Energies: Energy Commodity Price Forecasting With Deep Multiple Kernel Learning
16 pages
Big Data Approach To Analytical Chemistry: February 2014
No ratings yet
Big Data Approach To Analytical Chemistry: February 2014
7 pages
Design of a hybrid mechanistic/Gaussian process model to predict full-scale wastewater treatment plant eﬄuent
No ratings yet
Design of a hybrid mechanistic/Gaussian process model to predict full-scale wastewater treatment plant eﬄuent
12 pages
Hosseini-51716777093088
No ratings yet
Hosseini-51716777093088
12 pages
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
No ratings yet
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
7 pages
Machine Learning Based Fracturing Parameter Optimization For Horizontal Wells in Panke Field Shale Oil
No ratings yet
Machine Learning Based Fracturing Parameter Optimization For Horizontal Wells in Panke Field Shale Oil
16 pages
SPE 163723 Pressure Transient Analysis of Data From Permanent Downhole Gauges
No ratings yet
SPE 163723 Pressure Transient Analysis of Data From Permanent Downhole Gauges
24 pages
Virtual Multiphase Flowmetering Using Adaptive Neuro-Fuzzy Inference System (ANFIS) : A Case Study of Hai Thach-Moc Tinh Field, Offshore Vietnam
No ratings yet
Virtual Multiphase Flowmetering Using Adaptive Neuro-Fuzzy Inference System (ANFIS) : A Case Study of Hai Thach-Moc Tinh Field, Offshore Vietnam
15 pages
Hydrogen Losses in Fueling Station Operation: Version of Record
No ratings yet
Hydrogen Losses in Fueling Station Operation: Version of Record
26 pages
Production Planning in Automotive Powertrain Plants: A Case Study
No ratings yet
Production Planning in Automotive Powertrain Plants: A Case Study
17 pages
Application of Advantage of Analytics in Mining Industry 2
No ratings yet
Application of Advantage of Analytics in Mining Industry 2
9 pages
Twins For Optimizing Oil Extraction Processes
From Everand
Twins For Optimizing Oil Extraction Processes
DHIVAKAR POOSAPADI
No ratings yet
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
Alexander V. Andreichikov
No ratings yet
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Probabilidade Um Curso Moderno Com Aplicacoes Sheldon Ross
0% (1)
Probabilidade Um Curso Moderno Com Aplicacoes Sheldon Ross
11 pages
Asset Pricing Notes
No ratings yet
Asset Pricing Notes
29 pages
BDA (18CS72) Module-5
No ratings yet
BDA (18CS72) Module-5
52 pages
Correlation and Regression Notes 2.0
No ratings yet
Correlation and Regression Notes 2.0
6 pages
Chap 5
No ratings yet
Chap 5
144 pages
Lecture 18. Serial Correlation: Testing and Estimation Testing For Serial Correlation
No ratings yet
Lecture 18. Serial Correlation: Testing and Estimation Testing For Serial Correlation
21 pages
An ABC Analysis Model For The Multiple Products
No ratings yet
An ABC Analysis Model For The Multiple Products
9 pages
Practice Questions - Quantitative - Reading 7
No ratings yet
Practice Questions - Quantitative - Reading 7
13 pages
PCM (8) Test For Significance (Dr. Tante)
No ratings yet
PCM (8) Test For Significance (Dr. Tante)
151 pages
(MAA 4.4) LINEAR REGRESSION - Eco
No ratings yet
(MAA 4.4) LINEAR REGRESSION - Eco
12 pages
The Impact of Strategic Vigilance On Organizational Ingenuity in Yemen's Mobile Phone Companies
No ratings yet
The Impact of Strategic Vigilance On Organizational Ingenuity in Yemen's Mobile Phone Companies
11 pages
Stat Unit-5
No ratings yet
Stat Unit-5
43 pages
II Semester MA221TC-Matlab Manual - II
No ratings yet
II Semester MA221TC-Matlab Manual - II
26 pages
Evaluating The Accuracy of Valuation Multiples On
No ratings yet
Evaluating The Accuracy of Valuation Multiples On
30 pages
Minor Project Final Report (20bca19)
No ratings yet
Minor Project Final Report (20bca19)
85 pages
Class 5 - LinearRegression
No ratings yet
Class 5 - LinearRegression
20 pages
The Role of The Change Order On Time Performance in Construction Project
No ratings yet
The Role of The Change Order On Time Performance in Construction Project
12 pages
Mohammad I 2017
No ratings yet
Mohammad I 2017
13 pages
Lecture 01 Linear Regression Single - Var PDF
No ratings yet
Lecture 01 Linear Regression Single - Var PDF
16 pages
Hecht Teddy Ex6 MKMR310
No ratings yet
Hecht Teddy Ex6 MKMR310
8 pages
Measurement of Transformational Leadership and Its
100% (1)
Measurement of Transformational Leadership and Its
18 pages
Elsa Thyovani and Shilvi Maljeti
No ratings yet
Elsa Thyovani and Shilvi Maljeti
15 pages
Interview Questions - Linear Regression
No ratings yet
Interview Questions - Linear Regression
6 pages
Correlation
No ratings yet
Correlation
27 pages
Appreciation Uniquely Predicts Life Satisfaction Above Demographics
No ratings yet
Appreciation Uniquely Predicts Life Satisfaction Above Demographics
5 pages
Stock Trend Prediction Using Regression Analysis - A Data Mining Approach
No ratings yet
Stock Trend Prediction Using Regression Analysis - A Data Mining Approach
5 pages
Full Chapter Quantitative Psychological Research The Complete Students Companion 5Th Edition David Clark Carter PDF
100% (21)
Full Chapter Quantitative Psychological Research The Complete Students Companion 5Th Edition David Clark Carter PDF
46 pages
The Impact of Firm Performance On Annual Report Readability
No ratings yet
The Impact of Firm Performance On Annual Report Readability
12 pages

Mathematics: Thursday 17 October 2024
Mathematics: Thursday 17 October 2024
43R-08 Risk Analysis and Contingency Determination Using Parametric Estimating - Example Models As Applied For The Process Industries
43R-08 Risk Analysis and Contingency Determination Using Parametric Estimating - Example Models As Applied For The Process Industries
Ms
Ms
Urtec 2659996 Ms
Urtec 2659996 Ms
Scopus Journal
Scopus Journal
Schulze Riegert2017
Schulze Riegert2017
Best Practices of Assisted History Matching Using Design of Experiments
Best Practices of Assisted History Matching Using Design of Experiments
Spe 167869 MS
Spe 167869 MS
Spe 196428 Ms
Spe 196428 Ms
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
SPE-174031-MS Practical Application of Data-Driven Modeling Approach During Waterflooding Operations in Heterogeneous Reservoirs
SPE-174031-MS Practical Application of Data-Driven Modeling Approach During Waterflooding Operations in Heterogeneous Reservoirs
TALLER INGLES PROCESOS
TALLER INGLES PROCESOS
Spe 77597 MS
Spe 77597 MS
Data-driven_modeling_to_optimize_the_injection_wel
Data-driven_modeling_to_optimize_the_injection_wel
Schulze Riegert2016
Schulze Riegert2016
Spe 98010 MS P
Spe 98010 MS P
Machine Learning Based Decline Curve Analysis For
Machine Learning Based Decline Curve Analysis For
SPE 78333 Data Mining Techniques For Optimizing Fast Track Re-Engineering of Mature Fields
SPE 78333 Data Mining Techniques For Optimizing Fast Track Re-Engineering of Mature Fields
A New Integrated Static-Dynamic Assisted History Matching and Probabilistic Forecasting With NPV Analysis For "L" Field
A New Integrated Static-Dynamic Assisted History Matching and Probabilistic Forecasting With NPV Analysis For "L" Field
SPE-190090-MS Detecting Failures and Optimizing Performance in Artificial Lift Using Machine Learning Models
SPE-190090-MS Detecting Failures and Optimizing Performance in Artificial Lift Using Machine Learning Models
SPE-190812-MS - Status of Data-Driven Methods and Their Applications in Oil and Gas Industry
SPE-190812-MS - Status of Data-Driven Methods and Their Applications in Oil and Gas Industry
Arma 11 548
Arma 11 548
Spe 111853
Spe 111853
2021-SPE-208657-Successful Application of ML To Improve Dynamic Modeling HM For Complex Gas Condensate Reservoirs in Hai Thach Field, Nam Con Son Basin, Offshore Vietnam
2021-SPE-208657-Successful Application of ML To Improve Dynamic Modeling HM For Complex Gas Condensate Reservoirs in Hai Thach Field, Nam Con Son Basin, Offshore Vietnam
Detecting Interesting and Anomolous Patterns in Multivariate Time Series
Detecting Interesting and Anomolous Patterns in Multivariate Time Series
Reference Paper - Page 90
Reference Paper - Page 90
A Systematic Approach To Uncertainty Reduction With A Probabilistic and Multi-Objective History Matching
A Systematic Approach To Uncertainty Reduction With A Probabilistic and Multi-Objective History Matching
Spe 205723 Ms
Spe 205723 Ms
taribo
taribo
1 s2.0 S1876380420600556 Main
1 s2.0 S1876380420600556 Main
02 Introduction
02 Introduction
1 s2.0 S0920410521011116 Main
1 s2.0 S0920410521011116 Main
SPE 166111 Data Driven Analytics in Powder River Basin, WY
SPE 166111 Data Driven Analytics in Powder River Basin, WY
A_Digital_Twin_for_Unconventional_Reservoirs_A_Mul
A_Digital_Twin_for_Unconventional_Reservoirs_A_Mul
LCA
LCA
ML_PDM_ESP
ML_PDM_ESP
MSC Thesis Dissertation
MSC Thesis Dissertation
Spe 182660 Pa
Spe 182660 Pa
Editorial: Complexity Problems Handled by Big Data Technology
Editorial: Complexity Problems Handled by Big Data Technology
Uncertainty Quantification and Risk Analysis For Petroleum Exploration
Uncertainty Quantification and Risk Analysis For Petroleum Exploration
Review of Application of Artificial Intelligence Techniques in Petroleum
Review of Application of Artificial Intelligence Techniques in Petroleum
SPE 154400 Coupled Static/Dynamic Modeling For Improved Uncertainty Handling
SPE 154400 Coupled Static/Dynamic Modeling For Improved Uncertainty Handling
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
Reference Dataset For Rate of Penetration Benchmar
Reference Dataset For Rate of Penetration Benchmar
rieger2010
rieger2010
Spe 209127 Ms
Spe 209127 Ms
Genetic Algorithms in Oil Industry An Overview
Genetic Algorithms in Oil Industry An Overview
Data Science Analytics in Maritime Security Geo-Imagery Article 2 Puput
Data Science Analytics in Maritime Security Geo-Imagery Article 2 Puput
Energies: Energy Commodity Price Forecasting With Deep Multiple Kernel Learning
Energies: Energy Commodity Price Forecasting With Deep Multiple Kernel Learning
Big Data Approach To Analytical Chemistry: February 2014
Big Data Approach To Analytical Chemistry: February 2014
Design of a hybrid mechanistic/Gaussian process model to predict full-scale wastewater treatment plant eﬄuent
Design of a hybrid mechanistic/Gaussian process model to predict full-scale wastewater treatment plant eﬄuent
Hosseini-51716777093088
Hosseini-51716777093088
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
Machine Learning Based Fracturing Parameter Optimization For Horizontal Wells in Panke Field Shale Oil
Machine Learning Based Fracturing Parameter Optimization For Horizontal Wells in Panke Field Shale Oil
SPE 163723 Pressure Transient Analysis of Data From Permanent Downhole Gauges
SPE 163723 Pressure Transient Analysis of Data From Permanent Downhole Gauges
Virtual Multiphase Flowmetering Using Adaptive Neuro-Fuzzy Inference System (ANFIS) : A Case Study of Hai Thach-Moc Tinh Field, Offshore Vietnam
Virtual Multiphase Flowmetering Using Adaptive Neuro-Fuzzy Inference System (ANFIS) : A Case Study of Hai Thach-Moc Tinh Field, Offshore Vietnam
Hydrogen Losses in Fueling Station Operation: Version of Record
Hydrogen Losses in Fueling Station Operation: Version of Record
Production Planning in Automotive Powertrain Plants: A Case Study
Production Planning in Automotive Powertrain Plants: A Case Study
Application of Advantage of Analytics in Mining Industry 2
Application of Advantage of Analytics in Mining Industry 2
Twins For Optimizing Oil Extraction Processes
From Everand
Twins For Optimizing Oil Extraction Processes
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
CS 304.A Training Models
CS 304.A Training Models
Probabilidade Um Curso Moderno Com Aplicacoes Sheldon Ross
Probabilidade Um Curso Moderno Com Aplicacoes Sheldon Ross
Asset Pricing Notes
Asset Pricing Notes
BDA (18CS72) Module-5
BDA (18CS72) Module-5
Correlation and Regression Notes 2.0
Correlation and Regression Notes 2.0
Chap 5
Chap 5
Lecture 18. Serial Correlation: Testing and Estimation Testing For Serial Correlation
Lecture 18. Serial Correlation: Testing and Estimation Testing For Serial Correlation
An ABC Analysis Model For The Multiple Products
An ABC Analysis Model For The Multiple Products
Practice Questions - Quantitative - Reading 7
Practice Questions - Quantitative - Reading 7
PCM (8) Test For Significance (Dr. Tante)
PCM (8) Test For Significance (Dr. Tante)
(MAA 4.4) LINEAR REGRESSION - Eco
(MAA 4.4) LINEAR REGRESSION - Eco
The Impact of Strategic Vigilance On Organizational Ingenuity in Yemen's Mobile Phone Companies
The Impact of Strategic Vigilance On Organizational Ingenuity in Yemen's Mobile Phone Companies
Stat Unit-5
Stat Unit-5
II Semester MA221TC-Matlab Manual - II
II Semester MA221TC-Matlab Manual - II
Evaluating The Accuracy of Valuation Multiples On
Evaluating The Accuracy of Valuation Multiples On
Minor Project Final Report (20bca19)
Minor Project Final Report (20bca19)
Class 5 - LinearRegression
Class 5 - LinearRegression
The Role of The Change Order On Time Performance in Construction Project
The Role of The Change Order On Time Performance in Construction Project
Mohammad I 2017
Mohammad I 2017
Lecture 01 Linear Regression Single - Var PDF
Lecture 01 Linear Regression Single - Var PDF
Hecht Teddy Ex6 MKMR310
Hecht Teddy Ex6 MKMR310
Measurement of Transformational Leadership and Its
Measurement of Transformational Leadership and Its
Elsa Thyovani and Shilvi Maljeti
Elsa Thyovani and Shilvi Maljeti
Interview Questions - Linear Regression
Interview Questions - Linear Regression
Correlation
Correlation
Appreciation Uniquely Predicts Life Satisfaction Above Demographics
Appreciation Uniquely Predicts Life Satisfaction Above Demographics
Stock Trend Prediction Using Regression Analysis - A Data Mining Approach
Stock Trend Prediction Using Regression Analysis - A Data Mining Approach
Full Chapter Quantitative Psychological Research The Complete Students Companion 5Th Edition David Clark Carter PDF
Full Chapter Quantitative Psychological Research The Complete Students Companion 5Th Edition David Clark Carter PDF
The Impact of Firm Performance On Annual Report Readability
The Impact of Firm Performance On Annual Report Readability

A Data Analytics Tutorial Building Predictive

Uploaded by

A Data Analytics Tutorial Building Predictive

Uploaded by

J189969 DOI: 10.

2118/189969-PA Date: 19-July-18 Stage: Page: 1075 Total Pages: 15

A Data-Analytics Tutorial: Building

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1075

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1076 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Type Variable Description

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Table 1—List of variables in the study data set.

Predictive Input/Output Modeling

August 2018 SPE Journal 1077

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

2006 2010 2014 0.0×100 1.5×107 1,900,000 2,000,000

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1078 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Model Strengths Weaknesses

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Table 2—Model strengths and weaknesses.

August 2018 SPE Journal 1079

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Model Fit Model Evaluation on Training Data

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Model Model Model Model Model

Full data set

1080 August 2018 SPE Journal

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1081

ID: jaganm Time: 13:08 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Imputed Well Data Imputed Well Data Imputed Well Data

Prediction From OLS

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Imputed Well Data Imputed Well Data Imputed Well Data

Imputed Well Data Imputed Well Data Imputed Well Data

Prediction From GBM

Imputed Well Data Imputed Well Data Imputed Well Data

Prediction From SVR

AAE 21.75 AAE 25.88 AAE 19.20

Imputed Well Data Imputed Well Data Imputed Well Data

AAE 23.08 AAE 25.69 AAE 0.00

Fig. 5—Predictive-model results on the example data set. CV 5 cross validation.

1082 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1083

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

PROP vs. LATLEN PROP vs. TVDSS

–8,600 –8,400 –8,200 –8,000 –7,800

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1,000 2,000 3,000 4,000 5,000 4,000 4,500 5,000 5,500

1084 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

Table 3—DT terminal-node confusion matrix.

August 2018 SPE Journal 1085

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

OLS SVR RF GBM

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1086 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1087

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

1088 August 2018 SPE Journal

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

Downloaded from http://onepetro.org/SJ/article-pdf/23/04/1075/2117260/spe-189969-pa.pdf/1 by OMV E&P GmbH user on 18 December 2022

August 2018 SPE Journal 1089

ID: jaganm Time: 13:09 I Path: S:/J###/Vol00000/170131/Comp/APPFile/SA-J###170131

You might also like