A Data Analytics Tutorial Building Predictive
A Data Analytics Tutorial Building Predictive
Summary
Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical meth-
ods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of
advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is
Introduction
The recent surge in US oil and gas production can be attributed primarily to the success in unlocking hydrocarbon resources from
unconventional reservoirs (Ahmed and Meehan 2016). These reservoirs are unconventional in the sense that the organic-rich source
rock itself is targeted for resource extraction. The extremely low permeability of unconventional reservoirs, generally in the nanodarcy
range, also requires the use of multistage hydraulic fracturing with horizontal wells. Reservoir modeling in such systems is an extremely
complicated task, given the need to simulate fluid flow in a network of induced natural fractures coupled to geomechanical effects and
other processes such as water blocking, non-Darcy flow in nanoscale pores, and adsorption/desorption (Cipolla et al. 2010; Ding et al.
2014). Current research has therefore been focused on the development of robust and computationally efficient mechanistic modeling
frameworks and software tools for modeling reservoir performance and optimizing production in unconventional reservoirs (Yan et al.
2017). The key issue with the routine application of comprehensive physics-based simulators is high computational cost. This is
because of the need to perform a large number of simulations to support practical decisions such as optimal well spacing, production
optimization, and field development. A practical alternative is the use of surrogate (proxy) models (dependent on the outputs of full-
physics simulators) that are ideal for repetitive calculations (Kulga et al. 2017). This is also an active area of development, which does
require the availability of a full-featured geology- and physics-based model of flow to an unconventional well. In the interim, data-
driven statistical approaches to understand the behavior of unconventional reservoirs dependent on only production data have emerged
as an attractive alternative (Mishra and Lin 2017), which is the subject of this paper.
There is a long tradition of using statistical methods to provide data-driven insights into system performance in health care, business,
environmental, and energy applications (Hastie et al. 2001). The terms “data mining,” “statistical learning,” “knowledge discovery,”
and “data analytics” have all been used interchangeably in this context. Essentially, the goal of such an exercise is to extract important
patterns and trends, and understand “what the data says,” using supervised and/or unsupervised learning (Hastie et al. 2001). In super-
vised learning, the value of an outcome is predicted using a number of inputs, with the training data set used to build a predictive model
or “learner” by means of techniques such as regression analysis, tree-based methods, support-vector machine, and neural networks. On
the other hand, unsupervised learning involves describing associations/patterns among a set of input measures (i.e., understanding how
the data are organized or clustered), using techniques such as cluster analysis, multidimensional scaling, self-organizing maps, and prin-
cipal-component analysis.
In recent years, several publications have dealt with the application of data mining/analytics for the assessment of unconventional
resources (LaFollette et al. 2012; Bhattacaharya et al. 2013; Mohaghegh 2013; Gupta et al. 2014). These studies cover a broad range
of techniques such as advanced nonparametric regression, tree-based modeling, classification-tree analysis, fuzzy clustering, and time-
series analysis. A search of the OnePetro database reveals similar applications for conventional oil and gas assets. These data-driven
models provide an easy pathway to real-time design and optimization, because the equivalent mechanistic models such as physics-based
simulators would be more time-consuming to set up, execute, and interpret.
Unfortunately, the application of advanced statistical algorithms is not typically a primary focus for petroleum engineers and geo-
scientists. Commercially available (Mathworks 2017; SAS 2017) and open-source (R Development Core Team 2014; Rossum 2007)
1
now with Shell
Copyright V
C 2018 Society of Petroleum Engineers
This paper (SPE 189969) was revised for publication from paper URTEC 2167005, presented at the Unconventional Resources Technology Conference, San Antonio, Texas, USA, 20–22 July
2015. Original manuscript received for review 23 November 2016. Revised manuscript received for review 16 August 2017. Paper peer approved 31 October 2017.
statistical software make these algorithms available to the larger community for use, along with robust testing. However, there remains
the issue of choosing the right algorithm(s) for the problem (as opposed to using one for all cases), applying the algorithm(s) with the
proper choice of user-defined parameters, avoiding the problem of data overfitting and resulting bias in fitted-model predictions, and
ensuring that the data-driven model makes physical sense in terms of variable selection and parameter importance.
The objective of this paper is to provide some clarity to this issue from a methodological perspective. Using production data from an
unconventional-shale-oil reservoir as a test case, we describe how to build robust predictive models and how to develop decision rules
that help identify factors separating good wells from poor performers. Our discussion will emphasize a thought process and analytical
framework that can be easily applied by geoscientists and petroleum engineers, working together with data scientists.
The paper is organized as follows. First, we describe the problem setting and perform an exploratory analysis on the input and output
data. Next, we outline the predictive input/output-modeling process on the full data set, including model-evaluation steps and good-
ness-of-fit metrics. This is followed by a presentation of predictive-model results. Decision-tree analysis is presented next for a subset
of the data that contains the top 25% and bottom 25% of the wells ordered in terms of production performance. Next, the issue of vari-
able importance is addressed for a variety of predictive-modeling approaches. Finally, some concluding remarks are presented regard-
ing the application of statistical learning methods for production optimization in unconventional reservoirs.
Problem Description
The techniques described in this paper will be illustrated on an example data set from west Texas, USA (Zhong et al. 2015). The study
area is the Delaware Basin, which overlaps with Loving, Ward, Winkler, and Reeves counties (Fig. 1). In this region, Wolfcamp Shale
Delaware Basin
Central Basin
Platform
Fig. 1—The map shows the study wells (colored circles) as well as the features of the surrounding Delaware Basin and Central
Basin Platform in west Texas. The well colors identify the cumulative production within the first 12 producing months, ordered
from purple (low) to red (high). The color of the terrain indicates elevation, again moving through the color spectrum from purple
(low) to red (high).
The 476 horizontal shale wells in the data set are primarily selected from Phantom Field and are listed in the public data as being
Wolfcamp (451WFMP) completions. In addition to the well identifications (IDs), the data set also contains 12 predictor variables and
three response variables. All the predictors relate to operational characteristics of the wells, including when the well was drilled, its
physical dimensions, stimulation details, and operator. The response measures cumulative well production (in barrels) over the first 12
producing months. A list of all variables in the data set is shown in Table 1.
The response of primary interest was M12CO, which measures the cumulative well production over the first 12 months in barrels. A
typical first step in exploratory data analysis is to examine a pairwise scatterplot of the response against the predictors to determine
whether any of the predictors have a strong marginal effect on the response (Hastie et al. 2001). This matrix of scatterplots shows the
relationship between all possible input variables (predictors) and output variables (responses), along with the empirical histogram for
all the variables along the diagonal. This plot also reveals strong correlations between pairs of predictors, which can lead to poor esti-
mates of model uncertainty, the severity of which depends on what sort of regression model is used. The pairwise scatterplot for this
data set is shown in Fig. 2. Note that of the 476 wells in the full data set, only 319 had nonmissing values for M12CO.
In the pairwise scatterplot, the top row and first column show the relationship between the response (M12CO) and each of the pre-
dictors individually. Note that none of these predictors shows a strong relationship with the response. LATLEN potentially has a posi-
tive association with the response, but the correlation is fairly weak. FLUID appears to also have some correlation with LATLEN,
although the outlier obscures this fact in the scatterplot.
Although 319 wells had nonmissing values for M12CO, many of those wells still had missing values for one or more of the predic-
tors. In fact, only 171 wells had complete data for the predictors and the response. Many of the methods used in subsequent analysis
require nonmissing predictors. To avoid losing nearly one-half of the wells in the analysis to missing data, imputation was used to fill in
missing entries. Many techniques exist for doing this, including replacing missing values with the mean or median of that variable over
the data set; generating values using a parametric distribution or regression model; filling in values using eigenvectors, principal compo-
nents, or partial-least-squares components (Geladi and Kolwaski 1986); adding a “missing” indicator variable; or entering a value using
one or more “nearest neighbor” observations. In this case, a RF imputation method was used to fill in missing entries, which is a tech-
nique that falls into the “nearest-neighbor” category. The algorithm is derived from the RF predictive model (Breiman 2001), which
includes in the prediction a proximity score between each pair of observations. For each predictor, missing values are assigned by using
a weighted average of all nonmissing values across the data set, with the weights proportional to proximity scores. In other words, miss-
ing entries are given values that tend to agree most with nonmissing entries in similar observations in the data set. Note that this method
does not assume that interactions between predictors are the same across the input space, and can take local relationships into account
when entering missing values.
Ordinary-Least-Squares (OLS) Regression. OLS regression (Draper et al. 1966), also called multiple linear regression, describes the
response as a linear combination of the predictors or functions of the predictors. Popular choices include main-effects models and quad-
ratic models. The former describes the response as a linear combination of the predictors only (i.e., a multidimensional plane). The lat-
ter includes pairwise interactions and quadratic terms, resulting in a surface that essentially comprises parabolas opening upward or
downward in each dimension. This can result in broad arching features or “saddles” within the surface. OLS regression assumes nor-
mally distributed residuals from the model fit. This is typically verified by plotting the residuals and using statistical tests to verify nor-
mality. If that inspection fails, some of the conclusions from the regression may not be correct.
There exist extensions to OLS models that can capture more-diverse nonlinear behaviors than polynomials. For example, local linear
regression incorporates a weight matrix that predicts the response at a given location as a weighted linear combination of nearby obser-
vations. The weights typically are provided by a kernel function, such as a Gaussian, polynomial, or other exponential-type function.
Generalized linear models introduce nonlinearity through a link function. Finally, rather than fitting OLS models to the original predic-
tors, they can instead be fit to basis functions (e.g., natural cubic splines) that are derived from the predictors. For a review of these
methods and others, please refer to Hastie et al. (2001).
0 100,000 250,000
M12CO
2014
COMPYR
2010
2006
LATLEN
6,000
2,000
FLUID
0×100
PROP
6×106
0×100
1,900,000 2,000,000
SurfX
SurfY
3,100,000
0 100,000 250,000 2,000 6,000 0×100 6×106 3,100,000
Fig. 2—This pairwise scatterplot shows the relationship between the response (M12CO, top-left corner) and a subset of the predic-
tors. Plots on the diagonal contain histograms of the individual predictors and the response. Plots in the off-diagonal show the
relationship between the variables in the associated row and column.
DT. DTs are useful tools for building simple, interpretive models to describe how a response relates to one or more predictor values
(Breiman et al. 1984). Examples of DT analysis for oil and gas applications include Perez et al. (2005), Yarus et al. (2006), and Popa
and Wood (2011). The common approach to constructing a DT is to use a classification-and-regression-tree model (Breiman et al.
1984), which recursively partitions the data set using splits on predictor values. At each branch in the tree, a predictor and threshold are
used to assign one set of observations to go down the left path, and the others to go down the right path. In regression trees, splits are
chosen to minimize an error metric (e.g., sum of squared residuals), and each terminal node yields a flat prediction at a constant value.
In classification trees, the goal is to predict a class label for each observation, and therefore the splits in the tree are chosen to maximally
separate the category labels of the observations between those two paths. Terminal nodes are assigned a group label that is predicted
for all observations reaching that node. DTs are typically “pruned” to earlier splits such that terminal nodes contain multiple
training observations.
RF. RF regression (Breiman 2001) is a tree-based approach that uses a technique called “bagging.” The model is an ensemble of sim-
ple regression trees, each of which contains splits on predictor values. Each split indicates whether an observation should take the left
or right branch of the tree dependent on a comparison of a specific predictor with a threshold value. The final nodes in the trees, called
leaves, contain the regression prediction. In RFs, each tree in the ensemble is trained using a bootstrap sample of the training data, and
a random subset of the predictors is considered for each split. This randomization not only allows each regression tree to focus on subtly
different aspects of the predictor/response relationship, but it also allows the ensemble to avoid overfitting to the training data, which is
characteristic of simple DTs.
From a geometric perspective, each regression tree in the RF model defines a step function that predicts constant values over rectan-
gular regions that partition the predictor space. By aggregating these trees, RF models can approximate arbitrarily complex nonlinear
surfaces, which makes them a powerful prediction tool. Other than selecting the number of trees in the ensemble, the only tunable
parameter of the RF model is the number of predictors considered at each split.
GBM. GBMs (Friedman 2001; Elith et al. 2008) are similar to RFs in the sense that they are also ensembles of regression trees. How-
ever, these trees are constructed sequentially rather than in parallel. Each new tree is constructed in such a way to compensate for the
shortcomings of the previous tree. That is, when one tree tends to fit poorly to the training data for particular types of predictor values,
the next tree will put more emphasis on observations in that problem area and make sure it predicts them well. The final model looks
like a linear-regression model with thousands of terms, where each term is a tree.
As is the case in RF models, the tree structure of GBMs provides the built-in capability to capture nonlinear behavior in the
response. However, GBMs require the specification of three tuning parameters, which include the maximum interaction size between
predictors; a shrinkage factor (i.e., learning rate); and the minimum number of observations allowed in the terminal nodes of the trees.
SVR. SVR (Drucker et al. 1997) is a technique closely related to the use of support-vector machines (Vapnik 2000), which are widely
used in classification tasks. These models use a simple linear-regression model to describe the response. However, the model is con-
structed in such a way that a “kernel trick” can be used to transform the data into a different space where the linear model makes sense.
This means that SVR models can fit nonlinear responses as long as a proper transformation function can be specified that transforms the
data to a space where the relationship is linear.
Kriging Model (KM). The KM (Krige 1951; Cressie 1993), also called a Gaussian process, was originally developed for use in spatial
statistics. It contains two components: a trend and a spatial correlation structure. The trend term provides an underlying pattern to the
relationship between the response and the predictors (e.g., a linear-regression model or polynomial OLS regression model), and this pat-
tern will be relied on for prediction in regions where few data have been observed. Different types of Kriging assume different trend
terms. For example, ordinary Kriging assumes a constant mean across the predictor space, whereas universal Kriging assumes a polyno-
mial trend.
The correlation structure encourages response values to tend toward training responses where similar predictor levels were observed.
The influence a training observation exerts on the response is proportional to the similarity of the training predictors to the ones where
the response is being predicted. It should be noted that KMs are perfect interpolators. An assumption with this model is the correlation
structure itself, which implies that neighboring training observations will have similar responses. If two training observations are very
close in the predictor space, but have very different responses, this can cause issues in model fitting.
Note that there are other modeling approaches not considered in the Wolfcamp analysis that appear elsewhere in oil and gas litera-
ture. The most popular of these is the artificial neural network (McCulloch and Pitts 1943; Hopfield 1982; Rumelhart et al. 1986;
Gevrey et al. 2003), which mimics a network of neurons in the brain, with each neuron calculating a linear combination of its inputs,
then passing that value into an activation function (typically a sigmoid) that measures the degree of excitement that neuron expresses
dependent on its inputs. From a mathematical perspective, a neural network is a chain of connected models, each of which applies a
nonlinear activation function to a linear combination of inputs, and is therefore just a nonlinear extension of linear models (Hastie et al.
2001). Artificial neural networks are becoming more ubiquitous in statistical software (Abadi et al. 2016), and can now be constructed
as easily as other machine-learning models. Other predictive techniques include multivariate adaptive-regression splines (Friedman
1991), generalized additive models (Tibshirani 1988), k-nearest neighbors (Hastie et al. 2001), elastic nets (Zou and Hastie 2005), and
naı̈ve Bayes (Duda and Hart 1973; Langley et al. 1992).
Model Evaluation
One critical aspect of model selection is the evaluation of the goodness of fit, and the importance of this step is often overlooked.
A common approach to evaluating a model fit is to generate a scatterplot of actual response values in the training set against the
predicted response using the model. If all the points in the scatterplot lie near the 45 (1:1) line, this indicates a good model fit to the
training data. However, this does not necessarily mean that the model will work for future data collections.
For example, consider the model shown in Fig. 3, which is overfitting the training data set. That is, the model is placing too much
emphasis on reproducing the training set, and likely contains more degrees of freedom than are necessary to capture the underlying
shape of the curve producing these observations. This model is capturing not just the true underlying function, but also the noise in the
measurements, which makes it unlikely to produce good predictions going forward. However, in the model evaluation on the training
data, all the points lie along the 45 line, indicating a superb fit. Although this is a contrived example, many models can overfit when
certain conditions are met, and in a multidimensional space one cannot easily visualize the prediction surface generated by the model.
Predicted f(x)
4
4
f(x)
3 3
2
2
1
1
0
0 10 20 30 40 50 1 2 3 4 5 6
x True Function (x)
Fig. 3—This is an example of a poor model (red curve) that appears to fit well when evaluated solely against the training set.
A model like this is said to exhibit overfitting.
To avoid overfitting, it is important to move beyond using predictions on the training data as the sole measure of model quality. One
simple way to do this is to use an independent test set. This can either be a completely new data set (e.g., pilot data from a region where
the model is intended for use), or a “held out” portion of the training data set. In both cases, one can fit the model using the training por-
tion of the data set, and then evaluate the fit on the independent test observations.
A third method of model evaluation is called k-fold cross validation (Hastie et al. 2001). In this approach, the training data set is ran-
domly split into k different groups (commonly called “folds”). Next, each of the k groups is held out and the model is trained on the
remaining k–1 groups. That model is then used to make predictions on the group that was held out (Fig. 4). After cycling through all k
groups, there will be a single prediction for every observation in the data set, and the predictions were made using a model for which
that observation was not included in the training set. These cross-validated predictions can be used to evaluate the quality of the model-
fitting procedure, and can indicate whether any problems might be expected on future data collections.
Train
Fig. 4—This diagram is a conceptual representation of k-fold cross validation. The full data set is split into k groups (here, k 5 5)
and each group is systematically “held out” as a test set for a model trained on the remaining k–1 groups. This yields one cross-
validated prediction for each observation in the data set.
There are two important notes to make in this discussion of cross validation. First, one can extend the cross-validation procedure by
repeating the entire process with a different random selection of k groups. A repeated cross validation using r repeated runs of k ran-
domly selected groups will yield r different predictions on each of the observations. These can be averaged to compute goodness-of-fit
metrics, but they also give important information regarding the variability in model predictions depending on the characteristics of the
training set. Second, note that the models trained during cross validation are not the models one would use for prediction going forward;
rather, one would build a single predictive model using the full training set. The cross-validation procedure is only for evaluation pur-
poses and provides a better indicator on the robustness of the predictive model for future applications using new data.
Goodness-of-Fit Metrics
Many measures exist for evaluating the quality of a model fit. In this paper, two metrics are used: average-absolute error (AAE) and
mean-squared error (MSE). These two metrics are similar, and both attempt to capture the overall closeness of predictions to the evalua-
tion data. Let yi be the true response for the ith observation, and y^i be the predicted response for that observation. The AAE is defined
as the average magnitude of the difference between the true response and predicted response (i.e., the average size of the residuals):
1X n
AAE ¼ jyi y^i j: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð1Þ
n i¼1
1X n
MSE ¼ ðyi y^i Þ2 : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð2Þ
n i¼1
Note that AAE has units matching those of the response (in bbl, in the example data set), whereas MSE is measured in squared units
of the response. Values closer to zero are desirable because they indicate smaller deviations between the truth and predictions (i.e.,
more-accurate prediction). MSE is typically preferred over AAE because of its well-known distributional properties, including being
continuously differentiable and being a sufficient statistic for normally distributed random processes.
Predictive-Modeling Results
Model-fitting results on the Wolfcamp data set are summarized in Fig. 5. Each plot shows the true response (M12CO) on the horizontal
axis and the predicted response on the vertical axis. Points on the diagonal dotted line indicate perfect prediction. Each row of plots
shows predictions from one type of model (OLS, RF, GBM, SVR, and KM), while each column shows results for a different model-
evaluation type.
The left column shows independent validation results, where for the held-out test data (left column), a random 20% subset of the
wells was held out. The model was then fit to the remaining 80% of the data set and evaluated on the 20% hold-out set. The points in
the plots in the left column only correspond to those predictions on the hold-out segment of the data set. For the cross-validation predic-
tions (center column), a 10-fold cross validation was used as a further refinement of the fivefold cross-validation approach shown in
Fig. 4. The points in these plots show the actual vs. the cross-validated predictions of each of the wells in the data set. The right column
shows the results from training and predicting on the full data set, which is the conventional approach to evaluating goodness of fit.
Notice that the independent validation and cross-validation results (left and center columns) tell a much-different story from the pre-
dictions on the full training set (right column). Compared with OLS regression, many of the models show a dramatic reduction in error
in both the AAE and MSE metrics, if one accepts the full-training-set approach for model evaluation. However, this reduction is more
modest for the other methods of model evaluation. The extreme case is the KM, which is a perfect interpolator and hence, by design,
forces the model fit through the training observations. Prediction on the full training set yields an apparent perfect predictive ability,
which clearly will not hold up in future data sets. One might argue that this would be obvious to an engineer or geoscientist examining
these plots. However, for a case like the RF, it is not so clear that the predictions on the training data are biased; it is only when com-
pared with the independent validation and cross-validation plots that the overfitting is revealed.
Having a more-realistic understanding of model performance has its own intrinsic merit, but it can also be useful for identifying
issues with data collection and availability. In this case, the poor predictive ability of the models, especially for high-producing wells,
likely results from a data deficiency. Because of restricted availability, the predictors in this data set only capture information regarding
well completion and operation, but nothing regarding the local geology around and within the well; thus, the addition of geological pre-
dictors may improve the accuracy of the models. Generally, when validated or cross-validated model performance is poorer than
expected, it can be an indicator of similar data shortcomings that can be investigated and addressed moving forward.
In summary, basing one’s expectations solely on the full-training-set-based predictions could lead to an optimistic perception of what
the model accuracy will be on future test data. This could lead to disappointment when those levels of accuracy are not met. Instead, it
may be wiser to adopt an alternative, albeit more robust, method of model evaluation such as a k-fold cross validation. Although the
results may not be as spectacular as far as goodness of fit on current data is concerned, they should align closer with actual model per-
formance on future data collections. Not only will this allow stakeholders to calibrate their expectations of the model as it is applied on
new data, but it can also identify data gaps that will inform further development in data collection and management practices.
DT Analysis
One strategy for tackling predictive problems is to simplify the question being asked. The predictive regression models developed in
the preceding section were attempting to pinpoint the exact cumulative first-year production (M12CO) for a given set of well character-
istics. However, suppose the real aim of this exercise is to give a simple “go/no go” solution on constructing a well. In this case, accu-
rate prediction of M12CO is not necessarily required. One needs only to predict whether the well will be a “good” well (relatively large
M12CO) or a “bad” well (relatively low M12CO). One way of simplifying this problem is to change the modeling effort from a regres-
sion problem to a classification problem. That is, the response can be binned into categories, and classification-tree models can be used
to predict into which category a well falls.
For the Wolfcamp data set, the top 25% and bottom 25% of producing wells were identified, and the middle 50% of the wells were
removed. A classification tree was then built to separate the top and bottom 25% groups. The result is shown in Fig. 6. The tree begins
at the top of the figure, where the first split checks whether the proppant used is less than 1,405,000 lbm. If so, a well observation moves
down the left path; otherwise, it goes right. Subsequent splits work in the same way, until eventually the observation reaches a terminal
node that contains a prediction. The text at the terminal nodes in this tree indicate how many training observations on each type
(“Bottom 25%” or “Top 25%”) ended up in that node.
OLS Prediction on Held-Out Test Data OLS 10-Fold CV Predictions OLS Prediction on Full Training Data Set
Prediction From OLS (×1,000 bbl)
300
300
CV Prediction From OLS
200
200
(×1,000 bbl)
(×1,000 bbl)
0 50 100
0 50 100
0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)
300
300
CV Prediction From RF
Prediction From RF
AAE 23.71 AAE 25.54 AAE 11.11
MSE 1106.49 MSE 1272.06 MSE 244.86
(×1,000 bbl)
(×1,000 bbl)
200
200
200
0 50 100
0 50 100
0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)
GBM Prediction on Held-Out Test Data GBM 10-Fold CV Predictions GBM Prediction on Full Training Data Set
Prediction From GBM (×1,000 bbl)
300
300
CV Prediction From GBM
200
200
(×1,000 bbl)
(×1,000 bbl)
0 50 100
0 50 100
0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)
SVR Prediction on Held-Out Test Data SVR 10-Fold CV Predictions SVR Prediction on Full Training Data Set
Prediction From SVR (×1,000 bbl)
300
300
CV Prediction From SVR
200
200
(×1,000 bbl)
(×1,000 bbl)
0 50 100
0 50 100
0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)
KM Prediction on Held-Out Test Data KM 10-Fold CV Predictions KM Prediction on Full Training Data Set
Prediction From KM (×1,000 bbl)
300
300
CV Prediction From KM
Prediction From KM
(×1,000 bbl)
200
200
200
0 50 100
0 50 100
0 50 100
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl) 12-Month Cumulative Oil (×1,000 bbl)
PROP< 1.405×106
TVDSS≥–8294
Bottom 25% LATLEN≥5362
26/0 Bottom 25%
8/2
Bottom 25% Top 25%
7/3 12/58
Bottom 25% Top 25%
21/2 6/15
Fig. 6—A DT that separates the top 25% and bottom 25% of producing wells. If the expression at a split is true, an observation
goes down the left branch; otherwise, it goes down the right branch. The fractions in the terminal nodes indicate the proportion of
training observations from each group in that node (listed as “Bottom 25%” and “Top 25%”).
DTs are easy to interpret. Not only do they indicate which predictors are influential in determining the response category, they also
identify critical values at which these categories change. In this case, there are two general paths to obtaining a top 25% producing
well. For wells using lower amounts of proppant (PROP < 1.405 106 lbm), the goal is to have a longer lateral (LATLEN 2,756 ft)
and a greater vertical depth (TVDSS < –8,294 ft). For wells using larger amounts of proppant (PROP 1.405 106 lbm), the goal is
again to have a greater vertical depth (TVDSS < –8,100 ft), and to have a lateral that is not too long (LATLEN < 5,362 ft).
Fig. 7 shows a view of this DT from the perspective of the wells in the predictor space. Fundamentally, DTs partition the predictor
space into blocks of similar observations. In a scatterplot of two predictors, this appears as vertical and horizontal segmentation of the
plot. In the top-left plot, the first split at PROP ¼ 1.405 106 appears as a vertical division of the plot. Within each of those divisions,
the splits on LATLEN (2,756 ft on the left branch and 5,362 ft on the right) serve to further subdivide the plot. Unfortunately, 2D views
can only show so much when three predictors are involved in the tree. 3D visualization can be used in this case to better understand
where the top 25% of wells are compared with the bottom 25% (Fig. 8). In this case, the bottom 25% cases form a distinct cluster at
low proppant levels and short lateral lengths. The other bottom 25% wells are mixed with the top 25% wells, but tend toward higher
depths and low proppant levels.
Table 3 shows a “confusion matrix” that summarizes the separability of the two classes in the training set. The value in each cell
describes how many wells of the true category indicated in the row header were in a terminal node for which the majority category was
the one indicated in the column header. Because 62 of the 80 true top 25% wells were in “Top 25%” terminal nodes, this yields a cor-
rect identification rate of 62/80 ¼ 77.5%. A similar calculation gives a correct identification rate of 91.3% for the bottom 25% wells.
Overall, the rate is (62 þ 73)/160 ¼ 84.4%. This indicates a reasonable ability to separate the two classes. The terminal nodes in the tree
can be examined to determine where the evidence for splitting the classes is perhaps a bit weak. In this case, the “Top 25%” node with
a 6:15 ratio and the “Bottom 25%” node with the 7:3 ratio are indicating places where there is weaker evidence.
Variable Importance
In some applications, the objective may not be to build a predictive model for a given response, but instead to identify the drivers of
that response among a large set of predictors. This is typically called screening. For example, in the Wolfcamp data set, the aim may be
to identify operational characteristics of horizontal shale wells that tend to correlate with higher well performance. With fewer predic-
tors to focus on, a screening exercise could guide the development of simpler predictive models or make experimental designs feasible
with a smaller number of runs.
There are many different approaches to measuring variable importance. One easy approach that is not tied to a particular model is
called R2 loss (Mishra et al. 2009). This method works for any regression model, and the reasoning is that if an influential predictor is
removed from a model, the accuracy of that model will dramatically fall. Alternatively, if a superfluous predictor is removed from the
model, there should be little to no effect on the accuracy.
TVDSS
2,000
1,000
0×100 1×106 2×106 3×106 4×106 5×106 0×100 1×106 2×106 3×106 4×106 5×106
PROP PROP
–8,000
–8,200
TVDSS
TVDSS
–8,400
–8,600
Fig. 7—These scatterplots show the splits made by the top 25% vs. bottom 25% DT. Note how the tree is fairly efficient at partition-
ing the predictor space into the region that contains primarily the top 25% wells.
Model fit can be assessed using pseudo-R2, which is defined in Eq. 3. Pseudo-R2 compares the sum of squared differences between
the true responses yi and predicted responses y^i to the overall sum of squares, which is proportional to the variance of the responses.
That is, it measures how much of the variability in the response is explained by the model. Note that while in a linear-regression model,
the pseudo-R2 is bounded between zero and unity, this is not the case for a general-regression model. When a regression model fits the
data worse than a flat line at the mean response does, the pseudo-R2 will be negative.
Xn
2 SSmodel ðyi y^i Þ2
Rp ¼ 1 ¼ 1 Xi¼1 n : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð3Þ
SStotal ðyi yÞ2 i¼1
To measure variable importance, one can compute R2p using all the predictors, and then compute R2p for a reduced model that uses all
the predictors except the predictor of interest. The R2 loss is then the difference between R2p for the full model and R2p for the reduced
model. A larger loss in the pseudo-R2 indicates a predictor with higher influence on the response.
Figs. 9 and 10 show the R2-loss rankings for the Wolfcamp data set. When measuring variable importance, it can be useful to com-
pute the ranks using several different predictive models. This can give a more-robust sense of which predictors are important. In this
case, there is a great deal of disagreement among the models as to which predictors are the most influential. The depth (TVDSS) is pop-
ular among all models. Three of the four models also put weight on the amount of proppant used (PROP), the length of the lateral
(LATLEN), and the amount of fracturing fluid used (FLUID). From a physical standpoint, the importance of these variables
makes sense because they clearly affect the stimulated reservoir volume (LATLEN, PROP, FLUID), the productivity index of the well
(LATLEN), as well as the intrinsic energy in the reservoir (TVDSS)—all which contribute to the cumulative production over the first
12 months.
The same reasoning applied to the R2 loss also holds true when other measures of model importance are used. For example, one
could measure the loss in other quality-of-fit measures such as the Akaike (1973) information criterion or the Bayesian information cri-
terion (Schwarz 1978). One potential problem with this approach is that it is not well-suited to situations where predictors are highly
correlated. For example, suppose a pair of important predictors happen to be correlated. When one of the predictors is removed from
the model, the other predictor can stand in to compensate for the loss, resulting in only a small reduction in the model fit. As a result,
both correlated predictors appear to be unimportant, and may be removed from future consideration. Indeed, this is the case in the
Wolfcamp data set for some pairs of predictors. For example, FLUID and PROP have a high correlation, which can be observed in the
pairwise scatterplot in Fig. 2. This underlines the importance of performing a preliminary exploratory data analysis before jumping into
model building. It can help to identify potential issues to consider moving forward.
6,000 7,000
–8,800 –8,600 –8,400 –8,200 –8,000 –7,800
4,000 5,000
LATLEN
LATLEN
TCDSS
7,000
TVDSS
6,000
–7,800
3,000
1,000 2,000
3,000 –8,400
2,000 –8,600
1,000 –8,800
0×100 1×106 2×106 3×106 4×106 5×106 0×100 1×106 2×106 3×106 4×106 5×106
PROP PROP
Top 25%
5×106
Bottom 25%
4×106
3×106
TCDSS
PROP
2×106
–7,800
–8,000
–8,200
1×106
–8,400
–8,600
0×100
–8,800
1,000 2,000 3,000 4,000 5,000 6,000 7,000
LATLEN
Fig. 8—These plots show 3D views of the Wolfcamp data set. The three predictors on the axes are the ones that were selected for
splits in the top 25% vs. bottom 25% classification-and-regression-tree analysis. The bottom 25% of wells (yellow triangles) form
two groups. The first is a distinct cluster at low PROP and low LATLEN. The second are more mixed with the top 25% (blue circles),
but primarily occur either at higher true vertical depth subsea (TVDSS) or low PROP values.
Predicted Bottom 25% Wells Predicted Top 25% Wells Total Correct ID Rate
True bottom 25% wells 62 18 80 77.5%
True top 25% wells 7 73 80 91.3%
Total 69 91 160 84.4%
Other model-specific methods of measuring variable importance exist as well. For example, among the predictive models described
in Table 2, RFs and GBMs have custom methods for identifying influential predictors. In RFs, the prediction strength of each variable
is measured by calculating the increase of MSE when that variable is permuted while all others are left unchanged. The rationale behind
the permutation step is that if the predictor variable is not influential, rearranging its values among the training observations will not
change the prediction accuracy of the model significantly. For GBMs, the variable importance is dependent on the number of times a
predictor variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged
over all trees. Variable importance results for these two DT ensemble methods are shown in Fig. 11.
As is clear from the results shown in this section, variable importance rankings can differ widely from method to method. It can
be useful to try several different methods, as well the graphical comparative-assessment methodology discussed earlier, to gain more-
robust feedback regarding which variables are most influential in driving the response.
Rank
1
2 Predictors (Average Rank)
3 TVDSS (2.2)
4 opt2B (4.8)
SurfY (5.0)
5 SurfX (5.2)
6 FLUID (6.2)
7 PROP (6.5)
8 LATLEN (6.8)
COMPYR (7.8)
9
STAGE (8.8)
10 PROPCON (8.8)
11 DA (10.2)
12 AZM (11.5)
13 opt2C (11.5)
opt2D (12.8)
14
opt2A (13.8)
15 opt2E (14.2)
16
SurfY
opt2E PROP
opt2A 5
opt2D opt2C
opt2C COMPYR
AZM
4 SurfX STAGE
DA
Rank Variability
AZM
PROPCON
STAGE opt2B PROPCON
COMPYR 3 LATLEN DA
LATLEN
PROP opt2D
FLUID opt2E
SurfX 2
SurfY FLUID
opt2B opt2A
TVDSS 1 TVDSS
5 10 15 0 5 10 15
Rank Average Rank
Fig. 10—These plots show a different view of the R2-loss procedure on the Wolfcamp data set. Both plots show how the predictors
stack up in terms of the average rank over the four predictive models vs. the variability in those rankings. The left plot shows hori-
zontal boxplots of the rankings of the predictors, sorted from bottom to top by rank. The right plot shows a scatterplot of average
rank vs. the standard deviation of those ranks. TVDSS is clearly an influential predictor, with high rank and low variability. FLUID
also has a reliable rank in the middle of the pack. Finally, opt2A is determined as not important, with consistently low rankings.
FLUID
opt2B LATLEN
SurfX SurfY
PROP TVDSS
SurfY SurfX
PROPCON PROPCON
LATLEN PROP
FLUID AZM
TVDSS opt2B
COMPYR STAGE
opt2A COMPYR
STAGE DA
opt2D opt2D
opt2C opt2C
AZM opt2A
opt2E opt2E
DA
0 5 10 15 20 25
0 10 20 30 40 Relative Influence
MSE Increase (%)
Fig. 11—These plots show variable importance results on the Wolfcamp data set for RF (left) and GBM (right).
Concluding Remarks
Data from wells completed in the Wolfcamp Shale Formation in the Permian Basin are used to demonstrate how statistical methods can
provide data-driven insights into production performance. Predictive models for the first 12 months of production are built using multi-
ple input parameters characterizing well location, architecture, and completions. Regression techniques used include OLS, as well as
other advanced regression methods such RFs, SVR, GBM, and KM. Models are evaluated using goodness-of-fit metrics for the training
data set itself, a hold-out (validation) data set, and k-fold cross validation. In addition, DT analysis is applied to identify factors separat-
ing the top 25% of wells from the bottom 25% of wells. Finally, a variety of variable importance techniques are used to identify the
most-influential subset of parameters.
As far as regression analysis is concerned, our main conclusion is that reliance on the training data for goodness-of-fit-based model
ranking may lead to the selection of perfect interpolators, such as Kriging, all the time, although their predictive performance as per
cross validation may not be equally robust. In other words, the problem may be more nuanced than finding a single model that best fits
the data. It may be more beneficial to build multiple input/output models during the data-fitting process, and use a statistical model aver-
aging approach based on the k-fold goodness-of-fit statistics to estimate the “weight” of each model and combine model predictions
(Mishra 2012).
DT-based methods have not been as common as regression-based approaches in oilfield exploration and production (E&P) applica-
tions. However, they offer greater interpretability and can be more amenable to the formulation of a decision problem. In particular, the
DT can provide useful insights as to what variables or combinations of variables drive high-end and low-end production performance.
In summary, we note that there is a growing trend toward the use of statistical and machine-learning techniques for oil and gas appli-
Nomenclature
R2p ¼ pseudo-R2 (goodness-of-fit), see Eq. 3
SSmodel ¼ sum-of-squares for response variable explained by model
SStotal ¼ total sum of squares for response variable
yi ¼ true response for ith observation
y^i ¼ predicted response for ith observation
Acknowledgments
This study was supported by a Battelle Internal research and development grant. We thank our colleagues Rob Carnell and Rod Osborne
for a careful review of this manuscript. We also thank our technical editors for several useful comments and suggestions that helped
improve the readability of the paper.
References
Abadi, M., Barham, P., Chen, J. et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. Proc., 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI ’16), Savannah, Georgia, 2–4 November.
Ahmed, U. and Meehan, D. N. 2016. Unconventional Oil and Gas Resources: Exploitation and Development. Boca Raton, Florida: CRC Press.
Akaike, H. 1973. Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information
Theory, ed. B. N. Petrov and B. F. Csaki, 267–281. Budapest, Hungary: Academiai Kiado.
Bhattacaharya, S., Maucec, M., Yarus, J. et al. 2013. Causal Analysis and Data Mining of Well Stimulation Data Using Classification and Regression
Tree with Enhancements. Presented at the SPE Annual Technology Conference and Exhibition, New Orleans, 30 September–2 October. SPE-
166472-MS. https://doi.org/10.2118/166472-MS.
Breiman, L. 2001. Random Forests. Mach. Learn. 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J., Stone, C. J. et al. 1984. Classification and Regression Trees. Boca Raton, Florida: CRC Press.
Cipolla, C. L., Lolon, E. P., Erdle, J. C. et al. 2010. Reservoir Modeling in Shale-Gas Reservoirs. SPE Res Eval & Eng 13 (4): 638-653. SPE-125530-
PA. https://doi.org/10.2118/125530-PA.
Cressie, N. 1993. Statistics for Spatial Data. New York City: Wiley.
Dimitriadou, E., Hornik, K., Leisch, F. et al. 2011. e1071: Misc Functions of the Department of Statistics. TU Wien. R Package Version 1.6.
Ding, D. Y., Wu, Y.-S., Farah, N. et al. 2014. Numerical Simulation of Low Permeability Unconventional Gas Reservoirs. Presented at the SPE/EAGE
European Unconventional Resources Conference and Exhibition, Vienna, Austria, 25–27 February. SPE-167711-MS. https://doi.org/10.2118/
167711-MS.
Draper, N. R., Smith, H., and Pownell, E. 1966. Applied Regression Analysis, Vol. 3. New York City: Wiley.
Drucker, H., Burges, C. J., Kaufman, L. et al. 1997. Support Vector Regression Machines. Proc., 9th International Conference on Neural Information
Processing Systems, Denver, 3–5 December, 155–161.
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. New York City: John Wiley & Sons.
Elith, J., Leathwick, J. R., and Hastie, T. 2008. A Working Guide to Boosted Regression Trees. J. Anim. Ecol. 77 (4): 802–813. https://doi.org/10.1111/
j.1365-2656.2008.01390.x.
Friedman, J. H. 1991. Multivariate Adaptive Regression Splines. Annal. Stat. 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.
Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annal. Stat. 29 (5): 1189–1232. https://doi.org/10.1214/aos/
1013203451.
Geladi, P. and Kowalski, B. R. 1986. Partial Least-Squares Regression: A Tutorial. Anal. Chim. Ac. 185: 1–17. https://doi.org/10.1016/0003-
2670(86)80028-9.
Gevrey, M., Dimopoulos, I., and Lek, S. 2003. Review and Comparison of Methods to Study the Contribution of Variables in Artificial Neural Network
Models. Ecol. Model. 160 (3): 249–264. https://doi.org/10.1016/S0304-3800(02)00257-0.
Gupta, S., Fuehrer, F., and Jeyachandra, B. C. 2014. Production Forecasting in Unconventional Resources Using Data Mining and Time Series Analysis.
Presented at the SPE/CSUR Unconventional Resources Conference, Calgary, 30 September–2 October. SPE-171588-MS. https://doi.org/10.2118/
171588-MS.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York City:
Springer.
Hopfield, J. J. 1982. Neural Networks and Physical Systems With Emergent Collective Computational Abilities. Proc. Natl. Acad. Sci. USA 79 (8):
Jared Schuetter is a principal research statistician at Battelle Memorial Institute, and has been with the company for 8 years. His
research interests include exploratory data analysis, machine learning, image analysis, data visualization, and software devel-
opment. Schuetter holds a PhD degree in statistics from Ohio State University.
Srikanta Mishra is Institute Fellow and Chief Scientist for Energy at Battelle Memorial Institute. Previously, he worked for the geosys-
tems consulting company Intera, and as an adjunct professor of petroleum engineering at the University of Texas at Austin.
Mishra’s research interests include computational modeling and data-analytics applications for oil and gas problems. He is the
author of the book Applied Statistical Modeling and Data Analytics: A Practical Guide for the Petroleum Geosciences, published
by Elsevier, and will serve as an SPE Distinguished Lecturer during the 2018–2019 season. Mishra holds a PhD degree in petroleum
engineering from Stanford University.
Ming Zhong is a data scientist at Shell. Previously, he worked as a statistician with Baker Hughes, Capital One, and Abbott Labs.
Zhong’s current research interests focus on the application of machine learning in the oil and gas sector. He has authored or
coauthored more than 15 technical papers. Zhong holds a PhD degree in statistics from Texas A&M University. He is a member
of SPE.
Randy LaFollette is retired, and previously spent 39 years as a technical professional in the oil and gas industry, working for the
Western Company, Reservoirs Incorporated, BJ Services, and Baker Hughes. During his career, LaFollette developed new meth-
ods and concepts to display and analyze engineering data on geological maps to improve interpretation of unconventional-
reservoir-production results. He holds a bachelor’s degree in geological science from Lehigh University. LaFollette also volun-
teered extensively for SPE, the American Association of Petroleum Geologists, and the Houston Geological Society. He worked
to bring the benefits of multivariate statistical analysis to the study of unconventional reservoirs, coauthored numerous papers,
and served as an SPE Distinguished Lecturer during the 2015–2016 season.