Data Analytics
Data Analytics
Srikanta Mishra
[email protected]
+1-614-424-5712
Actionable
New insights about reservoir
from “data mining” can help information
increase operational efficiencies
Volume
Make
better
decisions
Understand
“what does
Examine data say”
data Prediction
Velocity Variety
Learning
With analytics, you discern not only what your customers want but
also how much they’re willing to pay and what keeps them loyal.
You look beyond compensation costs to calculate your workforce’s
exact contribution to your bottom line. And you don’t just track
existing inventories; you also predict/prevent future inventory issues.
Knowledge
Machine Discovery
Learning Data
Mining
Data Statistical
Analytics Learning
Unsupervised Learning
• Data reduction and clustering
• PCA, k-means, self-organizing maps
Supervised Learning
• Regression and classification
• Random forest, SVM, neural nets, kriging
Predictive Reservoir
Maintenance Management
Identifying factors for
Real-time prediction of system improved performance
response (drilling, fluid injection)
5000
4000
3000
2000
1000
0
1985 1990 1995 2000 2005 2010 2015 2020 2025
y = 2.0626x + 97.397
300
Initial Well Potential (BOPD)
R2 = 0.5385
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90
Net Pay (ft)
250 40
Residuals
200 20
0
150
-20 0 20 40 60 80 100
100
-40
50
-60
0
-80
0 100 200 300 400
-100
Y - observed X Variable 1
0
-3 -2 -1 -5 0 1 2 3
-10
-15
Standard Normal Deviate
MWC7.
MWC5.
C2.C6
MMP
robust measure for strength of
API
C7.
Vol
C1
V.I
Int
T
1
association (linear/non-linear) C7.
Multivariate Analysis
Ch. 5, Mishra and Datta-Gupta (2017)
• Eigenvalues solution of
Correlation Matrix |C-lI| = 0
x1 x2 x3 C = correlation matrix
x1 1.000 0.886 0.750
x2 0.886 1.000 0.889
I = identity matrix
x3 0.750 0.889 1.000 l = vector of eigenvalues
• N (obs) x P (var) dataset
Eigenvalues has P eigenvalues
l 2.685
l 0.251
• Eigenvalue = variance of
l 0.065 corresponding PC
• Sli = 3 = P (trace of C)
13.92%
Eigenvalue
▪ Scree plot (keep all PCs above 1.0
10.27%
“floor” level) 6.30%
0.5
3.01%
▪ Kaiser criterion (keep all PCs with .00%
0.0
eigenvalue > 1)
-0.5
▪ Variance threshold (keep all PCs -1 0 1 2 3 4 5 6 7 8 9
• Eigenvectors solution of
Eigenvalues
|C-liI|ui = 0
l 2.685 C = correlation matrix
l 0.251 I = identity matrix
l 0.065 li = eigenvalues for i-th PC
ui = eigenvector for li
Eigenvectors • Eigenvectors are coefficients
u1 u2 u3 of variables in linear equations
0.567 0.711 0.416 defining PCs
0.597 -0.007 -0.802
0.567 -0.703 0.429
• Also define rotation from
original variable to PC space
l1 = 2.685
0.272
0.116 0.098
0.028
Basic Concepts
Ch. 8, Mishra and Datta-Gupta (2017)
https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
Random
Random X1 < t1
Forest
Forest
Gradient
Gradient Boosting
Boosting
Machine
Machine X2 < t2 X2 < t3
Support
Support Vector
Vector
Machine
Machine R1 R2 R4 R3
Artificial Neural
Neural Partition
Build
Inputs
Find parameter
Multidimensional
Build
hyperplane
sequence
mapped
ensemble to space
ofmaximizing into
interpolation
of
outputs
trees
trees
that
via
using rectangular
address
separation
hidden
considering
random
short-
units
of
Network regions
trend
data
using awith
subsets
comings
and
and constant
sequence
of
transform values
autocorrelation
of
observations
eachof data
previous orlinear
nonlinear
into class
structure
and fitted labels
predictors
functions
tree
of
space
data
Gaussian
Gaussian Process
Process
Emulation
Emulation
Train
Full Dataset
Predict
https://christophm.github.io/interpretable-ml-book/
Recall
https://manisha-sirsat.blogspot.com/2019/04/confusion-
matrix.html
Mishra - IIT(ISM) 2021 62
Machine Learning Methods
Variety of
modeling methods
Three types of
model validation
- full training data
- 10-fold CV
- held out test data
RMSE
Model Name
(x1k BBL)
M1 37.57
M2 37.45
M3a 36.21
M3b 36.15
LPM 47.12
QPM 40.03
SVR 39.00
RF 38.33
GBM 40.40
Crystalline
Dolomite
• Identify vugs in a
single well using
image logs and
core samples
• Output file is a
Synthetic Vug Log:
SVL (0-1)
Probability
of Vugs
• Random Forest
▪ R^2 = 0.96
▪ RMSE = 7.4 ft/hr
▪ Mean error = ~5%
• Linear regression
▪ R^2 = 0.42
▪ RMSE = 18.4 ft/hr
▪ Mean error = ~14%
• Classification tree
analysis for
identifying rock
types from basic well
log attributes
• Accounting for
missing well logs
• Application for
permeability
prediction in Salt
Creek field
• Identifying performance
drivers and completion
effectiveness for
Marcellus shale wells
• Predictive model using
ANN (Artificial Neural
Networks)
• Role of different
variables evaluated
• Building prognostic
classifier for specific
turbogenerator failures
during startup
Test Accuracy
• Data from offshore (deg C)
facility – extraction of
Temperaure
Temperature
features
• RUSBoost and RF
models
Validation Set
• Multi-fold validation
approach for evaluation
Mishra - IIT(ISM) 2021 82
Example [5]
Arumugam et al.
SPE-184062, 2016
• Processing of daily
drilling data to identify
drilling anomalies / Drill, Directional Drill,
Connections increased,
best practices observed excess drag,
observed fresh cuttings
▪ Information retrieval
▪ Knowledge management
Software Demo
• https://cran.r-project.org/
• https://rattle.togaware.com/
y = 2.0626x + 97.397
300
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90
Net Pay (ft)
0.5
-0.5
-1
1
0.5 1
0.5
0
0
-0.5
-0.5
-1 -1
Random
Random Forest
Forest
N=81
▪ Data visualization/communication
Beware the
hype / manage
expectations
ML comes
after posing
the problem
Don’t forget
the physics
Regression/Classification Techniques
Ch. 8, Mishra and Datta-Gupta (2017)
X1 < t1
R3
R2 t3
X2
X2 < t2 X2 < t3
t2
R4
R
R1 R2 R4 R3
t1
X1
Regression
True Surface Tree
• Advantages
▪ Interpretable
▪ Resistant to outliers
• Disadvantages
▪ Less accurate than other models
• Prediction
▪ Observation is passed through all of the trees in the ensemble
• Built-in cross-validation
▪ Since each tree sees only a subset of the data, the remaining
observations are called out-of-bag samples
▪ For that tree, those out-of-bag samples are independent test data
• Advantages
▪ Can handle highly non-linear behavior
▪ Resistant to outliers
• Disadvantages
▪ Not easily interpretable
• Prediction
▪ Observation is passed through all of the trees in the ensemble
▪ Repeat for m = 1, …, M:
− Fit a model hm(x) to the negative gradient of the residuals y – Fm-1(x)
− Let Fm(x) = Fm-1(x) + hm(x)
▪ Make predictions with the final model FM(x)
• Advantages
▪ Invariant under all monotone transformations of the input variables
▪ Competitive accuracy
• Disadvantages
▪ Can easily overfit
▪ Can take a while to fit, but there are tricks for speeding this up
▪ Alternate formulation:
Polynomial
Gaussian
Exponential
Hyperbolic Tangent
▪ Using kernels like the ones above can produce regression fits to
non-linear surfaces
Mishra - IIT(ISM) 2021 128
Kernel Trick - Schematic
• Advantages
▪ Can capture non-linear behavior
− Kernel function allows adaptability to many situations
▪ Accurate predictor compared to most methods
• Disadvantages
▪ Not easily interpretable
β0 + xtβ = -1
▪ Prediction is made using the sign
of the hyperplane equation
Group B
(Y = -1)