0% found this document useful (0 votes)
21 views16 pages

Spe 196428 Ms

Uploaded by

luka4ln
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views16 pages

Spe 196428 Ms

Uploaded by

luka4ln
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SPE-196428-MS

From Data Collection to Data Analytics: How to Successfully Extract Useful

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Information from Big Data in the Oil & Gas Industry?

Mustafa A. Al-Alwani, Missouri University of Science and Technology; Larry K. Britt, NSI Fracturing;
Shari Dunn-Norman, and Husam H. Alkinani, Abo Taleb T. Al-Hameedi, Missouri University of Science and
Technology; Atheer M. Al-Attar, Enterprise Products; Mohammed M. Alkhamis, Missouri University of Science and
Technology; Waleed H. Al-Bazzaz, Kuwait Institute For Scientific Research

Copyright 2019, Society of Petroleum Engineers

This paper was prepared for presentation at the SPE/IATMI Asia Pacific Oil & Gas Conference and Exhibition held in Bali, Indonesia, 29-31 October 2019.

This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract
Big data has become a major topic in many industries. Most recently, the oil and gas industry adopted a
special interest in data science as a result of the increasing availability of public domains and commercial
databases. Utilizing and processing such data can help in making better future decisions. The aim of this
work is to provide an example and demonstrate methodologies on how to collect and utilize big data to help
in making better future decisions in the oils and gas industry.
After reading a good number of papers and books about the applications of data analysis in the oil and gas
industry, in addition to other industries, and given that data analysis is the area of expertise of the authors,
this paper was written to demonstrate real examples of data processing and validation workflows. This work
is intended to cover the gaps in the literature were many of the publications only discuss the importance
of data-driven analytics.
This paper provides an overview of the diverse and bulk data generating sources in the oil and gas
industry, starting from the exploration phase to the end of the lifecycle of the well. It provides an example of
utilizing a public domain database (FracFocus) and demonstrates a step by step workflow on how to collect
and process the data based on the objective of the analytics. Two real examples of descriptive and predictive
analytics are also demonstrated in this paper to show the power of having a diverse and multiple resources
databases. A framework of data validation and preparation is also shown to illustrate data quality checks
combined with best practices of data cleansing and outlier detection methodologies.
This paper provides a clear methodology on how to successfully apply data analysis which can serve as
a guide for some future data analysis applications in the oil and gas industry.

Introduction
The anecdotal saying about data is knowledge is not necessarily true. Data are generally recorded events
as they take place. Processing and analyzing the data yields knowledge. With gained knowledge, an
understanding of why things are happening can be gained and then utilized in the desired direction to
2 SPE-196428-MS

optimize the outcomes. The route data analysis studies must go through several phases starting with
the collection and understanding the data parameters, passing through data visualization and descriptive
analysis, especially for large and complex datasets. Visualization of the data will help in conveying insights
and overall comprehending of the common trends buried within the data. Good data visualization will also
help the non-specialist or non-technical individuals to be able to understand technical and specialized data.
After understanding the data and investigating the trends, modeling and testing hypotheses follow through
as the phase of predictive analytics which leads to prescriptive analytics. The goal of data analytics is to

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


gain useful insights and extract valuable parameters that can help in producing informed decisions.
Big Data can be defined as a collection of datasets that are large and complex in nature. The structured
and unstructured datasets grow large and fast with time as more data are generated and added to the original
database. The fast-growing rate of such data makes the traditional relational database systems and other
conventional statistical tools fail to manage this kind of databases.
Staring with a big and complex dataset then subjecting it to rigorous analysis and interpretation phases
can generate educated solutions for businesses or any operations success. To gain more knowledge, the
descriptive analytics is taken a step further to investigate the trends and to understand the driving hypotheses
behind the data by building independent models and formulating assumptions on why the trends and patterns
exist. The built complex mathematical models can be run to validate the assumptions and theories behind
the trends and generate artificial datasets. The generated models are checked against other sampled data
from the original data set that have never been used to generate the model.
The type of the data is characterized by the speed they are generated, the volume of represented items, the
variety of the type of the data they contain, and the degree of veracity they own. Big data analytics comes to
play to find answers from the data by utilizing a combination of high technology systems and mathematics
which together are capable of processing all the information and providing valuable insights.
The data analytics starts with collecting the data, store, process, analyze, and find patterns. Data analytics
can be classified into three major types. Descriptive data analytics describes what happened in the past by
presenting the data through graphics and reports, descriptive analysis is not necessarily capable of explaining
why the patterns exist or what will be the future trend. Predictive analytics utilizes the data to predict what
could happen in the future or what are the expected trends of certain parameters. Prescriptive analytics
evaluates the outcomes of predictive analytics and decides on how to proceed with future decisions or
alter previous designs. Big data analytics and the revolution of datafication help companies and public
administrations to better understand the data, find previously unnoticeable patterns and provide better
solutions for existing and future operations.
The difference between the traditional datasets and big datasets are characterized by many attributes.
The volume of the data in the traditional databases may reach to gigabytes or even terabytes while in big
datasets the volume can go up to petabytes and zettabytes. The data structure is mostly structured (such as
tables, columns, rows) in traditional and semi-structured or/and unstructured (no specific formats such as
emails, text, video, audio files) in big datasets. The storage organization of the data is stored centrally in the
traditional data and distributed in big data. The data model is based on the strict schema in the traditional
and flat schema in big data. Finally, the data relationship is complex interrelationships in traditional while
almost flat with few relationships in big data.
This study provides an overview of the data sources in the oil and gas industry with an example from
FracFocus database to demonstrate the process of collecting, processing, and analyzing the data based on the
objective of the analysis. In addition, a framework of data validation and preparation is shown to formulate
data quality rules combined with best practices of data cleansing and outlier detection methodologies. To
show the power of having multiple sources of data, two real examples of descriptive and predictive analytics
are also shown.
SPE-196428-MS 3

Oil and Gas Industry Data Sources


The oil and gas industry generates and stores an enormous amount of data. In the past, the industry has
been tormented by accumulating too much data without the capabilities of deducting insightful outcomes
from the collected data (Feblowitz et al., 2013). The industry has recently started to transition from the
data collection mode to proactive use of the data mode. The industry generates a massive amount of data
during all phases from exploration to production then abandonment. The industry is going through an

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


exponential increase in the number of sensors installed in upstream, midstream, and downstream operations.
The ultimate objectives of the utilization of big data and analytics are to reduce operational cost, increase
safety, and enhance productivity. The following sub-section briefly demonstrate the source of data being
generated in different phases during the lifecycle of oil and gas operations.

Exploration and Reservoir Modeling Phase


In the exploration phase, big data approach is very important to process the seismic data. Seismic data
processing requires high-speed computational power and clustered and parallel high-performance data
storage (Vega-gorgojo et al., 2016). Such infrastructure is deemed necessary to create geological models
in 3D to explain the complex geological structures underground. The data obtained from seismic usually
coupled and boosted by other data sources obtained from offset wells such as rock types and logs. As
technology advances, an additional vast amount of data are being generated. For example, in offshore
exploration, the old technology was called narrow azimuth towed streaming (NATS) while the recent
technology is called wide-azimuth (WAZ). WAZ generates more than six times the data being generated
by NATS. Seismic data recording has not only been used for the exploration phase but also recently has
been utilized in monitoring wells that contain permanent geophones. Those geophones are used to collect
data to investigate fluid front movement, monitor carbon capture sequestration, and detect microseismic
events due to fracture stimulation in nearby wells to map the fracture growth and dimensions. Seismic data
centers collect and store data which reaches to 20 petabytes of information and to put that in perspective,
it represents 926 times the size of the data of the U.S. Library of Congress (Beckwith, 2011). The new
techniques of machine learning and data analytics can be utilized as a part of the seismic interpretation
methods to find new discoveries.

Drilling Phase
Most of the modern drilling rigs used in drilling oil and gas wells are normally equipped with many sensors
to continuously record all the operation from the time the well is spud till the well is completed and ready
for the production phase. The safety and the performance are the two main key performance indicators
that the drilling engineers and the operators are actively trying to achieve. Data such as the weight on
bit (WOB), the rotary speed of the bit (RPM), pumping pressure (psi), torque (lb.ft), etc., are constantly
monitored and studied to optimize the drilling time and reduce the non-productive time (NPT). Multiple
sources of data are also recorded and collected by service companies on location and all the data gathered
and stored for each drilled well. Well logs are run several times during the drilling phase to collect formation
and fluid information with open hole logs that can be performed using wireline logs or logging while
drilling techniques. The cement bond logs are run to check for the quality of the cement around the casing
to make sure zonal isolation is achieved. Stress and geomechanical tests generate a large amount of data
associated with each well. Well control and safety monitoring data are also an important part of the data
collection and monitoring. Data such as the percentage of the gas in the mud while drilling through an active
reservoir formation and the type of the gases coming to surface with drilling mud (gas chromatography)
are constantly monitored and modeled to prevent accidents from happening. A large amount of the drilling
data that is generated, stored, and interpreted are dramatically increasing with time and as the regulations
require close monitoring and capturing more data, data analytics are becoming more important as the time
progresses. Drilling data can be utilized to minimize the cost and NPT to achieve an optimum drilling
4 SPE-196428-MS

operation (Alkinani et al., 2018a; Alkinani et al., 2018b; Alkinani et al., 2018c; Alkinani et al., 2019; Al-
Hameedi et al., 2017a; Al-Hameedi et al., 2017b; Al-Hameedi et al., 2017c; Al-Hameedi et al., 2018).

Production Phase
Production data are very important and always coupled with other phases. The objective in many of the
reservoir management and all drilling and completion techniques is to improve the productivity of the wells.
Production data are used in data science as a gauging parameter (response) to other parameters that the

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


engineers are trying to optimize. Production data are also utilized in reservoir modeling by trying to history
matching the production to fine-tune the geological and simulation models. Water production data can be
compared to oil and gas data to avoid completing near aquifers as well as to help in taking decisions in
production monitoring by adjusting the artificial lift devices' settings (e.g. control ESP, plunger, rod pump).
Some operators compare the production data and the geographic distribution of the wells to select the best
investment areas to drill more wells by identifying the best reservoir properties and oil and gas pockets.
The development of the concept of smart oilfields has flourished and enriched big data analytics. For every
well with intelligent completion, a large amount of data are being generated and stored. Hence, artificial
intelligence applications are being implemented to take actions based on pre-set algorithms. Different
sensors are installed as part of the completion downhole and the data are transmitted by fiber optic cables.
Examples of the data being measured are flow rate for every phase of the production fluids, water and gas-
oil ratio, methane emission during hydraulic fracturing fluids flowback, pressure and temperature along the
well all the way to the wellhead, electrical submersible pumps operating parameters and power usage, etc.

Operations Phase
Operations encompass a big stream of structured and unstructured data. The data are varied and consists of a
large range of formats from complex 3D models to sensors data. The speed accumulated is also challenging
as a result of the enormous number of sensors that are incorporated in most of the operations. Applying
big data in the oil and gas industry drives the reduction in wells' shutdown or productivity impairment. Big
data analytics also helps in extending the operational lifetime of equipment by predicting the failure rates
and suggesting condition-based preventative maintenance. Real-time data streaming enables worldwide
operational support which improved the quality of the jobs while minimizing crew in remote locations
such as offshore environments. Service providers analyze the operational conditions of their downhole tools
such as mud motor and logging while drilling (LWD) by continuously analyzing data from several sensors
that measures the temperature, pressure, and vibration to predict failure and provide surface warning to
change the operational parameters and hence prevent potential tool failure and consequential NPT and loss
of money. Overall, in all phases of the oil and gas industry, data analytics are used to leverage data-based
decision making in all operations and processes.
In all the E&P phases, the dramatic growth in data generation is not useful by itself. The ability to utilize
and integrate the diverse data sources to seek useful insights and provide data-driven actions and decisions
is the main target of using big data approaches in the oil and gas industry.

Data Collection and Formatting


Data collection and preparation considered the first step in any data analytics. Data collection can be a
challenging task because all the subsequent analyses will depend on data quality and reliability. Before
embarking on any data-driven studies, a well-defined objective statement should be formulated to decide
on what type of data should be collected to give a meaningful output. The results of data analytics are the
end-product, insights will be generated and decisions will be made based on those outcomes.
Understanding and organizing the data at the early phase of data collection saves the time of the data
analysts when they try to sift through the data to extract useful insights for a better-informed decision that
can save time and cost. The engineers in the oil and gas industry spending a large portion of their time trying
SPE-196428-MS 5

to search and assemble the data needed for their projects (BruléGroup, 2015). A recent survey conducted in
2018 by General Electric and Accenture among oil and gas executives and 81% of the participants indicated
that big data and data acquisitions are on their top three priorities for 2018 (Mehta, 2018)
To start a data collection project, the first step is to identify the scope of the project and search for the
potential sources. The state and the local governments in the United States regulate the oil and gas industry
and require the operators to reports their data to the state which are also made publicly available for the
community. Texas railroad commission website (2019) is one example of a public data source that contains

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


production and wells' data files. Some other websites combine most of the states and the regulations require
the operators to report specific data to that website for the purpose of public information. An example of
that is the FracFocus website (2019) that is managed by the groundwater protection council. FracFocus
contains all the chemicals types and percentages that are pumped in the hydraulically fractured wells. Some
commercial databases such as IHS Market (2019) and DrillingInfo (2019) have been developed to collect
and process the oil and gas data and provide a service of validated, standardized and easy to access organized
data. Those databases provide a large number of data elements on millions of wells' records which can go
back to wells drilled and produced since 1859. The wells data in the United States are obtained from the
regulatory agencies or directly from the operators. The following subsections demonstrate examples of data
collection and processing steps.

Combining Files from the Same Source


Once the objective of the study has been identified, the first step is to collect the data files from the source
then combine them into one database to make the data ready for the analysis. In this paper, an example will
be shown of how to utilize the public records of FracFocus to collect hydraulic fracturing chemical data and
integrate it with production and completion database obtained from a commercial database (DrillingInfo).
Figure 1 shows a workflow of stacking more than one file to generate one final file to process the data for
the analysis phase. In this example, all the data on FracFocus were downloaded and tabulated in an Excel
file. To stack data files together and merge them into one file, all data columns of each file must be exactly
the same in terms of the column header name and the format of the data type. Each chemical component of
the wells from the FracFocus records was represented in one Excel row with multiple columns that cover all
the reported data. Due to the limitations of how many rows each Excel sheet can contain (1,048,576 rows)
five Excel sheets were used to contain all the 4,146,279 chemicals (rows) that were found in 142,978 wells.
The workflow was generated using self-service data analytics platform (Alteryx) and the final combined
file was saved in a txt format which can be used for the next step of data cleaning and format checking.
6 SPE-196428-MS

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 1—Example of Data Collecting and Stacking into One Data File

Data Cleaning and Formatting


Once all the extracted data are combined in one file, the second step is to format and select the columns
needed for the analysis. Figure 2 shows a workflow example data formatting and columns selection process,
in this particular example, the dates columns of the hydraulic fracturing start and end dates were in a long
format which includes the hours and seconds, a date formatting function was chosen to change the data
formats to mm/dd/yyyy. Data cleaning tool was used to remove and handle whitespace, capitalization issues,
and nulls. The geographical tool was also used to convert the longitude and latitude data into map points to
plot the wells on the map for the descriptive analytics. The column selector function was used to eliminate
the data that have no meaning to the analysis and only retain the columns with useful parameters. Finally,
the cleaned and formatted file is exported to a txt file to be used as an input to the next data processing step.

Figure 2—Example of Data Formatting and Columns Selection


SPE-196428-MS 7

Data Processing to Satisfy the Objectives


Once the data are stacked, formatted, cleaned, and selected, the next step is to process the data to be organized
in a format that is useful for the analysis purpose. In the following example which is depicted in Figure 3, the
objective was to evaluate the quantities of proppant (sand) used in each fracturing job per well in the U.S.
Since FracFocus reported data only provides the weight percentage of the chemicals (including proppant)
and for each well there are several types of proppant used with different mass percentages, data processing
workflow was created to process the proppant data. The workflow starts by selecting the proppant related

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


columns and the unique well number (API Number). The workflow is designed to detect and remove any
data duplications by comparing all the parameters and flag the duplicated rows. The following step is to
aggregate the proppant data for each well as one grouped value in the same time inspect if the well has been
refractured by comparing the API number and the job start and end date. The wells that have been refractured
will be grouped separately and the final proppant summed up value for each well will be determined and
then saved in a file that contains all the proppant related data. The same procedure and workflow will be
created to handle and process the other chemical ingredients separately. Once all the chemical ingredients
and the water data files are created, a joining process will be started to create the final FracFocus database
that includes all the water and chemical data processed and summed up for each well.

Figure 3—Proppant Data Processing Example

Combining Different Databases


To enlarge the database and include more parameters such as the well completion and production data,
another database has to be joined with the created chemicals database. In this database example, DrillingInfo
which contains the completion and production data of the wells was chosen to be integrated with the
stimulation chemicals database. Figure 4 shows the workflow of joining two different databases. The two
files are imported from DrillingInfo and FracFocus then the column selector tool was used after each
database to select the parameters to be joined. The join tool is used in this workflow to join all columns
(completion, production, and stimulation chemicals). In order to use the join tool, a common entity must
exist between the two databases to be identified as the joining parameter and in this case, the API number
was used to join the parameters from both databases.
8 SPE-196428-MS

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 4—Example of Combining Two Different Databases

Data Validation and Quality Assurance


In this section, a non-case dependent framework is presented to deal with data validation and preparation.
The data point passes quality check when it can serve the planned purpose in an operation or when it can
lead to a clear and correct decision making (Dai et al., 2018). The data coming from the oil and gas industry
are not always accurate and sometimes have some missing values. Corrupted data cost the industry about
$60 billion per year (Nobakht and Mattar, 2009). Most datasets are expected to contain from 1-5% error.
The more caution and validation applied during the data collection process, the less the error percentage.
The most common error sources in the data are human errors, wrong measurement set up, or malfunction
in measuring equipment. Dealing with missing data or suspected data requires a deep understanding of the
dataset and its behavior. A good understanding of the dataset leads to a better decision when it comes to
remedy and substitute for the corrupted and missing data (Al Attar et al., 2016). The enforcement of data
quality best practices will not yield instantaneous results, but it will at least ensure a large signal/noise ratio
that accumulates as the process of data collection progress. Data quality and validation tests vary from being
simple to structures generalized, but in both cases, a wide knowledge in the data domain is required. The
data validation for the oil and gas datasets is significantly different in context from a validity check for
a healthcare dataset, with this contrast the validity is still looking for similar patterns to evaluate the data
quality and fit for operations decision-making.

Data Validation
Data validation defined as the process of trying to validate and verify if the value of a data point comes from
a known finite or infinite set of defined acceptable values (UNECE, 2013). Data validation is also defined
as the process of ensuring that the final dataset complies with several predetermined quality characteristics
(Simon, 2013). Di Zio et al. (2013) adopted a definition that considers the communication between the data
records on the variable level and on the field domain level which defines data validation as the process of
verifying that the dataset combination of values whether or not belongs to a set of acceptable combinations.

Levels of Data Validation


There are several levels of data checks that target the data at different levels. Di Zio et al. (2013) divided
the data validation into two levels which are:
SPE-196428-MS 9

i. Technical integrity of the file and


ii. Logical and statistical consistency of the data.
The second category is divided into sub-levels shown in Table 1.

Table 1—Data Validation Levels

Level Description Example

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


0 Checking the file within the IT guidelines Columns separators are valid, the number
of columns matches the expected number,
columns format matches the expected format

1 Checking with the element dataset, statistical This stage implementing ad-hoc rules, i.e. is the
information included in the file itself measured depth always greater or equal to true
vertical depth? Is flow value always positive?

2 Checking the integrity with all similar In this check level, the files are checked whether
files from a statistical point of view they are similar or different revisions of each other

3 Checking the integrity with all similar files from a In this check level, the files are checked
statistical point of view but from a different data source whether they are similar or different revisions
of each other but from a different data source.

4 Checking data that describes the same phenomenon Checking the well TVD from
but from different data sources or domains FracFocus and from DrillinInfo

5 The consistency of the data within different providers For example, calculating the gross perforated interval
from one dataset as MD of bottom perf – MD of
top perf should come closer to the reported gross
perforated interval reported in a different dataset

Accuracy and Precision


Accuracy and precision are two terms often used without referring to their real meaning as they define two
different aspects of data measures. Accuracy refers to the correlation between an expected (theoretical) result
and an actual value. For example, the predicted value of the temperature of the reservoir and the measured
value. The percentage of the difference between these two values (Equation 1) describes the accuracy of
that prediction.
(1)
Precision, in summary, is how the experimental data are reproducible. Yielding low differences between
sequential realizations or experimental trials pronounces a high precision experiment. Precision can be a
good indicator of how good and solid the experimental setup taking in consideration that the domain is highly
reproducible because some domains have high variability that yields low precision. For example, if a density
measurement tool measures the reservoir fluid, assuming the tool was run three times to measure the density
of the same fluid under the same conditions. The larger the difference between the three measurements,
the lower the precision of the tool. The precision can be quantified by utilizing the descriptive statistical
measures such as range and standard deviation.

Error Sources in the Data


Some examples of sources of errors in the data are as follow:
1. Human error: these errors include the misreading and interpretation of the data that has been recorded
by field personnel.
2. Data collection setup error: the way that data are collected might have a probability of error to
propagate. An example of that is when the single well flow rate of four producing wells is calculated
by dividing the total flow rate by four. This assumes that the rates from the four producers are equal,
10 SPE-196428-MS

and in fact, they most likely are not. Nobakht and Mattar (2009) mentioned some error sources related
to production and injection data include:
i. Averaging the rates of a specific well from an adjacent group of wells.
ii. Incorrect assumptions (single-phase flow while GOR is increasing for example).
iii. Incorrect location of the pressure gauge (i.e. it is downstream while the choke is not fully
opened).

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


iv. Incorrect synchronization (i.e. taking the injection and production data with a time shift
between them and treating them as being measured at the same time).
3. Device malfunction which results in corrupted data: The missing or corrupted data can have an impact
on the decisions that need to be made. The error in one dataset may propagate to parameters in another
dataset.

Outliers Detection Methodologies and Treatments


Hawkins (1980) defined the outlier as an observation in the dataset that significantly deviates from other
observations which raise the suspension that the observation was generated by different mechanism or
approach. The complexity of dealing with outliers come from the method used to decide if the data point
is corrupted or is normal behavior and that comes from the understanding of that specific dataset. Three
methods to identify the outliers which are box plots, descriptive statistics method, and the local outlier factor
(LOF) will be disused in the following sub-sections.

Box Plot
Box plots are a useful and easy visualization tool to detect outliers. Three are five elements of box plots
which are maximum, minimum, median, first and third quartiles as shown in Figure 5. The difference
between the first and the third quartiles is called the interquartile range (IQR). Data points fall outsides the
maximum or the minimum of the box plot whiskers can be potential outliers (Kirkman, 1992).

Figure 5—Box Plot (Kirkman, 1992)


SPE-196428-MS 11

Descriptive Statistics Method


The assumption here is that the data is normally distributed according to the Central Limit Theory (Franklin
and Brodeur, 1997). This methods flags any data point falls outside the six standard deviations from the
mean, the data point will be considered as a potential outlier. Example of flow rate outlier detection using
this method is shown in Figure 6.

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 6—Descriptive Statistics Method to Detect Outliers (Al Attar et al., 2016)

Local Outlier Factor (LOF)


LOF was developed by Breunig et al. (2000). The k-distance (A) can be defined as the distance from the
kth nearest neighbor to point A as shown in Figure 7. Nk(A) is the set of k points, the reachability distance
between the group of points Nk(A) and A can be defined as:
(2)
12 SPE-196428-MS

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 7—Local Outlier Factor (LOF) Method Definition (Al Attar et al., 2016)

Where:
d (A, B) is the distance between A and B i.e. |Y(A) -Y(B)|
B ∈ Nk (A). i.e. the reachability distance between two points A and B is the true distance and at least
k-distance (B).
The local reachability density lrd is defined as:

(3)

The LOF factor is given by:

(4)

Descriptive Data Analytics


The descriptive analysis uses the data to investigate and understand hidden trends and insights by utilizing
statistics and data visualization. It has become a growing topic in the oil and gas industry especially after
the availability of public and commercial databases and the advancement in statistical and visualization
software. Many of the commercial databases have their own version of dashboards-based analytics which
helps the users to drill down the data and select the parameters of interest that the user would like
to investigate and observe their trends. The descriptive analysis should not be done without an overall
knowledge about the parameters and the process being analyzed. Misleading insights and decisions might
be inferred because of the inability to justify some of the events within the data. Figure 8 illustrates an
example of a descriptive dashboard and statistical parameters for proppant used in the United States over
the past 8 years.
SPE-196428-MS 13

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 8—Example of Descriptive Analyses using Dashboards and Statistics

Predictive Data Analytics


The oil and gas industry deals with many variables which are having complex relationships among each
other. Like in all industries, investment in oil and gas involves a great deal of risk. In order to understand the
intensity of those risks and to make informed decisions, the industry uses data. Data that are generated at
every stage of the investment from seismic exploration and reservoir identification to drilling and production
then refining operations and more. The bulk of data being generated and stored every second from multiple
systems and sensors need to be analyzed to help in optimizing the operations and predict productivity based
on the most influential parameters. Having the data without the proper tools for processing and modeling is
just like have unrefined crude oil as it is not very useful without proper refining and processing to get to the
useful end products. Predictive analytics provides an optimized prediction by investigating the relationships
between the response parameters and gauge their influence on the predicted parameters. It converts raw
data into mathematical models to generate actionable insights by combining the investigated domain deep
knowledge with advanced data science analytics after filtering the noise and undesirable faulty data points.
The objective behind utilizing predictive analytics is to improve the operations and reduce the cost and time.
Machine learning and intelligent systems are very important to solve problems that can't be modeled
with analytical tools. The utilization of machine learning and intelligent systems can help to minimize
NPT and cost. Examples of intelligence systems, including but not limited to, regression, fuzzy logic,
genetic algorithms, probabilistic reasoning, recurrent neural networks (RNN), and evolutionary computing
(Mohaghegh, 2000; Ertekin, 2005).
An example of that would be utilizing data to predict well loading before it occurs and take actions
to eliminate the production shut down as a result to water level getting high in the well which exerts
hydrostatic pressure in the well higher than the reservoir pressure that halts the production. Another example
is to use machine learning approaches such as partial least square (PLS) method to predict the initial gas
production in the Marcellus shale based on some completion and stimulation parameters such as lateral
length, total proppant mass, and the concentration of other stimulation chemicals such as friction reducer,
14 SPE-196428-MS

biocide, corrosion inhibitor, surfactant, and clay control. Figure 9 shows the prediction model versus actual
initial gas production in Marcellus using PLS.

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Figure 9—Predicted and Actual Initial Gas Production in the Marcellus

Conclusion
Data analytics has become important for the oil and gas industry due to the large data that are available in
the exploration, drilling, production, and operations. Utilizing the available data will help to make better
future decisions. However, there are many challenges in the process of collection, formatting, validation,
managing, and analyzing the data that require close attention from the people who work on the data. The
following conclusions were made based on this study:

• Big data analytics and the revolution of datafication helped companies and public administrations
to better understand the data, find previously unnoticeable patterns, and provide better solutions
for existing and future operations.
• There is a substantial transition in the oil and gas industry towards data-driven operations.

• Many operators have incorporated data science as part of their organizational structure and are
training the next generation of engineers to be hybrid engineers that are expert in their area of
specialization and data science.
• In all the E&P phases, the dramatic growth in data generation is not useful by itself. The ability to
utilize and integrate the diverse data sources to seek useful insights and provide data-driven actions
and decisions is the main target of using big data approaches in the oil and gas industry.
• To join two different databases, a common entity must exist between the two databases to be
identified as the joining parameter.
• Datasets, especially oil and gas data are case-specific when it comes to data cleaning and
transformation.
• Although the process of the data validation seems very ad-hoc process, there are many aspects
of data validation remain valid and applicable across all the types of the datasets in oil and gas.
It is also recommended that a walkthrough framework is implemented especially for oil and gas
applications.
SPE-196428-MS 15

References
Al-Hameedi AT, Alkinani HH, Dunn-Norman S, Flori RE, Hilgedick SA, Amer AS (2017a) Limiting Key Drilling
Parameters to Avoid or Mitigate Mud Losses in the Hartha Formation, Rumaila Field, Iraq. J Pet Environ Biotechnol
8: 345345. doi:10.4172/2157-7463.1000345.
Al-Hameedi, A. T. T., Alkinani, H. H., Dunn-Norman, S., Flori, R. E., Hilgedick, S. A., Alkhamis, M. M., Alsaba, M. T.
(2018, August 16). Predictive Data Mining Techniques for Mud Losses Mitigation. Society of Petroleum Engineers.
Doi: 10.2118/192182-MS.
Al-Hameedi, A. T., Dunn-Norman, S., Alkinani, H. H., Flori, R. E., & Hilgedick, S. A. (2017b, August 28). Limiting

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


Drilling Parameters to Control Mud Losses in the Dammam Formation, South Rumaila Field, Iraq. American Rock
Mechanics Association.
Al-Hameedi, A.T., Dunn-Norman, S., Alkinani, H.H., Flori, R.E., and Hilgedick, S.A. (2017c). Limiting Drilling
Parameters to Control Mud Losses in the Shuaiba Formation, South Rumaila Field, Iraq. Paper AADE-17- NTCE- 45
accepted, and it was presented at the 2017 AADE National Technical Conference and Exhibition held at the Hilton
Houston.
Alkinani, H. H., Al-Hameedi, A. T. T., Dunn-Norman, S., Flori, R. E., Hilgedick, S. A., Amer, A. S., & Alsaba, M. T.
(2018c). Economic Evaluation and Uncertainty Assessment of Lost Circulation Treatments and Materials in the Hartha
Formation, Southern Iraq. Society of Petroleum Engineers. Doi: 10.2118/192097-MS.
Alkinani, H. H., Al-Hameedi, A. T., Flori, R. E., Dunn-Norman, S., Hilgedick, S. A., & Alsaba, M. T. (2018a). Updated
Classification of Lost Circulation Treatments and Materials with an Integrated Analysis and their Applications. Society
of Petroleum Engineers. doi:10.2118/190118-MS.
Alkinani, H.H., Al-Hameedi, A.T.T., Dunn-Norman, S., Flori, R.E., Alsaba, M.T., Amer, A.S., and Hilgedick, S.A. 2019.
"Using Data Mining to Stop or Mitigate Lost Circulation." Journal of Petroleum Science and Engineering, vol. 173
(February 2019), 1097–1108. https://doi.org/10.1016/j.petrol.2018.10.078.
Alkinani, H.H., Al-Hameedi, A.T.T., Dunn-Norman, S., Flori, R.E., Hilgedick, S.A., Al-Maliki, M.A., Alshawi, Y.Q.,
Alsaba, M.T., and Amer, A.S. 2018b. "Examination of the Relationship Between Rate of Penetration and Mud
Weight Based on Unconfined Compressive Strength of the Rock." Journal of King Saud University - Science. https://
doi.org/10.1016/j.jksus.2018.07.020.
Al Attar, A. A., Hughes, R. G., & Hassan, O. F. (2016). Handling Missing and Corrupted Data in Waterflood Surveillance,
Using Reservoir Linear Characterization Models. Society of Petroleum Engineers. doi:10.2118/182207-MS.
Beckwith, R. 2011. Managing Big Data: Cloud Computing and Co-Location Centers. J. Pet. Tech. 63 (10): 42–45
Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jorg Sander. 2000. LOF: identifying density-based local
outliers. ACM Sigmod Record, Volume 29. ACM, 93–104.
Dai, W., Cheng, X., Jiao, K., & Vigh, D. (2018). Least-squares reverse time migration with dynamic time warping. Society
of Exploration Geophysicists.
Di Zio, M., Fursova, N., Gelsema, T., Gießing, S., Guarnera, U., Petrauskien, J., Quensel-von Kalben, L., Scanu, M.,
ten Bosch, K.O., van der Loo, M., and Walsdorfer, K. (2016). Methodology for data validation 1.0. Retrieved March
24, 2019.
DrillingInfo, website link: https://info.drillinginfo.com/, retrieved on March/20/2019.
Ertekin, T. (2005). Virtual Intelligence - A Panacea or Hype for Long-standing Reservoir Engineering Issues. Society of
Petroleum Engineers.
Feblowitz, J. (2013). Analytics in Oil and Gas: The Big Deal About Big Data. Society of Petroleum Engineers.
doi:10.2118/163717-MS
FracFocus, website link: https://fracfocus.org/, retrieved on March/01/2019.
Franklin, S., Brodeur, M. 1997. "A practical application of a robust multivariate outlier detection method. Proceedings of
the Survey Research Methods Section, American Statistical Association. 186–191.
Hawkins, Douglas M. 1980. Identification of outliers. Volume 11. Springer.
IHS Market, website link: https://ihsmarkit.com/index.html, retrieved on March/20/2019.
Kirkman, T. W. (1992). Statistics to Use. Retrieved March 24, 2019, from http://www.physics.csbsju.edu/stats/
M.R. BruléGroup IBMS, The Data Reservoir : How Big Data Technologies Advance Data Management and Analytics in
E & P Introduction – General Data Reservoir Concepts Data, Reservoir for E & P, 2015.
Mehta, A. (2016). Tapping the Value from Big Data Analytics. Society of Petroleum Engineers. doi:10.2118/1216-0040-
JPT
Mohaghegh, S. (2000). Virtual-Intelligence Applications in Petroleum Engineering: Part 1—Artificial Neural Networks.
Society of Petroleum Engineers. Doi: 10.2118/58046-JPT.
Nobakht, M., & Mattar, L. (2009). Diagnostics of Data Quality for Analysis of Production Data. Petroleum Society of
Canada. doi:10.2118/2009-137.
16 SPE-196428-MS

Simon A. (2013). Definition of validation levels and other related concepts v01307.
Working document. Available from https://webgate.ec.europa.eu/fpfis/mwikis/essvalidserv/images/3/30/
Eurostat__definition_validation_levels_and_other_related_concepts_v01307.doc.
Texas Railroad Commission, website link: https://www.rrc.state.tx.us/, retrieved on March/02/2019
UNECE (2013). Glossary of terms on statistical data editing. Retrieved March 24, 2019, from http://www1.unece.org/
stat/platform/display/kbase/Glossary.
Vega-gorgojo, Guillermo & Fjellheim, Roar & Roman, Dumitru & Akerkar, Rajendra & Waaler, Arild. (2016). Big Data
in the Oil & Gas Upstream Industry - A Case Study on the Norwegian Continental Shelf. Oil Gas European Magazine.

Downloaded from http://onepetro.org/SPEAPOG/proceedings-pdf/19APOG/1-19APOG/D011S002R004/1156544/spe-196428-ms.pdf/1 by Chevron Corporation user on 28 August 2024


42. 67–77.

You might also like