Introduction to Data Analytics (2)
Introduction to Data Analytics (2)
Analytics
Contents
• What is data and data collection?
Field data is raw data that is collected in an uncontrolled "in situ" environment.
Experimental data is data that is generated within the context of a scientific investigation by
observation and recording.
Ways of data collection
• Automatic data collection
Environmental impact on data collection
Systematic errors
Impact of Machinery wear and tear
• Manual input
Numerical data
String valued data
Typos
Measurement reading issues
Non-standardized data
Different opinions
What is data and data collection?
• Types of data
Raw data through measurement
o Field
o Laboratory
Derived data
o Numerical data Structured data
o Categorical data
o Qualitative data
Identifying data
Recording of measurements
Derived data
o Numerical data
o Categorical data
o Qualitative data
Identifying data
Illustrative data
A sample data set
• Example data set for CO2 immiscible flooding projects
Typical questions for data analytics
projects
• What is my problem?
How to obtain as much relevant data for your data analytics project?
Data analytics and relevant terminology
• Data analysis
• Data analytics
• Data science
• Data mining
• Big data
Analysis versus Analytics
• Analysis is focused on understanding the past; what happened
On average, there are about 500 million tweets sent every day.
According to Nerdwallet, the average smartphone owner uses 2 to 5 GB on his or her cell phone plan each month.
On average, office workers each receive 110 to 120 emails per day, equaling approximately 124 billion emails on any given day.
According to the 2019 Federal Reserve Payments Study, total card payment transactions reached 131.2 billion with a value of $7.08
trillion in 2018, representing growth of 8.9 percent in volume year-over-year.
A typical data analytics project
workflow
A. Identify/formulate your problem
B. Data collection and preparation
C. Data exploration
o Transformation and selection
D. Select modeling methods and build your models
E. Validate the model
F. Deploy the model
G. Evaluate and monitor the results
o Theoretical/physical modeling
Data are prevalent everywhere and modeling based on data and algorithms is
cheaper and free from assumptions.
Four levels of data analytics
Summary
• Data analytics has a few other names;
• A data analytics project starts with data, and therefore data and its quality
are important for a successful project;
• Different data types and data collection methods have inherited potentials
for data quality problems;
• SPE facilitators
• Reading materials
Applications of data analytics in petroleum
engineering
SPE Facilitator
• JPT October 2017: https://www.spe.org/en/jpt/jpt-article-detail/?art=3402
• SPE community:
https://connect.spe.org/home?_ga=2.101392321.1933232530.1535376349-96
5409611.1534017735
• SPE
• SPE free seminar on data analytics
• SPE publications
Grade Breakdown
Homework: 30%
Midterm: 20%
By storage
o Paper report
o Computer storage
Data & Sources of Data Quality Problems
• Data types
By format
o Unstructured
o Structured
o Images
o Signals
• Production
Core sampling and tests reservoir characterization, performance evaluation
Fluid property tests
Well testing reservoir behavior
Well stimulation related data
Petroleum fluid production
Pressure information
Sources of Data Quality Problems
o Manual errors
o Measurement errors
o Precision induced errors
o Environment induced errors
Typical Data Quality Problems
• Data types for an entity
Identifying information
Name: string
Non-identifying information
Numerical properties: numbers
Image: pixels
Typical Data Quality Problems
• Data problems for string values
Lack of standardization: varied representations for the same thing
Typos
Categorical values
Qualitative values
Descriptive values
Typical Data Quality Problems
• Data problems for numerical values
Outlying data : single attribute analysis
Out-of-range data
Special cases
Errors
• For a variable
Metadata information: data type, size, discrete values, uniqueness, occurrence of null
values, typical string patterns, and abstract type recognition.
The metadata can then be used to discover problems such as illegal values, misspellings, missing
values, varying value representation, and duplicates
• Dispersion parameters:
Standard deviation
Range
Single variant analysis for data quality
problems
Single variant analysis for data quality
problems
• To-do list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range
Visualization
o Boxplot:
Single variant analysis for data quality
problems
• To-do-list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range
Visualization
o Boxplot:
Single variant analysis for data quality
problems
• To-do list (a) Projects
23
1
WGs
CDGs
16
15
o min, max 16 1
(a) Projects
o Column chart 69 WGs
CDGs
4
BGs
29 MR
Vs. 39
20 NF
11
14 14
26 6
14
3 6 8
Mat. NF. Mat. NF. Uncon.
Formation Type
Single variant analysis for data quality
problems
• To-do list 80
Descriptive statistics 70
o min, max
60 C1
o Median, average, mode
o Standard derivation 50
o Range 40
Records
C3
30
C2 C5
Visualization
20
o Histogram C8 C9
C10
10
0
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0
0 , + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10
0 ,1 0 0 0 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [
00 000 300 100 [10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
1
[ [5 [ [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [
Viscosity, cp
Single variant analysis for data quality
problems
• To-do list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range
Visualization
o Boxplot:
o Histogram
Single variant analysis for data quality
problems
• Combination of histogram and boxplot
Steam Flooding
Multiple variant analysis for data quality
problems
• Scatter plot and scatter plot matrix
Multiple variant analysis for data quality
problems
• Correlation matrix
Multiple variant analysis for data quality
problems
• Combination of scatter plot & boxplots
Summary
• Single variant analysis for data quality problems
Basic descriptive statistics
Boxplot
Histogram
Combination of boxplot & histogram