0% found this document useful (0 votes)
14 views

Introduction to Data Analytics (2)

The document provides an introduction to data analytics, covering key concepts such as data types, data collection methods, and the importance of data quality. It outlines a typical workflow for data analytics projects and highlights the relevance of data analytics in petroleum engineering. Additionally, it discusses various data quality problems and methods for their identification and enhancement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Introduction to Data Analytics (2)

The document provides an introduction to data analytics, covering key concepts such as data types, data collection methods, and the importance of data quality. It outlines a typical workflow for data analytics projects and highlights the relevance of data analytics in petroleum engineering. Additionally, it discusses various data quality problems and methods for their identification and enhancement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction to Data

Analytics
Contents
• What is data and data collection?

• Data analytics and relevant terminology

• A typical data analytics project workflow

• Data analytics is important

• Data analytics is important for a petroleum engineer


What is data and data collection?
• From Merriam-Webster
• From dictionary.com
• From Wikipedia
Data is measured, collected and reported, and analyzed, whereupon it can be
visualized using graphs, images or other analysis tools.

Raw data ("unprocessed data") is a collection of numbers or characters before it has


been "cleaned" and corrected by researchers.

Field data is raw data that is collected in an uncontrolled "in situ" environment.

Experimental data is data that is generated within the context of a scientific investigation by
observation and recording.
Ways of data collection
• Automatic data collection
Environmental impact on data collection
Systematic errors
Impact of Machinery wear and tear

• Manual input
Numerical data
String valued data

Typos
Measurement reading issues
Non-standardized data
Different opinions
What is data and data collection?
• Types of data
Raw data through measurement
o Field
o Laboratory

Derived data
o Numerical data Structured data
o Categorical data
o Qualitative data

Identifying data

Illustrative data Un- or semi-structured data


What is data and data collection?
• Functions of data in a data set
Identifying purposes

Recording of measurements

Derived data for analysis

Intermediate or/and final analysis results

Data quality is important


Discussion on Data quality problems
• Types of data
Raw data through measurement
o Field
o Laboratory

Derived data
o Numerical data
o Categorical data
o Qualitative data

Identifying data

Illustrative data
A sample data set
• Example data set for CO2 immiscible flooding projects
Typical questions for data analytics
projects
• What is my problem?

• What are my goals?

• Do I have ALL Relevant Data for my problem and my goals?

How to obtain as much relevant data for your data analytics project?
Data analytics and relevant terminology
• Data analysis

• Data analytics

• Data science

• Data mining

• Big data
Analysis versus Analytics
• Analysis is focused on understanding the past; what happened

• Analytics focuses on why it happened and what will happen next

• Thus, analytics is most often what we do.


Data analytics, data science, data
mining, big data
• Data analytics, data science, data mining
Interchangeably used

Slightly different emphases

All interdisciplinary, involving data quality investigation, data transformation,


knowledge and insights discovery, prediction and forecasting with statistical or/and AI
methods

Data analytics and data mining: modeling  prediction

Data science: data and database management


Data analytics, data science, data
mining, big data
• Big data
 Different opinion on how big a data is big data

 Advanced analytics tools, data size is not an emphasis

 Examples of big data


 IDC predicts that by 2025, the world’s data will grow to 175 Zettabytes. (To put that in perspective, if you attempted to download
175ZB at the average current internet connection speed, it would take you 1.8 billion years to download!)

 On average, there are about 500 million tweets sent every day.

 According to Nerdwallet, the average smartphone owner uses 2 to 5 GB on his or her cell phone plan each month.

 Walmart processes one million customer transactions per hour.

 Amazon records $283,000 in sales per minute.

 On average, office workers each receive 110 to 120 emails per day, equaling approximately 124 billion emails on any given day.

 According to the 2019 Federal Reserve Payments Study, total card payment transactions reached 131.2 billion with a value of $7.08
trillion in 2018, representing growth of 8.9 percent in volume year-over-year.
A typical data analytics project
workflow
A. Identify/formulate your problem
B. Data collection and preparation
C. Data exploration
o Transformation and selection
D. Select modeling methods and build your models
E. Validate the model
F. Deploy the model
G. Evaluate and monitor the results

Repeat A-G if necessary to tune the model and


results
Data analytics is important
• Paradigm shift (West Virginia: Dr. Shahab Mohaghegh)
o Empirical modeling

o Theoretical/physical modeling

o Physical modeling: numerical solution

o Data-driven proxy modeling

Data are prevalent everywhere and modeling based on data and algorithms is
cheaper and free from assumptions.
Four levels of data analytics
Summary
• Data analytics has a few other names;

• A data analytics project starts with data, and therefore data and its quality
are important for a successful project;

• Different data types and data collection methods have inherited potentials
for data quality problems;

• Data analytics is important;

• Data analytics is important for a future petroleum engineer.


Data Analytics & Relevance to Petroleum Engineering
• Significance of data analytics & Demands in petroleum engineering

• SPE facilitators

• Reading materials
Applications of data analytics in petroleum
engineering
SPE Facilitator
• JPT October 2017: https://www.spe.org/en/jpt/jpt-article-detail/?art=3402

• SPE community:
https://connect.spe.org/home?_ga=2.101392321.1933232530.1535376349-96
5409611.1534017735

Petroleum Data Driven Analytics Technical Section:


https://connect.spe.org/pd2a/home

Digital Energy: https://connect.spe.org/dets/home


Reading Material
• Links:
Big Data Analytics: What Can It Do for Petroleum Engineers and Geoscientists?
Shale Analytics
Youtube videos
• https://www.youtube.com/watch?v=phbScbosgbw (10 minutes)
• https://www.youtube.com/watch?v=T6LM4fCPvJs (17 minutes)
• https://www.youtube.com/watch?v=EkQI53SDeYc (44 minutes)

• SPE
• SPE free seminar on data analytics
• SPE publications
Grade Breakdown
Homework: 30%

Class practice: 20%

Midterm: 20%

Final evaluation: 20%

Final project: 10%


A Full Cycle of A Data Analytics Project
1. To define the problem

2. To identify relevant data & prepare data

3. To explore the data

4. To select and construct models to address the business problem

5. To validate the models

6. To deploy the model

7. To evaluate the results

8. Repeat the process until good results are achieved


Data Types & Sources of Data Quality
Problems
• Data types

• Petroleum related data types

• Sources of data quality problems

• Typical data quality problems


Data & Sources of Data Quality Problems
• Data types
By data acquisition methods
o Manual typing
o Measurement /observation
o Digitization
o Signal analysis
o Image processing
o Sensor devices

By storage
o Paper report
o Computer storage
Data & Sources of Data Quality Problems
• Data types
By format
o Unstructured
o Structured
o Images
o Signals

By value format


o Numerical
o String
Petroleum Engineering Related Data
• Exploration
Varied surveys: gravity survey, magnetic survey, passive seismic reflection survey, aerial
survey, seismic survey
Exploratory drilling & logs & log interpretation
Geostructural layering information

• Production
Core sampling and tests reservoir characterization, performance evaluation
Fluid property tests
Well testing  reservoir behavior
Well stimulation related data
Petroleum fluid production
Pressure information
Sources of Data Quality Problems
o Manual errors
o Measurement errors
o Precision induced errors
o Environment induced errors
Typical Data Quality Problems
• Data types for an entity
Identifying information
 Name: string

 ID: string + number, numbers

Non-identifying information
 Numerical properties: numbers

 Categorical properties: string

 Qualitative properties: string

 Descriptive properties: string

 Image: pixels
Typical Data Quality Problems
• Data problems for string values
Lack of standardization: varied representations for the same thing
Typos

• Consequences of data quality problems


Identifying information

Categorical values

Qualitative values

Descriptive values
Typical Data Quality Problems
• Data problems for numerical values
Outlying data : single attribute analysis
Out-of-range data
Special cases
Errors

Inconsistent data : multiple attribute analysis


Against common sense
Against relationships among attributes

• Consequences of data quality problems


Data Quality Analysis
• For a data set

• For a variable

• For multiple variables


Data Quality Analysis for A Data Set
• Data profiling: the process of examining the data available from an existing
information source and collecting statistics or informative summaries about that
data.
Improve the ability to search data by tagging it with keywords, descriptions, or assigning it
to a category
Assess data quality, including whether the data conforms to particular standards or
patterns
Assess the risk involved in integrating data in new applications, including the challenges of
joins
Understanding data challenges early in any data intensive project, so that late project
surprises are avoided. Finding data problems late in the project can lead to delays and cost
overruns.
Have an enterprise view of all data, for uses such as master data management, where key
data is needed, or data governance for improving data quality.
Data Profiling
• To-do list
Descriptive statistics: minimum, maximum, mean, mode, percentile, standard deviation,
frequency, variation, outliers, aggregates such as count and sum

Metadata information: data type, size, discrete values, uniqueness, occurrence of null
values, typical string patterns, and abstract type recognition.
The metadata can then be used to discover problems such as illegal values, misspellings, missing
values, varying value representation, and duplicates

Individual attribute/column/property: usage, frequency distribution of different values,


type

Cross-column/attribute/property: approximate dependency


Data Profiling
• Data profiling in Excel worksheets
Reading material:
• onePetro:
Facilitating data quality improvement in the oil and gas sector
Summary
• Data quality problems for a particular variable
Data quality problems for strings
Data quality problems for numerical values

• Data quality for a dataset or multiple variables


o Data profiling

• Next: Tools for data quality identification and methods of data


quality enhancement
Data Quality Problem Identification and Enhancement
• Problems in numerical properties
Outlying data
Inconsistent data

• Methods to identify data quality problems


Single variant analysis: statistics + domain knowledge
 Range: out-of-range

Multiple variant analysis: statistics + domain knowledge


 Scatter plot
 Multi-dimensional analysis
 Dependency
Single variant analysis for data quality
problems
• Location parameters:
Mean: average
Median
Mode

• Dispersion parameters:
Standard deviation

Coefficient of variance/dispersion : standard deviation/mean


Quartile coefficient of dispersion: interquartile range/average of quartiles:
(Q3-Q1)/(Q3+Q1)

Range
Single variant analysis for data quality
problems
Single variant analysis for data quality
problems
• To-do list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range

Visualization
o Boxplot:
Single variant analysis for data quality
problems
• To-do-list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range

Visualization
o Boxplot:
Single variant analysis for data quality
problems
• To-do list (a) Projects
23
1
WGs
CDGs

Descriptive statistics BGs

16
15
o min, max 16 1

o Median, average, mode 5


9
8
14
2
o Standard derivation 5 2
9 8
o Range 5
2
6 5

1-5 5-10 10-20 20-30 30-40 40-50

Visualization Carbonate Sandstone Oil Recovery Factor, %

(a) Projects
o Column chart 69 WGs
CDGs
4
BGs
29 MR
Vs. 39

20 NF
11
14 14
26 6
14
3 6 8
Mat. NF. Mat. NF. Uncon.

Formation Type
Single variant analysis for data quality
problems
• To-do list 80

Descriptive statistics 70

o min, max
60 C1
o Median, average, mode
o Standard derivation 50

o Range 40

Records
C3
30
C2 C5
Visualization
20
o Histogram C8 C9
C10
10

0
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0 ∞ 0 0 0 0 0
0 , + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10 0, + 000 ,500 ,300 ,100 0,10
0 ,1 0 0 0 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [ 00 0,1 00 00 00 [
00 000 300 100 [10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
1
[ [5 [ [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [ [1 [50 [3 [1 [

Viscosity, cp
Single variant analysis for data quality
problems
• To-do list
Descriptive statistics
o min, max
o Median, average, mode
o Standard derivation
o Range

Visualization
o Boxplot:
o Histogram
Single variant analysis for data quality
problems
• Combination of histogram and boxplot
Steam Flooding
Multiple variant analysis for data quality
problems
• Scatter plot and scatter plot matrix
Multiple variant analysis for data quality
problems
• Correlation matrix
Multiple variant analysis for data quality
problems
• Combination of scatter plot & boxplots
Summary
• Single variant analysis for data quality problems
Basic descriptive statistics
Boxplot
Histogram
Combination of boxplot & histogram

• multivariate analysis for data quality problems


Scatter plot & scatter plot matrix
Scatter plot & boxplot

You might also like