0% found this document useful (0 votes)
4 views

3. Basic Python Packages for Data Analytics

Uploaded by

Đỗ Anh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

3. Basic Python Packages for Data Analytics

Uploaded by

Đỗ Anh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

10/26/2024 Dr.

Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 121
Chapter Outline

 Introduction
 NumPy
 Pandas
 Matplotlib

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 122
Introduction
Feature NumPy Pandas Matplotlib
Efficient numerical
Data manipulation and
Primary Purpose computations and array Data visualization
analysis
operations
Data Structures Multi-dimensional arrays Series and DataFrames Not a data structure library

Vectorized operations, Data cleaning,


Creating various types of plots
Core Functionality linear algebra, random transformation, aggregation,
(line, scatter, bar, etc.)
number generation time series analysis

Data analysis, data Data visualization for


Scientific computing,
Typical Use Cases wrangling, exploratory data publications, presentations,
machine learning
analysis (EDA) and dashboards
Foundation for many other
Relationship to Often used in conjunction Seaborn is built on top of
scientific Python libraries
Other Libraries with NumPy and Matplotlib Matplotlib
(SciPy, Scikit-learn)
Speed, efficiency, Flexibility, ease of use, rich Customization, wide range of
Key Strengths
numerical precision data structures plot types
Image processing, Data cleaning, feature Creating publication-quality
Common Use
numerical simulations, engineering, exploratory data plots, visualizing data
Cases
machine learning analysis distributions

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 123
Introduction
Suitable
Use Case Purpose
package

1 Calculate the average porosity of different rock formations. ???

Visualize the relationship between oil production and reservoir


2 ???
pressure.

3 Perform matrix operations on reservoir simulation data. ???

4 Load a CSV file containing well log data. ???

5 Perform linear regression on production data. ???


Create a scatter plot of gas production vs. reservoir pressure for
6 ???
different wells.
7 Calculate total and average daily production rates from well data. ???

8 Clean and merge production data from multiple CSV files. ???

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 124
NumPy package
NumPy is a Python package that stands for Numerical Python

Fundamental package for scientific computing with Python.

N-dimensional array object.

Linear algebra, random number capabilities,...

Open source.

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 125
Installing and importing NumPy
NumPy can be installed by using a package manager like pip or
conda.

o pip install numpy

o conda install numpy

NumPy is imported into Python script by using the import


statement.

o import numpy as np

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 126
Creating and Manipulating Arrays in
NumPy
Creating arrays: array(), zeros(), ones(), arrange(), linspace()

Array attributes: ndim, shape, size, dtype.

Reshaping arrays: reshape(), flatten(), ravel().

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 127
Creating and Manipulating Arrays in NumPy
Command Purpose Syntax Example
numpy.array(
) Creates an array from a list or tuple. numpy.array(object) numpy.array([1, 2, 3])
numpy.zeros(
) Creates an array filled with zeros. numpy.zeros(shape) numpy.zeros((2, 3))
numpy.ones(
) Creates an array filled with ones. numpy.ones(shape) numpy.ones((3, 2))
numpy.arang Creates an array with evenly spaced numpy.arange(start, stop,
e() values. step) numpy.arange(0, 10, 2)
numpy.linspa Creates an array with a specified numpy.linspace(start,
ce() number of values. stop, num) numpy.linspace(0, 1, 5)
Gets the number of dimensions of an
ndim array. array.ndim my_array.ndim
Gets the shape (dimensions) of an
shape array. array.shape zeros_array.shape
Gets the total number of elements in
size an array. array.size zeros_array.size
Gets the data type of the array
dtype elements. array.dtype my_array.dtype
array.reshape(new_shape my_array.reshape((2,
reshape() Changes the shape of an array. ) 2))
flatten() Flattens an array
10/26/2024 into
Dr. Mai one- Faculty
Cao Lan dimension. array.flatten()
of Geology reshaped_array.flatten()
& Petroleum Engineering, HCMUT 128
Returns a flattened array, may return
Array Operations
Element-wise operations: addition, subtraction, multiplication,
division element-by-element

Aggregation functions: calculating some basic properties of the


array using values of ALL elements such as:

 sum(): sum of all elements of the array

 mean(), std(): the mean and standard deviation of all elements


in the array

 max(), min(): the maximum and minimum values in the array

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 129
Indexing and Slicing
Basic indexing and slicing (array[start:stop:step]).

Boolean indexing and masking.

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 130
Linear Algebra with NumPy

Matrix multiplication: numpy.dot(), @ operator.

Transpose, inverse, and determinant of a matrix:

 numpy.transpose

 numpy.linalg.inv

 numpy.linalg.det

Eigenvalues and eigenvectors:

 numpy.linalg.eig

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 131
Linear Algebra with NumPy
Quizzes:

1. Create two matrices A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]].
Compute their matrix multiplication.

2. Transpose, inverse the matrix C, and calculate its determinant:

C = [[9, 5, 4, 7], [1, 4, 3, 1], [6, 8, 4, 0]]

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 132
Exercises with NumPy
Exercises:

1. Create a random 3D array with dimensions 3×4×5. Slice out the


second "plane" (2D slice) along the first axis.

2. Create a random 2D array with dimensions 4×4, Find all elements


that are greater than a threshold of 0.5.

3. Solve the following system of linear equations:

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 133
Exercises with NumPy
Exercises:

4. Create a NumPy array with random integers between 0 and 15, of


size 1×20. Identify and print all elements that are greater than 5 and
less than 10.

5. Create a NumPy array with 10 random integers between 1 and 5.


Check if all elements in the array are equal and print the result
(True/False).

6. Create a NumPy array with random integers between -3 and 3, of


size 1×15. Find and print the indices of all elements that are equal to
0.
10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 134
Pandas Package

Pandas provides data manipulation tools in Python

Pandas is normally used in combination with numerical


computing tools like NumPy, data visualization libraries like
matplotlib or machine learning tools such as scikit-learn

Pandas fully supports array data structure in NumPy

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 135
Pandas Data Structure: DataFrame

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 136
Pandas Series & DataFrame

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 137
Series vs. DataFrame

Properties Series DataFrame

Dimensionality 1D (one-dimensional) 2D (two-dimensional)


Homogeneous (one data Heterogeneous (multiple
Data Type
type) data types)
Access elements via index Access data via column
Access
labels names and row labels
Two axes (row index and
Axis Labels Single axis (index)
column labels)
Column of data, single data Table, spreadsheet,
Examples
point multiple rows/columns

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 138
Pandas Series

Creating Pandas Series


import pandas as pd
import numpy as np
data = np.array([‘p’,’y’,’t’,’h’,’o’,’n’])
ps = pd.Series(data)

>>>
???

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 139
Pandas Series Accessing
There are two ways to access element of series:

– Accessing Element from Series with Position, []

– Accessing Element Using Label (index)

Quiz: Access and print the following:

 The info associated with index label 'C’.

 The info at the second position (index 1).

 A subset of fruits from index 'B' to index 'D'.

Data: [‘name’, ‘age', ‘address', ‘phone', ‘email’]


Index: ['A', 'B', 'C', 'D', 'E']

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 140
Using Series.apply(function) in Pandas
Syntax:

Series.apply(func, convert_dtype=True, args=(), **kwds)

Quiz: Given temperature data:

Data: [0, 20, 37, 100, 37.5]

- Define a function called celsius_to_fahrenheit that takes a Celsius


temperature and returns the equivalent Fahrenheit temperature
using the formula: F = (C*9/5)+32

- Use the apply() method to convert the temperatures in


celsius_series to Fahrenheit.

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 141
Pandas DataFrame
Creating a DataFrame:

Creating Pandas Series

import pandas as pd
data = {
‘Name’: [‘A’,’B’,’C’,’D’],
‘Age’: [25,30,28,40]
‘City’: [‘New York’,’Los Angeles’,’Chicago’,’Houston’]
}

df = pd.DataFrame(data)
print(df)

>>>
???

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 142
Practical Exercise
The PVT fluid data from a gas well in the Anaconda Gas Field is given below:

The well is producing at a stabilized


p (psig) 𝝁 (cp) Z
0.0 0.01270 1.000 bottom-hole flowing pressure of 2800
400.0 0.01286 0.937 psi. The wellbore radius is 0.3 ft. The
800.0 0.01390 0.882
following additional data is available:
1200.0 0.01530 0.832
k=65 md; h=15 ft; T=600 °R; Pr = 4400
1600.0 0.01680 0.794
2000.0 0.01840 0.770 psi; re=1000 ft.
2400.0 0.02010 0.763
2800.0 0.02170 0.775
1. Calculate the gas flow rate in
3200.0 0.02340 0.797
Mscf/day
3600.0 0.02500 0.827
4000.0 0.02660 0.860 2. Draw the graph of m(p) versus
4400.0 0.02831 0.896 pressure
10/26/2024 Dr. Mai Cao Lan, Dept. of Drilling & Production Engineering, GEOPET, HCMUT 143
Practical Exercise (cont’d)
The production flow rate of a gas well can be estimated using
well and reservoir data as follow:

qg 

kh m  p r   m  p w f 
  0.4 72 re  
14 24 T  ln   s
  rw  

where the real-gas pseudo pressure m(p) is defined as:

p 2p
m( p )   dp
0 Z
10/26/2024 Dr. Mai Cao Lan, Dept. of Drilling & Production Engineering, GEOPET, HCMUT 144
Practical Exercise (cont’d)
Trapezoidal Method

10/26/2024 Dr. Mai Cao Lan, Dept. of Drilling & Production Engineering, GEOPET, HCMUT 145
Practical Exercise (cont’d)

10/26/2024 Dr. Mai Cao Lân, Faculty of Geology & Petroleum Engineering, HCMUT, Vietnam 146
Matplotlib Package
Basic Plotting

# import the libs


import matplotlib.pyplot as plt
import numpy as np

# prepare data
x = np.linspace(0, 2 * np.pi, 200)
y = np.sin(x)
# plot the data
plt.plot(x, y)
# set axis lables and figure title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Basic Plotting')
# set grid on
plt.grid()
# show the plot
plt.show()

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 147
Matplotlib – Basic Plotting

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 148
Matplotlib – Color, Marker & Line Style

Color, Marker & Line Style

import numpy as np
import matplotlib.pyplot as plt
# prepare data
x = np.arange(10)
y = x
# plot the data with customized plot properties
plt.plot(x, y,
color='m',
linestyle='--',
linewidth=1.5,
marker='^',
markersize=5)
# set grid on
plt.grid()
# show the plot
plt.show()

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 149
Matplotlib – Color, Marker & Line Style

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 150
Matplotlib – Multiline Plots
Multiline Plots

# import the libs


import matplotlib.pyplot as plt
import numpy as np

x = np.arange(10)
plt.plot(x, -x**2,'o--')
plt.plot(x, -x**3,'^--')
plt.plot(x, -2*x, '>--', x, -2**x, '*--')
plt.legend(['-x**2', '-x**3', '-2*x', '-2**x'], loc = 'lower
left')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Multiline Plotting')
plt.grid()
plt.show()

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 151
Matplotlib – Multiline Plots

10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 152
Matplotlib Package
Practice

- Generate an array of 100 points ranging from 0 to 10 (inclusive)


using numpy.

- Calculate the following functions for each point:

+ Function 1: 1 = sin( )

+ Function 2: 2 = cos( )

+ Function 3: 3 = 2

- Create a line plot for each function using a different color, marker
and line style. Add title and labels of the x-axis and y-axis for each
plot, including a legend to distinguish between the lines.
10/26/2024 Dr. Mai Cao Lan - Faculty of Geology & Petroleum Engineering, HCMUT 153

You might also like