l9 Scientific Python Proc
l9 Scientific Python Proc
Radoslav Škoviera
Table of contents
Overview 2
Modules 3
NumPy: Numerical Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Pandas: Tabular Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Creating tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Accessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Filtering and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Grouping and aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Saving Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
SciPy: Scientific Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
optimize: Function Minimization & Curve Fitting . . . . . . . . . . . . . . . . . 9
integrate: Numerical Integration & ODE Solvers . . . . . . . . . . . . . . . . . 11
interpolate: Data Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
signal: Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
fft: Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
stats: Statistical Functions & Tests . . . . . . . . . . . . . . . . . . . . . . . . . 16
sparse: Sparse Matrix Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
spatial: KD‑Tree for Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 17
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
scikit-learn: Machine Learning Preprocessing . . . . . . . . . . . . . . . . . . . . . . 18
Preprocessing sklearn.preprocessing . . . . . . . . . . . . . . . . . . . . . . 19
Data Splitting and Model Selection sklearn.model_selection . . . . . . . . . 20
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Evaluation Metrics sklearn.metrics . . . . . . . . . . . . . . . . . . . . . . . 24
1
scikit-image: Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
I/O Plugins skimage.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Color Space (skimage.color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Filtering (skimage.filters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Morphology (skimage.morphology) . . . . . . . . . . . . . . . . . . . . . . . . 28
Geometric Transforms (skimage.transform) . . . . . . . . . . . . . . . . . . . 29
Overview
NumPy (Numerical Python) is the foundational package for numerical computing in Python.
It provides support for large, multi-dimensional arrays and matrices, along with a collection
of mathematical functions to operate on these arrays efficiently. NumPy’s array-oriented
computing is essential for scientific computing tasks, enabling high-performance operations
on large datasets.
SciPy is a library used for scientific and technical computing. It builds on NumPy by adding
a collection of algorithms and high-level commands for data manipulation and analysis. SciPy
includes modules for optimization, integration, interpolation, eigenvalue problems, algebraic
and differential equations, and others, making it a powerful tool for scientific applications.
Pandas is a library for data manipulation and analysis. It offers data structures and operations
for manipulating numerical tables and time series. Pandas introduces two new data structures
to Python: Series and DataFrame, which are built on top of NumPy arrays. These structures
allow for fast and efficient data manipulation.
To use Pandas with excel files, you need to install the openpyxl package.
2
!pip install scikit-learn
scikit-learn is a machine learning library for the Python programming language. It features
various classification, regression, and clustering algorithms, including support-vector machines,
random forests, gradient boosting, k-means, and DBSCAN. Designed to interoperate with the
Python numerical and scientific libraries NumPy and SciPy, scikit-learn is widely used for its
simplicity and efficiency in implementing machine learning models.
Modules
Loading Data
We have covered NPY and NPZ loading in previous lectures. It is also possible to load
structured text (CSV) files using NumPy.
Even though NumPy can load CSV data, it is best to use the library dedicated to loading and
processing of tabular data: Pandas.
3
import pandas as pd
Loading Data
Creating tables
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Score': [85, 92, 78]
}
df = pd.DataFrame(data)
print(df)
Data Exploration
4
print("--" * 20)
print("Data types")
print(df_csv.dtypes)
print("--" * 20)
print("Summary statistics")
print(df_csv.describe())
print("--" * 20)
print("Unique values")
print(df_csv.nunique())
print("--" * 20)
5
2 category 1000 non-null object
3 region 1000 non-null object
4 sales 1000 non-null float64
5 quantity 1000 non-null int64
6 returned 1000 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 48.0+ KB
----------------------------------------
Data types
order_id object
order_date datetime64[ns]
category object
region object
sales float64
quantity int64
returned bool
dtype: object
----------------------------------------
Summary statistics
order_date sales quantity
count 1000 1000.000000 1000.000000
mean 2024-05-14 12:00:00 21.866070 9.956000
min 2023-01-01 00:00:00 4.340000 1.000000
25% 2023-09-07 18:00:00 13.725000 5.000000
50% 2024-05-14 12:00:00 19.770000 10.000000
75% 2025-01-19 06:00:00 26.900000 15.000000
max 2025-09-26 00:00:00 86.240000 19.000000
std NaN 11.861339 5.346649
----------------------------------------
Unique values
order_id 1000
order_date 1000
category 4
region 4
sales 870
quantity 19
returned 2
dtype: int64
----------------------------------------
Single column stats
Mean sales: 21.86607
----------------------------------------
6
Accessing Data
Column: 0 25
1 30
2 35
Name: Age, dtype: int64
Multiple columns: Name Score
0 Alice 85
1 Bob 92
2 Charlie 78
Row by index: Name Bob
Age 30
Score 92
Name: 1, dtype: object
Row by label: Name Bob
Age 30
Score 92
Name: 1, dtype: object
Rows by index: Name Age Score
1 Bob 30 92
2 Charlie 35 78
Rows by label: Name Age Score
1 Bob 30 92
2 Charlie 35 78
Rows by condition: Name Age Score
2 Charlie 35 78
Column by index: 0 25
1 30
2 35
Name: Age, dtype: int64
Column by label: 0 25
7
1 30
2 35
Name: Age, dtype: int64
# Filter by condition
print("Filter by condition:", df[df['Age'] >= 30])
print("Filter by multiple conditions:", df[(df['Age'] >= 30) & (df['Score'] < 90)])
df_csv.groupby('region').agg({'sales': 'sum'})
df_csv.groupby('category').agg({'sales': ['mean', 'count']})
sales
mean count
category
Books 21.376166 253
Clothing 22.035588 238
Electronics 22.467233 253
Home 21.598516 256
Saving Data
# Save to CSV
df.to_csv('cleaned_data.csv', index=False)
8
# Save to Excel
df.to_excel('cleaned_data.xlsx', index=False)
import numpy as np
from scipy.optimize import minimize
from matplotlib import pyplot as plt
def f(x):
return x**2 + 10*np.sin(x)
plt.plot(x_vals, y_vals)
plt.scatter(res.x, f(res.x), color='red')
plt.annotate(f"Minimum at x={res.x}", (res.x, f(res.x)))
plt.show()
9
40
30
20
10
0
Minimum at x=[-1.30644012]
10
6 4 2 0 2 4
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
# Define model
def model(x, a, b):
return a * np.sin(b * x)
# Fit parameters
params, cov = curve_fit(model, x, y)
print("Fitted params:", params)
# Plot
plt.scatter(x, y, label='Data')
plt.plot(x, model(x, *params), 'r-', label='Fit')
plt.legend()
plt.show()
10
Fitted params: [3.55125647 1.29908483]
Data
4 Fit
4
0 2 4 6 8 10
Key Functions
f = lambda t: np.exp(-t**2)
result, error = quad(f, 0, np.inf)
print("Integral of exp(-t^2) from 0 to ∞ =", result)
11
interpolate: Data Interpolation
scipy.interpolate offers classes and functions for one‑ and multi‑dimensional interpolation
and smoothing splines.
# Create interpolator
f_linear = interp1d(x_raw, y_raw, kind='linear')
f_cubic = interp1d(x_raw, y_raw, kind='cubic')
12
1.00
0.75
0.50
0.25
0.00
0.25
0.50
Raw data points
0.75 Linear interpolation
1.00 Cubic interpolation
0 2 4 6 8 10
scipy.signal provides signal processing tools: filtering, spectral analysis, window functions,
and convolution.
Key Functions
# Design filter
sos = butter(4, 0.2, btype='low', output='sos')
# Create noisy signal
t = np.linspace(0, 1, 500)
sig = np.sin(2*np.pi*5*t) + 0.5*np.random.randn(500)
13
# Apply zero‑phase filter
filtered = sosfiltfilt(sos, sig)
1
Noisy
2 Filtered
Smoothed
0.0 0.2 0.4 0.6 0.8 1.0
14
fft: Fourier Transforms
scipy.fft Fast Fourier Transform routines for one‑ and multi‑dimensional arrays.
Key Functions
# Signal
t = np.linspace(0, 1, 400)
x = np.sin(2*np.pi*50*t) + 0.5*np.sin(2*np.pi*120*t)
# Compute FFT
X = fft(x)
freqs = fftfreq(t.size, d=t[1]-t[0])
fig, [ax1, ax2] = plt.subplots(2, 1)
ax1.plot(t, x)
ax1.set_title("Time Domain")
ax2.plot(freqs[:200], np.abs(X)[:200])
plt.title("Magnitude Spectrum")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Amplitude")
plt.tight_layout()
plt.show()
15
Time Domain
1
0
1
0.0 0.2 0.4 0.6 0.8 1.0
Magnitude Spectrum
200
Amplitude
100
0
0 25 50 75 100 125 150 175 200
Frequency (Hz)
16
stat, p = ttest_ind(a, b)
print("t‑statistic =", stat, "p‑value =", p)
scipy.sparse supports sparse matrix representations for memory‑efficient storage and fast
arithmetic on large, sparse arrays.
Key Classes
Key Methods
points = np.random.rand(100, 2)
tree = KDTree(points)
dist, idx = tree.query([0.5, 0.5], k=5)
print("Nearest indices:", idx)
17
plt.scatter([0.5], [0.5], c='r')
plt.scatter(points[idx, 0], points[idx, 1], c='g')
plt.show()
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Further Reading
scikit-learn provides a unified interface to many machine-learning algorithms and data tools
from preprocessing to model selection and evaluation.
18
Preprocessing sklearn.preprocessing
• StandardScaler: Centers features to zero mean and unit variance. Critical for algo-
rithms assuming Gaussian-distributed features (e.g. SVM, linear models).
• MinMaxScaler: Scales features to a fixed range [0,1], preserving shape of original distri-
bution but sensitive to outliers.
• RobustScaler: Uses median and IQR (inter-quartile range) for centering/scaling, robust
to outliers.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
N = 200
input_data = np.random.randn(N, 1)**2 * 100 - 50
sc1 = StandardScaler().fit(input_data)
data_std = sc1.transform(input_data)
sc2 = MinMaxScaler().fit(input_data)
data_mm = sc2.transform(input_data)
sc3 = RobustScaler().fit(input_data)
data_rb = sc3.transform(input_data)
# Plot
x_vals = np.arange(N)
fig, ax = plt.subplots(4, 1, figsize=(16, 8))
ax[0].scatter(x_vals, input_data)
ax[1].scatter(x_vals, data_std)
ax[2].scatter(x_vals, data_mm)
ax[3].scatter(x_vals, data_rb)
ax[0].set_title('Original Data')
ax[1].set_title('Standard Scaler')
ax[2].set_title('MinMax Scaler')
ax[3].set_title('Robust Scaler')
plt.tight_layout()
plt.show()
19
Original Data
750
500
250
0
0 25 50 75 100 125 150 175 200
Standard Scaler
6
4
2
0
0 25 50 75 100 125 150 175 200
MinMax Scaler
1.0
0.5
0.0
0 25 50 75 100 125 150 175 200
Robust Scaler
6
4
2
0
0 25 50 75 100 125 150 175 200
TRAIN: [0 1 2 4 5 7 8 9] TEST: [3 6]
TRAIN: [1 2 3 4 5 6 7 9] TEST: [0 8]
TRAIN: [0 1 3 4 5 6 7 8] TEST: [2 9]
TRAIN: [0 2 3 5 6 7 8 9] TEST: [1 4]
TRAIN: [0 1 2 3 4 6 8 9] TEST: [5 7]
20
Estimators
Linear Models
lr = LinearRegression().fit(X_train, y_tr_reg)
logr = LogisticRegression().fit(X_train, y_tr_clf)
Clustering
21
• DBSCAN: Density-based clustering; identifies arbitrarily shaped clusters.
# Plot
fig, ax = plt.subplots(2, 1, figsize=(8, 16))
ax[0].scatter(X[:, 0], X[:, 1], c=km.labels_)
ax[1].scatter(X[:, 0], X[:, 1], c=db.labels_)
ax[0].set_title('KMeans')
ax[1].set_title('DBSCAN')
plt.show()
22
KMeans
2000
1000
1000
2000
3000
3000 2000 1000 0 1000 2000
DBSCAN
2000
1000
1000
2000
3000
3000 2000 1000 0 1000 2000
23
Evaluation Metrics sklearn.metrics
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
Accuracy: 1.0
MSE: 0.009999999999999468
• imread/imsave: Read/write images in multiple formats via plugins (PIL, imageio, tifffile,
GDAL).
• ImageCollection, MultiImage: Efficiently handle batches of images or multi-frame
TIFFs.
24
from skimage import io
import os
# base_path = os.path.dirname(__file__)
base_path = os.getcwd()
path = os.path.join(base_path, 'bridge.jpg')
img = io.imread(path)
plt.imshow(img)
io.imsave(os.path.join(base_path, 'out.png'), img)
0
200
400
600
800
1000
25
from skimage.color import rgb2gray, gray2rgb
gray = rgb2gray(img)
rgb = gray2rgb(gray)
plt.imshow(gray)
0
200
400
600
800
1000
Filtering (skimage.filters)
edges = sobel(gray)
blur = gaussian(img, sigma=5)
th = threshold_otsu(gray)
binary = gray > th
plt.imshow(blur)
plt.show()
plt.imshow(edges)
plt.show()
26
plt.imshow(binary, cmap='gray')
plt.show()
0
200
400
600
800
1000
27
0
200
400
600
800
1000
Morphology (skimage.morphology)
plt.imshow(opened, cmap='gray')
plt.show()
28
0
200
400
600
800
1000
plt.imshow(small)
plt.show()
plt.imshow(rotated)
plt.show()
29
0
50
100
150
200
250
0 50 100 150 200 250
0
50
100
150
200
250
0 50 100 150 200 250
30