0% found this document useful (0 votes)
6 views

l9 Scientific Python Proc

The document is a lecture on using Scientific Python for data loading and processing, covering key libraries such as NumPy, Pandas, SciPy, scikit-learn, and scikit-image. It discusses functionalities for numerical data handling, tabular data processing, machine learning preprocessing, and image processing, along with code examples for loading and manipulating data. The lecture aims to equip engineers with essential tools and techniques for effective data analysis and scientific computing.

Uploaded by

Aarush Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

l9 Scientific Python Proc

The document is a lecture on using Scientific Python for data loading and processing, covering key libraries such as NumPy, Pandas, SciPy, scikit-learn, and scikit-image. It discusses functionalities for numerical data handling, tabular data processing, machine learning preprocessing, and image processing, along with code examples for loading and manipulating data. The lecture aims to equip engineers with essential tools and techniques for effective data analysis and scientific computing.

Uploaded by

Aarush Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Programming for Engineers

Lecture 9 - Scientific Python: Loading and Processing Data

Radoslav Škoviera

Table of contents

Overview 2

Modules 3
NumPy: Numerical Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Pandas: Tabular Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Creating tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Accessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Filtering and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Grouping and aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Saving Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
SciPy: Scientific Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
optimize: Function Minimization & Curve Fitting . . . . . . . . . . . . . . . . . 9
integrate: Numerical Integration & ODE Solvers . . . . . . . . . . . . . . . . . 11
interpolate: Data Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
signal: Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
fft: Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
stats: Statistical Functions & Tests . . . . . . . . . . . . . . . . . . . . . . . . . 16
sparse: Sparse Matrix Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
spatial: KD‑Tree for Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 17
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
scikit-learn: Machine Learning Preprocessing . . . . . . . . . . . . . . . . . . . . . . 18
Preprocessing sklearn.preprocessing . . . . . . . . . . . . . . . . . . . . . . 19
Data Splitting and Model Selection sklearn.model_selection . . . . . . . . . 20
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Evaluation Metrics sklearn.metrics . . . . . . . . . . . . . . . . . . . . . . . 24

1
scikit-image: Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
I/O Plugins skimage.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Color Space (skimage.color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Filtering (skimage.filters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Morphology (skimage.morphology) . . . . . . . . . . . . . . . . . . . . . . . . 28
Geometric Transforms (skimage.transform) . . . . . . . . . . . . . . . . . . . 29

Overview

1. NumPy: Numerical Data Handling

NumPy (Numerical Python) is the foundational package for numerical computing in Python.
It provides support for large, multi-dimensional arrays and matrices, along with a collection
of mathematical functions to operate on these arrays efficiently. NumPy’s array-oriented
computing is essential for scientific computing tasks, enabling high-performance operations
on large datasets.

2. SciPy: Scientific Computations

!pip install scipy

SciPy is a library used for scientific and technical computing. It builds on NumPy by adding
a collection of algorithms and high-level commands for data manipulation and analysis. SciPy
includes modules for optimization, integration, interpolation, eigenvalue problems, algebraic
and differential equations, and others, making it a powerful tool for scientific applications.

3. Pandas: Tabular Data Processing

!pip install pandas

Pandas is a library for data manipulation and analysis. It offers data structures and operations
for manipulating numerical tables and time series. Pandas introduces two new data structures
to Python: Series and DataFrame, which are built on top of NumPy arrays. These structures
allow for fast and efficient data manipulation.
To use Pandas with excel files, you need to install the openpyxl package.

!pip install openpyxl

4. scikit-learn: Machine Learning Preprocessing

2
!pip install scikit-learn

scikit-learn is a machine learning library for the Python programming language. It features
various classification, regression, and clustering algorithms, including support-vector machines,
random forests, gradient boosting, k-means, and DBSCAN. Designed to interoperate with the
Python numerical and scientific libraries NumPy and SciPy, scikit-learn is widely used for its
simplicity and efficiency in implementing machine learning models.

5. scikit-image: Image Processing

!pip install scikit-image

scikit-image is a collection of algorithms for image processing. It is designed to interoperate


with NumPy and SciPy, providing a versatile toolkit for image analysis. scikit-image includes
algorithms for segmentation, geometric transformations, color space manipulation, analysis,
filtering, morphology, feature detection, and more. It is widely used in academic research and
industry for processing and analyzing images.

Modules

NumPy: Numerical Data Handling

Loading Data

We have covered NPY and NPZ loading in previous lectures. It is also possible to load
structured text (CSV) files using NumPy.

# Load data from a text file


data = np.loadtxt('data.csv', delimiter=',')

# Save array to a file


np.savetxt('output.csv', data, delimiter=',')

Pandas: Tabular Data Processing

Even though NumPy can load CSV data, it is best to use the library dedicated to loading and
processing of tabular data: Pandas.

3
import pandas as pd

Loading Data

df_csv = pd.read_csv('sales.csv', parse_dates=['order_date'])


df_excel = pd.read_excel('sales.xlsx')

Creating tables

# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Score': [85, 92, 78]
}

df = pd.DataFrame(data)
print(df)

Name Age Score


0 Alice 25 85
1 Bob 30 92
2 Charlie 35 78

Data Exploration

print("Display first few rows")


print(df_csv.head())
print("--" * 20)

print("Display last few rows")


print(df_csv.tail())
print("--" * 20)

print("Display summary statistics")


df_csv.info()

4
print("--" * 20)

print("Data types")
print(df_csv.dtypes)
print("--" * 20)

print("Summary statistics")
print(df_csv.describe())
print("--" * 20)

print("Unique values")
print(df_csv.nunique())
print("--" * 20)

print("Single column stats")


print("Mean sales:", df_csv['sales'].mean())
print("--" * 20)

Display first few rows


order_id order_date category region sales quantity returned
0 ORD100083 2023-03-25 Books South 19.06 14 False
1 ORD100366 2024-01-02 Clothing East 30.88 11 False
2 ORD100564 2024-07-18 Books South 10.31 11 False
3 ORD100490 2024-05-05 Books East 20.66 11 False
4 ORD100507 2024-05-22 Electronics South 19.22 13 False
----------------------------------------
Display last few rows
order_id order_date category region sales quantity returned
995 ORD100387 2024-01-23 Books North 11.48 5 False
996 ORD100324 2023-11-21 Books East 28.33 3 False
997 ORD100861 2025-05-11 Books South 37.67 6 False
998 ORD100708 2024-12-09 Clothing North 13.89 6 False
999 ORD100484 2024-04-29 Clothing North 16.61 18 False
----------------------------------------
Display summary statistics
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 1000 non-null object
1 order_date 1000 non-null datetime64[ns]

5
2 category 1000 non-null object
3 region 1000 non-null object
4 sales 1000 non-null float64
5 quantity 1000 non-null int64
6 returned 1000 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 48.0+ KB
----------------------------------------
Data types
order_id object
order_date datetime64[ns]
category object
region object
sales float64
quantity int64
returned bool
dtype: object
----------------------------------------
Summary statistics
order_date sales quantity
count 1000 1000.000000 1000.000000
mean 2024-05-14 12:00:00 21.866070 9.956000
min 2023-01-01 00:00:00 4.340000 1.000000
25% 2023-09-07 18:00:00 13.725000 5.000000
50% 2024-05-14 12:00:00 19.770000 10.000000
75% 2025-01-19 06:00:00 26.900000 15.000000
max 2025-09-26 00:00:00 86.240000 19.000000
std NaN 11.861339 5.346649
----------------------------------------
Unique values
order_id 1000
order_date 1000
category 4
region 4
sales 870
quantity 19
returned 2
dtype: int64
----------------------------------------
Single column stats
Mean sales: 21.86607
----------------------------------------

6
Accessing Data

print("Column:", df['Age']) # Column


print("Multiple columns:", df[['Name', 'Score']]) # Multiple columns
print("Row by index:", df.iloc[1]) # Row by index
print("Row by label:", df.loc[1]) # Row by label
print("Rows by index:", df.iloc[1:3]) # Rows by index
print("Rows by label:", df.loc[1:3]) # Rows by label
print("Rows by condition:", df[df['Age'] > 30]) # Rows by condition
print("Column by index:", df.iloc[:, 1]) # Column by index
print("Column by label:", df['Age']) # Column by label

Column: 0 25
1 30
2 35
Name: Age, dtype: int64
Multiple columns: Name Score
0 Alice 85
1 Bob 92
2 Charlie 78
Row by index: Name Bob
Age 30
Score 92
Name: 1, dtype: object
Row by label: Name Bob
Age 30
Score 92
Name: 1, dtype: object
Rows by index: Name Age Score
1 Bob 30 92
2 Charlie 35 78
Rows by label: Name Age Score
1 Bob 30 92
2 Charlie 35 78
Rows by condition: Name Age Score
2 Charlie 35 78
Column by index: 0 25
1 30
2 35
Name: Age, dtype: int64
Column by label: 0 25

7
1 30
2 35
Name: Age, dtype: int64

Filtering and selection

# Filter by condition
print("Filter by condition:", df[df['Age'] >= 30])

print("Filter by multiple conditions:", df[(df['Age'] >= 30) & (df['Score'] < 90)])

Filter by condition: Name Age Score


1 Bob 30 92
2 Charlie 35 78
Filter by multiple conditions: Name Age Score
2 Charlie 35 78

Grouping and aggregation

df_csv.groupby('region').agg({'sales': 'sum'})
df_csv.groupby('category').agg({'sales': ['mean', 'count']})

sales
mean count
category
Books 21.376166 253
Clothing 22.035588 238
Electronics 22.467233 253
Home 21.598516 256

Saving Data

# Save to CSV
df.to_csv('cleaned_data.csv', index=False)

8
# Save to Excel
df.to_excel('cleaned_data.xlsx', index=False)

SciPy: Scientific Computations

optimize: Function Minimization & Curve Fitting

The scipy.optimize package offers algorithms for function minimization (scalar or


multi‑dimensional), root‑finding, and curve‑fitting.

Key Functions & Classes

• scipy.optimize.minimize: General‑purpose minimization of scalar functions of one or


more variables.
• scipy.optimize.curve_fit: Non‑linear least squares fitting of a function to data.
• scipy.optimize.root: Find roots of a function.
• scipy.optimize.least_squares: Solve nonlinear least‑squares with bounds.

Example: Minimizing a Non‑Convex Function

import numpy as np
from scipy.optimize import minimize
from matplotlib import pyplot as plt

def f(x):
return x**2 + 10*np.sin(x)

res = minimize(f, x0=0.0, method='BFGS')


print("Minimum at x =", res.x, "with value f(x) =", res.fun)

x_vals = np.linspace(res.x - 5, res.x + 5, 100)


y_vals = f(x_vals)

plt.plot(x_vals, y_vals)
plt.scatter(res.x, f(res.x), color='red')
plt.annotate(f"Minimum at x={res.x}", (res.x, f(res.x)))
plt.show()

Minimum at x = [-1.30644012] with value f(x) = -7.945823375615215

9
40

30

20

10

0
Minimum at x=[-1.30644012]
10
6 4 2 0 2 4

Example: Curve Fitting

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Generate synthetic data


x = np.linspace(0, 10, 50)
y = 3.5 * np.sin(1.3 * x) + np.random.normal(0, 0.5, x.size)

# Define model
def model(x, a, b):
return a * np.sin(b * x)

# Fit parameters
params, cov = curve_fit(model, x, y)
print("Fitted params:", params)

# Plot
plt.scatter(x, y, label='Data')
plt.plot(x, model(x, *params), 'r-', label='Fit')
plt.legend()
plt.show()

10
Fitted params: [3.55125647 1.29908483]

Data
4 Fit

4
0 2 4 6 8 10

integrate: Numerical Integration & ODE Solvers

scipy.integrate provides functions to compute definite integrals, solve ordinary differential


equations (ODEs), and perform multi‑dimensional integration.

Key Functions

• scipy.integrate.quad: Adaptive quadrature for single integrals.


• scipy.integrate.solve_ivp: Modern ODE solver interface.

Example: Definite Integral

from scipy.integrate import quad

f = lambda t: np.exp(-t**2)
result, error = quad(f, 0, np.inf)
print("Integral of exp(-t^2) from 0 to ∞ =", result)

Integral of exp(-t^2) from 0 to ∞ = 0.8862269254527579

11
interpolate: Data Interpolation

scipy.interpolate offers classes and functions for one‑ and multi‑dimensional interpolation
and smoothing splines.

Key Classes & Functions

• scipy.interpolate.interp1d: 1‑D linear and spline interpolation.


• scipy.interpolate.griddata: Interpolation over irregular 2‑D data.
• scipy.interpolate.BarycentricInterpolator, UnivariateSpline, RectBivariateSpline.

Example: 1‑D Interpolation

from scipy.interpolate import interp1d

# Original coarse data


x_raw = np.linspace(0, 10, 10)
y_raw = np.sin(x_raw)

# Create interpolator
f_linear = interp1d(x_raw, y_raw, kind='linear')
f_cubic = interp1d(x_raw, y_raw, kind='cubic')

# Evaluate at finer grid


x_fine = np.linspace(0, 10, 100)
plt.plot(x_raw, y_raw, 'o', label='Raw data points')
plt.plot(x_fine, f_linear(x_fine), '-', label='Linear interpolation')
plt.plot(x_fine, f_cubic(x_fine), '--', label='Cubic interpolation')
plt.legend()
plt.show()

12
1.00
0.75
0.50
0.25
0.00
0.25
0.50
Raw data points
0.75 Linear interpolation
1.00 Cubic interpolation
0 2 4 6 8 10

signal: Digital Signal Processing

scipy.signal provides signal processing tools: filtering, spectral analysis, window functions,
and convolution.

Key Functions

• scipy.signal.butter, scipy.signal.sosfilt / filtfilt: Digital filter design and


application.
• scipy.signal.welch: Power spectral density estimation.
• scipy.signal.convolve, correlate, decimate, resample.

Example: Butterworth Low‑Pass Filter

from scipy.signal import butter, sosfiltfilt, convolve

# Design filter
sos = butter(4, 0.2, btype='low', output='sos')
# Create noisy signal
t = np.linspace(0, 1, 500)
sig = np.sin(2*np.pi*5*t) + 0.5*np.random.randn(500)

13
# Apply zero‑phase filter
filtered = sosfiltfilt(sos, sig)

# Convolution with Gaussian kernel


kx = np.arange(-4, 5)
kernel = np.exp(-kx**2 / 2)
kernel /= kernel.sum()
print(kernel)
smoothed = convolve(sig, kernel, mode='same')

plt.plot(t, sig, alpha=0.5, label='Noisy')


plt.plot(t, filtered, 'r-', label='Filtered')
plt.plot(t, smoothed, 'g--', label='Smoothed')
plt.legend()
plt.show()

[1.33830625e-04 4.43186162e-03 5.39911274e-02 2.41971446e-01


3.98943469e-01 2.41971446e-01 5.39911274e-02 4.43186162e-03
1.33830625e-04]

1
Noisy
2 Filtered
Smoothed
0.0 0.2 0.4 0.6 0.8 1.0

14
fft: Fourier Transforms

scipy.fft Fast Fourier Transform routines for one‑ and multi‑dimensional arrays.

Key Functions

• scipy.fft.fft, ifft: Forward and inverse 1‑D FFT.


• scipy.fft.fft2, ifft2: 2‑D transforms.
• rfft, irfft: Real-input optimized transforms.

Example: 1‑D FFT Spectral Analysis

from scipy.fft import fft, fftfreq

# Signal
t = np.linspace(0, 1, 400)
x = np.sin(2*np.pi*50*t) + 0.5*np.sin(2*np.pi*120*t)
# Compute FFT
X = fft(x)
freqs = fftfreq(t.size, d=t[1]-t[0])
fig, [ax1, ax2] = plt.subplots(2, 1)
ax1.plot(t, x)
ax1.set_title("Time Domain")
ax2.plot(freqs[:200], np.abs(X)[:200])
plt.title("Magnitude Spectrum")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Amplitude")
plt.tight_layout()
plt.show()

15
Time Domain
1
0
1
0.0 0.2 0.4 0.6 0.8 1.0
Magnitude Spectrum
200
Amplitude

100

0
0 25 50 75 100 125 150 175 200
Frequency (Hz)

stats: Statistical Functions & Tests

scipy.stats provides a vast collection of statistical distributions, descriptive statistics, and


hypothesis tests.

Key Functions & Classes

• scipy.stats.norm, gamma, …: Continuous distributions.


• scipy.stats.ttest_ind, ttest_rel: T‑tests for independent and related samples.
• scipy.stats.pearsonr, spearmanr: Correlation coefficients.
• scipy.stats.kstest, chisquare: Goodness‑of‑fit tests.

Example: Two‑Sample T‑Test

from scipy.stats import ttest_ind

# Generate two samples


a = np.random.randn(100) + 0.5
b = np.random.randn(100)

16
stat, p = ttest_ind(a, b)
print("t‑statistic =", stat, "p‑value =", p)

t‑statistic = 3.2832265469906057 p‑value = 0.0012132727069695383

sparse: Sparse Matrix Tools

scipy.sparse supports sparse matrix representations for memory‑efficient storage and fast
arithmetic on large, sparse arrays.

Key Classes

• lil_matrix, csr_matrix, csc_matrix, coo_matrix.


• Methods: .dot(), .tocsc(), .toarray().

spatial: KD‑Tree for Nearest Neighbors

scipy.spatial.KDTree provides efficient nearest‑neighbor searches in k‑dimensional space.

Key Methods

• query, query_ball_point, query_pairs.

Example: Nearest‑Neighbor Query

from scipy.spatial import KDTree

points = np.random.rand(100, 2)
tree = KDTree(points)
dist, idx = tree.query([0.5, 0.5], k=5)
print("Nearest indices:", idx)

# Plot points and nearest neighbors


plt.scatter(points[:, 0], points[:, 1])

17
plt.scatter([0.5], [0.5], c='r')
plt.scatter(points[idx, 0], points[idx, 1], c='g')
plt.show()

Nearest indices: [65 98 46 48 92]

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Further Reading

• Official SciPy Reference: https://docs.scipy.org/doc/scipy/reference/


• “SciPy Lecture Notes” for deeper dives: https://scipy-lectures.org/

scikit-learn: Machine Learning Preprocessing

scikit-learn provides a unified interface to many machine-learning algorithms and data tools
from preprocessing to model selection and evaluation.

18
Preprocessing sklearn.preprocessing

Scaling & Normalization

• StandardScaler: Centers features to zero mean and unit variance. Critical for algo-
rithms assuming Gaussian-distributed features (e.g. SVM, linear models).
• MinMaxScaler: Scales features to a fixed range [0,1], preserving shape of original distri-
bution but sensitive to outliers.
• RobustScaler: Uses median and IQR (inter-quartile range) for centering/scaling, robust
to outliers.

import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

N = 200
input_data = np.random.randn(N, 1)**2 * 100 - 50
sc1 = StandardScaler().fit(input_data)
data_std = sc1.transform(input_data)
sc2 = MinMaxScaler().fit(input_data)
data_mm = sc2.transform(input_data)
sc3 = RobustScaler().fit(input_data)
data_rb = sc3.transform(input_data)

# Plot
x_vals = np.arange(N)
fig, ax = plt.subplots(4, 1, figsize=(16, 8))
ax[0].scatter(x_vals, input_data)
ax[1].scatter(x_vals, data_std)
ax[2].scatter(x_vals, data_mm)
ax[3].scatter(x_vals, data_rb)
ax[0].set_title('Original Data')
ax[1].set_title('Standard Scaler')
ax[2].set_title('MinMax Scaler')
ax[3].set_title('Robust Scaler')
plt.tight_layout()
plt.show()

19
Original Data
750
500
250
0
0 25 50 75 100 125 150 175 200
Standard Scaler
6
4
2
0
0 25 50 75 100 125 150 175 200
MinMax Scaler
1.0

0.5

0.0
0 25 50 75 100 125 150 175 200
Robust Scaler
6
4
2
0
0 25 50 75 100 125 150 175 200

Data Splitting and Model Selection sklearn.model_selection

• train_test_split: Quick split for train/test sets.


• KFold, StratifiedKFold: Cross-validation iterators preserving distribution.

from sklearn.model_selection import train_test_split, KFold


X = np.arange(100).reshape(10, 10)
y = np.arange(10)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)


kf = KFold(n_splits=5, shuffle=True)

for train_index, test_index in kf.split(X, y):


print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

TRAIN: [0 1 2 4 5 7 8 9] TEST: [3 6]
TRAIN: [1 2 3 4 5 6 7 9] TEST: [0 8]
TRAIN: [0 1 3 4 5 6 7 8] TEST: [2 9]
TRAIN: [0 2 3 5 6 7 8 9] TEST: [1 4]
TRAIN: [0 1 2 3 4 6 8 9] TEST: [5 7]

20
Estimators

Linear Models

• LinearRegression: Ordinary Least Squares regression.


• LogisticRegression: Regularized logistic classifier; supports l1/l2 penalties.

from sklearn.linear_model import LinearRegression, LogisticRegression

X_train = np.arange(100).reshape(10, 10)


y_tr_reg = np.arange(10)
y_tr_clf = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

lr = LinearRegression().fit(X_train, y_tr_reg)
logr = LogisticRegression().fit(X_train, y_tr_clf)

print("Linear Regression intercept:", lr.intercept_)


print("Linear predictions:")
print([f"{v:.2f}" for v in lr.predict(X_train)])
print("Actual values:")
print(y_tr_reg)

print("Logistic Regression intercept:", logr.intercept_)


print("Logistic predictions:")
print(logr.predict(X_train))
print("Actual values:")
print(y_tr_clf)

Linear Regression intercept: -0.4499999999999993


Linear predictions:
['0.00', '1.00', '2.00', '3.00', '4.00', '5.00', '6.00', '7.00', '8.00', '9.00']
Actual values:
[0 1 2 3 4 5 6 7 8 9]
Logistic Regression intercept: [-36.85911953]
Logistic predictions:
[0 0 0 0 1 1 1 1 1 1]
Actual values:
[0 0 0 0 1 1 1 1 1 1]

Clustering

• KMeans: Centroid-based clustering.

21
• DBSCAN: Density-based clustering; identifies arbitrarily shaped clusters.

from sklearn.cluster import KMeans, HDBSCAN


X = np.vstack([
np.random.rand(100, 2) * 1000 + 1500,
np.random.randn(200, 2) * 1000 - 400
])
km = KMeans(n_clusters=3).fit(X)
db = HDBSCAN().fit(X)

# Plot
fig, ax = plt.subplots(2, 1, figsize=(8, 16))
ax[0].scatter(X[:, 0], X[:, 1], c=km.labels_)
ax[1].scatter(X[:, 0], X[:, 1], c=db.labels_)
ax[0].set_title('KMeans')
ax[1].set_title('DBSCAN')
plt.show()

22
KMeans

2000

1000

1000

2000

3000
3000 2000 1000 0 1000 2000

DBSCAN

2000

1000

1000

2000

3000
3000 2000 1000 0 1000 2000

23
Evaluation Metrics sklearn.metrics

• Classification: accuracy_score, confusion_matrix, roc_auc_score.


• Regression: mean_squared_error, r2_score.
• Clustering: silhouette_score, adjusted_rand_score.

from sklearn.metrics import accuracy_score, mean_squared_error

X_train = np.arange(100).reshape(10, 10)


X_test = np.arange(101, 201).reshape(10, 10)
y_train = np.arange(10)
y_test = np.arange(10, 20).astype(float)

lr = LinearRegression().fit(X_train, y_train)

y_pred = lr.predict(X_test)

print("Accuracy:", accuracy_score(y_test.astype(int), y_pred.astype(int)))


print("MSE:", mean_squared_error(y_test, y_pred))

Accuracy: 1.0
MSE: 0.009999999999999468

scikit-image: Image Processing

scikit-image is a library of image processing algorithms built on NumPy and SciPy.

I/O Plugins skimage.io

• imread/imsave: Read/write images in multiple formats via plugins (PIL, imageio, tifffile,
GDAL).
• ImageCollection, MultiImage: Efficiently handle batches of images or multi-frame
TIFFs.

24
from skimage import io
import os
# base_path = os.path.dirname(__file__)
base_path = os.getcwd()
path = os.path.join(base_path, 'bridge.jpg')
img = io.imread(path)
plt.imshow(img)
io.imsave(os.path.join(base_path, 'out.png'), img)

0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400

# Load a collection of images


col = io.imread_collection(
os.path.join(base_path, multi_frames, 'frames_*.png')
)
stack = io.concatenate_images(col)

Color Space (skimage.color)

• rgb2gray, gray2rgb, rgb2hsv: Convert between color spaces.

25
from skimage.color import rgb2gray, gray2rgb
gray = rgb2gray(img)
rgb = gray2rgb(gray)
plt.imshow(gray)

0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400

Filtering (skimage.filters)

• Edge Detectors: sobel, scharr, prewitt compute image gradients.


• Noise Reduction: gaussian, median, threshold_otsu for binarization.

from skimage.filters import sobel, gaussian, threshold_otsu


from matplotlib import pyplot as plt

edges = sobel(gray)
blur = gaussian(img, sigma=5)
th = threshold_otsu(gray)
binary = gray > th

plt.imshow(blur)
plt.show()
plt.imshow(edges)
plt.show()

26
plt.imshow(binary, cmap='gray')
plt.show()

0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400


0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400

27
0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400

Morphology (skimage.morphology)

• Basic Ops: erosion, dilation, opening, closing.


• Advanced: skeletonize, remove_small_objects, area_closing.

from skimage.morphology import opening, remove_small_objects


mask = gray > th
clean = remove_small_objects(mask, min_size=512)
opened = opening(clean)

plt.imshow(opened, cmap='gray')
plt.show()

28
0
200
400
600
800
1000

0 200 400 600 800 1000 1200 1400

Geometric Transforms (skimage.transform)

• resize, rotate, warp for image warping.


• Hough Transforms: hough_line, probabilistic_hough_line for line detection.

from skimage.transform import resize, rotate


small = resize(img, (256,256))
rotated = rotate(small, 45)

plt.imshow(small)
plt.show()
plt.imshow(rotated)
plt.show()

29
0

50

100

150

200

250
0 50 100 150 200 250
0

50

100

150

200

250
0 50 100 150 200 250

30

You might also like