Lecture02. ML Pipeline (Chapter 2)
Lecture02. ML Pipeline (Chapter 2)
Fall 2024
Classification/Regression
Which algorithm(linear, poly, neural, kernel, tree, k-nn)
4
Example: California Housing Price (1990)
Given California housing prices(by district)
Train a model to predict a district’s median
housing price
5
Load in the dataset
Demo Code:
Create an isolated Environment: DA515
Install sklearn (code: pip install scikit-learn)
For us, the data is saved in disk in the same folder as your
code is in.
housing = pd.read_csv(“CA_housing.csv”)
7
Explore the dataset
Check data numbers of rows and cols
Check data types(int, float, categorical…)
8
Missing values
In this example, we do not have a lot of missing values:
9
Categorical data
# check the categorical data
ocean_proximity 20640 non-null object
housing["ocean_proximity"].value_counts()
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5
# install matplotlib
! pip install matplotlib
11
.hist()
12
Observations
Data distribution varies. Need to have outlier removed.
13
For Machine Learning
Data pre-processing
Missing values
Categorical data
Feature selection/engineering
Data scaling
Data Sampling:
using stratify parameter:
14
For missing values 1/2
You can:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
15
Use SimpleImputer to fill in 2/2
16
Text and Categorical Attributes
Check the first 10 categorical “ocean_proximity” data
samples
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)
17
Use value_count or distinct
# count all distinct values.
housing_cat.ocean_proximity.value_counts()
18
Convert:
categories text => ordinal numbers
Scikit-Learn provides
OrdinalEncoder to change 5
categorical data using number
list: [0 1 2 3 4] to represents:
20
Correct Encoding: One-Hot encoding
1-hot example:
21
Problem: TOO MANY VARIABLES
USE one-hot encoder
# Use pd.get_dummies( )
housing_cat_1hot =
pd.get_dummies(housing_cat).astype(int)
Feature selection:
keep only the important relevant features
There are several diff methods
Recursive elimination, Important Ranking, PCA, etc.
housing["rooms_per_household"]
=housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] =
housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/
housing["households"]
24
Matrix of Correlations
25
Standard correlation coefficient of various
datasets
(source: Wikipedia; public domain image)
26
Feature Selection
Correlations: median_house_value
#Looking for Correlations:
# for ML, keep only the important features
(Assume:linear relation)
27
Separate X and y: features and
target
Separate the dependent Y/independent variables X:
# training X
X = housing.drop("median_house_value", axis=1)
# label Y
y = housing["median_house_value"]
28
Splitting and Scaling
Which one needs to be done first?
I prefer do it later.
29
Feature Scaling
Now, all data are numerical
----------------------------------------------------------------------------
Scaling is very important:
For distance computation
For optimization
Two ways:
Normalization: (x-x_min)/(x_max – x_min) => [0,1]
Standardization: (x- mu)/sigma => N(0, 1)
30
Why do data scaling:
Lesson of the widow's mite
This poor widow put in more than all the other contributors
to the treasury. For they have all contributed from their
surplus wealth, but she, from her poverty, has contributed
all she had, her whole livelihood (Wikipedia)
31
Source of figure:
𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2
𝑥2 𝑥2
𝑥1 𝑥1
33
Feature Scaling 𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2
w1 w1
1, 2 …… x1 y 1, 2 …… x1 y
w2 w2
100, 200 …… x2 b 1, 2 …… x2 b
w2 Loss L w2 Loss L
w1 w1
Data Splitting: 80 vs 20
Scikit-Learn provides a few functions to split datasets into
multiple subsets in various ways. The simplest function is
train_test_split(),
# random sampling
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state= 100)
35
Now visualization for training data
36
ML models
Regression: for continuous data (Y)
Linear, or Poly
Tree
Random Forest
K-NN(HOMEWORK)
SVM
ANN
Kernel
….
37
Regression Evaluation Metrics (1/3)
1. Mean Square Error(MSE) or Root Mean Square
Error(RMSE):
38
Regression Evaluation Metrics (2/3)
2. Mean Absolute Error(MAE):
39
Regression Evaluation Metrics (3/3)
3. Square/Adjusted R Squared:
Simple linear Regression ( r2 instead of R2 )
The R2 quantifies the degree of any linear correlation between Yobs and Ypred,
or assess the goodness-of-fit
https://en.wikipedia.org/wiki/Coefficient_of_determination
40
Limitation of using R squared
R-squared Is Not Valid for Nonlinear Regression
https://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/#:~:text=Nonlinear%20regression%20is%20an
%20extremely,just%20don't%20go%20together
.
42
Grid Search: looking for best user-defined
parameters
43
Example:
Forest: Total 12 combinations + 6 combinations
44
Short Summary
46
Finally
Your System
47
Homework: K-NN for Appraisal
48
FYI: Real data sources
UCI Data Repository
http://archive.ics.uci.edu/ml/index.php
Kaggle
https://www.kaggle.com/datasets
Google datasets
https://cloud.google.com/public-datasets/
Government (Agriculture/Commerce/Education/FDA…. )
https://catalog.data.gov/dataset
49
END
• Read book Chapter 2
• Practice the code
• Do your homework 1