0% found this document useful (0 votes)

17 views50 pages

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views50 pages

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Lecture 2: ML Pipeline

CSC 484 / 584, DA 515

Fall 2024

REF: Chapter 2: End to End ML

Ch2. An Example: end to end

1. Look at the big picture.

2. Get the data.
3. Discover and visualize the data to gain
insights.

4. Prepare the data for Machine Learning

algorithms.
5. Select models and train them.
6. Fine-tune your models.
7. Present your solution.
2

8. Launch, monitor, and maintain your

ML pipeline example

# Create the pipeline (not totally matched the pipeline above)

pipeline = make_pipeline(StandardScaler(), PCA(n_components=8),
RandomForestClassifier(criterion='gini', n_estimators=50,
max_depth=2, random_state=1))
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Evaluate the model
print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))
3
Questions you might ask:
 What does the dataset include?
 District median price along with population, income…

 What are the business objectives?

 (summary, viz, prediction…)

 What do we have currently? Complex rules (low accuracy)

 What kind of problem is it?

 Supervised/Non-Supervised

 Classification/Regression
 Which algorithm(linear, poly, neural, kernel, tree, k-nn)

4
Example: California Housing Price (1990)
 Given California housing prices(by district)
 Train a model to predict a district’s median
housing price

5
Load in the dataset
 Demo Code:
 Create an isolated Environment: DA515
 Install sklearn (code: pip install scikit-learn)

 Download the dataset (check the book code)

 For us, the data is saved in disk in the same folder as your
code is in.
housing = pd.read_csv(“CA_housing.csv”)

 For my case: data is saved in subfolder “datasets”

housing = pd.read_csv(“./datasets/CA_housing.csv”) 6
Take a Quick Look at the Data Structure
The first 5 rows:

7
Explore the dataset
 Check data numbers of rows and cols
 Check data types(int, float, categorical…)

 Check missing values

 (Column total bedroom: 207 rows missing)

 Check duplicated data

 Check statistics(max, min, mean, …)

8
Missing values
 In this example, we do not have a lot of missing values:

 We need to fix the missing values later.

9
Categorical data
# check the categorical data
ocean_proximity 20640 non-null object

housing["ocean_proximity"].value_counts()
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5

There are 5 categories, counts are dif.

10
Plotting

# install matplotlib
! pip install matplotlib

# import matplotlib to memory

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

11
.hist()

12
Observations
 Data distribution varies. Need to have outlier removed.

 These attributes have very different scales.

 Finally, many histograms are tail-heavy: extend much

farther to the right

13
For Machine Learning
 Data pre-processing
 Missing values

 Categorical data

 Feature selection/engineering

 Data scaling
 Data Sampling:
 using stratify parameter:

This stratify parameter makes a split so that the proportion of values in

the sample produced will be the same as the proportion of values
provided to parameter stratify

14
For missing values 1/2
 You can:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

4. Find the closest neighbors and use the average

15
Use SimpleImputer to fill in 2/2

You can use Scikit-Learn: SimpleImputer:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer =
imputer.fit(housing[["total_bedrooms"]])
housing['total_bedrooms'] =
imputer.transform(housing[['total_bedrooms
']])

16
Text and Categorical Attributes
 Check the first 10 categorical “ocean_proximity” data
samples
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

 Computer cannot deal with text data directly

17
Use value_count or distinct
# count all distinct values.
housing_cat.ocean_proximity.value_counts()

<1H OCEAN 7276

INLAND 5263
NEAR OCEAN 2124
NEAR BAY 1847
ISLAND 2

There are 5 categories, counts are dif.

18
Convert:
categories text => ordinal numbers

Scikit-Learn provides
OrdinalEncoder to change 5
categorical data using number
list: [0 1 2 3 4] to represents:

[‘<1H OCEAN', 'INLAND',

'ISLAND', 'NEAR BAY', 'NEAR
OCEAN’]

Don’t Use it:

Reason: Not
Ordinal data
19
DATA TYPES
 Variables Types:
 Continuous
 Discrete:
 Ordinal: can order such as A, B, C, D, or Mon, Tue,
 Nominal: blue, red, .. Or banana, apple, orange, …
 Text -> encode to numerical
 Image -> use pixel
 Voice -> text
 ….

20
Correct Encoding: One-Hot encoding

There are total 5 categories:

use list to encode: [x1 , x2 , x3, x4, x5]
xi is 1 for yes, 0 for no

1-hot example:

21
Problem: TOO MANY VARIABLES
USE one-hot encoder
# Use pd.get_dummies( )
housing_cat_1hot =
pd.get_dummies(housing_cat).astype(int)

# then merge it with numerical attribute

housing = housing.join(housing_cat_1hot)

Now we have 14 features

Feature selection/engineering
 Feature engineering:
create some new features which make more
sense.

 For example: rooms_per_household is better than

total_rooms

 Feature selection:
keep only the important relevant features
 There are several diff methods
 Recursive elimination, Important Ranking, PCA, etc.

 We talk about Correlation 23

Experiments of attribute combinations

What you really want is the number of rooms per

household

housing["rooms_per_household"]
=housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"] =
housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/
housing["households"]

24
Matrix of Correlations

25
Standard correlation coefficient of various
datasets
(source: Wikipedia; public domain image)

26
Feature Selection
Correlations: median_house_value
#Looking for Correlations:
# for ML, keep only the important features
(Assume:linear relation)

27
Separate X and y: features and
target
 Separate the dependent Y/independent variables X:
# training X
X = housing.drop("median_house_value", axis=1)

# label Y
y = housing["median_house_value"]

28
Splitting and Scaling
 Which one needs to be done first?

 Book Code did splitting first

Splitting with random sampling methods. This is generally
fine if your dataset is large enough (especially relative to
the number of attributes)

 I prefer do it later.

29
Feature Scaling
 Now, all data are numerical
----------------------------------------------------------------------------
 Scaling is very important:
 For distance computation
 For optimization

 Two ways:
 Normalization: (x-x_min)/(x_max – x_min) => [0,1]
 Standardization: (x- mu)/sigma => N(0, 1)

30
Why do data scaling:
Lesson of the widow's mite
 This poor widow put in more than all the other contributors
to the treasury. For they have all contributed from their
surplus wealth, but she, from her poverty, has contributed
all she had, her whole livelihood (Wikipedia)

31
Source of figure:

Feature Scaling http://cs231n.github.io/neural-

networks-2/

𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

𝑥2 𝑥2

𝑥1 𝑥1

Make different features have the same scaling

Feature Selection
 Add combined features
 Remove less relevant features:

33
Feature Scaling 𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

w1 w1
1, 2 …… x1  y 1, 2 …… x1  y
w2 w2

100, 200 …… x2 b 1, 2 …… x2 b

w2 Loss L w2 Loss L

w1 w1
Data Splitting: 80 vs 20
Scikit-Learn provides a few functions to split datasets into
multiple subsets in various ways. The simplest function is
train_test_split(),

# random sampling
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state= 100)

35
Now visualization for training data

36
ML models
 Regression: for continuous data (Y)
 Linear, or Poly
 Tree
 Random Forest

 K-NN(HOMEWORK)

 SVM
 ANN
 Kernel
 ….
37
Regression Evaluation Metrics (1/3)
1. Mean Square Error(MSE) or Root Mean Square
Error(RMSE):

38
Regression Evaluation Metrics (2/3)
2. Mean Absolute Error(MAE):

39
Regression Evaluation Metrics (3/3)
3. Square/Adjusted R Squared:
 Simple linear Regression ( r2 instead of R2 )
 The R2 quantifies the degree of any linear correlation between Yobs and Ypred,
or assess the goodness-of-fit

https://en.wikipedia.org/wiki/Coefficient_of_determination

40
Limitation of using R squared
 R-squared Is Not Valid for Nonlinear Regression
https://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/#:~:text=Nonlinear%20regression%20is%20an
%20extremely,just%20don't%20go%20together
.

 If you use R-squared for nonlinear models, their study

indicates you will experience the following problems:

 R-squared is consistently high for both excellent and appalling models.

 R-squared will not rise for better models all of the time.
 If you use R-squared to pick the best model, it leads to the proper
model only 28-43% of the time.

 More info: https://en.wikipedia.org/wiki/Coefficient_of_determination

41
3 Steps of ML
 Import a model
 Train: fit (X, y)
 Evaluate:
 MSE(Mean Squared Error)
 RMSE(Root Mean Squared Error (RMSE).)
 Cross Validation

42
Grid Search: looking for best user-defined
parameters

43
Example:
 Forest: Total 12 combinations + 6 combinations

44
Short Summary

 After hyperparameter tuning,

Compare the RMSEs

 We also need to avoid

overfitting

 Feature Selection can be

done differently

 Feature Importance from

Random Forest:
45
Final Pipeline
 Example Bayesian Algorithm

46
Finally

 Evaluate Your System on the Test

Set
 Launch, Monitor, and Maintain

Your System

47
Homework: K-NN for Appraisal

Data: California Housing (chapter 2 )

Cannot use the sk-learn library

K_NN:K nearest neighbors:

• Lazy algorithm ，
• No training,
• No distribution assumption ，
• Based on feature similarity ，
• Used in classification by majority vote
• For regression, find the average price (scalar)

48
FYI: Real data sources
 UCI Data Repository
http://archive.ics.uci.edu/ml/index.php

 Kaggle
https://www.kaggle.com/datasets

 Google datasets
https://cloud.google.com/public-datasets/

 Government (Agriculture/Commerce/Education/FDA…. )
https://catalog.data.gov/dataset
49
END
• Read book Chapter 2
• Practice the code
• Do your homework 1

Class 12th Maths Chapter 13 (Probability) Unsolved PDF
No ratings yet
Class 12th Maths Chapter 13 (Probability) Unsolved PDF
9 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Unit 2
No ratings yet
Unit 2
78 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
ML - 03 - Machine Learning Systems
No ratings yet
ML - 03 - Machine Learning Systems
60 pages
Module 5
No ratings yet
Module 5
46 pages
ML Lab - BCSL606
No ratings yet
ML Lab - BCSL606
67 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Cp4252 Machine Learning Lab Manual
No ratings yet
Cp4252 Machine Learning Lab Manual
27 pages
CS 611 Slides 4
No ratings yet
CS 611 Slides 4
25 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Dawit House
No ratings yet
Dawit House
49 pages
Module 2
No ratings yet
Module 2
35 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
1 - Lab Manual (ML)
No ratings yet
1 - Lab Manual (ML)
42 pages
Z Test and T Test
No ratings yet
Z Test and T Test
7 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Project Report
No ratings yet
Project Report
37 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
House Report
No ratings yet
House Report
26 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Report
No ratings yet
Report
40 pages
QB 1
No ratings yet
QB 1
11 pages
ML Manual
No ratings yet
ML Manual
24 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
LAB MANUAL For Machine Learning
No ratings yet
LAB MANUAL For Machine Learning
15 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML Lap
No ratings yet
ML Lap
23 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
AS Maths Statistics Unit 2 MS
50% (4)
AS Maths Statistics Unit 2 MS
9 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
0 PDF
No ratings yet
0 PDF
9 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Datascience
No ratings yet
Datascience
8 pages
Module 2
No ratings yet
Module 2
20 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
How To Perform A Two-Way ANOVA in SPSS - Statology
No ratings yet
How To Perform A Two-Way ANOVA in SPSS - Statology
9 pages
7 Data Science / Machine Learning Cheat Sheets in One
100% (1)
7 Data Science / Machine Learning Cheat Sheets in One
9 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Six Sigma Project - Operators Attrition
100% (3)
Six Sigma Project - Operators Attrition
25 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Chapter 24 - Logistic Regression
100% (7)
Chapter 24 - Logistic Regression
21 pages
Statistical Quality Control Charts
No ratings yet
Statistical Quality Control Charts
69 pages
Biostatistics Module
No ratings yet
Biostatistics Module
79 pages
Ra
No ratings yet
Ra
75 pages
Volume 2 - conference-ICCS-X
No ratings yet
Volume 2 - conference-ICCS-X
578 pages
The Coordinatefree Approach To Linear Models 1st Edition Michael J Wichura Instant Download
No ratings yet
The Coordinatefree Approach To Linear Models 1st Edition Michael J Wichura Instant Download
80 pages
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
No ratings yet
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
57 pages
Statistics in Biology
No ratings yet
Statistics in Biology
35 pages
Measures of Central Tendency and Dispersion
No ratings yet
Measures of Central Tendency and Dispersion
66 pages
Bayesian Methods For Management and Business Pragmatic Solutions For Real Problems 1st Edition Eugene D. Hahn All Chapter Instant Download
No ratings yet
Bayesian Methods For Management and Business Pragmatic Solutions For Real Problems 1st Edition Eugene D. Hahn All Chapter Instant Download
55 pages
F Test
No ratings yet
F Test
19 pages
Cap 11 Doane
No ratings yet
Cap 11 Doane
46 pages
Statistics: Chapter 8: Tests of Hypotheses
No ratings yet
Statistics: Chapter 8: Tests of Hypotheses
46 pages
Statbis Tutor
No ratings yet
Statbis Tutor
7 pages
The Error Code
No ratings yet
The Error Code
17 pages
Unit V Anseer Key
No ratings yet
Unit V Anseer Key
10 pages
Statistical Inference INF312 - IS - Lecture 01
No ratings yet
Statistical Inference INF312 - IS - Lecture 01
11 pages
Probability Distributions: 4.1. Some Special Discrete Random Variables 4.1.1. The Bernoulli and Binomial Random Variables
No ratings yet
Probability Distributions: 4.1. Some Special Discrete Random Variables 4.1.1. The Bernoulli and Binomial Random Variables
12 pages
Quiz 2 Data Preparation
No ratings yet
Quiz 2 Data Preparation
8 pages
Design
No ratings yet
Design
6 pages
Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
No ratings yet
Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
5 pages
3023-Article Text-6617-1-10-20210106
No ratings yet
3023-Article Text-6617-1-10-20210106
10 pages
Chapter 12: Chi-Square and Nonparametric Tests
No ratings yet
Chapter 12: Chi-Square and Nonparametric Tests
43 pages
Stat 22 SP 21 HW7 Solutions
No ratings yet
Stat 22 SP 21 HW7 Solutions
2 pages
Math Reviewer
No ratings yet
Math Reviewer
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
The Beginner’s Guide to Unreal Engine Building Complete Games: The Beginner’s Guide to Unreal Engine, #3
From Everand
The Beginner’s Guide to Unreal Engine Building Complete Games: The Beginner’s Guide to Unreal Engine, #3
Steven Mcananey
No ratings yet
50 Java Concepts Every Developer Should Know
From Everand
50 Java Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

Lecture 2: ML Pipeline

CSC 484 / 584, DA 515

REF: Chapter 2: End to End ML

1. Look at the big picture.

4. Prepare the data for Machine Learning

8. Launch, monitor, and maintain your

# Create the pipeline (not totally matched the pipeline above)

 What are the business objectives?

 What do we have currently? Complex rules (low accuracy)

 What kind of problem is it?

 Download the dataset (check the book code)

 For my case: data is saved in subfolder “datasets”

 Check missing values

 (Column total bedroom: 207 rows missing)

 Check duplicated data

 Check statistics(max, min, mean, …)

 We need to fix the missing values later.

There are 5 categories, counts are dif.

# import matplotlib to memory

 These attributes have very different scales.

 Finally, many histograms are tail-heavy: extend much

This stratify parameter makes a split so that the proportion of values in

4. Find the closest neighbors and use the average

You can use Scikit-Learn: SimpleImputer:

 Computer cannot deal with text data directly

<1H OCEAN 7276

There are 5 categories, counts are dif.

[‘<1H OCEAN', 'INLAND',

Don’t Use it:

There are total 5 categories:

# then merge it with numerical attribute

Now we have 14 features

 For example: rooms_per_household is better than

 We talk about Correlation 23

What you really want is the number of rooms per

 Book Code did splitting first

Feature Scaling http://cs231n.github.io/neural-

Make different features have the same scaling

 If you use R-squared for nonlinear models, their study

 R-squared is consistently high for both excellent and appalling models.

 More info: https://en.wikipedia.org/wiki/Coefficient_of_determination

 After hyperparameter tuning,

 We also need to avoid

 Feature Selection can be

 Feature Importance from

 Evaluate Your System on the Test

Data: California Housing (chapter 2 )

Cannot use the sk-learn library

K_NN:K nearest neighbors:

You might also like