0% found this document useful (0 votes)

11 views22 pages

Lecture Slides - ML - Part 2

The document outlines the end-to-end machine learning process, emphasizing the importance of problem definition, data acquisition, visualization, transformation, model selection, evaluation, and deployment. It details various techniques for data transformation, including feature engineering and handling missing values, as well as model evaluation metrics for regression and classification tasks. The document concludes by highlighting the necessity of monitoring and maintaining models post-launch to ensure their effectiveness.

Uploaded by

Neeraja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views22 pages

Lecture Slides - ML - Part 2

Uploaded by

Neeraja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

End-to-end ML

Varol Kayhan, PhD

End-to-End ML (and our Agenda)
Discover and
Problem Data
Get the data visualize the
definition transformation
data

Present Evaluate the Select a model

Fine tuning
solution model and train

Launch,
monitor,
maintain
Problem Definition
• Select SPECIFIC goals/problems!
• Avoid broad problems!
• Several questions to ask:
• What is the expected benefit?
• Is it supervised or unsupervised?
• Is it regression or classification (binary or multi-class)?

• (There could be a single course on "Problem Definition" alone)

Get the Data
• What is the unit of analysis?
• What are the predictors (features)?
• What is the target/outcome variable?
• Where is the data coming from?
• Internal, external, combined?
• How much data do you have (i.e., how many rows?)

• (There could be a single course on "Getting the Data" alone)

Discover and Visualize
• Use visualization
• Histograms, boxplots, scatter plots
• Use multiple variables in one plot when possible
• Observe descriptive statistics
• Mean, median, std.dev, min, max
• Observe correlations
• Reduce variables (if needed)

• (There could be a single course on "Discovery and Viz." alone)

Data Transformation
• Perform "feature engineering" (if needed)
• Derive variables using other variables
• Convert continuous to categorical
• Create new categories from existing categories
• Etc.
• Transform skewed variables if needed

• (There could be a single course on "Data Transformation" alone)

Data Transformation
• Convert categorical text to numerical values:
• "Ordinal encoding“ OR "one-hot-encoding"

Size Color Price Size White Red Black Price

Compact White 22,000 1 1 0 0 22,000
Compact Red 22,750 1 0 1 0 22,750
Mid-size Black 25,000 2 0 0 1 25,000
Full-size White 32,000 3 1 0 0 32,000
Mid-size Red 26,000 2 0 1 0 26,000

Ordinal encoding One-hot-encoding

Data Transformation
• Standardize continuous variables (why???)
Standardized value

Age Salary Default Age_std Salary_std Default

23 58,560 1 0.19 -1.26 1
21 78,224 1 -0.77 0.43 1
25 86,571 0 1.16 1.15 0
20 78,841 1 -1.25 0.49 1
24 63,839 0 0.68 -0.81 0

Mean = 22.6 Mean = 73,207

Std.Dev. = 2.07 Std.Dev. = 11,595
Data Transformation
• Identify and deal with missing values
• Drop those rows/columns?
• Impute values (mean/median/etc.)?

Age Salary Default Age Salary Default Age Salary Default

58,560 1 23 58,560 1 23 1
78,224 1 21 78,224 1 21 78,224 1
25 0 25 86,571 0 25 0
20 1 20 1 20 1
24 63,839 0 24 63,839 0 24 0

What to do? What to do? What to do?

Data Transformation
• Identify and deal with outliers:
• How do you define outliers?
• Are outliers legitimate or data entry errors?
• For each outlier: eliminate, include, or correct?
Data Transformation
• Do you do data transformation BEFORE or AFTER the train/test split?
• BEFORE is easier – recommended for beginners
• AFTER is better – recommended for real-world applications
Data Transformation
• There are two functions we need to know about:

• fit_transform()
• Learns and uses the parameters needed to transform the data (e.g., mean, std.dev, etc.)
• Must be used to transform the TRAINING data

• transform()
• Uses the learned parameters to transform the data
• Can only be used after fit_transformed()is used
• Must be used to transform the TEST data
Data Transformation
• What does fit_transform() and transform() do?
Train set (learns)
Size Color Price Mean=25,550 Size White Red Black Price
Compact White 22,000 Stddev=3,537 1 1 0 0 -1.00
Compact Red 22,750 1 0 1 0 -0.79
Mid-size Black 25,000 2 0 0 1 -0.16
fit_transform()
Full-size White 32,000 3 1 0 0 1.82
Mid-size Red 26,000 2 0 1 0 0.13

Test set
Size Color Price transform()
Size White Red Black Price
Mid-size Purple 23,000 2 0 0 0 -0.60
Data Transformation
• Do not use fit_transform() on train and test separately!!!
Train set
Size Color Price Mean=23,250 Size White Red Black Price
Compact White 22,000 Stddev=1,275 1 1 0 0 -0.98
Compact Red 22,750 fit_transform() 1 0 1 0 Instead, use only
-0.39
transform()
Mid-size Black 25,000 2 0 0 1 1.37
Test set
Size Color Price Mean=46,000 Size White Red Blue Price Size White Red Black Price
Mid-size Blue 43,000 Stddev=2,944 1 0 0 1 2 0 0 0
-1.02 15.49
Full-size White 45,000 2 1 0 0 -0.34 unk 1 0 0 17.06
fit_transform()
Mid-size Red 50,000 1 0 1 0 1.36 2 0 1 0 20.98
Data Transformation (fit_transform)
• When to use fit_transform and transform()
Train Models
• Select the right algorithm(s) for the task:
Regression Tasks Classification Tasks
DecisionTreeRegressor DecisionTreeClassifier
SVR SCV
RandomForestRegressor RandomForestClassifier
Etc. Etc.

• Build alternative models to compare

Model Evaluation
• Regression evaluation metric: (to be discussed later)
• Mean squared error (MSE)
• Root mean squared error (RMSE)
• Classification evaluation metric: (to be discussed later)
• Accuracy
• Sensitivity/specificity
• F1 score
• ROC curve
• Cross-validation! (Next slide)
• Goal: generate robust results (rather than one-off/non-replicable results)
Model Evaluation
• Cross-validation has two use cases:
• Data set is small (a single train/test split will not generate robust results)
• Data set is large, but train/test split created lucky/unlucky samples
• There are several influential outliers in the data???
• 5-fold cross-validation:
Entire data set
Test Train Train Train Train
Train Test Train Train Train
Train Train Test Train Train Overall
Train = Avg(trains)
Train Train Train Test Train Test = Avg(tests)

Train Train Train Train Test

Fold #1 Fold #2 Fold #3 Fold #4 Fold #5
Train = … Train = … Train = … Train = … Train = …
Test = … Test = … Test = … Test = … Test = …
Fine tuning
• Change hyperparameters
• Grid search!
• Define the set of parameters
• Perform a (random) search!
• (Exhaustive search not recommended!)
Present Solution
• Is the "error" acceptable?
• Can you interpret the model (black-box or white box)?
• Can you identify "variable importance"?
Launch, Monitor, Maintain
• Is the model successful/effective?
• Conduct A/B experiments to test model effectiveness
Most time consuming &

Conclusion error prone

Discover and
Problem Data
Get the data visualize the
definition transformation
data

Present Evaluate the Select a model

Fine tuning
solution model and train

We will perform these steps in Python

Launch,
monitor,
maintain

Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Anova
67% (3)
Anova
55 pages
CSE445 NSU Week - 2
No ratings yet
CSE445 NSU Week - 2
31 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
Machine Learning New
No ratings yet
Machine Learning New
8 pages
INSY446 - 02 - Linear Model Part 1
No ratings yet
INSY446 - 02 - Linear Model Part 1
27 pages
Lecture 4 Evaluation
No ratings yet
Lecture 4 Evaluation
58 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
Lecture-18 - Evaluation Metrics For Different Model
No ratings yet
Lecture-18 - Evaluation Metrics For Different Model
27 pages
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
No ratings yet
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
37 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Article Review 11 Eng
No ratings yet
Article Review 11 Eng
18 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Learn Machine Learning in One Lesson Book
No ratings yet
Learn Machine Learning in One Lesson Book
8 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Data Science in FInancial Services - 3
No ratings yet
Data Science in FInancial Services - 3
76 pages
Lecture 4
No ratings yet
Lecture 4
63 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Data Science New
No ratings yet
Data Science New
8 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Machine Learning in PySpark
No ratings yet
Machine Learning in PySpark
18 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
Machine Learning With Python 2021
No ratings yet
Machine Learning With Python 2021
124 pages
ML and Deploying It Using Flask and Docker.
No ratings yet
ML and Deploying It Using Flask and Docker.
30 pages
Python Learning
No ratings yet
Python Learning
21 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
PS Notes (Machine Learning
No ratings yet
PS Notes (Machine Learning
14 pages
Skit Learn Cheatsheet
No ratings yet
Skit Learn Cheatsheet
11 pages
SML
No ratings yet
SML
8 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
WIP - ML-22-DEC Weekend
No ratings yet
WIP - ML-22-DEC Weekend
40 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Chapter 1 Capstone Project Ai Class 12
No ratings yet
Chapter 1 Capstone Project Ai Class 12
5 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
CH 11
No ratings yet
CH 11
87 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
ML 01
No ratings yet
ML 01
24 pages
Final ML
No ratings yet
Final ML
2 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
My Notes
No ratings yet
My Notes
15 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
MLT - Solutions (12 Weeks Merged) PDF
No ratings yet
MLT - Solutions (12 Weeks Merged) PDF
143 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Examples Anova
100% (1)
Examples Anova
13 pages
Arline Industry: Appicaion of Business Analytics and Intelligence in Airline Industry
No ratings yet
Arline Industry: Appicaion of Business Analytics and Intelligence in Airline Industry
51 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Ejercicios Resueltos de Inferencia Estadistica
No ratings yet
Ejercicios Resueltos de Inferencia Estadistica
229 pages
ANOVA: Analysis of Variance: Prof. Rohit Joshi, Prof. Achinta Kr. Sarmah
No ratings yet
ANOVA: Analysis of Variance: Prof. Rohit Joshi, Prof. Achinta Kr. Sarmah
40 pages
Applied Data Science Questions
No ratings yet
Applied Data Science Questions
15 pages
Data Analytics With Python - Unit 8 - Week 5
100% (1)
Data Analytics With Python - Unit 8 - Week 5
3 pages
MCQs Unit 3 Measures of Dispersion
100% (2)
MCQs Unit 3 Measures of Dispersion
15 pages
Text Mining
No ratings yet
Text Mining
35 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
Lecture1 IntroductiontoML
No ratings yet
Lecture1 IntroductiontoML
70 pages
Multivariate Calibration . II. Chemometric Methods: Tormod Naes and Harald Martens
No ratings yet
Multivariate Calibration . II. Chemometric Methods: Tormod Naes and Harald Martens
6 pages
Statistical Data Analysis Assignment
No ratings yet
Statistical Data Analysis Assignment
17 pages
Factors Affecting Career Preferences Amo
No ratings yet
Factors Affecting Career Preferences Amo
29 pages
0.1 Advertising Dataset: Linear Regression and Model Assumption
No ratings yet
0.1 Advertising Dataset: Linear Regression and Model Assumption
42 pages
Emerging Markets Review: Paresh Kumar Narayan, Susan Sunila Sharma, Dinh Hoang Bach Phan, Guangqiang Liu T
No ratings yet
Emerging Markets Review: Paresh Kumar Narayan, Susan Sunila Sharma, Dinh Hoang Bach Phan, Guangqiang Liu T
16 pages
ML Unit-2 Material
No ratings yet
ML Unit-2 Material
20 pages
Lecture Slides - ML - Part 1
No ratings yet
Lecture Slides - ML - Part 1
12 pages
Abebeand Nardos Vol 35 Art 3
No ratings yet
Abebeand Nardos Vol 35 Art 3
13 pages
Securing Data in Iot Using Cryptography and Steganography
No ratings yet
Securing Data in Iot Using Cryptography and Steganography
8 pages
Batc602 Business Simulation All Questions
No ratings yet
Batc602 Business Simulation All Questions
29 pages
A GIS Based Fertilizer Decision Support System For Farmers
No ratings yet
A GIS Based Fertilizer Decision Support System For Farmers
14 pages
K-Nearest Neighbor Algorithm: Dataset Preparation
No ratings yet
K-Nearest Neighbor Algorithm: Dataset Preparation
6 pages
Collaborative Job Prediction Based On Nave Bayes Classifier Using Python Platform
No ratings yet
Collaborative Job Prediction Based On Nave Bayes Classifier Using Python Platform
5 pages
2 Tolentino
No ratings yet
2 Tolentino
12 pages
Purpose in Life Is A Robust Protective Factor of Reported Cognitive Decline Among Late Middle-Aged Adults: The Emory Healthy Aging Study
No ratings yet
Purpose in Life Is A Robust Protective Factor of Reported Cognitive Decline Among Late Middle-Aged Adults: The Emory Healthy Aging Study
8 pages
Food Delivery Time Prediction With LSTM Neural Network
No ratings yet
Food Delivery Time Prediction With LSTM Neural Network
7 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow 2016-09-26
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow 2016-09-26
14 pages
Lab 1
No ratings yet
Lab 1
3 pages
Comparative Analysis of Various Filtering Techniques For Denoising EEG Signals
No ratings yet
Comparative Analysis of Various Filtering Techniques For Denoising EEG Signals
4 pages
Q4Mentor Course Details
No ratings yet
Q4Mentor Course Details
3 pages
CSB Ias Academy: Wef Annual Meeting 2021 Will Take Place in Lucerne-Burgenstock
No ratings yet
CSB Ias Academy: Wef Annual Meeting 2021 Will Take Place in Lucerne-Burgenstock
2 pages
CSB Ias Academy: Svamitva Scheme: PM To Distribute Property Cards For Rural Landowners
No ratings yet
CSB Ias Academy: Svamitva Scheme: PM To Distribute Property Cards For Rural Landowners
2 pages
India - Chile Relations
No ratings yet
India - Chile Relations
2 pages
M Cap 31122020
No ratings yet
M Cap 31122020
80 pages

Lecture Slides - ML - Part 2

Uploaded by

Lecture Slides - ML - Part 2

Uploaded by

End-to-end ML

Varol Kayhan, PhD

Present Evaluate the Select a model

• (There could be a single course on "Problem Definition" alone)

• (There could be a single course on "Getting the Data" alone)

• (There could be a single course on "Discovery and Viz." alone)

• (There could be a single course on "Data Transformation" alone)

Size Color Price Size White Red Black Price

Ordinal encoding One-hot-encoding

Age Salary Default Age_std Salary_std Default

Mean = 22.6 Mean = 73,207

Age Salary Default Age Salary Default Age Salary Default

What to do? What to do? What to do?

• Build alternative models to compare

Train Train Train Train Test

Conclusion error prone

Present Evaluate the Select a model

We will perform these steps in Python

You might also like