0% found this document useful (0 votes)

42 views22 pages

L3 - End To End Machine Learning Project

machine learning notes

Uploaded by

its.sharuu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views22 pages

L3 - End To End Machine Learning Project

machine learning notes

Uploaded by

its.sharuu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Toronto Metropolitan University

Faculty of Engineering and Architectural Science

Department of Aerospace Engineering

End-to-End Machine
Learning Project
AER850: Intro to Machine Learning
Steps Involved in a Machine Learning Project

Identifying objectives
and variables

Splitting data into train

and test datasets

Data visualization

Data cleaning and

preprocessing

Tuning Variable selection

Fine

Model training

Model evaluation

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 2

Step 1: Identifying Objectives and Variables

• Supervised or unsupervised?
• Regression or classification?
• Independent and dependent variables?
• Types of data?

• Example: Use California census data to build a model of housing prices in the state.
This data includes metrics such as the population, median income, and median
housing price for each district in California. Your model should learn from this data
and be able to predict the median housing price in any district, given all the other
metrics.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 3

Types of Data

• Numerical
• Discrete
• The number of speakers, cameras, cores in the processor, sims supported by a smartphone.
• Continuous
• The temperature and operating frequency of smartphone processor.
• Categorical
• Nominal
• Colors of smartphones. It is not possible to state that “Red” is greater than “Blue”. So as gender of a
person where we cannot differentiate between male, female, or others.
• Ordinal
• Size of clothing which can have an order: small < medium <large, or letter grading system where an A+
is definitely greater than a B grade.

• Categorical data are often expressed in letters which is not understandable to

machine learning models. Data encoding methods are required to convert categorical
data into numbers. One-hot encoding is one common method to handle nominal
data.
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 4
Step 2: Splitting Data Into Train and Test Datasets

• The need for test data

• Once a machine learning model is trained, it is important to test it on data that the model has not
seen yet.
• Testing a model on the same data that has been used for training is not an accurate evaluation of
the model, and must be avoided.
• It is a good practice to randomly select a small portion of data (e.g., 20%), and set it aside for
testing.
• Stratified sampling is often the method of choice for creation of test data.
• When the available data size is limited, creating train and test subsets are done via cross
validation folds. More on this later.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 5

Stratified Sampling

• A method of sampling from a population which can be partitioned into

subpopulations.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 6

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 7
Step 3: Data Visualization

• Plots e.g., scatter plots, color maps, 3D visualizations

• Data distributions e.g. histograms
• Correlation matrix

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 8

• Always aim for simple visualization methods, and
gradually increase the complexity, if needed.
• Complicated plots are not always useful.

https://clauswilke.com/dataviz/no-3d.html
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 9
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 10
Correlation Matrix

• A correlation matrix is a table showing correlation coefficients between variables.

Each cell in the table shows the correlation between two variables.
• Correlation describes statistical relationship or dependence between two variables.
This statistical dependence is reported using a correlation coefficient.
• There are several correlation coefficients. Care must be taken in choosing the
appropriate correlation coefficient based on the nature of the data.
• Two common correlation coefficients:
• Pearson correlation coefficient (aka Pearson’s r) – sensitive to linear relationships
• Spearman’s rank correlation coefficient (aka Spearman’s ρ) – more sensitive to nonlinear
relationships and ranked data.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 11

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 12
Data Snooping Bias

• Visualization of data is an important step to select an appropriate machine learning

model to train.

• However, visualization must be done only on the train dataset.

• If the test dataset was included during the visualization, then model selection will be
biased, because information outside of the training dataset is “leaked” for model
selection, creating a data snooping bias.

• Data snooping refers to statistical inference that the researcher decides to perform
after looking at the data (as contrasted with pre-planned inference, which the
researcher plans before looking at the data).

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 13

Step 4: Data Cleaning and Preprocessing

• Data imputation
• The process of replacing missing data with substitute values.
• Some common methods include removing data points, or adding neutral elements (often 0) or
average values to the missing data.
• Handling text and categorical data
• Data encoding methods
• Data scaling
• If we need to create a mixed fruit juice, we need mix all fruit not by their size but based on their
right proportion.
• Common methods:
• Standardization
• Normalization (aka min-max scaling), to map data into unitless ranges, typically [0,1] or [-1,1].
• Scaling to unit length

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 14

Step 5: Variable Selection

• Rule of thumb: Having lower number of independent variables is desired to

unnecessary complexity of model
• Linearly dependent variables
• A sequence of vectors is said to be linearly dependent if there exist scalars such that
𝑎1 𝑣1 + 𝑎2 𝑣2 +. . . +𝑎𝑛 𝑣𝑛 = 0.

If at least one of the scales is nonzero (e.g. 𝑎1 ), the above equation can be written as
−𝑎2 −𝑎𝑛
𝑣1 = 𝑣 +. . . + 𝑣
𝑎1 2 𝑎1 𝑛
• Correlation matrix is often used to detect linearly dependent variables and trim the number of
independent variables.

• Dimensionality reduction, e.g., principal component analysis (PCA)

Step 6: Model Training

• Rule of thumb: start with a simple model, and increase model complexity if needed.
• Examples of commonly used models:
• Linear and logistic regression models
• Support vector machines
• Decision trees
• Random forests
• K-means
• K-nearest neighbours
• Neural networks
• Performance Index
• Mean squared error
• Cross-entropy

Step 7: Model Evaluation

• At this stage, the data that was set aside for testing in Step 2 will be used to evaluate
the model.
• The test data must pass through the same pipeline that was created for the train
data. If data-dependent values were used in the pipeline, the same values that were
obtained for the train data must be applied to the test data. For example, during
standardization, the mean and standard deviation of train data must be applied to the
for standardization of the test data; otherwise, there is a leakage of information.
• Prediction results for the test must be evaluated using the same performance index
that was used for evaluating the training procedure.
• Use K-fold cross validation.

K-Fold Cross Validation

Step 8: Fine Tuning

• The key to fine tuning is to find a trade-off between training error and test error.

Rule of thumb: train a complex

model to reach low training error
(overfitting). At this stage,
training error is low; thus,
underfitting is ruled out. Next,
start gradually decreasing the
model complexity until the least
test error is found. This will
constitute the best fit.

General Recommendations to Reduce Training Error

1. Start with a simple model

2. Increase the number of independent variables
3. Extract new features from the independent variables
4. Apply systematic hyperparameter optimization methods, e.g., grid search
5. If the training error is not satisfactory, try a more complex model, and repeat the
above steps in the order provided.

Scikit-Learn Design

• Scikit-Learn is a common machine learning package that provides many useful

algorithms for a variety of machine learning problems; therefore, it is important to
understand the overall architecture of its algorithms or objects.

• There are three primary categories of scikit-learn objects:

• Estimators: estimate some parameters based on a dataset – they have a .fit() member.
• Transformers: estimators that also apply a transformation on the dataset – they have a
.transform() member, and for convenience, they have a .fit_transform() member as well.
• Predictors: estimators that also make predictions - they have a .predict() member.

• Each scikit-learn objects have a set of parameters that are named with a succeeding
underscore, like “coefs_”. They can be accessed via <object>.<parameter>

Weekend Windfalls: Trading Manual & Quick-Start Guide
No ratings yet
Weekend Windfalls: Trading Manual & Quick-Start Guide
33 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Pronunciation Rules Regular Past Verbs - US
No ratings yet
Pronunciation Rules Regular Past Verbs - US
1 page
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Research Chapter 1-5 REVISE G NA!
No ratings yet
Research Chapter 1-5 REVISE G NA!
47 pages
PES 4 AFS Map Files by Ajay
No ratings yet
PES 4 AFS Map Files by Ajay
2 pages
11月13日文献（陈洁）a Neural Circuit for Bergamot Essential Oil‐Induced Anxiolytic Effects
No ratings yet
11月13日文献（陈洁）a Neural Circuit for Bergamot Essential Oil‐Induced Anxiolytic Effects
13 pages
Studentsco: Computer Science
No ratings yet
Studentsco: Computer Science
6 pages
STD.7 Comparing Quantities and Algebraic Expressions Practice Worksheet
No ratings yet
STD.7 Comparing Quantities and Algebraic Expressions Practice Worksheet
5 pages
Javascriptinterviewquestions 240713104909 D9bedd8b
No ratings yet
Javascriptinterviewquestions 240713104909 D9bedd8b
25 pages
Hemchand Yadav Vishwavidyalaya, Durg (C.G.) 5th Sem
No ratings yet
Hemchand Yadav Vishwavidyalaya, Durg (C.G.) 5th Sem
1 page
Ejercicios Recuperacion Presente Simple 4c2ba
No ratings yet
Ejercicios Recuperacion Presente Simple 4c2ba
4 pages
Introduction To Machine Learning PPT Main
100% (1)
Introduction To Machine Learning PPT Main
15 pages
Symptoms of Ca3 Problems
100% (1)
Symptoms of Ca3 Problems
4 pages
4.micro Syllabus
No ratings yet
4.micro Syllabus
5 pages
DTS 101 Lecture 7
No ratings yet
DTS 101 Lecture 7
32 pages
ML - Part - A
No ratings yet
ML - Part - A
10 pages
Previews 2034814 Pre
No ratings yet
Previews 2034814 Pre
7 pages
Dis. Sensor
No ratings yet
Dis. Sensor
3 pages
Coursera Machine Learning Specialization
No ratings yet
Coursera Machine Learning Specialization
46 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
List of Units Competency: Daftar Unit Kompetensi
No ratings yet
List of Units Competency: Daftar Unit Kompetensi
1 page
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
8 pages
INT354 Syllabus
No ratings yet
INT354 Syllabus
2 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
Machine Learning
No ratings yet
Machine Learning
95 pages
x100 Pad 2 User Manual PDF
No ratings yet
x100 Pad 2 User Manual PDF
29 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Manual Data
No ratings yet
Manual Data
13 pages
Class10-Introduction To ML
No ratings yet
Class10-Introduction To ML
32 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
What Is The Definition of - Medium - in Art
No ratings yet
What Is The Definition of - Medium - in Art
9 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
MLE
No ratings yet
MLE
15 pages
Summary FS24
No ratings yet
Summary FS24
63 pages
ISL439 E - Syllabus - 2023 - 2024
No ratings yet
ISL439 E - Syllabus - 2023 - 2024
4 pages
ML Unit 1
No ratings yet
ML Unit 1
22 pages
3 - Technical - Methods of Development
No ratings yet
3 - Technical - Methods of Development
29 pages
Enterprise Resource Planning: MODULE 9: Business Process Management (BPM)
No ratings yet
Enterprise Resource Planning: MODULE 9: Business Process Management (BPM)
4 pages
Presenttion 33
No ratings yet
Presenttion 33
2 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
F5 Got It Pass Class Notes 2021 June
No ratings yet
F5 Got It Pass Class Notes 2021 June
221 pages
Machinelearning Unit1
No ratings yet
Machinelearning Unit1
9 pages
Final 1
No ratings yet
Final 1
6 pages
Worksheet (AS)
No ratings yet
Worksheet (AS)
4 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Presentation - J&J 2
0% (1)
Presentation - J&J 2
47 pages
2019 ASHRAE Boston Product Guide Final PDF
No ratings yet
2019 ASHRAE Boston Product Guide Final PDF
75 pages
ML Course Outline
No ratings yet
ML Course Outline
4 pages
Machine Learning Deep
No ratings yet
Machine Learning Deep
95 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Week 15
No ratings yet
Week 15
41 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Chapter 01 Machine Learning
No ratings yet
Chapter 01 Machine Learning
22 pages
FRM Course Syl Lab Us Ip Download
No ratings yet
FRM Course Syl Lab Us Ip Download
2 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
ACW Flow Calculation Basis
No ratings yet
ACW Flow Calculation Basis
4 pages
Unit III
No ratings yet
Unit III
19 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Machine Learning Steps - A Complete Guide - Simplilearn
No ratings yet
Machine Learning Steps - A Complete Guide - Simplilearn
11 pages
Machine Learning Guide: Meher Krishna Patel
No ratings yet
Machine Learning Guide: Meher Krishna Patel
121 pages
Instructional Module
100% (2)
Instructional Module
6 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
ABC Moments in The Beams As Shown in Figure 1 Below.: Plastic Centroid. A A A Plastic Centroid. A A A
No ratings yet
ABC Moments in The Beams As Shown in Figure 1 Below.: Plastic Centroid. A A A Plastic Centroid. A A A
21 pages
Sbi General Set PPT 2012
No ratings yet
Sbi General Set PPT 2012
20 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Data - Analytics - Chapter 2
No ratings yet
Data - Analytics - Chapter 2
58 pages
Seminar On Schedule U: Presented by
No ratings yet
Seminar On Schedule U: Presented by
21 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Vina Milk
No ratings yet
Vina Milk
5 pages
Math C4 Practice
No ratings yet
Math C4 Practice
53 pages
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
No ratings yet
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
3 pages
Introduction To ML Linear Regression
No ratings yet
Introduction To ML Linear Regression
33 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
d2l en PDF
No ratings yet
d2l en PDF
996 pages
Data Science
No ratings yet
Data Science
64 pages
Data Science - Machine Learning
No ratings yet
Data Science - Machine Learning
3 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
MachineLearning Presentation
No ratings yet
MachineLearning Presentation
71 pages
Cidu2011 Banerjee Intro To ML 01
No ratings yet
Cidu2011 Banerjee Intro To ML 01
120 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages