0% found this document useful (0 votes)
42 views22 pages

L3 - End To End Machine Learning Project

machine learning notes

Uploaded by

its.sharuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views22 pages

L3 - End To End Machine Learning Project

machine learning notes

Uploaded by

its.sharuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Toronto Metropolitan University

Faculty of Engineering and Architectural Science


Department of Aerospace Engineering

End-to-End Machine
Learning Project
AER850: Intro to Machine Learning
Steps Involved in a Machine Learning Project

Identifying objectives
and variables

Splitting data into train


and test datasets

Data visualization

Data cleaning and


preprocessing

Tuning Variable selection


Fine

Model training

Model evaluation

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 2


Step 1: Identifying Objectives and Variables

• Supervised or unsupervised?
• Regression or classification?
• Independent and dependent variables?
• Types of data?

• Example: Use California census data to build a model of housing prices in the state.
This data includes metrics such as the population, median income, and median
housing price for each district in California. Your model should learn from this data
and be able to predict the median housing price in any district, given all the other
metrics.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 3


Types of Data

• Numerical
• Discrete
• The number of speakers, cameras, cores in the processor, sims supported by a smartphone.
• Continuous
• The temperature and operating frequency of smartphone processor.
• Categorical
• Nominal
• Colors of smartphones. It is not possible to state that “Red” is greater than “Blue”. So as gender of a
person where we cannot differentiate between male, female, or others.
• Ordinal
• Size of clothing which can have an order: small < medium <large, or letter grading system where an A+
is definitely greater than a B grade.

• Categorical data are often expressed in letters which is not understandable to


machine learning models. Data encoding methods are required to convert categorical
data into numbers. One-hot encoding is one common method to handle nominal
data.
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 4
Step 2: Splitting Data Into Train and Test Datasets

• The need for test data


• Once a machine learning model is trained, it is important to test it on data that the model has not
seen yet.
• Testing a model on the same data that has been used for training is not an accurate evaluation of
the model, and must be avoided.
• It is a good practice to randomly select a small portion of data (e.g., 20%), and set it aside for
testing.
• Stratified sampling is often the method of choice for creation of test data.
• When the available data size is limited, creating train and test subsets are done via cross
validation folds. More on this later.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 5


Stratified Sampling

• A method of sampling from a population which can be partitioned into


subpopulations.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 6


AER850: Intro to Machine Learning | © Reza Faieghi, 2024 7
Step 3: Data Visualization

• Plots e.g., scatter plots, color maps, 3D visualizations


• Data distributions e.g. histograms
• Correlation matrix

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 8


• Always aim for simple visualization methods, and
gradually increase the complexity, if needed.
• Complicated plots are not always useful.

https://clauswilke.com/dataviz/no-3d.html
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 9
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 10
Correlation Matrix

• A correlation matrix is a table showing correlation coefficients between variables.


Each cell in the table shows the correlation between two variables.
• Correlation describes statistical relationship or dependence between two variables.
This statistical dependence is reported using a correlation coefficient.
• There are several correlation coefficients. Care must be taken in choosing the
appropriate correlation coefficient based on the nature of the data.
• Two common correlation coefficients:
• Pearson correlation coefficient (aka Pearson’s r) – sensitive to linear relationships
• Spearman’s rank correlation coefficient (aka Spearman’s ρ) – more sensitive to nonlinear
relationships and ranked data.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 11


AER850: Intro to Machine Learning | © Reza Faieghi, 2024 12
Data Snooping Bias

• Visualization of data is an important step to select an appropriate machine learning


model to train.

• However, visualization must be done only on the train dataset.

• If the test dataset was included during the visualization, then model selection will be
biased, because information outside of the training dataset is “leaked” for model
selection, creating a data snooping bias.

• Data snooping refers to statistical inference that the researcher decides to perform
after looking at the data (as contrasted with pre-planned inference, which the
researcher plans before looking at the data).

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 13


Step 4: Data Cleaning and Preprocessing

• Data imputation
• The process of replacing missing data with substitute values.
• Some common methods include removing data points, or adding neutral elements (often 0) or
average values to the missing data.
• Handling text and categorical data
• Data encoding methods
• Data scaling
• If we need to create a mixed fruit juice, we need mix all fruit not by their size but based on their
right proportion.
• Common methods:
• Standardization
• Normalization (aka min-max scaling), to map data into unitless ranges, typically [0,1] or [-1,1].
• Scaling to unit length

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 14


Step 5: Variable Selection

• Rule of thumb: Having lower number of independent variables is desired to


unnecessary complexity of model
• Linearly dependent variables
• A sequence of vectors is said to be linearly dependent if there exist scalars such that
𝑎1 𝑣1 + 𝑎2 𝑣2 +. . . +𝑎𝑛 𝑣𝑛 = 0.

If at least one of the scales is nonzero (e.g. 𝑎1 ), the above equation can be written as
−𝑎2 −𝑎𝑛
𝑣1 = 𝑣 +. . . + 𝑣
𝑎1 2 𝑎1 𝑛
• Correlation matrix is often used to detect linearly dependent variables and trim the number of
independent variables.

• Dimensionality reduction, e.g., principal component analysis (PCA)

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 15


Step 6: Model Training

• Rule of thumb: start with a simple model, and increase model complexity if needed.
• Examples of commonly used models:
• Linear and logistic regression models
• Support vector machines
• Decision trees
• Random forests
• K-means
• K-nearest neighbours
• Neural networks
• Performance Index
• Mean squared error
• Cross-entropy

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 16


Step 7: Model Evaluation

• At this stage, the data that was set aside for testing in Step 2 will be used to evaluate
the model.
• The test data must pass through the same pipeline that was created for the train
data. If data-dependent values were used in the pipeline, the same values that were
obtained for the train data must be applied to the test data. For example, during
standardization, the mean and standard deviation of train data must be applied to the
for standardization of the test data; otherwise, there is a leakage of information.
• Prediction results for the test must be evaluated using the same performance index
that was used for evaluating the training procedure.
• Use K-fold cross validation.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 17


K-Fold Cross Validation

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 18


Step 8: Fine Tuning

• The key to fine tuning is to find a trade-off between training error and test error.

Rule of thumb: train a complex


model to reach low training error
(overfitting). At this stage,
training error is low; thus,
underfitting is ruled out. Next,
start gradually decreasing the
model complexity until the least
test error is found. This will
constitute the best fit.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 19


General Recommendations to Reduce Training Error

1. Start with a simple model


2. Increase the number of independent variables
3. Extract new features from the independent variables
4. Apply systematic hyperparameter optimization methods, e.g., grid search
5. If the training error is not satisfactory, try a more complex model, and repeat the
above steps in the order provided.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 20


Scikit-Learn Design

• Scikit-Learn is a common machine learning package that provides many useful


algorithms for a variety of machine learning problems; therefore, it is important to
understand the overall architecture of its algorithms or objects.

• There are three primary categories of scikit-learn objects:


• Estimators: estimate some parameters based on a dataset – they have a .fit() member.
• Transformers: estimators that also apply a transformation on the dataset – they have a
.transform() member, and for convenience, they have a .fit_transform() member as well.
• Predictors: estimators that also make predictions - they have a .predict() member.

• Each scikit-learn objects have a set of parameters that are named with a succeeding
underscore, like “coefs_”. They can be accessed via <object>.<parameter>

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 21


Suggested Reading

Chapter 2: pages 35 – 84
Check out the code: https://github.com/ageron/handson-ml2
Check out the exercises. The solutions are given in Appendix A.

AER850: Intro to Machine Learning | © Reza Faieghi, 2024 22

You might also like