L3 - End To End Machine Learning Project
L3 - End To End Machine Learning Project
End-to-End Machine
Learning Project
AER850: Intro to Machine Learning
Steps Involved in a Machine Learning Project
Identifying objectives
and variables
Data visualization
Model training
Model evaluation
• Supervised or unsupervised?
• Regression or classification?
• Independent and dependent variables?
• Types of data?
• Example: Use California census data to build a model of housing prices in the state.
This data includes metrics such as the population, median income, and median
housing price for each district in California. Your model should learn from this data
and be able to predict the median housing price in any district, given all the other
metrics.
• Numerical
• Discrete
• The number of speakers, cameras, cores in the processor, sims supported by a smartphone.
• Continuous
• The temperature and operating frequency of smartphone processor.
• Categorical
• Nominal
• Colors of smartphones. It is not possible to state that “Red” is greater than “Blue”. So as gender of a
person where we cannot differentiate between male, female, or others.
• Ordinal
• Size of clothing which can have an order: small < medium <large, or letter grading system where an A+
is definitely greater than a B grade.
https://clauswilke.com/dataviz/no-3d.html
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 9
AER850: Intro to Machine Learning | © Reza Faieghi, 2024 10
Correlation Matrix
• If the test dataset was included during the visualization, then model selection will be
biased, because information outside of the training dataset is “leaked” for model
selection, creating a data snooping bias.
• Data snooping refers to statistical inference that the researcher decides to perform
after looking at the data (as contrasted with pre-planned inference, which the
researcher plans before looking at the data).
• Data imputation
• The process of replacing missing data with substitute values.
• Some common methods include removing data points, or adding neutral elements (often 0) or
average values to the missing data.
• Handling text and categorical data
• Data encoding methods
• Data scaling
• If we need to create a mixed fruit juice, we need mix all fruit not by their size but based on their
right proportion.
• Common methods:
• Standardization
• Normalization (aka min-max scaling), to map data into unitless ranges, typically [0,1] or [-1,1].
• Scaling to unit length
If at least one of the scales is nonzero (e.g. 𝑎1 ), the above equation can be written as
−𝑎2 −𝑎𝑛
𝑣1 = 𝑣 +. . . + 𝑣
𝑎1 2 𝑎1 𝑛
• Correlation matrix is often used to detect linearly dependent variables and trim the number of
independent variables.
• Rule of thumb: start with a simple model, and increase model complexity if needed.
• Examples of commonly used models:
• Linear and logistic regression models
• Support vector machines
• Decision trees
• Random forests
• K-means
• K-nearest neighbours
• Neural networks
• Performance Index
• Mean squared error
• Cross-entropy
• At this stage, the data that was set aside for testing in Step 2 will be used to evaluate
the model.
• The test data must pass through the same pipeline that was created for the train
data. If data-dependent values were used in the pipeline, the same values that were
obtained for the train data must be applied to the test data. For example, during
standardization, the mean and standard deviation of train data must be applied to the
for standardization of the test data; otherwise, there is a leakage of information.
• Prediction results for the test must be evaluated using the same performance index
that was used for evaluating the training procedure.
• Use K-fold cross validation.
• The key to fine tuning is to find a trade-off between training error and test error.
• Each scikit-learn objects have a set of parameters that are named with a succeeding
underscore, like “coefs_”. They can be accessed via <object>.<parameter>
Chapter 2: pages 35 – 84
Check out the code: https://github.com/ageron/handson-ml2
Check out the exercises. The solutions are given in Appendix A.