04_MLModelingBasics
04_MLModelingBasics
Thanks to Christian Kaestner for preparing the material for these slides
© 2024 Fraunhofer USA, Inc. - Center Mid-Atlantic
© 2024 University of Maryland
Learning Goals
▪ Explain the benefits and drawbacks of notebooks
▪ Demonstrate effective use of computational notebooks
▪ Understand how machine learning learns models from labeled data (basic
model)
▪ Explain the steps of a typical machine learning pipeline and their
responsibilities and challenges
▪ Understand the role of hyper-parameters
▪ Appropriately use vocabulary for machine learning concepts
▪ Evaluate a machine-learned classifier
▪ Target array
▪ In addition to the feature matrix X, we also generally work with a
label or target array, which by convention we will usually call y.
▪ The target array is usually one dimensional, with length
n_samples, and is generally contained in a NumPy array.
▪ The target array may have continuous numerical values, or
discrete classes/labels.
▪ Most commonly, the steps in using the Scikit-Learn estimator API are as follows:
1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector.
4. Fit the model to your data by calling the fit() method of the model instance.
5. Apply the Model to new data. For supervised learning, often we predict labels for unknown
data using the predict() method.
▪ We want to evaluate the model on data it has not seen before, and so we will
split the data into a training set and a testing set.
▪ In some data sets there will be a second data set available. Other times we
will need to do the split ourselves. We hold back some subset of the data from
the training of the model, and then use this holdout set to check the model
performance.
© 2024 Fraunhofer USA, Inc. - Center Mid-Atlantic
© 2024 University of Maryland
Scikit Learn - Model Validation via cross-validation
▪ by observing data
▪ Typically used when writing that function manually is hard because the
problem is hard or complex.
▪ Examples:
• Detecting cancer in an image
• Transcribing an audio file
• Detecting spam
• Predicting recidivism
• Detect suspicious activity in a credit card
© 2024 Fraunhofer USA, Inc. - Center Mid-Atlantic
© 2024 University of Maryland
Supervised Machine Learning
▪ Given a training dataset containing instances
▪ learn a function
▪ that "fits" the given training set and "generalizes" to other data.
%Large
Crime Rate %Industrial Near River # Rooms ... Price
Lots
0.006 18 2.3 0 6 ... 240.000
0.027 0 7.0 0 6 ... 216.000
0.027 0 7.0 0 7 ... 347.000
0.032 0 2.1 0 6 ... 334.000
0.069 0 2.1 0 7 ... 362.000
Geeksforgeeks.org
Towards AI
▪ Gain Ratio
: fraction of items labeled with class i in the set
© 2024 Fraunhofer USA, Inc. - Center Mid-Atlantic
© 2024 University of Maryland
Example - Golfing 𝑓 ( 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 ,𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 , 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 ,𝑊𝑖𝑛𝑑𝑦 ) =𝑃𝑙𝑎𝑦 𝑔𝑜𝑙𝑓 𝑜𝑟 𝑛𝑜𝑡
Degrees of freedom
© 2024 Fraunhofer USA, Inc. - Center Mid-Atlantic
© 2024 University of Maryland
Demo
▪ Navigate to:
▪ https://colab.research.google.com/github/jGiltinan/SE4AI_DecisionTree/blob/master/golf_TrainTest
Split.ipynb
• Method: Repeated partitioning of data into train and validation data, train and
evaluate model on each partition, average results
• Many split strategies, including
• leave-one-out: evaluate on each datapoint using all other data for training
• k-fold: k equal-sized partitions, evaluate on each training on others
• repeated random sub-sampling (Monte Carlo)
Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” 2014.
▪ Classification:
▪ Categorical cross entropy:
▪ Sparse CCE – similar,
does not require one-hot encoding
https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
“Why Should I Trust You?” Explaining the Predictions of Any Classifier , Ribeiro et al. 2016
https://github.com/marcotcr/lime