0% found this document useful (0 votes)
23 views51 pages

Lecture 3 1611410001002

The document discusses model selection, training, and evaluation for regression and classification problems. It provides examples of selecting linear regression and random forest regression models for a house price prediction problem. For classification, it discusses training a model on the MNIST dataset to detect the digit '5'. It also covers multi-class classification using one-vs-all and one-vs-one strategies. The document concludes by discussing performance metrics like accuracy and ROC curves, and methods for estimating model performance including cross-validation, bootstrap, and learning curves.

Uploaded by

amrasirah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views51 pages

Lecture 3 1611410001002

The document discusses model selection, training, and evaluation for regression and classification problems. It provides examples of selecting linear regression and random forest regression models for a house price prediction problem. For classification, it discusses training a model on the MNIST dataset to detect the digit '5'. It also covers multi-class classification using one-vs-all and one-vs-one strategies. The document concludes by discussing performance metrics like accuracy and ROC curves, and methods for estimating model performance including cross-validation, bootstrap, and learning curves.

Uploaded by

amrasirah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Model Selection and Training

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
In This Segment

• Model Selection and Training


• For regression problem
• For classification problem

• Multi-class classification
• Multi-output classification
Model Selection and Training

• Select based on training data


• If prediction label/output is available, use regression or classification model
• Regression if real valued output
• Classification if output is discrete (binary/integer)
• else, unsupervised model is used.
• For the house price prediction problem, use regression model since median
house prices are available along with training data (predictors)
• Example: Linear Regression

• Better model is necessary for improving the prediction accuracy


Regression Model Selection and Training
• Decision Tree Based Regression produces low error on training data

• Cross-validation error is not satisfactory


• Better accuracy can be obtained using Random Forest based regressor
Classification Model
MNIST Dataset Classification
• A set of 70,000 small images of handwritten
digits
• Each image 28x28 pixel with intensity 0 (black) –
255 (white)
• Input data represented as 70000 x 784 matrix
• Each image is labeled with the digit it represents.
Classification Model Training
Detect a ‘5’
• Segment the dataset into 60,000 training images and 10,000 test images

• Shuffle the training dataset

• Train a classification model for detecting ‘5’. Target output for training data
instance corresponding an image of ‘5’ is +1, else target output is ‘0’

• Perform cross validation like the regression problem and try out multiple
classification model for achieving acceptable performance.
Multiclass Classification

• Multiclass classifiers (aka multinomial classifiers) can distinguish between more


than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers)
are capable of handling multiple classes directly.
• Many (such as Support Vector Machine classifiers or Linear classifiers) are strictly
binary
• One-versus-all (OvA) or One-versus-rest strategy using multiple binary classifiers.
• e.g., for MNIST classification, train 10 binary classifiers, one for each digit (a 0-detector,
a 1-detector, a 2-detector, and so on).
• get the decision score from each classifier for that image and select the class whose
classifier outputs the highest score.
• One-versus-one (OvO) strategy
• train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to
distinguish 0s and 2s, another for 1s and 2s, and so on.
• If there are N classes, you need to train N × (N – 1) / 2 classifiers.
• Run an image through all 45 classifiers and see which class wins the most duels.
• Main advantage of OvO is each classifier only needs to be trained on the part of
the training set for the two classes that it must distinguish
Multi Label / Output Classification

• In multilabel classification, multiple classes for each instance should be


output.
• e.g., the classifier has been trained to recognize three faces, A, B, and C; then
when it is shown a picture of A and C, it should output [1, 0, 1]

• In multioutput-multiclass or simply multioutput classification each label can


be multiclass, i.e., it can have more than two possible values.
• e.g., in a system to remove noise from images, input is a noisy digit image, output a
clean digit image, represented as an array of pixel intensities
• classifier’s output is multilabel - one label per pixel
• each label can have multiple values (pixel intensity ranges from 0 to 255
Thank You!
In our next segment: Model Evaluation
Model Evaluation

• Metrics for Performance Evaluation


• How to evaluate the performance of a model?

• Methods for Performance Evaluation


Metrics for Performance Evaluation
Focus on the predictive capability of a model
• Confusion Matrix

PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
ACTUAL Class=Yes a b d: TN (true negative)
CLASS
Class=No c d
Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

• Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Limitation of Accuracy

• Consider a 2-class problem


• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %


• Accuracy is misleading because model does not detect any class 1 example
Measures for Imbalanced Classes

a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c

wa  w d
Weighted Accuracy  1 4

wa  wb wc w d
1 2 3 4
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning
algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets
Learning Curve

 Learning curve shows


how accuracy changes
with varying sample
size
 Requires a sampling
schedule for creating
learning curve:
 Arithmetic
sampling
(Langley, et al)
 Geometric
sampling
(Provost et al)

Effect of small sample


size:
- Bias in the
estimate
Methods of Estimation

• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Stratified sampling
• oversampling vs undersampling
• Bootstrap
• Sampling with replacement
ROC (Receiver Operating Characteristic)

• Developed in 1950s for signal detection theory to analyze noisy signals


• Characterize the trade-off between positive hits and false alarms
• ROC curve plots TP (on the y-axis) against FP (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve
• changing the threshold of algorithm, sample distribution or cost matrix changes
the location of the point
ROC Curve

• 1-dimensional data set containing 2 classes (positive and


negative)
• any points located at x > t is classified as positive

At threshold t:
TP=0.5, FN=0.5, FP=0.12, TN=0.88
ROC Curve

(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
• Random guessing
• Below diagonal line:
• prediction is opposite of the
true class
Using ROC for Model Comparison

 No model consistently
outperform the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve


 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
How to Construct an ROC curve

Instance P(+|A) True Class


• Use classifier that produces
1 0.95 +
posterior probability for each
2 0.93 + test instance P(+|A)
3 0.87 - • Sort the instances according to
4 0.85 - P(+|A) in decreasing order
5 0.85 - • Apply threshold at each unique
6 0.85 + value of P(+|A)

7 0.76 - • Count the number of TP, FP,


TN, FN at each threshold
8 0.53 +
• TP rate, TPR = TP/(TP+FN)
9 0.43 -
• FP rate, FPR = FP/(FP + TN)
10 0.25 +
How to construct an ROC curve

ROC Curve:

Class + - + - - - + - + +
P
Threshol 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

d >= TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0


Thank You!
In our next segment: Hyperparameter Optimization
Hyperparameters
Machine Learning systems use many parameters internally

• Gradient Descent
• e.g., Learning rate, how long to run
• Mini-batch
• Batch size
• Regularization constant
• Many Others
• will be discussed in upcoming sessions
Hyperparameter Optimization

• Also called metaparameter optimization


• Also called tuning

• How to find best values of hyperparameters?


Tuning By Hand

• Just fiddle with the parameters until you get the results you want

• Probably the most common type of hyperparameter optimization

• Upsides: the results are generally pretty good…

• Downsides: lots of effort, and no theoretical guarantees


Grid Search

• Define some grid of parameters you want to try


• Try all the parameter values in the grid
• By running the whole system for each setting of parameters
• Then choose the setting with the best result
• Essentially a brute force method
Downsides of Grid Search

• As the number of parameters increases, the cost of grid search increases


exponentially!
• Need some way to choose the grid properly
• Something this can be as hard as the original hyperparameter
optimization
• Can’t take advantage of any insight you have about the system!
Making Grid Search Fast

• Early stopping to the rescue


• Can run all the grid points for one epoch, then discard the half that performed worse,
then run for another epoch, discard half, and continue.

• Can take advantage of parallelism


• Run all the different parameter settings independently on different servers in a
cluster.
• An embarrassingly parallel task.
• Downside: doesn’t reduce the energy cost.
One Variant: Random Search

• This is just grid search, but with randomly chosen points instead of points
on a grid.
• RandomSearchCV

• This solves the curse of dimensionality


• Don’t need to increase the number of grid points exponentially as the number
of dimensions increases.

• Problem: with random search, not necessarily going to get anywhere near the
optimal parameters in a finite sample.
An Alternative: Bayesian Optimization

• Statistical approach for minimizing noisy black-box functions.

• Idea: learn a statistical model of the function from hyperparameter values to


the loss function
• Then choose parameters to minimize the loss

• Main benefit: choose the hyperparameters to test not at random, but in a way that
gives the most information about the model
• This lets it learn faster than grid search
Effect of Bayesian Optimization

• Downside: it’s a pretty heavyweight method


• The updates are not as simple-to-implement as grid search

• Upside: empirically it has been demonstrated to get better results in fewer


experiments
• Compared with grid search and random search

• Pretty widely used method


• Lots of research opportunities here.
Cross-Validation

• Partition part of the available data to create an validation dataset that we don’t
use for training.

• Then use that set to evaluate the hyperparameters.

• Typically, multiple rounds of cross-validation are performed using different


partitions
• Can get a very good sense of how good the hyperparameters are
• But at a significant computational cost!
Thank You!
In our next segment: Machine Learning Pipeline
Machine Learning Pipeline

Dr. Sugata Ghosal


[email protected]
In This Segment

• What is MLOps
• DevOps vs MLOps

• Level 0 MLOps
• Continuous Training

• Level 1 MLOps
• Continuous Integration, Delivery

• Frameworks
What is MLOps?
Apply DevOps principles to ML systems
• An engineering culture and practice that aims at unifying ML system
development (Dev) and ML system operation (Ops).

• Automation and monitoring at all steps of ML system construction, including


integration, testing, releasing, deployment and infrastructure management.

• Data scientists can implement and train an ML model with predictive


performance on an offline validation (holdout) dataset, given relevant training
data for their use case.

• However, the real challenge is building an integrated ML system and to


continuously operate it in production.
Ecosystem of ML System Components
A small fraction of a real-world ML system is composed of the ML code
DevOps Vs. MLOps

• DevOps for developing and operating large-scale software systems provides


benefits such as
• shortening the development cycles
• increasing deployment velocity, and
• dependable releases.
• Two key concepts
• Continuous Integration (CI)
• Continuous Delivery (CD)
• An ML system is a software system, so similar practices apply to reliably build
and operate at scale.
• However, ML systems differ from other software systems
• Team skills: focus on exploratory data analysis, model development, and
experimentation.
• Development: ML is experimental in nature.
• The challenge is tracking what worked and what did not, maintaining reproducibility, and
maximizing code reusability.
• Testing: Additional testing needed for data validation, trained model quality
evaluation, and model validation.
DevOps Vs. MLOps

• Deployment: a multi-step pipeline to automatically retrain and deploy model.


• adds complexity
• Automation needed before deployment by data scientists to train and validate
new models.
• Production: ML models can have reduced performance due to constantly
evolving data profiles.
• Need to track summary statistics of data and
• monitor the online performance of model to send notifications or roll back for
suboptimal values
• ML and other software systems are similar in CI of source control, unit /
integration testing, and CD of the software module / package.
• However, in ML,
• CI is also about testing and validating data, data schemas, and models.
• CD is a system (an ML training pipeline) that automatically deploys another
service (model prediction service).
• Continuous training (CT) is a new property, unique to ML systems, that is
concerned with automatically retraining the model in production and
serving the models.
Manual ML Steps
• Manual, script-driven, and interactive process.
• Disconnection between ML and operations, possibly leading to training-
serving skew
• Infrequent release iterations. No CI, CD, active performance monitoring
• Deploy trained Model as a prediction service
• Deployment process is concerned only with deploying the trained model as a
prediction service, e.g., a microservice with a REST API
MLOps Level 1

• Perform continuous
training (CT) by
automating the ML
pipeline

• Achieves continuous
delivery of model
prediction service.

• Automated data and
model validation
steps to the pipeline

• Needs pipeline
triggers and metadata
management.
Data and Model Validation

• Data validation: Required prior to model training to decide whether to retrain


the model or stop the execution of the pipeline based on following
• Data values skews: significant changes in the statistical properties of data,
triggering retraining
• Data schema skews: downstream pipeline steps, including data processing and
model training, receives data that doesn't comply with the expected schema.
• stop the pipeline to release a fix or an update to the pipeline to handle these changes in
the schema.
• Schema skews include receiving unexpected features or with unexpected values, not
receiving all the expected features

• Model validation: Required after retraining the model with the new data.
Evaluate and validate the model before promoting to production. This offline
model validation step consists of
• Producing evaluation metric using the trained model on test data to assess the
model quality.
• Comparing the evaluation metrics of production model, baseline model, or other
business-requirement models.
• Ensuring the consistency of model performance on various data segments
• Test model for deployment, including infrastructure compatibility and API
Level 2: CI/CD and automated pipeline automation
Stages of CI/CD Automation Pipeline

1) Development and experimentation: iteratively try new ML algorithms and


modeling. The output is the source code of the ML pipeline steps that are
then pushed to a source repository.
2) Pipeline continuous integration: build source code and run various tests.
The outputs of this stage are pipeline components (packages,
executables, and artifacts).
3) Pipeline continuous delivery: deploy artifacts produced by the CI stage to
the target environment.
4) Automated training: automatically executed in production based on a
schedule or trigger. The output is a trained model pushed to the model
registry.
5) Model continuous delivery: serve the trained model as a prediction
service for the predictions.
6) Monitoring: collect statistics on the model performance based on live
data. The output is a trigger to execute the pipeline or to execute a new
experiment cycle.
Stages of the CI/CD automated ML
pipeline
Continuous Integration

• Pipeline and its components are built, tested, and packaged when
• new code is committed or
• pushed to the source code repository.

• Besides building packages, container images, and executables, CI process


can include
• Unit testing feature engineering logic.
• Unit testing the different methods implemented in your model.
• For example, you have a function that accepts a categorical data column and you encode
the function as a one-hot feature.
• Testing for training convergence
• Testing for NaN values due to dividing by zero or manipulating small or large
values.
• Testing that each component in the pipeline produces the expected artifacts.
• Testing integration between pipeline components.
Continuous Delivery

• Continuously delivers new pipeline implementations to the target


environment
• prediction services of the newly trained model.
• For rapid and reliable continuous delivery of pipelines and models, consider
• Verifying the compatibility of the model with the target infrastructure
• e.g., required packages are installed in the serving environment
• Availability of memory, compute, and accelerator resources.
• Testing the prediction service by calling the service API for the updated model
• Testing prediction service performance, such as throughput, latency.
• Validating the data either for retraining or batch prediction.
• Verifying that models meet the predictive performance targets prior to deployment.
• Automated deployment to a test environment, triggered by new code to the
development branch.
• Semi-automated deployment to a pre-production environment, triggered by code
merging
• Manual deployment to a production from pre-production.
Frameworks
Cloud Vendors are providing MLOps framework
• https://cloud.google.com/solutions/machine-learning/mlops-continuous-deliver
y-and-automation-pipelines-in-machine-learning
• Kubeflow and Cloud Build

• Amazon AWS MLOps

• Microsoft Azure MLOps


Thank You!
In our next segment: Linear Regression

You might also like