L3 Overview of ML Model Development Lifecycle-1
L3 Overview of ML Model Development Lifecycle-1
Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation
Data Preparation
Preprocessing Data Wrangling
Median:
▪ Considered as the “middle” value of a
dataset
▪ Equivalent to mean but robust, not affected
by outliers as long as less than 50% of data
is contaminated
Kth -Percentile:
• The score below which k% of the dataset
falls
Ex: If 𝑁 ∈ {1 . . 1000} Then, median (Q2) = 500,
25th percentile (Q1) = 250, 75th percentile (Q3) =
750.
Covariance is a measure of how much two Correlation coefficient (𝜌) depicts the strength
variables change together. of the relationship between data sets
For two data sets X and Y it is estimated as, For the same data sets X and Y,
𝑁 𝜎𝑋𝑌
1 𝜌=
𝐶𝑜𝑣 𝑋, 𝑌 = 𝜎𝑋𝑌 = (𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 ) 𝜎𝑋 𝜎𝑌
𝑁
𝑖=1 The value of 𝜌 is always between [-1,1].
where,
𝑥ҧ 𝑎𝑛𝑑 𝑦ത are means of X and Y
Standard Score: Choose this method when you need to handle outliers and want to
preserve relative distances between values. It's also useful when the data needs to be
normally distributed.
Min-Max Scaling: Opt for this when you need values in a bounded range. However, be
mindful that it can be sensitive to outliers.
Model Development
Exploratory
Data Analysis Model
(EDA) Deployment
• Noise reduction
Targets
February 5, 2024 | Slide 15
Exploratory Data Analysis (EDA)
February 5, 2024 | Slide 16 Example for EDA operation given in: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
Example for EDA operation Sample analysis in python
Identify number of rows, column and its Calculate mean, standard deviation,
label minimum and maximum values and the
Check any missing / null value quantiles of the data
How many integer/real/float
1. Define the metrics used in measuring the 2. Clean the data set.
model’s performance.
• R2, Root Mean Square Error (RMSE), and • Remove duplicated records.
mean absolute percentage error (MAPE) have • Handle data with missing values -
commonly used metrics for regression models
that output numeric values. imputing or filling in missing values
• Precision, Recall, F1, and Area under the or removing problematic rows or
receiver operating characteristic (ROC) Curve columns.
(AUC) are commonly used for classification
models that output categorical values. • Find any potential outliers and
3. Split data into three sets (the percentage can vary, 4. Choose the model structure.
according to problem nature and data availability).
• Tune the model’s hyperparameters (for
• The training set (commonly 70% of available data)
is used to train the model. example, regression coefficient in linear
• The validation set (commonly 10%) is used to regression and the number of nearest
optimize model hyperparameters (for example, the neighbour data points in KNN classification).
regularization weight of linear regression model). • For each hyperparameter configuration, a
• The test set (commonly 20%) is used only to model is trained on training data and
measure the performance of the final trained performance is evaluated with the validation
model on unseen data.
data.
• If the output classes are highly imbalanced,
resample the data to balance the output classes. • The model with best performance will typically
• Over-sample records from the rare classes and be selected as the final model. However,
under-sample records from the frequent classes depending on requirements, we may opt to
• Optionally, create synthetic data records for the trade performance for a model that is more
rare classes. explainable, fairer, or one that supports faster
inference.
5. Evaluate the final model’s performance based on metrics defined in step 1 (“Define
the metrics used in measuring the model’s performance”) using the test set.
• If we are happy with the performance of the final model, we can save the final
model into the model repository and deploy the model into production.
• If not, we need to go back to do more data exploration, create new features, and
repeat the training cycle.
• Follow company policies on model governance and check to make sure the model
is not using protected attributes (such as gender, age, etc.) in making decisions or
predictions
• There are two ways to expose the model: package the model as a library function called by
application code or package the model as a RESTful API service hosted by the model serving
platform.
• Optionally, we can deploy the latest model incrementally to a limited percentage of users and, in
parallel, run an A/B test to compare it with the currently deployed production model. Based on the
performance difference, we can decide whether we should roll forward to the latest model.
• Monitor the model’s performance to detect whether there is drift in the following:
• Distribution of input variables: if new data deviates significantly from the training data, it is a good
indication that something in the environment has changed.
• Model performance metrics: if the model shows degraded performance on new data relative to its
performance during the training phase, model retraining may be necessary.
• If drift is detected, retrain the model with the latest refreshed data. Model retraining may also be scheduled
in some cases. For example, you may choose to retrain the model every week on the latest data.
1 2 3 4 5 6 7
1 2 3 4 5 6 7
EDA Toolbox
Exploratory Data Analysis (EDA) Data sets
Numpy
Preprocessing Scikit-learn
Scikit-learn
Exploratory Data Analysis (EDA) Matplotlib, Seaborn, Plotly
Scikit-learn, Scipy
Model development (training, validation), Tensorflow, PyTorch, Keras