0% found this document useful (0 votes)
50 views30 pages

L3 Overview of ML Model Development Lifecycle-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views30 pages

L3 Overview of ML Model Development Lifecycle-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Overview of ML model development lifecycle


In Python and MATLAB as s/w tool
Machine learning model development lifecycle
Overview

Data collection Data Gathering and Planning

Data Preparation
Preprocessing Data Wrangling

Exploratory Data Analysis (EDA) Analyse Data and Model Engineering

Train the model


Model development (training, validation), Test the model

Monitoring and Maintenance


Deployment Improve the model performance

February 5, 2024 | Slide 2


Mean and Standard deviation

▪ For a given data, the mean is given by the For sample,


sum of the data over number of data points. 2
σ(𝑥𝑖 −μ)2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 =
▪ Mean provides the central tendency of data, 𝑁−1
i.e what value is the data centered ▪ Numerically, averaged squared deviations
Let X = {x1+x2+…xn), then from mean.
▪ The units of variance are square of the data
𝑥1 + 𝑥2 + ⋯ 𝑥𝑛 units and thus difficult to correlate with
𝑀𝑒𝑎𝑛(μ) = actual data.
𝑛
▪ Variance, provides the spread of the data ▪ The square root of variance is the standard
over the mean. deviation,
For population, For sample,
σ(𝑥𝑖 −μ)2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 2 = σ(𝑥𝑖 −μ)2
𝑁 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑜𝑖𝑛 𝜎 =
𝑁−1
February 5, 2024 | Slide 3
Median and Percentiles

Median:
▪ Considered as the “middle” value of a
dataset
▪ Equivalent to mean but robust, not affected
by outliers as long as less than 50% of data
is contaminated
Kth -Percentile:
• The score below which k% of the dataset
falls
Ex: If 𝑁 ∈ {1 . . 1000} Then, median (Q2) = 500,
25th percentile (Q1) = 250, 75th percentile (Q3) =
750.

February 5, 2024 | Slide 4


Covariance and Correlation coefficient

Covariance is a measure of how much two Correlation coefficient (𝜌) depicts the strength
variables change together. of the relationship between data sets
For two data sets X and Y it is estimated as, For the same data sets X and Y,
𝑁 𝜎𝑋𝑌
1 𝜌=
𝐶𝑜𝑣 𝑋, 𝑌 = 𝜎𝑋𝑌 = ෍(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 ) 𝜎𝑋 𝜎𝑌
𝑁
𝑖=1 The value of 𝜌 is always between [-1,1].
where,
𝑥ҧ 𝑎𝑛𝑑 𝑦ത are means of X and Y

February 5, 2024 | Slide 5


Normalization
Different types
For a given vector X, Where,

• Standard score: Z= Stand score


𝑋−𝜇 𝑋෠ = Normalized vector
𝑍=
𝜎 𝜇 = mean of the vector
𝜎 = standard deviation
• Min-Max scaling:
𝑋 − 𝑋𝑚𝑖𝑛 𝑋𝑚𝑖𝑛 = Minimum value of the vector

𝑋=
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛 𝑋𝑚𝑎𝑥 = Maximum value of the vector

February 5, 2024 | Slide 6


Use Cases for Standard Score

• When you want to compare scores Example in Chemical Engineering:


from different distributions. In process optimization, if you're
• In machine learning algorithms that analyzing variables like temperature,
assume data to be normally distributed, pressure, and reaction time, which are on
like Linear Regression, Logistic different scales, using Z-score
Regression, and Neural Networks. normalization can bring them to a
• When dealing with features with common scale, facilitating a more
different units or scales, as Z-score straightforward comparative analysis.
normalization removes the units.

February 5, 2024 | Slide 7


Use Cases for Min-Max Scaling

• When you need values in a bounded Example in Chemical Engineering


interval. If you're developing novel desalination
• In algorithms that do not assume any technologies and need to normalize the
distribution of the data, like Decision range of salinity levels or contaminant
Trees and K-Nearest Neighbors. concentrations across different samples,
• In image processing and neural Min-Max Scaling can be used to bring all
networks working with image data, values into a consistent range for better
where pixel intensities need to be comparability.
normalized.

February 5, 2024 | Slide 8


Choosing Between Standard Score and Min-Max Scaling

Standard Score: Choose this method when you need to handle outliers and want to
preserve relative distances between values. It's also useful when the data needs to be
normally distributed.

Min-Max Scaling: Opt for this when you need values in a bounded range. However, be
mindful that it can be sensitive to outliers.

February 5, 2024 | Slide 9


Machine learning model development lifecycle

Data collection and preprocessing

Model Development

Exploratory
Data Analysis Model
(EDA) Deployment

February 5, 2024 | Slide 10


Data collection
Requirement collection and Data Gathering
• Model Requirements: Define the ML model's
purpose and its operational range. Data Gathering Process
• Data Availability: Ensure sufficient data for training, • Identify Data Sources: Distinguish
access to ongoing data updates, and evaluate
synthetic data use to cut costs. between labeled/unlabeled and real-
• Applicability: Confirm the solution addresses the
problem and assess the feasibility of using ML for the time/offline data.
issue.
• Legal Constraints: Secure data source permissions, • Data Collection Protocol: Set up a
uphold ethical data collection standards, and consider system for accumulating data into a
societal, user, and safety impacts.
• Robustness & Scalability: Evaluate the application's database.
robustness and capacity for growth.
• Explainability: Ensure the ML model's decisions are • Data Integration: Merge data from
interpretable and results are verifiable. various sources sequentially into the
• Resource Availability: Check for adequate
computational, storage, and network resources, and database.
qualified personnel.

February 5, 2024 | Slide 11


Preprocessing

• Data pre-processing involves the following • Data Processing/mapping


activates
• Feature (input) and label (Output)
• Data Exploration
selection,
• Understand the nature of data that we
have to work with. • Dealing with imbalanced classes,
• understand the characteristics, format, • Feature engineering,
and quality of data. • Data augmentation,
• A better understanding of data leads to an • Normalizing and scaling the data.
effective outcome.
• Find Correlations, general trends, and • Data management
outliers. • Identify best data storage solutions,
• Data Wrangling/Cleaning • Data versioning for reproducibility,
• Filling of Missing Values • Storing metadata, and Creating a data
• Removal of Duplicate data interconnectivity platform
• Removal of Invalid data • To ensure a constant data stream for
• Noise reduction Model training.
February 5, 2024 | Slide 12
Example for Data Cleaning and mapping

Example data cleaning using example Data Processing/mapping


data form HRT Mapping of input out for High Rate
• Filling of Missing Values Thickener (HRT) problem
• Removal of Duplicate data

• Removal of Invalid data

• Noise reduction

February 5, 2024 | Slide 13


Example for Data Cleaning and mapping
High Rate Thickener (HRT)
Visualization of data: Data cleaning to remove outliers using
percentiles cut-off

February 5, 2024 | Slide 14


Example for Data Cleaning and mapping
High Rate Thickener
Correlation coefficients among data Optimization architecture

Targets
February 5, 2024 | Slide 15
Exploratory Data Analysis (EDA)

• It is an approach of analysing the data • EDA involves following tasks


using visual techniques. That involves • Organise the data using delimiter
function
following aspects of data analysis • Identify number of rows, column and
• performing initial investigations on its label
data to discover patterns, • Check any missing / null value
• to spot anomalies
• How many intiger/real/float
• Calculate mean, standard deviation,
• to test the hypothesis and minimum and maximum values and
• to check assumptions with the the quantiles of the data
help of summary statistics and • Identify the relation/interaction
between input and output variables
graphical representations. by plotting heat map, correlation
matrix, boxplot, distribution plot

February 5, 2024 | Slide 16 Example for EDA operation given in: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
Example for EDA operation Sample analysis in python

pH (-):2:6:2.4:6.5:7:7.5 Organise the data using delimiter function


TDS (g/l):10:20:10.1:12:19:18.5
BOD (g/l):1:0.1:1.2:0.2:0.3:0.15
COD (g/l):1:0.2:1.5:0.1:0.2:0.3
Temp (C):23:23:21:24:25:20

February 5, 2024 | Slide 17


Example for EDA operation Sample analysis in python

Identify number of rows, column and its Calculate mean, standard deviation,
label minimum and maximum values and the
Check any missing / null value quantiles of the data
How many integer/real/float

February 5, 2024 | Slide 18


Example for EDA operation Sample analysis in python

heat map correlation matrix

df.corr() # code to reveal the correlation matrix where “df”


represents the dataframe

February 5, 2024 | Slide 19


Example for EDA operation Sample analysis in python

boxplot distribution plot

February 5, 2024 | Slide 20


Model development (training, validation),

1. Define the metrics for estimate model’s


performance
2. Clean the data set
3. Create input features from existing
attributes
4. Split data into three sets
5. If the output classes are highly imbalanced,
resample the data to balance the output
classes
6. Choose the model structure
7. Tune the hyperparameters
8. Evaluate the final model’s performance
based on metrics defined
February 5, 2024 | Slide 21
Model development (training, validation),

1. Define the metrics used in measuring the 2. Clean the data set.
model’s performance.
• R2, Root Mean Square Error (RMSE), and • Remove duplicated records.
mean absolute percentage error (MAPE) have • Handle data with missing values -
commonly used metrics for regression models
that output numeric values. imputing or filling in missing values
• Precision, Recall, F1, and Area under the or removing problematic rows or
receiver operating characteristic (ROC) Curve columns.
(AUC) are commonly used for classification
models that output categorical values. • Find any potential outliers and

• You also need to choose a loss function to remove them,


reflect your choice of model performance • but only if they represent
metrics.
errors; true outliers are often
important data points.

February 5, 2024 | Slide 22


Model development (training, validation),

3. Split data into three sets (the percentage can vary, 4. Choose the model structure.
according to problem nature and data availability).
• Tune the model’s hyperparameters (for
• The training set (commonly 70% of available data)
is used to train the model. example, regression coefficient in linear
• The validation set (commonly 10%) is used to regression and the number of nearest
optimize model hyperparameters (for example, the neighbour data points in KNN classification).
regularization weight of linear regression model). • For each hyperparameter configuration, a
• The test set (commonly 20%) is used only to model is trained on training data and
measure the performance of the final trained performance is evaluated with the validation
model on unseen data.
data.
• If the output classes are highly imbalanced,
resample the data to balance the output classes. • The model with best performance will typically
• Over-sample records from the rare classes and be selected as the final model. However,
under-sample records from the frequent classes depending on requirements, we may opt to
• Optionally, create synthetic data records for the trade performance for a model that is more
rare classes. explainable, fairer, or one that supports faster
inference.

February 5, 2024 | Slide 23


Model development (training, validation),

5. Evaluate the final model’s performance based on metrics defined in step 1 (“Define
the metrics used in measuring the model’s performance”) using the test set.
• If we are happy with the performance of the final model, we can save the final
model into the model repository and deploy the model into production.
• If not, we need to go back to do more data exploration, create new features, and
repeat the training cycle.
• Follow company policies on model governance and check to make sure the model
is not using protected attributes (such as gender, age, etc.) in making decisions or
predictions

February 5, 2024 | Slide 24


Deployment and Maintenance

6. Deploy the model into a production serving environment.

• There are two ways to expose the model: package the model as a library function called by
application code or package the model as a RESTful API service hosted by the model serving
platform.
• Optionally, we can deploy the latest model incrementally to a limited percentage of users and, in
parallel, run an A/B test to compare it with the currently deployed production model. Based on the
performance difference, we can decide whether we should roll forward to the latest model.

• Monitor the model’s performance to detect whether there is drift in the following:

• Distribution of input variables: if new data deviates significantly from the training data, it is a good
indication that something in the environment has changed.
• Model performance metrics: if the model shows degraded performance on new data relative to its
performance during the training phase, model retraining may be necessary.

• If drift is detected, retrain the model with the latest refreshed data. Model retraining may also be scheduled
in some cases. For example, you may choose to retrain the model every week on the latest data.

February 5, 2024 | Slide 25


Oil Well Drilling Process Overview

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple


2. Shale Shakers 9. Kelly hose 16. Stand of drill pipe 23. Blowout preventers
3. Suction line 10. Goose-neck 17. Pipe rack floor 24. Drill Sting
4. Mud pump 11. Travelling block 18. Swivel 25. Drill bit
5. Moto or power source 12. Drill line 19. Kelly drive 26. Casing head
6. Vibrating hose 13. Crown block 20. Rotary table 27. Flow line
7. Draw-works 14. Derrick 21. Drill floor
February 5, 2024 | Slide 26
https://dtetechnology.wordpress.com/2014/05/04/components-of-a-land-based-rotary-drilling-platform
Example: Data collection for Drilling DSS

Initial window Prediction & Future


for training optimization data

1 2 3 4 5 6 7

1 2 3 4 5 6 7

Dropped New window


for training

February 5, 2024 | Slide 27


Process lifecycle in MATLAB
Commonly used toolbox

Data collection Database, Datafeed, and OPC

Statistics and Machine Learning, Signal Processing


Preprocessing System Identification, Wavelet, Text Analytics, Image
Processing, Fuzzy Logic

EDA Toolbox
Exploratory Data Analysis (EDA) Data sets

Statistics and Machine Learning, Classification learner,


Model development (training, validation), Regression learner, Deep learning

Deployment MATLAB Coder, MATLAB Production Server

February 5, 2024 | Slide 28


Process lifecycle in Python
Commonly used Libraries

Data collection/scraping SQLAlchemy, requests, BeautifulSoup

Data manipulation Pandas, Dask (parallel computing)

Numpy
Preprocessing Scikit-learn

Scikit-learn
Exploratory Data Analysis (EDA) Matplotlib, Seaborn, Plotly

Scikit-learn, Scipy
Model development (training, validation), Tensorflow, PyTorch, Keras

Flask, Django, Streamlit, FastAPI (web framework)


Deployment Tensorflow serving, Torchserve

February 5, 2024 | Slide 29

You might also like