0% found this document useful (0 votes)

50 views30 pages

L3 Overview of ML Model Development Lifecycle-1

Uploaded by

ankitkumar20771089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views30 pages

L3 Overview of ML Model Development Lifecycle-1

Uploaded by

ankitkumar20771089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Overview of ML model development lifecycle

In Python and MATLAB as s/w tool
Machine learning model development lifecycle
Overview

Data collection Data Gathering and Planning

Data Preparation
Preprocessing Data Wrangling

Exploratory Data Analysis (EDA) Analyse Data and Model Engineering

Train the model

Model development (training, validation), Test the model

Monitoring and Maintenance

Deployment Improve the model performance

February 5, 2024 | Slide 2

Mean and Standard deviation

▪ For a given data, the mean is given by the For sample,

sum of the data over number of data points. 2
σ(𝑥𝑖 −μ)2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 =
▪ Mean provides the central tendency of data, 𝑁−1
i.e what value is the data centered ▪ Numerically, averaged squared deviations
Let X = {x1+x2+…xn), then from mean.
▪ The units of variance are square of the data
𝑥1 + 𝑥2 + ⋯ 𝑥𝑛 units and thus difficult to correlate with
𝑀𝑒𝑎𝑛(μ) = actual data.
𝑛
▪ Variance, provides the spread of the data ▪ The square root of variance is the standard
over the mean. deviation,
For population, For sample,
σ(𝑥𝑖 −μ)2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 2 = σ(𝑥𝑖 −μ)2
𝑁 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑜𝑖𝑛 𝜎 =
𝑁−1
February 5, 2024 | Slide 3
Median and Percentiles

Median:
▪ Considered as the “middle” value of a
dataset
▪ Equivalent to mean but robust, not affected
by outliers as long as less than 50% of data
is contaminated
Kth -Percentile:
• The score below which k% of the dataset
falls
Ex: If 𝑁 ∈ {1 . . 1000} Then, median (Q2) = 500,
25th percentile (Q1) = 250, 75th percentile (Q3) =
750.

February 5, 2024 | Slide 4

Covariance and Correlation coefficient

Covariance is a measure of how much two Correlation coefficient (𝜌) depicts the strength
variables change together. of the relationship between data sets
For two data sets X and Y it is estimated as, For the same data sets X and Y,
𝑁 𝜎𝑋𝑌
1 𝜌=
𝐶𝑜𝑣 𝑋, 𝑌 = 𝜎𝑋𝑌 = ෍(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 ) 𝜎𝑋 𝜎𝑌
𝑁
𝑖=1 The value of 𝜌 is always between [-1,1].
where,
𝑥ҧ 𝑎𝑛𝑑 𝑦ത are means of X and Y

February 5, 2024 | Slide 5

Normalization
Different types
For a given vector X, Where,

• Standard score: Z= Stand score

𝑋−𝜇 𝑋෠ = Normalized vector
𝑍=
𝜎 𝜇 = mean of the vector
𝜎 = standard deviation
• Min-Max scaling:
𝑋 − 𝑋𝑚𝑖𝑛 𝑋𝑚𝑖𝑛 = Minimum value of the vector
෠
𝑋=
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛 𝑋𝑚𝑎𝑥 = Maximum value of the vector

February 5, 2024 | Slide 6

Use Cases for Standard Score

• When you want to compare scores Example in Chemical Engineering:

from different distributions. In process optimization, if you're
• In machine learning algorithms that analyzing variables like temperature,
assume data to be normally distributed, pressure, and reaction time, which are on
like Linear Regression, Logistic different scales, using Z-score
Regression, and Neural Networks. normalization can bring them to a
• When dealing with features with common scale, facilitating a more
different units or scales, as Z-score straightforward comparative analysis.
normalization removes the units.

February 5, 2024 | Slide 7

Use Cases for Min-Max Scaling

• When you need values in a bounded Example in Chemical Engineering

interval. If you're developing novel desalination
• In algorithms that do not assume any technologies and need to normalize the
distribution of the data, like Decision range of salinity levels or contaminant
Trees and K-Nearest Neighbors. concentrations across different samples,
• In image processing and neural Min-Max Scaling can be used to bring all
networks working with image data, values into a consistent range for better
where pixel intensities need to be comparability.
normalized.

February 5, 2024 | Slide 8

Choosing Between Standard Score and Min-Max Scaling

Standard Score: Choose this method when you need to handle outliers and want to
preserve relative distances between values. It's also useful when the data needs to be
normally distributed.

Min-Max Scaling: Opt for this when you need values in a bounded range. However, be
mindful that it can be sensitive to outliers.

February 5, 2024 | Slide 9

Machine learning model development lifecycle

Data collection and preprocessing

Model Development

Exploratory
Data Analysis Model
(EDA) Deployment

February 5, 2024 | Slide 10

Data collection
Requirement collection and Data Gathering
• Model Requirements: Define the ML model's
purpose and its operational range. Data Gathering Process
• Data Availability: Ensure sufficient data for training, • Identify Data Sources: Distinguish
access to ongoing data updates, and evaluate
synthetic data use to cut costs. between labeled/unlabeled and real-
• Applicability: Confirm the solution addresses the
problem and assess the feasibility of using ML for the time/offline data.
issue.
• Legal Constraints: Secure data source permissions, • Data Collection Protocol: Set up a
uphold ethical data collection standards, and consider system for accumulating data into a
societal, user, and safety impacts.
• Robustness & Scalability: Evaluate the application's database.
robustness and capacity for growth.
• Explainability: Ensure the ML model's decisions are • Data Integration: Merge data from
interpretable and results are verifiable. various sources sequentially into the
• Resource Availability: Check for adequate
computational, storage, and network resources, and database.
qualified personnel.

February 5, 2024 | Slide 11

Preprocessing

• Data pre-processing involves the following • Data Processing/mapping

activates
• Feature (input) and label (Output)
• Data Exploration
selection,
• Understand the nature of data that we
have to work with. • Dealing with imbalanced classes,
• understand the characteristics, format, • Feature engineering,
and quality of data. • Data augmentation,
• A better understanding of data leads to an • Normalizing and scaling the data.
effective outcome.
• Find Correlations, general trends, and • Data management
outliers. • Identify best data storage solutions,
• Data Wrangling/Cleaning • Data versioning for reproducibility,
• Filling of Missing Values • Storing metadata, and Creating a data
• Removal of Duplicate data interconnectivity platform
• Removal of Invalid data • To ensure a constant data stream for
• Noise reduction Model training.
February 5, 2024 | Slide 12
Example for Data Cleaning and mapping

Example data cleaning using example Data Processing/mapping

data form HRT Mapping of input out for High Rate
• Filling of Missing Values Thickener (HRT) problem
• Removal of Duplicate data

• Removal of Invalid data

• Noise reduction

February 5, 2024 | Slide 13

Example for Data Cleaning and mapping
High Rate Thickener (HRT)
Visualization of data: Data cleaning to remove outliers using
percentiles cut-off

February 5, 2024 | Slide 14

Example for Data Cleaning and mapping
High Rate Thickener
Correlation coefficients among data Optimization architecture

Targets
February 5, 2024 | Slide 15
Exploratory Data Analysis (EDA)

• It is an approach of analysing the data • EDA involves following tasks

using visual techniques. That involves • Organise the data using delimiter
function
following aspects of data analysis • Identify number of rows, column and
• performing initial investigations on its label
data to discover patterns, • Check any missing / null value
• to spot anomalies
• How many intiger/real/float
• Calculate mean, standard deviation,
• to test the hypothesis and minimum and maximum values and
• to check assumptions with the the quantiles of the data
help of summary statistics and • Identify the relation/interaction
between input and output variables
graphical representations. by plotting heat map, correlation
matrix, boxplot, distribution plot

February 5, 2024 | Slide 16 Example for EDA operation given in: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
Example for EDA operation Sample analysis in python

pH (-):2:6:2.4:6.5:7:7.5 Organise the data using delimiter function

TDS (g/l):10:20:10.1:12:19:18.5
BOD (g/l):1:0.1:1.2:0.2:0.3:0.15
COD (g/l):1:0.2:1.5:0.1:0.2:0.3
Temp (C):23:23:21:24:25:20

February 5, 2024 | Slide 17

Example for EDA operation Sample analysis in python

Identify number of rows, column and its Calculate mean, standard deviation,
label minimum and maximum values and the
Check any missing / null value quantiles of the data
How many integer/real/float

February 5, 2024 | Slide 18

Example for EDA operation Sample analysis in python

heat map correlation matrix

df.corr() # code to reveal the correlation matrix where “df”

represents the dataframe

February 5, 2024 | Slide 19

Example for EDA operation Sample analysis in python

boxplot distribution plot

February 5, 2024 | Slide 20

Model development (training, validation),

1. Define the metrics for estimate model’s

performance
2. Clean the data set
3. Create input features from existing
attributes
4. Split data into three sets
5. If the output classes are highly imbalanced,
resample the data to balance the output
classes
6. Choose the model structure
7. Tune the hyperparameters
8. Evaluate the final model’s performance
based on metrics defined
February 5, 2024 | Slide 21
Model development (training, validation),

1. Define the metrics used in measuring the 2. Clean the data set.
model’s performance.
• R2, Root Mean Square Error (RMSE), and • Remove duplicated records.
mean absolute percentage error (MAPE) have • Handle data with missing values -
commonly used metrics for regression models
that output numeric values. imputing or filling in missing values
• Precision, Recall, F1, and Area under the or removing problematic rows or
receiver operating characteristic (ROC) Curve columns.
(AUC) are commonly used for classification
models that output categorical values. • Find any potential outliers and

• You also need to choose a loss function to remove them,

reflect your choice of model performance • but only if they represent
metrics.
errors; true outliers are often
important data points.

February 5, 2024 | Slide 22

Model development (training, validation),

3. Split data into three sets (the percentage can vary, 4. Choose the model structure.
according to problem nature and data availability).
• Tune the model’s hyperparameters (for
• The training set (commonly 70% of available data)
is used to train the model. example, regression coefficient in linear
• The validation set (commonly 10%) is used to regression and the number of nearest
optimize model hyperparameters (for example, the neighbour data points in KNN classification).
regularization weight of linear regression model). • For each hyperparameter configuration, a
• The test set (commonly 20%) is used only to model is trained on training data and
measure the performance of the final trained performance is evaluated with the validation
model on unseen data.
data.
• If the output classes are highly imbalanced,
resample the data to balance the output classes. • The model with best performance will typically
• Over-sample records from the rare classes and be selected as the final model. However,
under-sample records from the frequent classes depending on requirements, we may opt to
• Optionally, create synthetic data records for the trade performance for a model that is more
rare classes. explainable, fairer, or one that supports faster
inference.

February 5, 2024 | Slide 23

Model development (training, validation),

5. Evaluate the final model’s performance based on metrics defined in step 1 (“Define
the metrics used in measuring the model’s performance”) using the test set.
• If we are happy with the performance of the final model, we can save the final
model into the model repository and deploy the model into production.
• If not, we need to go back to do more data exploration, create new features, and
repeat the training cycle.
• Follow company policies on model governance and check to make sure the model
is not using protected attributes (such as gender, age, etc.) in making decisions or
predictions

February 5, 2024 | Slide 24

Deployment and Maintenance

6. Deploy the model into a production serving environment.

• There are two ways to expose the model: package the model as a library function called by
application code or package the model as a RESTful API service hosted by the model serving
platform.
• Optionally, we can deploy the latest model incrementally to a limited percentage of users and, in
parallel, run an A/B test to compare it with the currently deployed production model. Based on the
performance difference, we can decide whether we should roll forward to the latest model.

• Monitor the model’s performance to detect whether there is drift in the following:

• Distribution of input variables: if new data deviates significantly from the training data, it is a good
indication that something in the environment has changed.
• Model performance metrics: if the model shows degraded performance on new data relative to its
performance during the training phase, model retraining may be necessary.

• If drift is detected, retrain the model with the latest refreshed data. Model retraining may also be scheduled
in some cases. For example, you may choose to retrain the model every week on the latest data.

February 5, 2024 | Slide 25

Oil Well Drilling Process Overview

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple

2. Shale Shakers 9. Kelly hose 16. Stand of drill pipe 23. Blowout preventers
3. Suction line 10. Goose-neck 17. Pipe rack floor 24. Drill Sting
4. Mud pump 11. Travelling block 18. Swivel 25. Drill bit
5. Moto or power source 12. Drill line 19. Kelly drive 26. Casing head
6. Vibrating hose 13. Crown block 20. Rotary table 27. Flow line
7. Draw-works 14. Derrick 21. Drill floor
February 5, 2024 | Slide 26
https://dtetechnology.wordpress.com/2014/05/04/components-of-a-land-based-rotary-drilling-platform
Example: Data collection for Drilling DSS

Initial window Prediction & Future

for training optimization data

1 2 3 4 5 6 7

Dropped New window

for training

February 5, 2024 | Slide 27

Process lifecycle in MATLAB
Commonly used toolbox

Data collection Database, Datafeed, and OPC

Statistics and Machine Learning, Signal Processing

Preprocessing System Identification, Wavelet, Text Analytics, Image
Processing, Fuzzy Logic

EDA Toolbox
Exploratory Data Analysis (EDA) Data sets

Statistics and Machine Learning, Classification learner,

Model development (training, validation), Regression learner, Deep learning

Deployment MATLAB Coder, MATLAB Production Server

February 5, 2024 | Slide 28

Process lifecycle in Python
Commonly used Libraries

Data collection/scraping SQLAlchemy, requests, BeautifulSoup

Data manipulation Pandas, Dask (parallel computing)

Numpy
Preprocessing Scikit-learn

Scikit-learn
Exploratory Data Analysis (EDA) Matplotlib, Seaborn, Plotly

Scikit-learn, Scipy
Model development (training, validation), Tensorflow, PyTorch, Keras

Flask, Django, Streamlit, FastAPI (web framework)

Deployment Tensorflow serving, Torchserve

February 5, 2024 | Slide 29

Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Dev Core
No ratings yet
Dev Core
7 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
DS Module 1 Notes
No ratings yet
DS Module 1 Notes
25 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Tools For Data Preparation
No ratings yet
Tools For Data Preparation
4 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Lec 7
No ratings yet
Lec 7
45 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Mining
No ratings yet
Mining
63 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
USL - 21070126112 - Colaboratory
No ratings yet
USL - 21070126112 - Colaboratory
3 pages
Práctica 2 Estadística
No ratings yet
Práctica 2 Estadística
3 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
Session 3 Slides Inferential Statistics
No ratings yet
Session 3 Slides Inferential Statistics
17 pages
Linkages Between Organisational Dynamics and Organisational Performance A Case Study of A State P
No ratings yet
Linkages Between Organisational Dynamics and Organisational Performance A Case Study of A State P
12 pages
Correlation Ratio
No ratings yet
Correlation Ratio
3 pages
Previous Year Questions BSO-317 Research Methods
No ratings yet
Previous Year Questions BSO-317 Research Methods
41 pages
Types of Statistical Methods
No ratings yet
Types of Statistical Methods
2 pages
Dnyaneshwar Ds
No ratings yet
Dnyaneshwar Ds
2 pages
R
0% (1)
R
5 pages
2101 - Assignment 4
No ratings yet
2101 - Assignment 4
3 pages
Wipro
50% (2)
Wipro
17 pages
Minitab Project Report
No ratings yet
Minitab Project Report
3 pages
Unit Iv
No ratings yet
Unit Iv
50 pages
Data Analysis Using MS-Excel - Sarvesh - Sem2
No ratings yet
Data Analysis Using MS-Excel - Sarvesh - Sem2
45 pages
Statistical Analysis Using SAS
No ratings yet
Statistical Analysis Using SAS
47 pages
Preganancy
No ratings yet
Preganancy
8 pages
Correlation and Regression Analyses
No ratings yet
Correlation and Regression Analyses
8 pages
Predicting Demand Forecastability For Inventory Planning: C S S C & O I
No ratings yet
Predicting Demand Forecastability For Inventory Planning: C S S C & O I
2 pages
ANOVA Example: Econ 222
No ratings yet
ANOVA Example: Econ 222
2 pages
Applied Mining Geostatistics
No ratings yet
Applied Mining Geostatistics
32 pages
Standard Deviation and Variance
No ratings yet
Standard Deviation and Variance
4 pages
Unit 5 (The Method of Least Squares)
No ratings yet
Unit 5 (The Method of Least Squares)
34 pages
Puneeth Resume
No ratings yet
Puneeth Resume
2 pages
Topic 3 Writing Research Proposal
No ratings yet
Topic 3 Writing Research Proposal
41 pages
MBAX9134: Change Management Research Project
No ratings yet
MBAX9134: Change Management Research Project
53 pages
Correlation Types and Degree and Karl Pearson Coefficient of Correlation
No ratings yet
Correlation Types and Degree and Karl Pearson Coefficient of Correlation
5 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
6 Steps To Becoming A Business Analytics Professional
No ratings yet
6 Steps To Becoming A Business Analytics Professional
5 pages

L3 Overview of ML Model Development Lifecycle-1

Uploaded by

L3 Overview of ML Model Development Lifecycle-1

Uploaded by

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Overview of ML model development lifecycle

Data collection Data Gathering and Planning

Exploratory Data Analysis (EDA) Analyse Data and Model Engineering

Train the model

Monitoring and Maintenance

February 5, 2024 | Slide 2

▪ For a given data, the mean is given by the For sample,

February 5, 2024 | Slide 4

February 5, 2024 | Slide 5

• Standard score: Z= Stand score

February 5, 2024 | Slide 6

• When you want to compare scores Example in Chemical Engineering:

February 5, 2024 | Slide 7

• When you need values in a bounded Example in Chemical Engineering

February 5, 2024 | Slide 8

February 5, 2024 | Slide 9

Data collection and preprocessing

February 5, 2024 | Slide 10

February 5, 2024 | Slide 11

• Data pre-processing involves the following • Data Processing/mapping

Example data cleaning using example Data Processing/mapping

• Removal of Invalid data

February 5, 2024 | Slide 13

February 5, 2024 | Slide 14

• It is an approach of analysing the data • EDA involves following tasks

pH (-):2:6:2.4:6.5:7:7.5 Organise the data using delimiter function

February 5, 2024 | Slide 17

February 5, 2024 | Slide 18

heat map correlation matrix

df.corr() # code to reveal the correlation matrix where “df”

February 5, 2024 | Slide 19

boxplot distribution plot

February 5, 2024 | Slide 20

1. Define the metrics for estimate model’s

• You also need to choose a loss function to remove them,

February 5, 2024 | Slide 22

February 5, 2024 | Slide 23

February 5, 2024 | Slide 24

6. Deploy the model into a production serving environment.

February 5, 2024 | Slide 25

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple

Initial window Prediction & Future

Dropped New window

February 5, 2024 | Slide 27

Data collection Database, Datafeed, and OPC

Statistics and Machine Learning, Signal Processing

Statistics and Machine Learning, Classification learner,

Deployment MATLAB Coder, MATLAB Production Server

February 5, 2024 | Slide 28

Data collection/scraping SQLAlchemy, requests, BeautifulSoup

Data manipulation Pandas, Dask (parallel computing)

Flask, Django, Streamlit, FastAPI (web framework)

February 5, 2024 | Slide 29

You might also like