0% found this document useful (0 votes)

9 views11 pages

6 Workflow

The document outlines a comprehensive workflow for data science and machine learning projects, detailing steps from data ingestion and preprocessing to model training, evaluation, and deployment. It emphasizes the importance of data quality, feature engineering, model selection, and hyperparameter tuning to ensure effective model performance. Additionally, it provides specific techniques for evaluating model accuracy and refining models for optimal results.

Uploaded by

Fernando Flórez Gómez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

6 Workflow

Uploaded by

Fernando Flórez Gómez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Workflow > Overall

Steps followed for each data science /

machine learning project:

DATA
o Ingestion
o Cleaning and preprocessing
o Exploration
o Encoding
o Scaling and normalization

MODEL
o Data reshuffling and partitioning
o Training
o Evaluation
o Hyperparameter tuning
o Inference
Workflow > Data > Sourcing & Ingestion

Identify a data source and ingest the data:

o Source
• Validity
• Quality
• Cost

o Ingestion
• Size (big, small): Spark (Hadoop), Pandas (Python), …
• Format (Schema)
• Method (API, SFTP, …)

o Storage and representation

• File: What format should the input be? Parquet, JSON, CSV, …
• Database: What type of database? SQL, NoSQL, Graph
Workflow > Data > Preprocessing

Cleaning and standardization of the data to be amenable for analytics

and training:

o Missing values
• Do nothing (if algorithm allows)
• Delete records (bias and data size issues)
• Impute (if possible, using simple or sophisticated algorithms)

o Duplicate records
• Remove (exact matching or more sophisticated record matching)

o Cleaning and standardization

• Clean and standardize (e.g. phones, postal codes, SSNs, etc.)
(787) 333-3333, 7873333333, 787.333.3333, …

• Map categorical values on standard values

5th Ave, Fifth Av, 5 Avenue, …
Workflow > Data > Feature engineering

Transforming raw data into features that are suitable for machine learning:

o Selection
choose the most relevant features and excluding irrelevant ones

o Transformation
Transform features to make them more amenable to modeling
o Logarithmic transform, Box-Cox transform, …

o Scaling
Remove sensitivity to feature magnitudes and improve performance. Scaling
features to a common range (e.g., [0, 1]).

o Encoding categorical variables

Encoded into a numerical format, e.g. one-hot encoding, …

o Dimensionality reduction: Reduce complexity and computational load

• Eliminate correlated features
• Extract most important combination of features
Workflow > Data > Partitioning

Split the data set into train and validation

datasets to validate the model independently:

o Random sampling or reshuffling

Remove order artifacts

o Split into two or more partitions

§ Training
70-80% of the data, used to train the
model, i.e. fit the model parameters

§ Validate
(optional) for hyperparameter training

§ Test
Evaluate model performance i.e. measures
on how it performs on unseen data
Workflow > Model > Design

Chose the most appropriate model for the task at hand:

o Supervised or unsupervised
• Has labels or not

o Regression or classification
• Cardinal or numerical output
• Number of categories

o Traditional or neural network

• Simplest model that captures the nature of the problem
Workflow > Model > Training

Calculate the model parameters using an optimization algorithm to

minimize a measure of the error (difference between the ground truth and
the prediction):

o Loss (cost) function:

Measures how well the model's predictions match the actual target
values

o Minimization algorithm:
Variants of steepest gradient descent

o Hyperparameters:
Algorithm parameters that control the optimization efficiency
• Epochs
• Learning rate
• Batch size
Example cost function landscape
o Infrastructure:
Hardware on which to train efficiently the model, e.g. CPUs, GPUs,
clusters, etc.
Workflow > Model > Evaluation > Classification

Calculate measures for the goodness of the prediction:

o Accuracy
Proportion of correctly classified instances out of all instances

o Precision
Measures the accuracy of positive predictions. It's the ratio of true positive
predictions to all positive predictions.

o Recall (Sensitivity)
Measures the ability to correctly identify all relevant instances. It's the ratio
of true positive predictions to all actual positives.

o Confusion Matrix
Table that summarizes the model's classification performance, including true
positives, true negatives, false positives, and false negatives.

o Receiver Operating Characteristic (ROC)

Visualizes the trade-off between true positive rate and false positive rate. The
Area Under the Curve (AUC) quantifies the overall performance.
Workflow > Model > Evaluation > Regression

Variations on how far is the predicted data from the ground truth:

o Mean Absolute Error (MAE)

Average absolute difference between predicted and actual values. It
provides an understanding of the model's average prediction error.

o Mean Squared Error (MSE)

Average squared difference between predicted and actual values. MSE
penalizes large errors more than MAE.

o R-squared (R2)
Proportion of the variance in the dependent variable explained by the
model. It ranges from 0 to 1, with higher values indicating a better fit.
Workflow > Model > Refinement

Refine the hyperparameters of the model to maximize performance:

o Grid Search:
Systematically evaluates all possible combinations of hyperparameters to
find the best set.

o Bayesian Optimization:
Probabilistic model-based optimization technique that l to guide the search
for optimal hyperparameters efficiently.

o Genetic Algorithms:
Use populations of hyperparameter configurations and evolve them over
generations, selecting and mutating the best-performing configurations.
Workflow > Model > Deployment

Deploy the trained model to make predictions:

o Save model:
Save model weights.

o Infrastructure:
Hardware on which to perform inference efficiently.

o Method of delivery:
Batch, API endpoint, …

GST-IFP8 - DS10104799 Datasheet
0% (1)
GST-IFP8 - DS10104799 Datasheet
3 pages
API Marketplace Engineering Design
No ratings yet
API Marketplace Engineering Design
281 pages
Auto Water Pump Insem Report
100% (1)
Auto Water Pump Insem Report
44 pages
mlr3 Tutorial
100% (2)
mlr3 Tutorial
271 pages
6 - Steps of The Classification Algorithm in Supervised Learning
No ratings yet
6 - Steps of The Classification Algorithm in Supervised Learning
15 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Face Project
No ratings yet
Face Project
43 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
10 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Codes and Concepts of ML-Developer-2
No ratings yet
Codes and Concepts of ML-Developer-2
17 pages
ML Notes
No ratings yet
ML Notes
16 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
AIS Chap 3 2022
No ratings yet
AIS Chap 3 2022
27 pages
Release Notes ONT R24.02
No ratings yet
Release Notes ONT R24.02
88 pages
2-ML Principles
No ratings yet
2-ML Principles
34 pages
Cs 101 Lecture - Unit1-Week1-2
No ratings yet
Cs 101 Lecture - Unit1-Week1-2
33 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
Falcon Outdoor OI
No ratings yet
Falcon Outdoor OI
78 pages
User Manual Fudaa-LSPIV 1.7
No ratings yet
User Manual Fudaa-LSPIV 1.7
61 pages
AI ML Concepts
No ratings yet
AI ML Concepts
97 pages
Lecture 4 Evaluation
No ratings yet
Lecture 4 Evaluation
58 pages
DTMSU MicroProject - 1
No ratings yet
DTMSU MicroProject - 1
22 pages
A System Analysis Approach
No ratings yet
A System Analysis Approach
23 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
ML Module 1
No ratings yet
ML Module 1
12 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Lecture 1
No ratings yet
Lecture 1
21 pages
Lecture 8
No ratings yet
Lecture 8
11 pages
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
No ratings yet
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
13 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Aoop-A CH
No ratings yet
Aoop-A CH
34 pages
HB Aircraft Industries AG HB-23/2400
No ratings yet
HB Aircraft Industries AG HB-23/2400
24 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Single Dan Multi Item
No ratings yet
Single Dan Multi Item
11 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
Unit 5
No ratings yet
Unit 5
11 pages
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
No ratings yet
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
20 pages
Access Networks: Introduction and Overview
No ratings yet
Access Networks: Introduction and Overview
15 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Manual Data
No ratings yet
Manual Data
13 pages
Constructivism in Theory and in Practice: Miriam Schcolnik, Sara Kol, and Joan Abarbanel
No ratings yet
Constructivism in Theory and in Practice: Miriam Schcolnik, Sara Kol, and Joan Abarbanel
9 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Lec 2
No ratings yet
Lec 2
13 pages
Ai Unit 5
No ratings yet
Ai Unit 5
13 pages
Week-1 ML Slides
No ratings yet
Week-1 ML Slides
16 pages
Comparison of Crisp and Fuzzy Sets
No ratings yet
Comparison of Crisp and Fuzzy Sets
10 pages
PSCS511 - Machine Learning
No ratings yet
PSCS511 - Machine Learning
23 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
Introduction and Basics of Machine Learning
No ratings yet
Introduction and Basics of Machine Learning
9 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
How To Create A Python Model
No ratings yet
How To Create A Python Model
29 pages
Python Essential Methods in Machine Learning
No ratings yet
Python Essential Methods in Machine Learning
6 pages
ML (AutoRecovered)
No ratings yet
ML (AutoRecovered)
5 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
AIML-Unit 5 Notes-Assignment 5
No ratings yet
AIML-Unit 5 Notes-Assignment 5
24 pages
Ass Bigd
No ratings yet
Ass Bigd
9 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Unit-1 Introduction To Machine Learning (5hrs)
No ratings yet
Unit-1 Introduction To Machine Learning (5hrs)
8 pages
SEC Presentation
No ratings yet
SEC Presentation
22 pages
WBGT 2010SD
No ratings yet
WBGT 2010SD
2 pages
Mis13 Ch13 Case1 Ibm-Bpm
No ratings yet
Mis13 Ch13 Case1 Ibm-Bpm
3 pages
Because We Have Failed
No ratings yet
Because We Have Failed
7 pages
Unit2 - 2) How Python Is Deployed and Data Science Process
No ratings yet
Unit2 - 2) How Python Is Deployed and Data Science Process
7 pages
Mod8 DM
No ratings yet
Mod8 DM
13 pages
manual-en-EU Automate Diseño AI 3shape
No ratings yet
manual-en-EU Automate Diseño AI 3shape
26 pages
9 Oxides Materials Science
No ratings yet
9 Oxides Materials Science
6 pages
ML Midterm Cheatsheet
No ratings yet
ML Midterm Cheatsheet
2 pages
ML Pipeline
No ratings yet
ML Pipeline
6 pages
3 Prerequisites Software
No ratings yet
3 Prerequisites Software
5 pages
2 Prerequisites General
No ratings yet
2 Prerequisites General
5 pages
Wireless Network Assignment
No ratings yet
Wireless Network Assignment
5 pages
Test 012025 Industrial Technology 241210 101938
No ratings yet
Test 012025 Industrial Technology 241210 101938
4 pages
ThinkStation P2 Tower 30FS000ELM
No ratings yet
ThinkStation P2 Tower 30FS000ELM
2 pages
66f2333917152bc83a343f60 94216597565
No ratings yet
66f2333917152bc83a343f60 94216597565
2 pages
AAM 1st Unit QB
No ratings yet
AAM 1st Unit QB
4 pages
Intel 8080 CPU Chip Development
No ratings yet
Intel 8080 CPU Chip Development
4 pages
ML Viva Practice (Answers)
No ratings yet
ML Viva Practice (Answers)
4 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
My Resulth
No ratings yet
My Resulth
3 pages
Day 4 - Preprocessing, Model Code
No ratings yet
Day 4 - Preprocessing, Model Code
5 pages
Cubeacon Card - Datasheet-V - 0.3.1
No ratings yet
Cubeacon Card - Datasheet-V - 0.3.1
2 pages
Components of Ai System Design PDF
No ratings yet
Components of Ai System Design PDF
1 page
How To Send To A Business Paypal - Google Search
No ratings yet
How To Send To A Business Paypal - Google Search
1 page
Adnan Shoukat CV
No ratings yet
Adnan Shoukat CV
1 page
Components of Ai System Design PDF
No ratings yet
Components of Ai System Design PDF
1 page

6 Workflow

Uploaded by

6 Workflow

Uploaded by

Workflow > Overall

Steps followed for each data science /

Identify a data source and ingest the data:

o Storage and representation

Cleaning and standardization of the data to be amenable for analytics

o Cleaning and standardization

• Map categorical values on standard values

o Encoding categorical variables

o Dimensionality reduction: Reduce complexity and computational load

Split the data set into train and validation

o Random sampling or reshuffling

o Split into two or more partitions

Chose the most appropriate model for the task at hand:

o Traditional or neural network

Calculate the model parameters using an optimization algorithm to

o Loss (cost) function:

Calculate measures for the goodness of the prediction:

o Receiver Operating Characteristic (ROC)

o Mean Absolute Error (MAE)

o Mean Squared Error (MSE)

Refine the hyperparameters of the model to maximize performance:

Deploy the trained model to make predictions:

You might also like