6 Workflow
6 Workflow
DATA
o Ingestion
o Cleaning and preprocessing
o Exploration
o Encoding
o Scaling and normalization
MODEL
o Data reshuffling and partitioning
o Training
o Evaluation
o Hyperparameter tuning
o Inference
Workflow > Data > Sourcing & Ingestion
o Source
• Validity
• Quality
• Cost
o Ingestion
• Size (big, small): Spark (Hadoop), Pandas (Python), …
• Format (Schema)
• Method (API, SFTP, …)
o Missing values
• Do nothing (if algorithm allows)
• Delete records (bias and data size issues)
• Impute (if possible, using simple or sophisticated algorithms)
o Duplicate records
• Remove (exact matching or more sophisticated record matching)
Transforming raw data into features that are suitable for machine learning:
o Selection
choose the most relevant features and excluding irrelevant ones
o Transformation
Transform features to make them more amenable to modeling
o Logarithmic transform, Box-Cox transform, …
o Scaling
Remove sensitivity to feature magnitudes and improve performance. Scaling
features to a common range (e.g., [0, 1]).
§ Validate
(optional) for hyperparameter training
§ Test
Evaluate model performance i.e. measures
on how it performs on unseen data
Workflow > Model > Design
o Supervised or unsupervised
• Has labels or not
o Regression or classification
• Cardinal or numerical output
• Number of categories
o Minimization algorithm:
Variants of steepest gradient descent
o Hyperparameters:
Algorithm parameters that control the optimization efficiency
• Epochs
• Learning rate
• Batch size
Example cost function landscape
o Infrastructure:
Hardware on which to train efficiently the model, e.g. CPUs, GPUs,
clusters, etc.
Workflow > Model > Evaluation > Classification
o Accuracy
Proportion of correctly classified instances out of all instances
o Precision
Measures the accuracy of positive predictions. It's the ratio of true positive
predictions to all positive predictions.
o Recall (Sensitivity)
Measures the ability to correctly identify all relevant instances. It's the ratio
of true positive predictions to all actual positives.
o Confusion Matrix
Table that summarizes the model's classification performance, including true
positives, true negatives, false positives, and false negatives.
Variations on how far is the predicted data from the ground truth:
o R-squared (R2)
Proportion of the variance in the dependent variable explained by the
model. It ranges from 0 to 1, with higher values indicating a better fit.
Workflow > Model > Refinement
o Grid Search:
Systematically evaluates all possible combinations of hyperparameters to
find the best set.
o Bayesian Optimization:
Probabilistic model-based optimization technique that l to guide the search
for optimal hyperparameters efficiently.
o Genetic Algorithms:
Use populations of hyperparameter configurations and evolve them over
generations, selecting and mutating the best-performing configurations.
Workflow > Model > Deployment
o Save model:
Save model weights.
o Infrastructure:
Hardware on which to perform inference efficiently.
o Method of delivery:
Batch, API endpoint, …