0% found this document useful (0 votes)
9 views11 pages

6 Workflow

The document outlines a comprehensive workflow for data science and machine learning projects, detailing steps from data ingestion and preprocessing to model training, evaluation, and deployment. It emphasizes the importance of data quality, feature engineering, model selection, and hyperparameter tuning to ensure effective model performance. Additionally, it provides specific techniques for evaluating model accuracy and refining models for optimal results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

6 Workflow

The document outlines a comprehensive workflow for data science and machine learning projects, detailing steps from data ingestion and preprocessing to model training, evaluation, and deployment. It emphasizes the importance of data quality, feature engineering, model selection, and hyperparameter tuning to ensure effective model performance. Additionally, it provides specific techniques for evaluating model accuracy and refining models for optimal results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Workflow > Overall

Steps followed for each data science /


machine learning project:

DATA
o Ingestion
o Cleaning and preprocessing
o Exploration
o Encoding
o Scaling and normalization

MODEL
o Data reshuffling and partitioning
o Training
o Evaluation
o Hyperparameter tuning
o Inference
Workflow > Data > Sourcing & Ingestion

Identify a data source and ingest the data:

o Source
• Validity
• Quality
• Cost

o Ingestion
• Size (big, small): Spark (Hadoop), Pandas (Python), …
• Format (Schema)
• Method (API, SFTP, …)

o Storage and representation


• File: What format should the input be? Parquet, JSON, CSV, …
• Database: What type of database? SQL, NoSQL, Graph
Workflow > Data > Preprocessing

Cleaning and standardization of the data to be amenable for analytics


and training:

o Missing values
• Do nothing (if algorithm allows)
• Delete records (bias and data size issues)
• Impute (if possible, using simple or sophisticated algorithms)

o Duplicate records
• Remove (exact matching or more sophisticated record matching)

o Cleaning and standardization


• Clean and standardize (e.g. phones, postal codes, SSNs, etc.)
(787) 333-3333, 7873333333, 787.333.3333, …

• Map categorical values on standard values


5th Ave, Fifth Av, 5 Avenue, …
Workflow > Data > Feature engineering

Transforming raw data into features that are suitable for machine learning:

o Selection
choose the most relevant features and excluding irrelevant ones

o Transformation
Transform features to make them more amenable to modeling
o Logarithmic transform, Box-Cox transform, …

o Scaling
Remove sensitivity to feature magnitudes and improve performance. Scaling
features to a common range (e.g., [0, 1]).

o Encoding categorical variables


Encoded into a numerical format, e.g. one-hot encoding, …

o Dimensionality reduction: Reduce complexity and computational load


• Eliminate correlated features
• Extract most important combination of features
Workflow > Data > Partitioning

Split the data set into train and validation


datasets to validate the model independently:

o Random sampling or reshuffling


Remove order artifacts

o Split into two or more partitions


§ Training
70-80% of the data, used to train the
model, i.e. fit the model parameters

§ Validate
(optional) for hyperparameter training

§ Test
Evaluate model performance i.e. measures
on how it performs on unseen data
Workflow > Model > Design

Chose the most appropriate model for the task at hand:

o Supervised or unsupervised
• Has labels or not

o Regression or classification
• Cardinal or numerical output
• Number of categories

o Traditional or neural network


• Simplest model that captures the nature of the problem
Workflow > Model > Training

Calculate the model parameters using an optimization algorithm to


minimize a measure of the error (difference between the ground truth and
the prediction):

o Loss (cost) function:


Measures how well the model's predictions match the actual target
values

o Minimization algorithm:
Variants of steepest gradient descent

o Hyperparameters:
Algorithm parameters that control the optimization efficiency
• Epochs
• Learning rate
• Batch size
Example cost function landscape
o Infrastructure:
Hardware on which to train efficiently the model, e.g. CPUs, GPUs,
clusters, etc.
Workflow > Model > Evaluation > Classification

Calculate measures for the goodness of the prediction:

o Accuracy
Proportion of correctly classified instances out of all instances

o Precision
Measures the accuracy of positive predictions. It's the ratio of true positive
predictions to all positive predictions.

o Recall (Sensitivity)
Measures the ability to correctly identify all relevant instances. It's the ratio
of true positive predictions to all actual positives.

o Confusion Matrix
Table that summarizes the model's classification performance, including true
positives, true negatives, false positives, and false negatives.

o Receiver Operating Characteristic (ROC)


Visualizes the trade-off between true positive rate and false positive rate. The
Area Under the Curve (AUC) quantifies the overall performance.
Workflow > Model > Evaluation > Regression

Variations on how far is the predicted data from the ground truth:

o Mean Absolute Error (MAE)


Average absolute difference between predicted and actual values. It
provides an understanding of the model's average prediction error.

o Mean Squared Error (MSE)


Average squared difference between predicted and actual values. MSE
penalizes large errors more than MAE.

o R-squared (R2)
Proportion of the variance in the dependent variable explained by the
model. It ranges from 0 to 1, with higher values indicating a better fit.
Workflow > Model > Refinement

Refine the hyperparameters of the model to maximize performance:

o Grid Search:
Systematically evaluates all possible combinations of hyperparameters to
find the best set.

o Bayesian Optimization:
Probabilistic model-based optimization technique that l to guide the search
for optimal hyperparameters efficiently.

o Genetic Algorithms:
Use populations of hyperparameter configurations and evolve them over
generations, selecting and mutating the best-performing configurations.
Workflow > Model > Deployment

Deploy the trained model to make predictions:

o Save model:
Save model weights.

o Infrastructure:
Hardware on which to perform inference efficiently.

o Method of delivery:
Batch, API endpoint, …

You might also like