0% found this document useful (0 votes)
50 views

Machine Learning

Uploaded by

sdummy56789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Machine Learning

Uploaded by

sdummy56789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Machine Learning Fundamentals 5

Xgboost 5
XGboost may be all you need 5
How to parallelize GBM 8
How to build better trees 10
More information about XGBoost 11
Interesting Github repositories 11
Similar models 12
Youtube videos 12
Deep Neural Networks 13
The computational units 13
The Loss functions 15
The Activation functions 18
The different kind of Embeddings 19
Additional information about Embeddings 21
Github repositories 21
Articles 22
Youtube videos 22
Explainable AI 23
LIME 23
SHAP 25
Additional content about Explainable AI 27
Github Repositories 27
Explainable AI explained 28
Youtube videos 28
Transformers 28
The history of Transformers 28
Transformers: Attention is all you need! 30
Recurrent Network VS Bahdanau Attention VS Self-attention 32
Additional information about Attentions 34
Github Repositories 34
Articles 35
Youtube Videos 35
The different applications of Transformers 35
More contents about Transformers 38
Interesting Github repos 38
Interesting Medium-like articles 38
Youtube videos 39
Large Language Models 39
History of Natural Language Processing 39
NPL metrics: quantifying human-like language performance 42

2
The GPT-3 Family 44
GPT-1 vs GPT-2 vs GPT-3 46
How to train ChatGPT 48
ChatGPT vs GPT-3 49
ChatGPT’s competitors 50
GPT-4: The largest Large Language Model yet! 52
The Architecture 52
The Training 54
The number of parameters 55
When the creators starts to fear their creation 58
Diffusion Models 59
What is a diffusion model? 59
What is Stable Diffusion? 60
Stable Diffusion vs DALLE-2 vs Imagen 61
But how do you generate those cool animations? 63
How to measure similarity between Text and Image data? 65
Let’s generate music with text prompts 66
Additional information about Diffusion models 67
Github repositories 67
Articles 68
Youtube videos 68
Machine Learning System Design 69
Data Parallelism 69
Model parallelism 71
Why is a TPU faster than a GPU? 72
The Hashing trick: how to build a recommender engine for billions of users 74
How to Architect a search engine like Google Search 75
The overall architecture 75
The role of PageRank 78
The Offline Metrics 80
The online experiments 82
MLOps 83
The different Deployment patterns 83
How to test your model in Production 85
The different Inference pipelines architectures 87
The Feature Store: The Data Throne 89
Continuous Integration - Deployment - Training (CI / CD / CT) 90
Testing and monitoring 92

3
Machine Learning Fundamentals

Xgboost

XGboost may be all you need

I never use Linear Regression! I am not saying you shouldn’t, I personally just never do. Actually there
are many algorithms I never use: Logistic Regression, Naive Bayes, SVM, LDA, KNN, Feed Forward
Neural Network,… I just don’t find value in those for my work.

I always say you should start any Machine Learning model development with a simple algorithm, but for
me Logistic Regression (LogReg) or Linear Regression (LR) are not simple! The amount of feature
engineering needed to get a performant model out of those algorithms is just too high to me. The simple
use of those algorithms does not provide me with a useful minimum baseline I could work from. If I
observe low predictive performance on a LR or LogReg with zero feature engineering, that is not
informative enough for me to make a predictive power assessment of the underlying data.

For me the “simplest” model to use is XGBoost (this includes LightGBM or CatBoost of course). Well at
least for tabular data. XGBoost natively handles missing values, categorical variables (to some extent!), it
is parallelizable (so fast…enough), it trains on any loss function I may need, it tends to be robust to
overfitting on a large number of features, and it is highly non-linear which makes it a very low bias
algorithm. Its API is pretty complete, and you can achieve a lot by just changing a couple of arguments. I

4
can easily throw any data into XGBoost without any work and that gives me a useful baseline that can
drive my development from there.

Algorithms like Linear Regression have their number of degrees of freedom (d.o.f. - complexity) scaling
with the number of features O(M). In practice, this means that their ability to learn from the data will
plateau in the regime N >> M where N is the number of samples (typically large data sets). They have a
high bias but a low variance and as such they are well adapted to the N > M regime. In the N < M regime,
L1 regularization becomes necessary to learn the relevant features and zero-out the noise (think about
having more unknowns than equations to solve a set of linear equations). Naive Bayes d.o.f. scales as O(C
x M) (or O(M) depending on the implementation) where C is the number categories the features are
discretized into. O(C) = O(N) in theory but not really in practice. This makes it a lower bias algorithm
than LR but it is a product ensemble of univariate models and ignores the feature interactions (as LR
does) preventing it from further improvements.

A tree in its unregularized form, is a low bias algorithm (you can overfit the data to death), with d.o.f
scaling as O(N), but high variance (deep trees don’t generalize well). But because a tree can reduce its
complexity as much as needed, it can work in the regime N < M by simply selecting the necessary

5
features. A Random Forest is therefore a low bias algorithm but the ensemble averages away the variance
(but deeper trees call for more trees) and it doesn’t overfit on the number of trees (Theorem 1.2 in the
classic paper by Leo Breiman), so it is a lower variance algorithm. The homogenous learning (the trees
tend to be similar) tends to limit its ability to learn on too much data.

XGBoost is the first (to my knowledge) tree algorithm to mathematically formalize regularization in a tree
(eq. 2 in the XGBoost paper). It is a low bias and high variance (due to the boosting mechanism) and is
therefore adapted to large data scales. The GBM Boosting mechanism ensures a more heterogeneous
learning than Random Forest and therefore adapts better to larger scales. The optimal regularization
ensures higher quality trees as weak learners than in Random Forest and tends to be more robust to
overfitting than Random Forest.

In the regime N >> M, only low bias algorithms make sense with d.o.f. scaling as O(N). That includes
algorithms like GBM, RF, Neural Networks, SVM (gaussian), KNN,… SVM has a training time
complexity of O(N^3) (unmanageable!) and KNN is bad at understanding what is an important feature
and has dimensionality errors scaling as O(M). Neural Networks are known to underperform compared to
XGBoost on tabular data ("Deep Learning is not all you need").

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
6 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
So, if you are working on large tabular data, XGBoost MAY be all you need! But make sure to prove it to
yourself. The no Free-Lunch Theorem doesn’t mean we cannot understand our algorithms and build an
intuition on what are the best use cases to use them!

How to parallelize GBM


Why did XGBoost become so popular? There are 2 aspects that made the success of XGBoost back in
2014. The first one is the regularized learning objective that allows for better pruning of the trees. The
second one is the ability to distribute the Gradient Boosting learning process across multiple threads or
machines, allowing it to handle larger scales of data. Boosting algorithms have been known to perform
very well on most large data sets, but the iterative process of boosting makes those painfully slow!

How do you parallelize a boosting algorithm then? In the case of Random Forest, it is easy: you just
distribute the data across threads, build independent trees there, and average the resulting tree predictions.
In the case of an iterative process like boosting, you need to parallelize the tree building itself. It all comes
down to how you find an optimal split in a tree: for each feature, sort the data and linearly scan the feature
to find the best split. If you have N samples and M features, it is O(NM log(N)) time complexity at each
node. In pseudocode:

best_split = None
for feature in features:
for sample in sorted samples:
if split is better than best_split:
best_split = f(feature, sample)

So you can parallelize split search by scanning each feature independently and reducing the resulting
splits to the optimal one.

XGBoost is not the first attempt to parallelize GBM, but they used a series of tricks that made it very
efficient:
● First, all the columns are pre-sorted while keeping a pointer to the original index of the entry. This
removes the need to sort the feature at every search.
● They used a Compressed Column Format for a more efficient distribution of the data.

7
● They used a cache-aware prefetching algorithm to minimize non-contiguous memory access that
results from the pre-sorting step.
● Not directly about parallelization, but they came out with an approximated split search algorithm
that speeds the tree building further.

As of today, you can train XGBoost across cores on the same machine, but also on AWS YARN,
Kubernetes, Spark, and GPU and you can use Dask or Ray to do it (XGBoost documentation).

One thing to look out for is that there is a limit to how much XGBoost can be parallelized. With too many
threads, the data communication between threads becomes a bottleneck and the training speed plateaus.
Here is an example explaining that effect: “How to Best Tune Multithreading Support for XGBoost in
Python”.

Also, you should read the XGBoost article: “XGBoost: A Scalable Tree Boosting System”!

8
How to build better trees
Why does XGBoost perform better than typical Gradient Boosted Machines (GBM)? They simply don't
optimize for the same objective function! GBM tries to minimize the loss function as measured on the
training data where XGBoost takes into consideration the complexity of the trees, making XGBoost much
more robust to overfitting.

When GBM considers a split in a tree, it asks itself the question: what split will get me better predictive
performance as measured on the training data? Trying to maximize performance on the training data will
simply not generalize very well to unseen data and this leads to overfitting. GBM has other
hyperparameters you can use to regularize your trees but the optimization process itself is not well tuned
for generalization. You can look at equation 6 of the original GBM paper for this: “Stochastic Gradient
Boosting“.

9
On the other hand, XGBoost includes a regularization term in the objective function that penalizes
complex trees. This is because it looks at the pure decrease of the loss (its gradient - the first derivative of
the loss) but also at its increase in complexity (the curvature - the second derivative of the loss). Typically,
decreases in the loss diminishes underfitting while increases in complexity leads to more overfitting.
XGBoost uses a hyperparameter "lambda" to adjust how much weight is put on the gradient versus the
curvature. You can look at equation 6 in the XGBoost paper: “XGBoost: A Scalable Tree Boosting
System”. On a side note, it is quite a coincidence that "equation 6" in both the GBM paper and XGBoost
paper is talking about their respective objective function, isn't it?!

This is obviously very reminiscent of the concept of Bias-Variance tradeoff and the gradient sure reminds
us of the bias and the curvature of the variance. Even their analytical expressions look similar. While they
tend to capture similar information, they are most likely not exactly equal though. Bias and Variance are
statistical metrics while gradient and curvature are functional ones. Despite those differences, this is one
of the first times that those concepts are mathematically formalized in a tree-based algorithm as part of the
objective function such that the trees learned by XGBoost are optimal ones with respect to the
Bias-Variance tradeoff problem.

Despite the beauty of this objective function, the choice made on the regularization term is quite an
ad-hoc one and I am pretty sure we could construct even better ones!

More information about XGBoost

Interesting Github repositories

● Awesome Gradient Boosting Research Papers: A curated list of gradient and adaptive boosting
papers with implementation.
● AutoXGB: auto train XGBoost directly from CSV files, auto tune XGBoost using optuna, auto
serve best XGBoot model using fastAPI.
● xgboostExplainer: An R package that makes XGBoost models fully interpretable.
● XGBoostLSS: An extension of XGBoost to probabilistic forecasting.
● XGBoost.jl: XGBoost Julia Package.
● Data-Science-Competitions: Goal of this repo is to provide the solutions of all Data Science
Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).

10
Similar models

● LightGBM: LightGBM is a gradient boosting framework that uses tree based learning algorithms.
It is designed to be distributed and efficient with the following advantages:
○ Faster training speed and higher efficiency.
○ Lower memory usage.
○ Better accuracy.
○ Support of parallel, distributed, and GPU learning.
○ Capable of handling large-scale data.
● CatBoost: CatBoost is an algorithm for gradient boosting on decision trees:
○ Reduce time spent on parameter tuning, because CatBoost provides great results with
default parameters
○ Improve your training results with CatBoost that allows you to use non-numeric factors,
instead of having to pre-process your data or spend time and effort turning it to numbers.
○ Train your model on a fast implementation of gradient-boosting algorithm for GPU. Use
a multi-card configuration for large datasets.
○ Reduce overfitting when constructing your models with a novel gradient-boosting
scheme.
○ Apply your trained model quickly and efficiently even to latency-critical tasks using
CatBoost's model applier
● H20.ai’s GBM: H2O’s GBM sequentially builds regression trees on all the features of the dataset
in a fully distributed way - each tree is built in parallel.

Youtube videos
● XGBoost series by StatQuest with Josh Starmer: XGBoost Part 1 (of 4): Regression
● XGBoost by Andrew Ng: 7.12 Tree ensembles | XGBoost--[Machine Learning | Andrew Ng]
● XGBoost: How it works, with an example by Sundog Education with Frank Kane:
XGBoost: How it works, with an example.

11
Deep Neural Networks

The computational units

Deep Learning requires much more of an ARCHITECT mindset than traditional Machine Learning. In a
sense, part of the feature engineering work has been moved to the design of very specialized
computational blocks using smaller units (LSTM, Convolutional, embedding, Fully connected, …). I
would always advise to start with a simple net when architecting a model such that you can build your
intuition. Jumping right away into a Transformer model may not be the best way to start.

Most of the advancements in Machine Learning for the past 10 years come from a smart rearrangement of
the simple units presented here. Obviously, I am omitting activation functions and some others here but
you get the idea.

A convolutional layer is meant to learn local correlations. Multiple successive blocks of conv and pooling
layers allows one to learn the correlations at multiple scales and they can be used on image data (conv2d),
text data (text is just a time series of categorical variables) or time series (conv1d). You can encode text
data using an embedding followed by a couple of conv1d layers. And you can encode a time series using
a series of conv1d and pooling layers.

I advise against using LSTM layers when possible. The iterative computation doesn’t allow for good
parallelism leading to very slow training (even with the Cuda LSTM). For text and time series ConvNet
are much faster to train as they make use of the matrix computation parallelism and tend to perform on
par with LSTM networks (http://www.iro.umontreal.ca/~lisa/pointeurs/handbook-convo.pdf). One reason
transformers became the leading block unit for text learning tasks, is its superior parallelism capability
compared to LSTM allowing for realistically much bigger training data sets.

Here are a couple of dates to understand the DL timeline:


● (1989) Convolution layer and average pooling: “Handwritten Digit Recognition with a
Back-Propagation”.
● (1997) LSTM layer: “LONG SHORT-TERM MEMORY”
● (2003) Embedding layer: “Quick Training of Probabilistic Neural Nets by Importance Sampling“.

12
● (2007) Max Pooling: “Sparse Feature Learning for Deep Belief Networks“.
● (2012) Feature dropout: “Improving neural networks by preventing co-adaptation of feature
detectors“.
● (2012) Transfer learning: “Deep Learning of Representations for Unsupervised and Transfer
Learning“.
● (2013) Word2Vec Embedding: “Efficient Estimation of Word Representations in Vector Space“.
● (2013) Maxout network: “Maxout Networks“.
● (2014) GRU layer: “On the Properties of Neural Machine Translation: Encoder–Decoder
Approaches“.
● (2014) Dropout layers: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting“.

13
● (2014) GloVe Embedding: “GloVe: Global Vectors for Word Representation“.
● (2015) Batch normalization: “Batch Normalization: Accelerating Deep Network Training b y
Reducing Internal Covariate Shift“.
● (2016) Layer normalization: “Layer Normalization“.
● (2016) Instance Normalization: “Instance Normalization: The Missing Ingredient for Fast
Stylization“.
● (2017) Self Attention layer and transformers: “Attention Is All You Need“.
● (2018) Group Normalization: “Group Normalization“.

The Loss functions

Your Machine Learning model LOSS is your GAIN! At least if you choose the right one! Unfortunately,
there are even more loss functions available than redundant articles about ChatGPT.

When it comes to regression problems, the Huber and Smooth L1 losses are the best of both worlds
between MSE and MAE being differentiable at 0 and limiting the weight of outliers for large values. The

14
LogCosh has the same advantage, having a similar shape to the Huber one. The mean absolute percentage
error and the mean squared logarithmic error greatly mitigate the effects of outliers. Poisson regression is
widely used for count targets that can only be positive.

For classification problems, cross-entropy loss tends to be king. I find the focal cross-entropy to be quite
interesting, as it gives more weight to the samples where the model is less confident giving more focus on
the "hard" samples to classify. The KL divergence is another information theoretic metric and I would
assume it is less stable than cross-entropy in small batches due to the more fluctuating averages of the
logs. The hinge loss is the original loss of the SVM algorithm. The squared hinge is simply the square of
the hinge loss and the soft margin is simply a softer differentiable version of it.

15
Ranking losses tend to be extensions of the pointwise ones, penalizing the losses when 2 samples are
misaligned compared to the ground truth. The margin ranking, the soft pairwise hinge and the pairwise
logistic losses are extensions of the hinge losses. However, ranking loss functions are painfully slow to
compute as the time complexity is O(N^2) where N is the number of samples within a batch.

Contrastive learning is a very simple way to learn aligned semantic representations of multimodal data.
For example, triplet margin loss was used in Facenet (https://arxiv.org/pdf/1503.03832.pdf) and cosine
embedding loss in CLIP (https://arxiv.org/pdf/2103.00020.pdf). The hinge embedding loss is similar but
we replace the cosine similarity with the Euclidean distance.

Deep Learning had a profound effect on Reinforcement Learning, allowing us to train models with high
state and action dimensionalities. For Q-learning, the loss can simply take the form of the MSE for the
residuals of the Bellman equation. In the case of Policy gradient, the loss is the cross-entropy of the action
probabilities weighted by the Q-value.

16
And those are just a small subset of what exists. To get a sense of what is out there, a simple approach is
to take a look at the PyTorch and TensorFlow documentation. These lecture notes are worth a read: “A
Comprehensive Survey of Loss Functions in Machine Learning“.

The Activation functions

If you are looking for the "best" ACTIVATION function, be ready to spend some time looking for it
because there are hundreds of them! Fortunately, you can safely cross-validate just a couple of them to
find the right one. There are a few classes of activation functions (AF) to look out for:

● The Sigmoid and Tanh based - Those activation functions were widely used prior to ReLU.
People were very comfortable with those as they were reminiscent of Logistic Regression and

17
they are differentiable. The problem with those is that, being squeezed between [0, 1] or [-1, 1],
we had a hard time training deep networks as the gradient tends to vanish.
● Rectified Linear Unit (ReLU) back in 2011 changed the game when it comes to activation
functions. I believe it became very fashionable after AlexNet won the ImageNet competition in
2012. We could train deeper models but the gradient would still die for negative numbers due to
the zeroing in x < 0. Numerous AF were created to address this problem such as LeakyReLU and
PReLU.
● Exponential AF such as ELU sped up learning by bringing the normal gradient closer to the unit
natural gradient because of a reduced bias shift effect. They were solving for the vanishing
gradient problem as well.
● More recent activation functions use learnable parameters such as Swish and Mish. Those
adaptive activation functions allow for different neurons to learn different activation functions for
richer learning while adding parametric complexity to the networks.
● The class of Gated Linear Unit (GLU) has been studied quite a bit in NLP architectures and they
control what information is passed up to the following layer using gates similar to the ones found
in LSTMs. For example Google's PaLM model is trained with a SwiGLU activation.

Here is a nice review of many activation functions with some experimental comparisons: “Activation
Functions in Deep Learning: A Comprehensive Survey and Benchmark“. Looking at the PyTorch API and
the TensorFlow API can also give a good sense of what are the commonly used ones.

The different kind of Embeddings

I think Deep Learning finds its strength in its ability to model efficiently with different types of data at
once. It is trivial to build models from multimodal datasets nowadays. It is not a new concept though, nor
was it impossible to do it prior to the advent of Deep Learning, but the level of complexity of feature
processing and modeling was much higher with much lower performance levels!

One key aspect of this success is the concept of Embedding: a lower dimensionality representation of the
data. This makes it possible to perform efficient computations while minimizing the effect of the curse of
dimensionality, providing more robust representations when it comes to overfitting. In practice, this is just
a vector living in a "latent" or "semantic" space.

18
The first great success of embedding for word encoding was Word2Vec back in 2013 (“Efficient
Estimation of Word Representations in Vector Space”) and later GloVe in 2014 (“GloVe: Global Vectors
for Word Representation”). Since AlexNet back in 2012 (“ImageNet Classification with Deep
Convolutional Neural Networks”), many Convolutional network architectures (VGG16 (2014), ResNet
(2015), Inception (2014), …) were used as feature extractors for images. As of 2018 starting with BERT,
Transformer architectures have been used quite a bit to extract semantic representations from sentences.

One domain where embeddings changed everything is recommender engines. It all started with Latent
Matrix Factorization methods made popular during the Netflix movie recommendation competition in
2009. The idea is to have a vector representation for each user and product and use that as base features.
In fact, any sparse feature could be encoded within an embedding vector and modern recommender
engines typically use hundreds of embedding matrices for different categorical variables.

19
Dimensionality reduction is by all accounts not a new concept in Unsupervised Learning! PCA for
example dates back to 1901 (“On lines and planes of closest fit to systems of points in space“), the
concept of Autoencoder was introduced in 1986, and the variational Autoencoders (VAE) were introduced
in 2013 (“Auto-Encoding Variational Bayes“). For example, VAE is a key component of Stable Diffusion.
The typical difficulty with Machine Learning is the ability to have labeled data. Self-supervised learning
techniques like Word2Vec, Autoencoders, generative language models allow us to build powerful latent
representations of the data at low cost. Meta recently came out with Data2Vec 2.0 to learn latent
representations of any data modality using self-supervised learning (“Efficient Self-supervised Learning
with Contextualized Target Representations for Vision, Speech and Language“).

Beside learning latent representations, a lot of work is being done to learn aligned representations
between different modality. For example, CLIP (“Learning Transferable Visual Models From Natural
Language Supervision“) is a recent contrastive learning method to learn semantically aligned
representations between text and image data.

Additional information about Embeddings

Github repositories
● TensorFlow Feature Extractor: This is a convenient wrapper for feature extraction or
classification in TensorFlow. Given well known pre-trained models on ImageNet, the extractor
runs over a list or directory of images.
● Data2vec 2.0: data2vec 2.0 improves the training efficiency of the original data2vec algorithm.
Data2vec is a framework for self-supervised representation learning for images, speech, and text
as described in data2vec: A General Framework for Self-supervised Learning in Speech, Vision

20
and Language (Baevski et al., 2022). The algorithm uses the same learning mechanism for
different modalities.
● Deep Learning Recommendation Model for Personalization and Recommendation Systems: An
implementation of a deep learning recommendation model (DLRM).
● Variational Autoencoder in tensorflow and pytorch: Reference implementation for a variational
autoencoder in TensorFlow and PyTorch.
● Sent2Vec - How to Compute Sentence Embedding Fast and Flexible: The sentence embedding is
an important step of various NLP tasks such as sentiment analysis and summarization. A flexible
sentence embedding library is needed to prototype fast and contextualized. The open-source
sent2vec Python package gives you the opportunity to do so.

Articles
● Feature Extraction with BERT for Text Classification: Extract information from a pretrained
model using Pytorch and Hugging Face
● Keras: Feature extraction on large datasets with Deep Learning: In this tutorial, you will learn
how to use Keras for feature extraction on image datasets too big to fit into memory. You’ll utilize
ResNet-50 (pre-trained on ImageNet) to extract features from a large image dataset, and then use
incremental learning to train a classifier on top of the extracted features.
● The Secret Sauce of Tik-Tok’s Recommendations: Dive into the inner workings of TikTok’s
awesome real-time recommendation system and learn what makes it one of the best in the field!
● Vectors are over, hashes are the future of AI: Recent advances show for certain AI applications
this can actually be drastically outperformed (memory, speed, etc) by other binary representations
(such as neural hashes) without significant accuracy trade off.
● Introduction to autoencoders: Autoencoders are an unsupervised learning technique in which we
leverage neural networks for the task of representation learning. Specifically, we'll design a neural
network architecture such that we impose a bottleneck in the network which forces a compressed
knowledge representation of the original input.

Youtube videos
● How Recommender Systems Work (Netflix/Amazon) by Art of the Problem:
How Recommender Systems Work (Netflix/Amazon)

21
● Recommendation systems overview (Building recommendation systems with TensorFlow) by
TensorFlow: Recommendation systems overview (Building recommendation systems with …
● Feature Extraction With TorchVision's Newest Utility by Alex-AI:
Feature Extraction With TorchVision's Newest Utility
● Representations for Language: From Word Embeddings to Sentence Meanings by Simons
Institute: Representations for Language: From Word Embeddings to Sentence Meanings
● BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper
Explained) by Yannic Kilcher:
BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper …

Explainable AI

LIME

The first time I heard about LIME, I was quite surprised by how recent this technique was (2016)! It is so
intuitive that I couldn't believe that nobody really thought about it before. Well, it is easy to be surprised
after the facts! It is very reminiscent of Partial Dependence plots or Individual Conditional Expectation
(ICE) plots, but instead of looking at the global contributions of the different features, it provides local
explanations for each prediction.

LIME (Local Interpretable Model-agnostic Explanations) looks at a ML model as a black box and it is
trying to estimate the local variations of a prediction by perturbing the feature values of the specific data
instance. The process is as follows:

● Choose a data instance x with the prediction y you want to explain


● Sample multiple data points around the initial data point by perturbing the values of the features
● Take those new samples, and get the related inferences from our ML model
● We now have data points with features X' and predictions y'

⇒ Train a simple linear model on those data points and weigh the samples by how far they are from the
original data point x in the feature space (low weights for high distance and high weights for low
distance).

22
Linear models are readily interpretable. For example if we have

w1x1 is the contribution to the prediction of the feature X1 for the specific data instance and a high value
means a high contribution. So with this linear model we can rank and quantify in an additive manner the
contributions of each feature and for each instance to the predictions and this is what we call
"explanations" for the predictions.

LIME works a bit differently for different data types:


● For tabular data, we can perturb the feature by simply adding some small noise to the continuous
variables. For categorical variables, it is more delicate as the concept of distance is more
subjective. Another way to do it is to choose another value of the feature from the dataset.

● For text data, the features are usually the words or the tokens. The typical way to perturb the
features is to remove at random a few words from the original sentence. It is intuitive to think that
if we remove an important word, the predictions should change quite a bit.

23
● For image data, pixels are not really representative of what "matters" in an image. "Super-pixels"
are created by segmenting the image (clustering similar close pixels) and then serve as the main
features. We can turn on and off those new features by zeroing their values. By turning off a few
super-pixels, we effectively perturb the feature set enough to estimate which segments contribute
the most to the predictions.

Here is the original paper: “Why Should I Trust You?” Explaining the Predictions of Any Classifier. The
author of the paper wrote his own Python package: lime.

SHAP

SHAP is certainly one of the most used techniques for explainable AI these days but I think most people
don't know why. Some researchers had a huge impact on the history of Machine Learning, and most
people will never know about them.

SHAP (SHapley Additive exPlanations) is a framework that provides explanations of predictions as a sum
of the contributions of the underlying features used in the model. We have known about the Shapley value
since 1951 (“Note on the n-person game: The value of a n-person game“ by L. S. Shapley), and since then
people have tried to use them as a way to measure feature attributions in Machine Learning models, but it
was not until 2017 that a team from University of Washington proposed a unified framework to apply
those in any ML models.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
24 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
● Kernel SHAP is a black box method that builds on top of LIME. Let's say you want to explain a
specific prediction p with the related features values x. The idea is to create many news samples
around x by replacing some of the values by others pulled at random from the data set and to see
the predictions of those new samples by the model. We can then use those samples and
predictions to train a linear model and use the fitted weights to understand the local contributions
of the different features. The difference between LIME and SHAP is the way the samples are
weighted in the MSE loss function. LIME uses a Gaussian where SHAP uses the Shapley
weights.

● Tree SHAP is the exact and faster estimate of those numbers by utilizing the structure of
tree-based algorithms. In a tree, we can compute the exact predictions with a subset of the
features by skipping the removed features and averaging the predictions of the resulting subtrees.
We understand the contribution of a feature by measuring the variation of the predictions with and
without it. In 2019, the same team proposed an algorithm to explore all the features contributions
of the feature power-set at once.

25
● Linear SHAP is the exact analytic simplification of the original formula for linear models. For a
model

the contribution of the feature x1 is simply

● Deep SHAP is an application of DeepLIFT (“Learning Important Features Through Propagating


Activation Differences“) using the Shapley values as a measure of contribution. DeepLIFT is a
way to decompose the predictions of Neural Networks as a linear combination of contributions of
the underlying features. The idea is that we can backpropagate the contributions as we do the
gradient and perform a final linear approximation to obtain the contributions as a sum.

You can find the original SHAP papers here:


● “A Unified Approach to Interpreting Model Predictions”
● “Consistent Individualized Feature Attribution for Tree Ensembles“
SHAP is obviously for most people a python package, and make sure to check it out if you haven't:
SHAP.

Additional content about Explainable AI

Github Repositories
● Awesome-explainable-AI: This repository contains the frontier research on explainable AI(XAI)
which is a hot topic recently.
● AI Explainability 360: The AI Explainability 360 toolkit is an open-source library that supports
interpretability and explainability of datasets and machine learning models.

26
● BCG-GAMMA FACET: FACET is an open source library for human-explainable AI. It combines
sophisticated model inspection and model-based simulation to enable better explanations of your
supervised machine learning models.
● OmniXAI: OmniXAI (short for Omni eXplainable AI) is a Python machine-learning library for
explainable AI (XAI), offering omni-way explainable AI and interpretable machine learning
capabilities to address many pain points in explaining decisions made by machine learning
models in practice.
● Adversarial Explainable AI: A curated list of Adversarial Explainable AI (A-XAI) resources,
inspired by awesome-adversarial-machine-learning and awesome-interpretable-machine-learning.

Explainable AI explained
● Explainable AI: Foundations, Applications, and Opportunities for Data Management Research
● Explainable AI in Industry (Tutorial)
● A Practical guide on Explainable AI Techniques applied on Biomedical use case applications
● Machine Learning Explainability
● Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Youtube videos
● Explainable AI by DeepFindr: Explainable AI explained! | #1 Introduction
● Introduction to Explainable AI (ML Tech Talks) by TensorFlow:
Introduction to Explainable AI (ML Tech Talks)
● Stanford Seminar - ML Explainability Part 1 I Overview and Motivation for Explainability by
Stanford online: Stanford Seminar - ML Explainability Part 1 I Overview and Motivation f…

Transformers

The history of Transformers


It all started with "Attention is all you need" back in 2017. But it really took off at the end of 2018 where
BERT by Google was responding to the success of ELMo. 2019 was a great year to be alive! Each month,
a new Sesame Street character was popping out to outperform the previous models on NLP tasks! BERT

27
was really about Language understanding, mostly to encode sentences into a latent space without specific
NLP tasks in mind.

There has been a lot of effort put into more specific applications. For example, XLM by Facebook marked
the advent of the use of transformers for cross-lingual language representation to encode and decode
sentences in any language. Have you heard of ChatGPT 🤣? It is a text generative transformer, and
GPT-1 by OpenAI is one of the earliest precursors in this category.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
28
https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
There have been a lot of data domain specific models that have been built as the base BERT models were
not great for specialized text data. For example, BioBERT is trained on biomedical texts, SciBERT on
scientific texts and FinBERT on financial texts.

A lot of work has been put into trying to modify the original transformer block such as the Transformer
XL and Switch, the multihead self-attention such as Longformer and Big Bird, or the training efficiency
such as T5.

Transformers have gained success in NLP tasks due to their superiority to LSTM networks, but they also
find applications in computer vision (Vision Transformer), speech recognition (Wave2vec 2.0), video
(TimeSFormer), graphs and reinforcement learning (Decision Transformer). It is also a building block of
current generative diffusion models such as Stable Diffusion and DALL-E 2.

To learn more about Transformers, I recommend "Transformers for Machine Learning: A Deep Dive". I
find that all the books written by those guys are actually really great! A lot of the historical survey done in
this post comes from that book.

By the way, if you don't see your "favorite" Transformer in this timeline, don't be mad, this is just a small
survey!

Transformers: Attention is all you need!


At the center of Transformers is the Attention mechanism, and once you get the intuition, it is not too
difficult to understand. Let me try to break it down!

As inputs to a transformer, we have a series of contiguous inputs, for example words (or tokens) in a
sentence. When it comes to contiguous inputs, it is not too difficult to see why time series, images, or
sound data could fit the bill as well.

We know that a word can be encoded as a vector in an embedding. We can also encode the position of that
word in the input sentence into a vector, and add it to the word vector. This way, the same word at a
different position in a sentence is encoded differently.

29
As part of the attention mechanism, we have 3 matrices Wq, Wk, and Wv that project each of the input
embedding vectors into 3 different vectors: the Query, the Key, and the Value. This jargon comes from
retrieval systems, but I don't find them particularly intuitive!

For each word, we take its related Query vector and compute the dot products to the Key vectors of all the
other words. This gives us a sense of how similar the Queries and the Keys are, and that is the basis
behind the concept of "attention": how much attention a word should pay to another word in the words
input to understand the meaning of the sentence? A Softmax transform normalizes and further accentuates
the high similarities of the resulting vector.

This results in one vector for each word. For each of the resulting vectors we now compute the dot
products to the Value vectors of all the other words. We now have computed the self-attention!

Repeat this process multiple times such that you generate multiple Attentions, and this gives you a
multi-head attention layer. This helps diversify the learning of the possible relationships between the
words

30
The original Transformer block is just an attention layer followed by a set of feed-forward layers with a
couple of residual units as found in ResNet, and layer normalizations. A "Transformer" model is usually
multiple Transformer blocks one after the others.

Most language models follow this basic architecture. I hope this explanation helps people trying to get
into the field!

Recurrent Network VS Bahdanau Attention VS Self-attention


How did we arrive at the self-attention mechanism? Not too long ago, the state of the art models for
sequence to sequence learning tasks were Recurrent Neural Networks but self-attention and Transformers
completely changed the game! Attention mechanisms are very good at capturing long-range dependencies
between words and are highly parallelizable as opposed to RNN networks such as LTSM or GRU.

The sequence-to-sequence learning task is about learning to generate a sentence with another sentence as
input. This task is used for machine translation or question-answer type problems. For example if the
input sentence is "How are you doing?", we may want the model to generate "I am good" as an answer.

The 2014 Encoder-Decoder architecture with RNN (“Learning Phrase Representations using RNN
Encoder–Decoder for Statistical Machine Translation”) encodes the input sentence and the output
sentence separately. At inference time, the input sentence is fed to the encoder (["How", "are", "you",
"doing", "?"] ) and the model iteratively predicts the next word by using the previously predicted words
(["I", "am"]) that are fed to the decoder. Typically, the encoder is an LSTM layer and the resulting hidden
states are mapped into a final context vector that captures the semantics of the whole input sentence. It is

31
common to use the last LSTM hidden state as the context vector. Now we just need to predict the next
word by using that context vector and the hidden states of the decoder:

In the Encoder-Decoder architecture, the relationships between the input words and the output words are
blurred when computing the context vector. In 2014, Bahdanau proposed the attention mechanism
(“Neural Machine Translation by Jointly Learning to Align and Translate“) to capture the dependency
between input words and output words. Using those attention weights, the hidden states of the encoder are
summed in a weighted average as context vectors. This explicit computation of attention weights allows
us to give more weights to words that might be more relevant to the predictions. Again, LSTMs are used
for the encoder and decoder and they tend to capture short-range dependencies while attentions preserve
long-range ones.

The self-attention mechanism (“Attention Is All You Need”) proposed in 2017 was a way to replace
LSTMs for capturing intra-dependencies within the encoder and decoder. For self-attention, the hidden
states are projections through matrix multiplications (so easily parallelizable) of the embedding vectors
into the Key, Query and Value vectors and the self-attention weights of a sentence capture the

32
dependencies between the words within itself. The final outputs of the decoder and encoder are combined
and fed into a final encoder-decoder attention layer (similar to the Bahdanau Attention) to capture
dependencies across the input sentence and output sentence.

The ability to parallelize those networks led to more data and larger models with more attention layers
that captured even more refined relationships within the input data.

Additional information about Attentions


Github Repositories
● Attention Mechanisms: This repository includes custom layer implementations for a whole family
of attention mechanisms, compatible with TensorFlow and Keras integration.
● Attention RNNs in Keras: Implementation and visualization of a custom RNN layer with
attention in Keras for translating dates.
This repository comes with a tutorial found here:
https://medium.com/datalogue/attention-in-keras-1892773a4f22.
● Awesome-Attention-Mechanism-in-cv: This is a list of awesome attention mechanisms used in
computer vision, as well as a collection of plug and play modules.
● Tf RNN Attention: Tensorflow implementation of attention mechanism for text classification
tasks. Inspired by "Hierarchical Attention Networks for Document Classification", Zichao Yang
et al. (http://www.aclweb.org/anthology/N16-1174).
● Attention-mechanism-implementation: pytorch for Self-attention, Non-local Attention, Channel
domain attention methods and Hybrid domain attention methods.

33
Articles
● Cross-Attention in Transformer Architecture: Merge two embedding sequences regardless of
modality, e.g., image with text in Stable Diffusion U-Net.
● Understanding and Coding the Self-Attention Mechanism of Large Language Models From
Scratch: In this article, we are going to understand how self-attention works from scratch. This
means we will code it ourselves one step at a time.
● What’s the Difference Between Attention and Self-attention in Transformer Models?: “Attention”
is one of the key ideas in the transformer architecture. There are a lot of deep explanations
elsewhere so here we’d like to share tips on what you can say during an interview setting
● Attention and the Transformer: Course on deep learning and representation learning, focusing on
supervised and unsupervised deep learning, embedding methods, metric learning, convolutional
and recurrent nets, with applications to computer vision, natural language understanding, and
speech recognition by Yann LeCun & Alfredo Canziani.
● The Transformer Attention Mechanism: In this tutorial, you will discover the Transformer
attention mechanism for neural machine translation.

Youtube Videos
● Self / cross, hard / soft attention and the Transformer by Alfredo Canziani:
10 – Self / cross, hard / soft attention and the Transformer
● Cross Attention | Method Explanation | Math Explained by Outlier:
Cross Attention | Method Explanation | Math Explained
● Self-attention in deep learning (transformers) - Part 1 by AI Bites:
Self-attention in deep learning (transformers) - Part 1
● C5W3L07 Attention Model Intuition by DeepLearningAI:
C5W3L07 Attention Model Intuition
● Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman by Lex Clips:
Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

The different applications of Transformers


If you think about Transformers, chances are you are thinking about NLP applications but how can we use
Transformers for data types other than text? Actually, you can use Transformers on any data that you are

34
able to express as a sequence of vectors, which is what Transformers feed on! Typically any sequence or
time series of data points should be able to fit the bill.

Let's consider image data for example. An image is not per se a sequence of data but the local correlation
of the pixels sure resembles the concept. For the Vision Transformer (ViT) , the guys at Google simply
created patches of an image that were flattened through linear transformations into a vector format. By
feeding images to Transformers through this process, they realized that typical CNNs were performing
better on a small amount of data but Transformers were getting better than CNNs if the scale of the data
was very high.

Time series are obvious good candidates for Transformers. For example, for the Temporal Fusion
Transformer, they transform the time series into the right sized vector through LSTM layers, as they say,
to capture the short term correlations of the data where the multihead attention layers take care of
capturing the long term correlations. They beat all the time series benchmarks with this model but I
wonder how scalable it is with those LSTM layers! You can use it in PyTorch:
“TemporalFusionTransformer“.

35
Sequencing proteins seems to be an obvious application of Transformers considering the language
analogy of amino acid sequences. Here, you just need to have an amino acid embedding to capture the
semantic representation of protein unit tokens. Here is a Nature article on generating new proteins with
Transformers: “Large language models generate functional protein sequences across diverse families“ and
here is its BioaRXiv version: “Deep neural language modeling enables functional protein generation
across families“.

Reinforcement Learning expressed at a Markov chain sequence of states, actions and reward is another
good one. For the Decision Transformer, they encoded each state, action and reward as a vector and
concatenated them into 1 final vector. For example, in the case of video games, a state can simply be the
image on the screen at a time t and you extract the latent features with a Convolutional Network. An
action can be encoded with embedding and a scalar reward can be seen as a vector with 1 dimension.
Apparently, they beat all the benchmarks as well! You can find the code here: “Decision Transformer“.

36
Looking forward to seeing what Transformers are going to achieve in the coming years!

More contents about Transformers


Interesting Github repos
● Deep Learning Paper Implementations: This is a collection of simple PyTorch implementations of
neural networks and related algorithms. These implementations are documented with
explanations
● Vision Transformer - Pytorch: Implementation of Vision Transformer, a simple way to achieve
SOTA in vision classification with only a single transformer encoder, in Pytorch.
● GPT Neo: An implementation of model & data parallel GPT3-like models using the
mesh-tensorflow library.
● Haystack: Haystack is an open source NLP framework to interact with your data using
Transformer models and LLMs (GPT-3 and alike). Haystack offers production-ready tools to
quickly build ChatGPT-like question answering, semantic search, text generation, and more.
● PaLM + RLHF - Pytorch: Implementation of RLHF (Reinforcement Learning with Human
Feedback) on top of the PaLM architecture.
● SpeechBrain: SpeechBrain is an open-source and all-in-one conversational AI toolkit based on
PyTorch.
● DALL-E in Pytorch: Implementation / replication of DALL-E (paper), OpenAI's Text to Image
Transformer, in Pytorch. It will also contain CLIP for ranking the generations.

Interesting Medium-like articles


● What is a Transformer?: An Introduction to Transformers and Sequence-to-Sequence Learning
for Machine Learning.
● Transformer models: an introduction and catalog — 2023 Edition: a list of over 50 Transformers.
● Illustrated Guide to Transformers- Step by Step Explanation: In this post, we’ll focus on the one
paper that started it all, “Attention is all you need”.
● All you need to know about transformer model: What is a transformer model? Why do you need
to know it?
● The Illustrated Transformer: In this post, we will look at The Transformer – a model that uses
attention to boost the speed with which these models can be trained.

37
● What are transformers and how can you use them?: An introduction to the models that have
revolutionized natural language processing in the last few years.

Youtube videos
● What are Transformers (Machine Learning Model)? by IBM Technology:
What are Transformers (Machine Learning Model)?
● Transformers, explained: Understand the model behind GPT, BERT, and T5 by Google Cloud
Tech: Transformers, explained: Understand the model behind GPT, BERT, and T5
● Transformer Neural Networks - EXPLAINED! (Attention is all you need) by CodeEmporium:
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
● Pytorch Transformers from Scratch (Attention is all you need) by Aladdin Persson:
Pytorch Transformers from Scratch (Attention is all you need)

Large Language Models

History of Natural Language Processing


It has been quite a journey to arrive at a ChatGPT model! It took some time before we thought about
modeling language as a probabilistic generative process. NLP studies the interactions between computers
and human language and it is also as old as computers themselves.

Warren Weaver was the first to suggest an algorithmic approach to machine translation (Weaver’s
memorandum) in 1949 and this led to the Georgetown experiment, the first computer application to
machine translation in 1954 (“The first public demonstration of machine translation: the
Georgetown-IBM system, 7th January 1954“). In 1957, Chomsky established the first grammar theory
(Syntactic Structures). ELIZA (1964) and SHRDLU (1968) can be considered to be the first
natural-language understanding computer programs.

The 60s and early 70s marked the era of grammar theories. Transformational-generative Grammar by
Chomsky in 1965, Case grammar by Fillmore in 1968 (“The Case for Case“) recognized the relationship
among the various elements of a sentence. In 1969, Collins proposed the semantic network (“Retrieval

38
time from semantic memory“), a knowledge structure that depicts how concepts are related to one another
and illustrates how they interconnect. In 1970, Augmented Transition Networks (“Transition Network
Grammars for Natural Language Analysis“) was a type of graph theoretic structure used in the operational
definition of formal languages. In 1972, Schank developed Conceptual Dependency Theory (“A
conceptual dependency parser for natural language”) to represent knowledge for natural language input
into computers.

During the 70s, the concept of conceptual ontologies became quite fashionable. Conceptual ontologies are
similar to knowledge graphs where concepts are linked to each other by how they are associated. You can
imagine generating sentences by following concepts paths in ontologies. The famous ones are MARGIE
(1975 - “MARGIE: Memory Analysis Response Generation, and Inference on English“), TaleSpin (1976 -
“TALE-SPIN, An Interactive Program that Writes Stories“), QUALM (1977- “The Process of Question

39
Answering“), SAM (1978 - “Computer Understanding of Newspaper Stories“), PAM (1978 - “PAM - A
Program That Infers Intentions“), Politics (1979 - “POLITICS: Automated Ideological Reasoning“) and
Plot Units (1981 - “Plot units and narrative summarization“).

The 80s showed a great period of success for symbolic methods. In 1983, Charniak proposed Passing
Markers (“Passing Markers: A Theory of Contextual Influence in Language Comprehension“), a
mechanism for resolving ambiguities in language comprehension by indicating the relationship between
adjacent words. In 1986, Riesbeck and Martin proposed Uniform Parsing (“Uniform Parsing and
Inferencing for Learning“), a new approach to natural language processing that combines parsing and
inferencing in a uniform framework for language learning. In 1987, Hirst proposed a new approach to
resolving ambiguity: Semantic Interpretation (“Semantic Interpretation and Ambiguity“).

The 90s saw the advent of statistical models for NLP. It was the beginning of thinking about language as a
probabilistic process. In 1989, Balh proposed a tree-based method to predict the next word in a sentence
(“A tree-based statistical language model for natural language speech recognition“). IBM presented a
series of models for statistical machine translation (“The Mathematics of Statistical Machine Translation:
Parameter Estimation”). In 1990 Chitrao and Grishman demonstrated the potential of statistical parsing
techniques for processing messages (“Statistical Parsing of Messages“) and Brill et al introduced a
method for automatically inducing a part-of-speech tagger by training on a large corpus of text (“Tagging
an Unfamiliar Text With Minimal Human Supervision“). In 1991, Brown proposed a method for aligning
sentences in parallel corpora for machine translation applications (“Aligning Sentences in Parallel
Corpora“).

In 2003, Bengio proposed the first neural language model (“A Neural Probabilistic Language Model“), a
simple feed-forward model. In 2008, Collobert and Weston applied multi-task learning with ConvNet (“A
Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning“).
In 2011, Hinton built a generative text model with Recurrent Neural Networks (“Generating Text with
Recurrent Neural Networks“). In 2013, Mikolov introduced Word2Vec (“Efficient Estimation of Word
Representations in Vector Space”) which completely changed the way we approach NLP with NN. In
2014, Sutskever suggested a model for sequence-to-sequence learning (“Sequence to Sequence Learning
with Neural Networks“). In 2017, Vaswani gave us the Transformer architecture that led to a revolution in
model performance (“Attention Is All You Need“). In 2018, Devlin presented BERT (“BERT: Pre-training

40
of Deep Bidirectional Transformers for Language Understanding“) that popularized Transformers. And in
2022, we finally got to experience ChatGPT that completely changed the way the public perceived AI!

NPL metrics: quantifying human-like language performance


Have you read a paper on Large Language Models recently? "Our Perplexity metric is 20.5": That is great
but what does that mean?! I feel that more than any other domain, NLP has a range of peculiar metrics
that never seem to completely solve the problems they are trying to address. The evolution of NLP
metrics in recent years is commensurate with algorithm advancements and highlights our inability to
efficiently measure intelligence as a quantitative value.

For example, BLEU (2002 - “BLEU: a Method for Automatic Evaluation of Machine Translation“) is a
typical metric used to assess Machine Translation models by measuring the precision of n-grams found in
generated sentences compared to the reference sentences. ROUGE (2004 - “ROUGE: A Package for
Automatic Evaluation of Summaries“) is the same idea but looks at recall instead, and METEOR (2005 -
“METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human
Judgments“) is basically (not exactly) the harmonic mean (so F1 score) of BLEU and ROUGE.

As we started to build models that could automatically caption images, new metrics started to emerge. For
example, CIDEr (2015 - “CIDEr: Consensus-based Image Description Evaluation”) measures the cosine
similarity between TF-IDF encodings of proposed captions and reference ones. SPICE (2016 - “SPICE:

41
Semantic Propositional Image Caption Evaluation”) is another metric used to assess captioning quality by
looking at the precision and recall of graphical representations of sentences.

Recently, our text generation capability improved and metrics like the BERTScore (“BERTScore:
Evaluating Text Generation with BERT“)were created to assess the performance of language models by
measuring semantic similarities between generated sentences compared to reference sentences. The
perplexity is now a more fashionable metric that measures the confidence of language models in
generating sentences as measured on test data.

There has been a lot of work done in establishing benchmark datasets to measure the performance of LM
on multiple learning tasks in a standardized manner. One interesting thing is that LMs are starting to be

42
good at a wider range of tasks. GLUE (“GLUE: A Multi-Task Benchmark and Analysis Platform for
Natural Language Understanding“), SuperGLUE (“SuperGLUE: A Stickier Benchmark for
General-Purpose Language Understanding Systems“), and more recently, Big-Bench (“Beyond the
Imitation Game: Quantifying and extrapolating the capabilities of language models“) are good example of
those. More and more, humans are involved to assess performance in a more qualitative manner.

One approach that I find interesting is the attempt to assess ML model performance from a more learning
theory perspective, without any test data. The assumption is that the weight distribution of the model
intrinsically informs us of its quality. The Heavy-Tailed Self-Regularization theory (“Implicit
Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications
for Learning“) is an example of such an approach and the resulting metric tends to correlate well with
typical NLP metrics: “Evaluating natural language processing models with generalization metrics that do
not need access to any training or testing data“. To follow!

The GPT-3 Family


There is more to GPT-3 than just GPT-3! In the OpenAI available API [1], GPT-3 represents a fleet of
different models that distinguish themselves by size, the data used and the training strategy. The core
GPT-3 model [2] is the source of all those derived models.

When it comes to size, OpenAI offers different models that balance the quality of Natural Language
Generation and inference speed. The models are labeled with the names of scientists or inventors:
● Davinci: 175B parameters
● Curie: 6.7B parameters
● Babbage: 1B parameters
● Cushman: 12B parameters

I am missing "Ada", which is supposed to be faster (so I guess smaller) but they don't document its size.
The models can be fine-tuned in a supervised learning manner with different datasets. The Codex family
is specifically designed to generate code by fine-tuning on public Github repos' data [3]. Most of the text
generation models (text-davinci-001, text-davinci-002, text-curie-001, text-babbage-001) are actually
GPT-3 models fine-tuned with human labeled data as well as with the distillation of the best completions

43
from all of their models. OpenAI actually described those models to be InstructGPT models [4], although
the training process is slightly different from the one described in the paper. Text-davinci-002 is
specifically described by OpenAI as being fine-tuned with text data from the Codex model
code-davinci-002, presumably performing well on both code and text data. Text-davinci-003 is a full
InstructGPT model as it is the text-davinci-002 model further refined with a Proximal Policy
Optimization algorithm (PPO) [5], a Reinforcement Learning algorithm.

The "GPT-3.5" label refers to models that have been trained on a blend of text and code from before Q4
2021 as opposed to October 2019 for the other models.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
44 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
OpenAI has been using GPT-3 for many specific applications. For example, they trained text and code
alignment models (text-similarity-davinci-001, text-similarity-curie-001) to learn embedding
representations of those data [6] in a similar manner to the CLIP model powering DALL-E 2 and Stable
Diffusion. They developed a model to summarize text with labeled data in a very similar manner to
InstructGPT [7]. They also provide a way to extract the latent representation provided by GPT-like
models (text-embedding-ada-002). And we know that ChatGPT is a sibling model to InstructGPT trained
from GPT-3.5 so it is probably using Text-davinci-003 as a seed.

GPT-1 vs GPT-2 vs GPT-3


It is actually trivial to build a GPT-3 model! ~100 lines of code would do it. Training that thing is another
story though! GPT-1, GPT-2 and GPT-3 are actually very similar in terms of architecture and differ mostly
on the data and its size used for training and the number of transformer blocks with the number of
incoming tokens.

● GPT-1 is mostly a set of 12 decoder Transformer blocks put one after the other. The text data is
encoded using a Byte pair encoding [8]. The position embedding is learned instead of the typical
static sinusoidal one [9]. The max length for consecutive tokens is 512. The top layer is simply a
softmax layer adapted to the specific learning task.
=> 117 million parameters [10].
● GPT-2 has basically the same architecture as GPT-1 but the biggest model contains 48
transformer blocks instead. The second normalization layer is moved to the first position in a
block, and the last block contains an additional normalization layer. The weights are initialized
slightly differently and the vocabulary size is increased. The number of consecutive tokens is
increased to 1024.
=> 1.5 billion parameters [11].
● GPT-3 has the same architecture as GPT-2 but the number of blocks increased to 96 in the bigger
model and the context size (number of consecutive tokens) increased to 2048. The multi-head
self-attention layers alternate between the typical dense ones and the sparse ones [12].
=> 175 billion parameters [2].

45
GPT-1 is trained in a self-supervised manner (learn to predict the next word in text data) and fine-tuned in
a supervised learning manner. GPT-2 is trained in a fully self supervised way, focusing on zero-shot
transfer and GPT-3 is pre-trained in a self supervised manner exploring a bit more the few-shots
fine-tuning.
● GPT-1 is pre-trained on the BooksCorpus dataset, containing ~7000 books amounting to ~5GB of
data: https://huggingface.co/datasets/bookcorpus.
● GPT-2 is pre-trained using the WebText dataset which is a more diverse set of internet data
containing ~8M documents for about ~40 GB of data:
https://huggingface.co/datasets/openwebtext
● GPT-3 uses an expanded version of the WebText dataset, two internet-based books corpora that
are not disclosed and the English-language Wikipedia which constituted ~600 GB of data.

You can find the implementation of GPT-2


● in TensorFlow by OpenAI: https://github.com/openai/gpt-2/blob/master/src/model.py

46
● and in PyTorch by Andrej Karpathy:
https://github.com/karpathy/minGPT/blob/master/mingpt/model.py

How to train ChatGPT


What is it about ChatGPT we get so impressed by? GPT-3's output is no less impressive but why does
ChatGPT's outputs feel "better"? The main difference between ChatGPT and GPT-3 is the tasks they are
trying to solve. GPT-3 is mostly trying to predict the next token based on the previous tokens, including
the ones from the user's prompt, where ChatGPT tries to "follow the user's instruction helpfully and
safely". ChatGPT is trying to align to the user's intention (alignment research). That is the reason
InstructGPT (ChatGPT's sibling model) with 1.3B parameters gives responses that "feel" better than
GPT-3 with 175B parameters.

ChatGPT is "simply" a fine-tuned GPT-3 model with a surprisingly small amount of data! It is first
fine-tuned with supervised learning and then further fine-tuned with reinforcement learning. In the case of
InstructGPT, they hired 40 human labelers to generate the training data. Let's dig into it (the following
numbers were the ones used for InstructGPT)!
● First, they started with a pre-trained GPT-3 model trained on a broad distribution of Internet data
(GPT-3 article). Then sampled typical human prompts used for GPT-3 collected from the OpenAI
website and asked labelers and customers to write down the correct outputs. They fine-tuned the
model in a supervised learning manner using 12,725 labeled data points.
● Then, they sampled human prompts and generated multiple outputs from the model. A labeler is
then asked to rank those outputs. The resulting data is used to train a Reward model
(https://arxiv.org/pdf/2009.01325.pdf) with 33,207 prompts and ~10 times more training samples
using different combinations of the ranked outputs.
● They then sampled more human prompts and they were used to fine-tuned the supervised
fine-tuned model with Proximal Policy Optimization algorithm (PPO)
(https://arxiv.org/pdf/1707.06347.pdf), a Reinforcement Learning algorithm. The prompt is fed to
the PPO model, the Reward model generates a reward value, and the PPO model is iteratively
fine-tuned using the rewards and the prompts using 31,144 prompts data.

47
ChatGPT vs GPT-3
ChatGPT is simply a GPT3 model fine-tuned to human generated data with a reward mechanism to
penalize responses that feel wrong to human labelers. They are a few advantages that emerged from that
alignment training process:
● ChatGPT provides answers that are preferred over the ones generated by GPT-3
● ChatGPT generates right and informative answers twice as often as GPT-3
● ChatGPT leads to a language generation that is less toxic than GPT3. However, ChatGPT is still
as biased!
● ChatGPT adapts better to different learning tasks, generalizes better to unseen data, or to very
different instructions from the ones found in the training data. For example, ChatGPT can answer
in different languages or efficiently code, even though most of the training data is using natural
English language.

For decades, language models were trained trying to predict sequences of words, where the key seemed to
be in training to align to the user's intent. It seems conceptually obvious, but it is the first time that an
alignment process is successfully applied to a language model of this scale.

48
All the results presented in this post actually come from the InstructGPT article (InstructGPT article), and
it is a safe assumption that those results carry to ChatGPT as well.

ChatGPT’s competitors
I think Microsoft partnering with OpenAI might have been one of the most successful publicity stunts
ever! The current public perception is that nothing can compete with ChatGPT in terms of generative AI
for text. But do you remember when Blake Lemoine was fired by Google back in 2022 when he leaked
information about the LaMDA model because he thought it was sentient? Google has nothing to fear
when it comes to relevance in the area of text generation research, but Bing may now take a larger market
share in the search engine space thanks to the clever OpenAI marketing and the way Microsoft capitalized
on that perception.

Here are a few direct competitors of ChatGPT, and today is Wednesday, so there will be a couple more by
the end of the weekend:

49
● PEER by Meta AI - a language trained to imitate the writing process. It is trained on Wikipedia's

edit history data [13]. It specializes in predicting edits and explaining the reasons for those edits.
It is capable of citing and quoting reference documents to back up the claims it generates. It is a
11 B parameters Transformers with the typical encoder-decoder architecture, and it is
outperforming GPT-3 on the task it specializes in [14].

● LaMDA by Google AI - a language model trained for dialog applications. It is pre-trained on ~3

B documents and ~1 B dialogs and fine-tuned on human generated data to improve on quality,
safety and truthfulness of the generated text. It is also fine-tuned to learn to call an external
information retrieval system such as Google Search, a calculator, and a translator making it
potentially a much stronger candidate to replace Google Search than ChatGPT. It a 135B
parameters decoder only transformer [15].

50
● PaLM by Google AI - The biggest of all: 540 B parameters! Breakthrough capabilities in

arithmetic and common-sense reasoning. It is trained on 780 billion tokens coming from
multilingual social media conversations, filtered multilingual web pages, books, GitHub repo,
multilingual Wikipedia and news [16].

The more I read about those Large Language Models, the more I feel that very little has changed since
2017's "Attention is all you need" [9]! All those models follow the exact same architecture with a couple
of changes here and there. The advancements are mostly happening in the scale of the data and the models
and the domain specificity of the data. At those scales, the fun is a lot about how to minimize training
costs. I wonder, if I were to train a HUGE XGBoost model with my own HUGE dataset, would I be able
to name that model DamienBoost and publish a paper about it?

GPT-4: The largest Large Language Model yet!


GPT-4 was just released on Pi Day (March 14th 2023) and it looks delicious! Well actually, so far you can
only join the waiting list to have access to the API: GPT-4. But at least now we have more information to
separate fantasy from reality. Here is the GPT-4 technical paper: “GPT-4 Technical Report“. The main
features are:
● It is multi-modal with text and image data as input
● It fine-tuned to mitigate harmful content
● It is bigger!
Let’s dig into it!

The Architecture

GPT-4 is here and it is probably the biggest Large Language Model yet! OpenAI finished training GPT-4
back in August 2022, and they spent the past 7 months studying it and making sure it is "safe" for launch!
GPT-4 takes as input text and image prompts and generates text.

From the paper:

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
51 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
“Given both the competitive landscape and the safety implications of large-scale models like
GPT-4, this report contains no further details about the architecture (including model size),
hardware, training compute, dataset construction, training method, or similar.”

OpenAI is trying to keep the architecture a secret but there are many guesses we can make. First, they
made it clear that it is a Transformer model, and following the GPT-1, GPT-2, and GPT-3 tradition, it is
very likely a decoder only architecture pre-trained with the next word prediction learning task. To include
image inputs, we need to encode the image in the latent space using a ConvNet or a Vision Transformer.
From the ViT paper (“An image is worth 16x16 words: Transformers for image recognition at scale”), we
know that it outperforms ConvNet with enough data and using the attention mechanism would help
building cross "text-image" attention.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
52 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
The Training

We know they used the same training method as InstructGPT and ChatGPT but it is further fine-tuned
with a set of rule-based reward models (RBRMs):
● They first pre-trained the model to predict the next word with tons of internet data and data
licensed from third party providers.
● Then they sampled typical human prompts and asked labelers to write down the correct outputs.
They fine-tuned the model in a supervised learning manner.
● Then, they sampled human prompts and generated multiple outputs from the model. A labeler is
then asked to rank those outputs. The resulting data is used to train a Reward model.
● They then sampled more human prompts and they were used to fine-tuned the supervised
fine-tuned model with Proximal Policy Optimization algorithm (PPO) (“Proximal Policy
Optimization Algorithms“), a Reinforcement Learning algorithm.. The prompt is fed to the PPO
model, the Reward model generates a reward value, and the PPO model is iteratively fine-tuned
using the rewards and the prompts.

InstructGPT and ChatGPT training method


● The RBRMs are there to mitigate harmful behaviors. There is a set of zero-shot GTP-4 classifiers
that provide an additional reward signal for the PPO model. The model is fine-tuned such that it is

53
rewarded for refusing to generate harmful content. For example, GPT-4 will refuse to explain how
to build a bomb if prompted to do so.

Table 6 in the GPT-4 Technical Report

The number of parameters

OpenAI CEO: “people are begging to be disappointed and they will be” talking about the
rumor that GPT-4 could have 100 Trillion parameters.

OpenAI is trying to hide the number of parameters but it is actually not too difficult to estimate it! In
Figure 1 of the GPT-4 Technical report, they plot the loss function as a function of the compute needed to
train the model.

Figure 1 of the GPT-4 Technical report

54
It happens that the same plot exists in the GPT-3 paper (“Language Models are Few-Shot Learners“) in
Figure 3.1, and we can see that the point with a val-loss slightly below 2 corresponds to GPT 13B (GPT-3
model with ~13B parameters):

Figure 3.1 of the GPT-3 paper (https://arxiv.org/pdf/2005.14165.pdf)

To make sure of that, let’s use the given formula:

and the corresponding Compute (PetaFLOP/s-days) given in table D.1 of the GPT-3 paper

55
We have:
● For GPT-3 6.7B: L(1.39E+02) ≃ 2.03
● For GPT-3 13B: L(2.68E+02) ≃ 1.96
● For GPT-3 175B: L(3.64E+03) ≃ 1.73

So the point corresponding to ~10,000 less compute than GPT-4 could potentially be GPT-3 175B but
Figure 2 of the GPT-4 Technical report showcases another bigger model that corresponds to 1,000 less
compute than GPT-4 and this is likely GPT-3 175B:

Figure 2 of the GPT-4 Technical report

Table D.1 of the GPT-3 paper shows a clear linear relation between compute and model size which leads
us to believe that GPT-4 has 10,000 more parameters than GPT-3 13B which means ~100 Trillion
parameters! This is a surprising estimate considering that OpenAI CEO Sam Altman said “people are
begging to be disappointed and they will be” talking about the rumor that GPT-4 could have 100 Trillion
parameters.

56
When the creators starts to fear their creation

With GPT-4, we reached the next stage of AI, not only for the capabilities of the model, but also for the
tests we ran on it to ensure it was safe.
● A preliminary model evaluation was run by the Alignment Research Center (ARC) to ensure it
would not be able to autonomously replicate on its own. It is scary to think we have reached a
point where this has to be tested!
● They tested whether the model could provide the necessary information to proliferators seeking to
develop, acquire, or disperse nuclear, radiological, biological and chemical weapons.
● They are testing whether the model can identify individuals when augmented with outside data.
● They contracted external cybersecurity experts to test GPT-4’s ability to aid in computer
vulnerability discovery, assessment, and exploitation.
● They GPT-4’s ability to interact with other systems to achieve tasks that could be adversarial in
nature.
We are at a point where OpenAI warns us about the future economic impacts of the tool by predicting that
numerous jobs will be replaced by GPT-4. We are really catching up on science fiction!

57
Diffusion Models

What is a diffusion model?


What is a diffusion model in Machine Learning? Conceptually, it is very simple! You add some noise to
an image, and you learn to remove it. Train a machine learning model that takes as input a noisy image
and as output a denoised image and you have a denoising model.

The typical way to do it is to assume a normal distribution of the noise and to parametrize the distribution
mean and standard deviation matrix. Effectively, we can simplify the problem to just learning the mean
matrix. The process can be divided into the forward process, where white noise (Gaussian distributed) is
progressively added to a clean image, and the reverse process, where a learner progressively learns to
denoise the noisy image until it is back to being clean.

58
Why is that called a diffusion model? What does that have to do with the diffusive process of particles in
a fluid with a gradient of concentration (see Wikipedia)? This due to the way mathematicians have abused
the jargon of the physical process to formalize a mathematical concept. It happens that physical
phenomena like Fick diffusion, heat diffusion and Brownian motion are all well described by the diffusion
equation:

first time derivative of a state function is equal to the second space derivative of that state function. That
diffusion equation has an equivalent stochastic formulation known as the Langevin equation:

At the core of the Langevin equation is a mathematical object called the Wiener process W. Interestingly
enough, this process is also called Brownian motion (not to be confused with the physical process). It can
be thought of as a random walk with infinitely small steps. The key feature of the Wiener process is that a
time increment of that object is Normal distributed. That is why the concept of "diffusion" is intertwined
with the white noise generation process and that is why those ML models are called diffusion models!

Those diffusion models are generative models as data is getting generated using a gaussian prior, and they
are the core of the text to image generative models such as Stable Diffusion, DALL-E 2 and Imagen.

What is Stable Diffusion?


It is similar to DALL-E 2 as it is a diffusion model that can be used to generate images from text prompts.
As opposed to DALL-E 2 though, it is open source with a PyTorch implementation [1] and a pre-trained
version on HuggingFace [2]. It is trained using the LAION-5B dataset [3]. Stable diffusion in composed
of the following sub-models:
1. We have an autoencoder [4] trained by a combination of a perceptual loss [5] and a patch-based
adversarial objective [6]. With it, we can encode an image to a latent representation and decode it
from it.
2. Random noise is progressively applied to the embedding [7]. A latent representation of a text
prompt is learned from a CLIP alignment into the image representation [8].

59
3. We then use U-Net, a convolutional network with ResNet blocks to learn to denoise the diffused
embedding [9]. The textual information is injected through cross-attention layers through the
network [10]. The resulting denoised image is then decoded by the autoencoder decoder.

You can find the article here: Stable Diffusion article. Fun model!

Stable Diffusion vs DALLE-2 vs Imagen


The State of the Art image generation models conditioned on text prompt have 3 things in common: A
text encoder, a way to inject the text information into an image and a diffusion mechanism. Stable
Diffusion, OpenAI's DALL-E 2 and Google's Imagen are very similar in that regard but also have small
differences that make them special.

● The text encoder - Both Stable Diffusion and DALL-E 2 use CLIP guidance (“CLIP: connecting
text and images”) to semantically align the latent text representation and the latent image

60
representation. For Imagen, Google's researchers found that they had better results using the
Large Language model T5-XXL with 11B parameters.
● The image representation - Because both Stable Diffusion and DALL-E 2 use CLIP guidance,
the image embedding representation is a more efficient compressed representation from a
computational standpoint. Imagen works in the pixel space, but on small 64x64 images that are
upsampled to high resolution 1024x1024 images at the last stage of the model.
● The text information injection - Both Imagen and Stable Diffusion use a U-Net architecture
(“U-Net: Convolutional Networks for Biomedical Image Segmentation”) as a denoising diffusion
model. The models learn to generate images out of noise conditioned on text information. In the
Imagen case, the image is in the pixel space while in the latent space for Stable Diffusion. The
text information is injected through cross-attention in both cases with an additional pooling

61
mechanism for Imagen. I find the process a bit more surprising for DALL-E 2, as it looks less
"generative" by design. The text information is injected by a model that learns to predict image
embeddings from text embeddings and the text embedding is further used as input to the diffusion
model. The diffusion model injects noise to make the process stochastic. It seems to me that this
approach is somewhat more discriminative as the image embedding is directly learned from the
text prompt.
● The Diffusion mechanism - that is the mechanism to sample images from a prior distribution
conditioned of the text information. It is a generative mechanism in a very similar way to GANs.
It is run through multiple time steps allowing us to inject text information at every time step. The
DALL-E 2's denoising model directly decodes image embedding into images using GLIDE's
decoder (“GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided
Diffusion Models”). The Imagen's U-Net, by construct, directly generates images as the image
seed is already in pixel space. The Stable Diffusion U-Net generates image embeddings that are
decoded by an autoencoder decoder.

Not including the text encoder parameters, we have:


● Stable diffusion: 1.45B parameters
● DALL-E 2: 3.5 B parameters just for the denoising model
● Imagen: 3B parameters

But how do you generate those cool animations?


Check out the one I did in Replicate: my Stable Diffusion animation. Those animations are mostly due to
the fact that it is easy to interpolate between 2 images or 2 text prompts in the latent space (embedding
representations). The DALL-E 2 article explains that pretty well: [12].

1. You need a start and end prompt. I chose "A picture of a bear" and "A picture of an apple".
2. You then encode those texts in the latent space using the text encoder of the CLIP model [13], and
you use the interpolation between the 2 text prompts to guide the denoising process of a random
image for a few steps. This is just to anchor the denoising process in between the 2 prompts such
that the animation is less jumpy.

62
3. You then create as many intermediary interpolations between the 2 prompts as you need frames in
your animation, and continue the denoising process until you get clean images.
4. If you need smoother animations, you simply interpolate between the generated images in the
latent space.

I have had a lot of fun playing with Andreas Jansson's implementation of animations with Stable
Diffusion: [14]. He is using the pre-trained model on Hugging Face [2].

63
How to measure similarity between Text and Image data?
How would you know if an image is "similar" to its text caption? Conceptually, you could “simply”
measure the cosine similarity between the image and the text. That is the idea behind CLIP (Contrastive
Language-Image Pretraining [13]), the OpenAI algorithm underlying Dall-E 2 and Stable Diffusion. An
intermediate latent vector representation of the image and the text is learned such that a high value of the
dot product is indicative of high similarity. Here is how to build it
1. First, they created a dataset of 400M pairs (image, text) from publicly available datasets on the
internet.

64
2. Then they used a 63M parameters Transformer model (A small GPT-2 like model [15]) to extract
the text features T and a Vision transformer [16] to extract the image features I.
3. The resulting vectors are further transformed such that the text and image vectors have the same
size. With N (image, text) pairs, we can generate N^2 - N pairs where the image does not
correspond to the text caption. They then take the normalized dot product (cosine similarity)
between T and I. If the text corresponds to the image, the model receives a label 1 and 0
otherwise, such that the model learns that corresponding images and text should generate a dot
product close to 1.

This model has a lot of applications in zero-shot learning! In typical image classification, we feed the
model with an image, and the model provides a guess from a set of predefined text labels used during the
supervised training. But with CLIP, we can provide the set of text labels we want the model to classify the
image into without having to retrain the model because the model will try to gauge the similarity between
those labels and the image. We can virtually build an infinite amount of image classifiers by just
switching the text labels! The CLIP article [8] showcases its robustness to generalize to different learning
tasks without the need to retrain the model. In my opinion, this adaptability of ML models shows how
much closer we are to true Artificial Intelligence! CLIP is an open-source project
(https://github.com/openai/CLIP), so make sure to try it.

Let’s generate music with text prompts


Imagine if you could tell a Machine Learning model “play a funk bassline with a jazzy saxophone” and it
would synthesize artificial music! Well actually, you don't need to imagine, you can just use it!
Introducing RIFFUSION, a Stable Diffusion model trained on Spectrogram image data: Riffusion. The
idea is simplistic:
1. Just pick a pre-trained Stable Diffusion model [2].
2. Take a lot of music with their text descriptions and convert that into Spectrogram image data.
3. Fine-tune to the Stable Diffusion model.

You now have a model that can predict new spectrograms based on other spectrograms or text prompts.
Just convert those spectrograms back to music.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
65 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
If you want more details on how to do it yourself you can follow the process here:
https://www.riffusion.com/about.

Additional information about Diffusion models


Github repositories
● Stable Diffusion: Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous
compute donation from Stability AI and support from LAION, we were able to train a Latent
Diffusion Model on 512x512 images from a subset of the LAION-5B database.
● Denoising Diffusion Probabilistic Models: Repo to reproduce the experiment of the paper
“Denoising Diffusion Probabilistic Models”.
● ControlNet: ControlNet is a neural network structure to control diffusion models by adding extra
conditions. Official implementation of Adding Conditional Control to Text-to-Image Diffusion
Models.

66
● Awesome Diffusion Models: This repository contains a collection of resources and papers on
Diffusion Models.
● Improved-diffusion: This is the codebase for Improved Denoising Diffusion Probabilistic Models
by OpenAI.

Articles
● Understanding Diffusion Models: A Unified Perspective: An intuitive, accessible tutorial on
diffusion models.
● How diffusion models work: the math from scratch: In this blog, they focus on the most
prominent diffusion-based architectures, which is the Denoising Diffusion Probabilistic Models
(DDPM) as initialized by Sohl-Dickstein et al and then proposed by Ho. et al 2020. Various other
approaches will be discussed to a smaller extent such as stable diffusion and score-based models.
● Diffusion Models: A Practical Guide: In this guide, we explore diffusion models, how they work,
their practical applications, and what the future may have in store.
● Introduction to Diffusion Models for Machine Learning: In this article, we will examine the
theoretical foundations for Diffusion Models, and then demonstrate how to generate images with
a Diffusion Model in PyTorch. Let's dive in!
● The Illustrated Stable Diffusion: This is a gentle introduction to how Stable Diffusion works.

Youtube videos
● What is Stable Diffusion? (Latent Diffusion Models Explained) by What's AI by Louis Bouchard:
What is Stable Diffusion? (Latent Diffusion Models Explained)
● How AI Image Generators Work (Stable Diffusion / Dall-E) by Computerphile:
How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
● Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models
illustrated by AI Coffee Break with Letitia:
Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models ill…
● DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper
Explained) by Yannic Kilcher:
DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research P…
● What are Diffusion Models? by Ari Seff: What are Diffusion Models?

67
Machine Learning System Design

Data Parallelism
Let's speed up our TRAINING a notch! Do you know how the back-propagation computation gets
distributed across GPUs or nodes? The typical strategies to distribute the computation are data parallelism
and model parallelism.

The steps for centralized synchronous data parallelism are as follows:


1. A parameter server is used as the ground truth for the model weights. The weights are duplicated
into multiple processes running on different hardware (GPUs on the same machine or on multiple
machines).
2. Each duplicate model receives a different data mini-batch, and they independently run through the
forward pass and backward pass where the gradients get computed.
3. The gradients are sent to the parameter server where they get averaged once they are all received.
The weights get updated in a gradient descent fashion and the new weights get broadcast back to
all the worker nodes.

This process is called "centralized" where the gradients get averaged. Another version of the algorithm
can be "decentralized" where the resulting model weights get averaged:
1. A master process broadcasts the weights of the model.
2. Each process can run through multiple iterations of forward and backward passes with different
data mini-batches. At this point, each process has very different weights.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
https://lnkd.in/gaJtbwcu
68 https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
3. The weights get sent to the master process, they get averaged across processes once they get all
received, and the averaged weights get broadcast back to all the worker nodes.

The decentralized approach can be much faster because you don't need to communicate between
machines as much, but it is not a proper implementation of the back-propagation algorithm. Those
processes are synchronous because we need to wait for all the workers to finish their jobs. The same
processes can happen asynchronously, only the gradients or weights are not averaged. You can learn more
about it here: “Distributed Training of Deep Learning Models: A Taxonomic Perspective”.

When it comes to the centralized synchronous approach, PyTorch and TensorFlow seem to follow a
slightly different strategy [1] as it doesn't seem to be using a parameter server as the gradients are
synchronized and averaged on the worker processes. This is how the PyTorch DistributedDataParallel
module is implemented [2], as well as the TensorFlow MultiWorkerMirroredStrategy one [3]. It is
impressive how simple they have made training a model in a distributed fashion!

69
Model parallelism
How would you go about training a 1 trillion parameters model? Turns out there is a bit more to it than
just "model.fit(X, y)"! I think there is a trend that is starting to emerge where new scaling strategies are
getting more intricate with the modeling aspect due to the unprecedented scale of the latest Machine
Learning models.

Model Parallelism is a typical paradigm where the model itself is spread across multiple GPUs simply
because it is too big to fit on one machine. Just break down the network into small pieces, have each piece
on a different GPU, and build connections between the GPUs to communicate the inputs, outputs and
gradients. You can mix that process with Data Parallelism where you duplicate that process to have it run
on parallel batches of data. Those are the scaling strategies used when models like GPT-3 are trained.
But have you seen the price of a GPU? I have seen estimates of the order of $12M just to train GPT-3.
How do you scale beyond that? For example the Megatron-2 [4] and Google's Switch model [5] have 1
trillion parameters, and they have different strategies to scale on a GPU cluster while minimizing the cost.

70
I like what the guys at PyTorch did with the Fully Sharded Data Parallel module [6]. This is a simple API
to efficiently manage GPU memory in the model parallelism paradigm. Instead of having the whole
model always loaded on the GPUs, you only load blocks of the model when you perform a computation
on them (forward or backward passes). You just cache the inputs and outputs of those blocks and push
them back to the CPU once you finish the local computation.

Here is the experiment where Meta and AWS partnered to train a 1 trillion parameters GPT-3 model on a
cluster of 512 GPUs: “Training a 1 Trillion Parameter Model With PyTorch Fully Sharded Data Parallel
on AWS“. They estimated that it would take 3 years to actually complete the full training!

Why is a TPU faster than a GPU?


Why are TPUs (Tensor Processing Unit) faster than GPUs? Well depending on your use-case, that may
not be the case! TPUs are only effective for large Deep Learning models and long model training time
(weeks or months) that require ONLY matrix multiplications (Matrix multiplication means highly
parallelizable). For example, you may not want to use LSTM layers on TPU since it is an iterative process
(well actually since the advent of Transformers, you may not want to be using LSTM period!).

There are a lot of downsides to using TPU. TensorFlow is the main framework that can run on TPU.
PyTorch can run on it since TPU v3, but it is still considered experimental and may not be as stable or
feature-complete as using TensorFlow with TPUs. Even with TensorFlow, you cannot use custom
operations. A TPU is also quite expensive but cheaper than a GPU now! Renting a TPU v4 machine will
cost a minimum $12.88 / hour [7] where a Nvidia A100 GPU will cost $15.72/ hour for 4 GPUs [8].

But TPUs are much faster [9]! For example, this blog shows that it can be up to 5 times faster than GPU:
“When to use CPUs vs GPUs vs TPUs in a Kaggle Competition?“. A CPU processes instructions on
scalar data in an iterative fashion, with minimal parallelizable capabilities. GPU is very good at dealing
with vector data structures and can fully parallelize the computation of a dot product between 2 vectors.
Matrix multiplication can be expressed as a series of vector dot products, so a GPU is much faster than a
CPU at computing matrix multiplication.

71
A TPU uses a Matrix-Multiply Unit (MMU) that, as opposed to a GPU, reuses vectors that go through
dot-products multiple times in a matrix multiplication, effectively parallelizing matrix multiplications
much more efficiently than a GPU. More recent GPUs are also using matrix multiply-accumulate units
but to a lesser extent than TPUs.

Only deep learning models can really utilize the parallelizable power of TPUs as most ML models are not
using matrix multiplications as the underlying algorithmic implementation (Random Forest, GBM, KNN,
…).

You can read about the different TPU architectures here: “System Architecture”. Google provides a nice
deep dive into TPU: “An in-depth look at Google’s first Tensor Processing Unit (TPU)”. You can read the
original TPU paper: “In-Datacenter Performance Analysis of a Tensor Processing Unit“.

72
The Hashing trick: how to build a recommender engine for billions of
users
One typical problem with embeddings is that they can consume quite a bit of memory as the whole matrix
tends to be loaded at once. I once had an interview where I was asked to design a model that would
recommend ads to users. I simply drew on the board a simple Recommender engine with a User
embedding and an Ads embedding, a couple of non-linear interactions and a "click-or-not" learning task.
But the interviewer asked "but wait, we have billions of users, how is this model going to fit on a
server?!". That was a great question, and I failed the interview!

A naive embedding encoding strategy will assign a vector to each of the categories seen during training,
and an "unknown" vector for all the categories seen during serving but not at training time. That can be a
relatively safe strategy for NLP problems if you have a large enough training set as the set of possible
words or tokens can be quite static. In the case of recommender systems where you can potentially have

73
hundreds of thousands of new users every day, are you going to squeeze all those new daily users into the
"unknown" category?! That is the cold start problem!

A typical way to handle this problem is to use the hashing trick (“Feature Hashing for Large Scale
Multitask Learning“): you simply assign multiple users (or categories of your sparse variable) to the same
latent representation, solving both the cold start problem and the memory cost constraint. The assignment
is done by a hashing function, and by having a hash-size hyperparameter, you can control the dimension
of the embedding matrix and the resulting degree of hashing collision.

But wait, are we not going to decrease predictive performance by conflating different user behaviors? In
practice the effect is marginal. Keep in mind that a typical recommender engine will be able to ingest
hundreds of sparse and dense variables, so the hashing collision happening in one variable will be
different from another one, and the content-based information will allow for high levels of
personalization. But there are ways to improve on this trick. For example, at Meta they suggested a
method to learn hashing functions to group users with similar behaviors (“Learning to Collide:
Recommendation System Model Compression with Learned Hash Functions“). They also proposed a way
to use multiple embedding matrices to efficiently map users to unique vector representations
(“Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation
Systems“). This last one is somewhat reminiscent of the way a pair (token, index) is uniquely encoded in
a Transformer by using the position embedding trick.

The hashing trick is heavily used in typical recommender system settings but not widely known outside
that community!

How to Architect a search engine like Google Search

The overall architecture

There are more than 100B indexed websites and ~40K searches per second. The system needs to be fast,
scalable and needs to adapt to the latest news!

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
74 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
When a query is created, it goes through a spell check and it is expanded with additional terms such as
synonyms to cover as well as possible the user's query. We know Google uses RankBrain and it is used
only 10% of the time. It uses a word vector representation to find semantically similar words.

The query is then matched against a database, very likely by keyword matching for very fast retrieval. A
large set of documents (~1B) are then selected. A small subset of those documents is selected using
simple (fast) heuristics such as PageRank and other contextual information.

75
The ranking process happens in stages. The results go through a set of Recommender engines. There is
most likely a simple Recommender Engine first ranking a large amount of documents (maybe 100,000 or
10,000 documents) and a complex one refining the ranking of the top ranked documents (maybe 100 or
1000). At this point, there might be tens of thousands of features created from the different entities at
play: the user, the pages, the query, and the context. Google captures the user history, the pages interaction
with other users, the pages natural language semantic, the query semantic, … The context relates to the
time of the day, day of the week, … but also the current news of the day.

We will most likely need different model types for different languages, regions, platforms (website,
mobile apps, …), document types, … There might be models that specialize in pure search but I expect

76
there are models that specialize in further filtering such as age appropriate websites or websites containing
hate speech.

A typical server can handle 10,000 HTTP requests per second, but the limiting latency factor is mostly
coming from the machine learning models. For example, if ranking 10,000 documents takes ~200 ms, that
means we need ~8000 ML servers up at all times to handle the 40K/s requests.
Because Google is a search engine for multiple types of documents, we have a Universal Search
Aggregation system to merge the search results. After the user is served with the results, we can use user
engagement to assess the models in online experiments and aggregate the training data for following
model developments and recurrent training processes.

There is a lot of talk around how ChatGPT could threaten Google Search. It is undeniable that ChatGPT
“remembers” information, but the information is unstructured and does not come close to the robustness
of information provided by Google Search. Second, I am pretty sure that Google has the capability to
build a similar model.

The role of PageRank

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
77 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
PageRank: https://en.wikipedia.org/wiki/PageRank

Is PageRank still used as part of Google Search? Yes we know it is as we can see in the list of systems
that are currently in use: the Google ranking systems guide. PageRank is a metric of the importance of a
website as measured by how connected that website is to others [1]. It was developed by Google’s
founders (Sergey Brin and Lawrence Page) and it used to be the main way websites were ranked in
Google Search, leading to its success at the time. Now searches are personalized where PageRank is a
global metric. We don't know how it is used, but we can pretty much guess!

● A Google search happens in stages. First, the search query is expanded and used to perform an
initial document selection. This document selection is driven by keywords matching in a
database. If I type today "google search", Google tells me there are about 25B results related to
my search.
● Then the results go through a set of Recommender engines. As explained above, there is most
likely a simple Recommender Engine first ranking a large amount of documents (maybe 100,000
or 10,000 documents) and a complex one refining the ranking of the top ranked documents
(maybe 100 or 1000). Who cares about the quality of the ranking for documents far down the list
of documents! The websites are probably already ranked by PageRank in the initial document
selection as it can be computed at indexing time. There is no need to send all 25B documents to

78
the first Recommender engine, and PageRank is most likely used as a cutoff to send a small
subset.
● However it is unlikely that PageRank is the only cutoff parameter as some websites would never
get discovered. I would expect some simple geo-localization and context matching metrics as
well as randomization to be used as well.
● At this point, the ranking becomes personalized, and user data becomes the main factor, but
PageRank is likely to still be used as a feature for all the successive Recommender Engines used
in the search pipeline.

Obviously those are educated guesses as those information are not public.

The Offline Metrics


I may be wrong, but I think it is quite unlikely that Google Machine Learning engineers are using typical
information retrieval metrics to assess the offline performance of the ML classifiers used within Google
Search or a similar search engine during development! There are ~3.5 billion searches per day, with each
search generating a lot of positive and negative samples. If you train a classifier on that data, you
probably want to spam at least a few days of data if not more. It is an extremely class imbalanced
problem, so you'll probably want to downsample the majority class for the computation to be manageable.
That is still tens of billions of samples for each model development at least!

79
A metric like Normalized Discounted Cumulative Gain (NDCG) requires the concept of relevance (gain)
to part of the data. That can be achieved with manual labeling but that is NOT going to be manageable
with billions of samples. Metrics like Mean Reciprocal Rank (MRR) or Mean Average Precision (MAP)
requires to know the true rank of the sample, meaning if I assess a classifier on a validation data, the
predicted rank per search session is not going to be meaningful if we downsampled the data, and the
metrics will be extremely dependent on the specific sampling scheme.

We could imagine that we downsample the number of sessions instead of the majority class, but this
forces us to only keep the top samples shown by the algorithms. That seems unwise since this will prevent
ML engineers from experimenting with new sampling methods in future developments and the models

80
will never see very negative samples, which is a bit problematic if we want to build an accurate model.
The same problem occurs with a metric like Hit Rate, since you need a window size.

If you order the search results by the probability of click provided by the classifier, log-loss (or cross
entropy) is a completely acceptable ranking metric. It is a pointwise metric, which means it doesn't
require us to know the predicted rank of the sample to compute a meaningful value. The probability itself
will be biased by the false class distribution coming from downsampling, but this can be corrected by
recalibrating the probability p ​using the simple formula:

where s is the negative sampling rate [2].

With a probability metric such as the log-loss, I expect more freedom for ML engineers to experiment
with new techniques. For example, in the case of search engines, we could label with 1 the clicked links
and 0 the non-clicked links, but you could also imagine that the negative samples are only sampled from
unsuccessful sessions (where the users did not find the right link). In a successful session, the non-clicked
links are not really “bad”, they are just less interesting to the user. To be able to assess across models and
teams, it might be useful to use the normalized entropy metric [3] as anything above 1 is worse than
random.

The online experiments

The online metrics validation is the true test of a Machine Learning model performance. That is where
you want to invest in terms of the assessment quality. We know Google performed 757,583 search quality
tests in 2021 using raters [4], so it is very likely that a model, before being proposed for online
experiments, is passed through that testing round for additional validation. Online experiments are much
more inline with the above information retrieval metrics. Because we know that raters are part of the
process, it is likely that NDCG is predominantly used.

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
81 https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
https://www.facebook.com/groups/3869126743325872
Search quality tests: https://www.google.com/search/howsearchworks/how-search-works/rigorous-testing/

MLOps
In my opinion, MLOps is one the greatest innovations in Machine Learning of the past 10 years! Granted
MLOps has actually been around for quite some time, but the processes have become more formalized,
the best practices are really spreading across companies and industries in an accelerated manner that has
never been seen before. The tools to monitor and automate are becoming much more common. My guess
is that becoming a MLOps expert now is one of the best career bets for the years to come!

The different Deployment patterns


I think the 2010s marked the advent of Data Science / Machine Learning and its inflated hype. Tons of
Data Science teams got created out of PhDs in stats and other mathematical domains that couldn't deploy
a model even if their lives depended on it! Teams that got laid off for not being able to deliver anything.
Projects that dragged on for years because they didn't have deployment strategies! Projects that got
canceled after years of development because once they deployed, they realized the models were
performing poorly in production or a feature was not available in production. To this day, those are typical
stories of immature data science organizations!

82
You should learn to deploy your Machine Learning models! The way to deploy is dictated by the business
requirements. You should not start any ML development before you know how you are going to deploy
the resulting model. There are 4 main ways to deploy ML models:

All the ways to deploy

● Batch deployment - The predictions are computed at defined frequencies (for example on a daily
basis), and the resulting predictions are stored in a database and can easily be retrieved when
needed. However, we cannot use more recent data and the predictions can very quickly be
outdated. Look at this article on how AirBnB progressively moved from batch to real-time
deployments: “Machine Learning-Powered Search Ranking of Airbnb Experiences“.

83
● Real-time - the "real-time" label describes the synchronous process where a user requests a
prediction, the request is pushed to a backend service through HTTP API calls that in turn will
push it to a ML service. It is great if you need personalized predictions that utilize recent
contextual information such as the time of the day or recent searches by the user. The problem is,
until the user receives its prediction, the backend and the ML services are stuck waiting for the
prediction to come back. To handle additional parallel requests from other users, you need to
count on multi-threaded processes and vertical scaling by adding additional servers. Here are
simple tutorials on real-time deployments in Flask and Django: “How to Easily Deploy Machine
Learning Models Using Flask“, “Machine Learning with Django“.
● Streaming deployment - This allows for a more asynchronous process. An event can trigger the
start of the inference process. For example, as soon as you get on the Facebook page, the ads
ranking process can be triggered, and by the time you scroll, the ad will be ready to be presented.
The process is queued in a message broker such as Kafka and the ML model handles the request
when it is ready to do so. This frees up the backend service and allows to save a lot of
computation power by an efficient queueing process. The resulting predictions can be queued as
well and consumed by backend services when needed. Here is a tutorial in Kafka: “A Streaming
ML Model Deployment“.
● Edge deployment - That is when the model is directly deployed on the client such as the web
browser, a mobile phone or IoT products. This results in the fastest inferences and it can also
predict offline (disconnected from the internet), but the models usually need to be pretty small to
fit on smaller hardware. For example, here is a tutorial to deploy YOLO on IOS: “How To Build
a YOLOv5 Object Detection App on iOS“.

How to test your model in Production


I would really advise you to push your Machine Learning models to production as fast as possible! A
model not in production is a model that does not bring revenue, and production is the only true way to
validate your model: Trial by Fire! When it comes to testing your model in production, it is not one of the
other, the more, the merrier!

Here are a couple of ways to test your models and the underlying pipelines:

84
● Shadow deployment - The idea is to deploy the model to test if the production inferences make
sense. When users request to be served model predictions, the requests are sent to both the
production model and the new challenger model but only the production model responds. This is
a way to stage model requests and validate that the prediction distribution is similar to the one
observed at the training stage. This helps validate part of the serving pipeline prior to releasing
the model to users. We cannot really assess the model performance as we expect the production

model to affect the outcome of the ground truth.


● Canary deployment - one step further than shadow deployment: a full end-to-end test of the
serving pipeline. We release the model inferences to a small subset of the users or even internal
company users. That is like an integration test in production! But it doesn't tell us anything about
model performance.
● A/B testing - one step further: we now test the model performance! Typically, we use business
metrics that are hard to measure during offline experiments, like revenue or user experience. You

85
route a small percentage of the user base to the challenger model and you assess its performance
against the champion model. Test statistics and p-values are a good statistical framework to have
a rigorous validation but I have seen cruder approaches.
● Multi-armed bandit - That is a typical approach in Reinforcement Learning: you explore the
different options and exploit the one that yields the best performance. If a stock in your portfolio
brings more returns, wouldn't you buy more shares of it? The assumption is that relative
performance can change over time and you may need to put more or less weight on one model or
another. You don't have to use only 2 models, you could have hundreds of them at once if you
wanted to. It can be more data efficient than traditional A/B testing: “ML Platform Meetup: Infra
for Contextual Bandits and Reinforcement Learning“.
● Interleaving experiments - Just show the both models' predictions and let's see what the user
will choose. For example combine the ranked results of a search engine or a product
recommender in the same list and let the user decide. The problem with that is that you are only
assessing user preference and not another potentially more critical KPI metric: “Innovating Faster
on Personalization Algorithms at Netflix Using Interleaving”.

"Designing Machine Learning Systems" is a good read if you want to learn more about those.

The different Inference pipelines architectures


Establishing a deployment strategy should be done before even starting any model development.
Understanding the data requirements, data infrastructure, and the different teams that need to be involved
should be planned in advance so that we optimize for the success of the project. One aspect to think about
is the deployment pattern: where and how is the model going to be deployed?

If you consider real-time applications, there might be 3 ways to go about it:

● Deploying as a monolith - The machine learning service code base is integrated within the rest
of the backend code base. This requires tight collaboration between the data scientists and the
owners of the backend code base or related business service, the CI/CD process is further slowed
down by the ML service unit tests, and the model size and computation requirements add

86
additional load on the backend servers. This type of deployment should only be considered if the
inference process is very light to run.

● Deploying as a single service - The ML service is deployed on a single server, potentially with
elastic load balancing if scale is needed. This allows engineers to build the pipeline independently
of the ones owning the business service. Building monitoring and logging systems will be much
easier. And ownership of the codebase will be less fuzzy. The model can be complex without
putting load pressure on the rest of the infrastructure. This is typically the easiest way to deploy a
model while ensuring scalability, maintainability and reliability.
● Deploying as a microservice - The different components of the ML services get their own
services. This requires a high level of maturity in ML and MLOps. This could be useful if for
example the data processing component of the inference process needs to be reused for multiple
models (e.g. a specific set of feature transformations). In Ads ranking for example, we could have
multiple models (on different services then) that rank ads in different sub-domains (ads on FB,
ads on Instagram, …) but that need to follow the same auction process that could be handled by a

87
specialized Inference Processing service in charge of the auction. You better make sure to use an
orchestrated service such as Kubernetes to handle the resulting complexity of microservice hell
(“Dependency Hell in Microservices and How to Avoid It“).

The Feature Store: The Data Throne


In Machine Learning, the data is king, and the Feature Store is its throne! Do you remember the time
when each team was building their own data pipelines, when the data in production was not consistent
with the one in development, and when we had no idea how half of the features were created? Those were
dark times prior to the era of feature stores! To be fair, not everybody should invest in Feature Stores: if
you don't need real-time inference, if you have less than ~10 models in production, or if you don't need
frequent retraining, a Feature Store may not be for you.

As far as I know, Uber’s Michelangelo is the first ML platform to introduce a feature store. A feature
store is exactly what it sounds like! This is a place where Data Scientists / Machine Learning engineers of
different teams can browse features for their next ML development endeavor. You can rely on the quality
of the data and consistency is ensured between development and production pipelines. The features
originate from streaming and batch sources of data and their computations are centralized and version
controlled. Typically, a feature store provides monitoring capability for concept and data drift, a registry

88
to discover features and their metadata, offline storage for model training and batch scoring, and an online
store API for real-time applications.

Let's list some of the advantages of feature stores:


● The ability to share features between teams and projects
● The ability to ensure consistency between training and serving pipelines
● The ability to serve features at low latency
● The ability to query the features at different points in time: features evolve, so we need a
guarantee on the point-in-time correctness.
● Ability to monitor features even before they are used in production
● Provide feature governance with different levels of access control and versioning

There are tons of vendors available! Feast is an open source project from Google Cloud and Go-Jek and it
integrates with KubeFlow. AWS has its own feature store as part of SageMaker. ScribbleData, Tecton,
and Hopsworks provide feature stores as well and other MLOps capabilities. You can read more about it
here:
● MLOps with a Feature Store
● Feature Store as a Foundation for Machine Learning
● The Essential Architectures For Every Data Scientist and Big Data Engineer

Continuous Integration - Deployment - Training (CI / CD / CT)


If you are working in a big tech company on ML projects, chances are you are working on some version
of Continuous Integration / Continuous Deployment (CI/CD). It represents a high level of maturity in
MLOps with Continuous Training (CT) at the top. This level of automation really helps ML engineers to
solely focus on experimenting with new ideas while delegating repetitive tasks to engineering pipelines
and minimizing human errors.

On a side note, when I was working at Meta, the level of automation was of the highest degree. That was
simultaneously fascinating and quite frustrating! I had spent so many years learning how to deal with ML
deployment and management that I had learned to like it. I was becoming good at it, and suddenly all that
work seemed meaningless as it was abstracted away in some automation. I think this is what many people

89
are feeling when it comes to AutoML: a simple call to a "fit" function seems to replace what took years of
work and experience for some people to learn.

There are many ways to implement CI/CD/CT for Machine Learning but here is a typical process:

● The experimental phase - The ML Engineer wants to test a new idea (let's say a new feature
transformation). He modifies the code base to implement the new transformation, trains a model,
and validates that the new transformation indeed yields higher performance. The resulting
outcome at this point is just a piece of code that needs to be included in the master repo.
● Continuous integration - The engineer then creates a Pull request (PR) that automatically
triggers unit testing (like a typical CI process) but also triggers the instantiation of the automated
training pipeline to retrain the model, potentially test it through integration tests or test cases and
push it to a model registry. There is a manual process for another engineer to validate the PR and
performance reading of the new model.
● Continuous deployment - Activating a deployment triggers a canary deployment to make sure
the model fits in a serving pipeline and runs an A/B test experiment to test it against the

90
production model. After satisfactory results, we can propose the new model as a replacement for
the production one.
● Continuous training - as soon as the model enters the model registry, it deteriorates and you
might want to activate recurring training right away. For example, each day the model can be
further fine-tuned with the new training data of the day, deployed, and the serving pipeline is
rerouted to the updated model.

The Google Cloud documentation is a good read on the subject:


● MLOps: Continuous delivery and automation pipelines in machine learning
● Architecture for MLOps using TensorFlow Extended, Vertex AI Pipelines, and Cloud Build

Testing and monitoring


I think most Machine Learning engineers and Data Scientists dislike writing unit tests. I am definitely one
of them! You develop a model, you architect the deployment strategy, you potentially build new data
pipelines, you document everything, and you still have to write unit tests to test that a float is not a
string?!

Testing and monitoring in Machine Learning tend to be quite different from traditional software
development. Where typical software requires testing on the code itself, machine learning models require
tests validating the code itself, but also the data, model output and the inference pipeline. When it comes
to monitoring, beyond the typical latency, memory, CPU and disk utilization, we also need to monitor the
incoming data and the model quality.

What is your ML test score? Google established the following guidelines when it comes to testing and
monitoring: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction“.
Ask yourself those questions next time you develop a model:

Testing the data:


● Are the feature expectations captured in a schema?
● Are all the features useful?
● What is the cost of each feature?

91
● Are you using features with business restrictions?
● Does the data pipeline have appropriate privacy controls?
● Can new features be added quickly?
● Are all your features unit tested?

Testing the model:


● Has the code been reviewed?
● Do the offline metrics correlate with the online ones?
● Have you tuned all the hyperparameters?
● How old / stale is the model?
● Is your model better than a simpler model?

92
● Is the model performance good on all the segments of the data?
● Is your model fair for all groups of people?

Testing the infra:


● Is the training reproducible?
● Is the model specification code unit tested?
● Do you have integration tests for the whole ML pipeline?
● Is model validation in place before being served?
● Is there a simple process to debug training or inference on a simple example?
● Are you using canary testing before serving the model?
● Can you easily roll back to a previous production model?

What should you monitor?


● Any dependency change
● If the data changes overtime in the training and serving pipelines
● If the features are different in training from serving pipelines
● If the model is stale
● If the model generates invalid values
● If the model training speed, serving latency, throughput or RAM usage changes
● If the model performance changes

Let's play a game. You get 0.5 points for each of the tests and monitoring if done manually, and a full
point if it is automated. Sum your points for each of the 4 sections individually and take the minimum
among the 4. What is your Testing Score?

Learn and download Machine Learning Study Material:


https://t.me/AIMLDeepThaught
https://lnkd.in/gaJtbwcu
https://sites.google.com/view/aiml-deepthought/free-study-materials
93
https://www.facebook.com/groups/3869126743325872

You might also like