0% found this document useful (0 votes)

25 views

Data Science Guide

Uploaded by

ARAYA JORQUERA MAR�A PAZ

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Data Science Guide

Uploaded by

ARAYA JORQUERA MAR�A PAZ

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 275

FREE

2024 EDITION

DATA
SCIENCE
FULL ARCHIVE

Daily Dose of Avi Chawla

Data Science DailyDoseofDS.com
DailyDoseofDS.com

e n g a d m
r s r e n g i - n g u i s e n g
n e r e e n g
Most ML models are trained independently without any interaction with other
models. However, in the realm of real-world ML, there are many powerful
learning techniques that rely on model interactions to improve performance.

The following
image
summarizes
four such
well-adopted
and must-know
training
methodologies:

8
DailyDoseofDS.com

1 r s r e n g

This is extremely useful when:

● The task of interest has less data.

● But a related task has abundant data.

This is how it works:

● Train a neural network model (base model) on the related task.

● Replace the last few layers on the base model with new layers.
● Train the network on the task of interest, but don’t update the weights of
the unreplaced layers during backpropagation.

By training a model on the related task ﬁrst, we can capture the core patterns of
the task of interest. Later, we can adjust the last few layers to capture
task-speciﬁc behavior.

Another idea which is somewhat along these lines is knowledge distillation,

which involves the “transfer” of knowledge. We will discuss it in the upcoming
chapters.

Transfer learning is commonly used in many computer vision tasks.

9
DailyDoseofDS.com

2 i - n g

Fine-tuning involves updating the weights of some or all layers of the pre-trained
model to adapt it to the new task.

The idea may appear similar to transfer learning, but in ﬁne-tuning, we typically
do not replace the last few layers of the pre-trained network.

Instead, the pretrained model itself is adjusted to the new data.

3 u i a e n g

As the name suggests, a model is trained to perform multiple tasks

simultaneously.

The model shares knowledge across tasks, aiming to improve generalization and
performance on each task.

10
DailyDoseofDS.com

It can help in scenarios where tasks are related, or they can beneﬁt from shared
representations.

In fact, the motive for multi-task learning is not just to improve generalization.

We can also save compute power during training by having a shared layer and
task-speciﬁc segments.

● Imagine training two models independently on related tasks.

● Now compare it to having a network with shared layers and then
task-speciﬁc branches.

Option 2 will typically result in:

● Better generalization across all tasks.

● Less memory utilization to store model weights.
● Less resource utilization during training.

4 e r e e n g
Let’s discuss it in the next chapter.

11
DailyDoseofDS.com

n o c o o e r e e n g
In my opinion, federated learning is among those very powerful ML techniques
that is not given the true attention it deserves.

Here’s a visual that depicts how it works:

Let’s understand this topic in this chapter and why I consider this to be an
immensely valuable skill to have.

h r l
Modern devices (like smartphones) have access to a wealth of data that can be
suitable for ML models.

To get some perspective, consider the number of images you have on your phone
right now, the number of keystrokes you press daily, etc.

That’s plenty of data, isn’t it?

And this is just about one user — you.

But applications can have millions of users. The amount of data we can train ML
models on is unfathomable.

So what is the problem here?

The problem is that almost all data available on modern devices is private.

12
DailyDoseofDS.com

● Images are private.

● Messages you send are private.
● Voice notes are private.

Being private, it is likely that it cannot be aggregated in a central location, as

traditionally, ML models are always trained on centrally located datasets.

But this data is still valuable to us, isn’t it?

We want to utilize it in some way.

h o t n
Federated learning smartly addresses this challenge of training ML models on
private data.

Here’s the core idea:

13
DailyDoseofDS.com

● Instead of aggregating data on a central server, dispatch a model to an end

device.
● Train the model on the user’s private data on their device.
● Fetch the trained model back to the central server.
● Aggregate all models obtained from all end devices to form a complete
model.

That’s an innovative solution because each client possesses a local training

dataset that remains exclusively on their device and is never uploaded to the
server.

Yet, we still get to train a model on this private data.

Send a global model to the user’s device, train a model on private data, and
retrieve it back.

Furthermore, federated learning distributes most computation to a user’s device.

As a result, the central server does not need the enormous computing that it
would have demanded otherwise.

This is the core idea behind federated learning.

14
DailyDoseofDS.com

u d g u i a e n g M ) o l
Most ML models are trained on one task. As a result, many struggle to intuitively
understand how a model can be trained on multiple tasks simultaneously.

So let’s discuss it in this chapter.

To reiterate, in MTL, the network has a few shared layers and task-speciﬁc
segments. During backpropagation, gradients are accumulated from all branches,
as depicted in the animation below:

Let’s take a simple example to understand its implementation.

Consider we want our model to take a real value (x) as input and generate two
outputs:

● sin(x)
● cos(x)

This can be formulated as an MTL problem.

15
DailyDoseofDS.com

First, we deﬁne our model class using PyTorch.

● We have some fully connected layers in self.model → These are the shared
layers.
● Furthermore, we have the output-speciﬁc layers to predict sin(x) and cos(x).

Next, let’s deﬁne the forward

pass in the class above:

● First, we pass the

input through the
shared layers
(self.model).
● The output of the
shared layers is passed
through the sin and
cos branches.
● We return the output
from both branches.

16
DailyDoseofDS.com

We are almost done. The ﬁnal part of this implementation is to train the model.
Let’s use mean squared error as the loss function. The training loop is
implemented below:

● We pass the input data through the model.

● It returns two outputs, one from each segment of the network.
● We compute the branch-speciﬁc loss values (loss1 and loss2) using true
predictions.
● We add the two loss values to get the total loss for the network.
● Finally, we run the backward pass.

With this, we have trained our MTL model. Also, we get a decreasing loss, which
depicts that the model is being trained.

And that’s how we train an MTL model. You can extend the same idea to build
any MTL model of your choice.

17
DailyDoseofDS.com

Do remember that building an MTL model on unrelated tasks will not produce
good results.

Thus, “task-relatedness” is a critical component of all MTL models because of

the shared layers. Also, it is NOT necessary that every task must equally
contribute to the entire network’s loss.

We may assign weights to each task as well, as depicted below:

The weights could be based on task importance.

Or…

At times, I also use dynamic task weights, which could be inversely proportional
to the validation accuracy achieved on that task.

My rationale behind this technique is that in an MTL setting, some tasks can be
easy while others can be diﬃcult.

If the model achieves high accuracy on one task during training, we can safely
reduce its loss contribution so that the model focuses more on the second task.

You can download the notebook for this chapter here: https://bit.ly/3ztY5hy.

18
DailyDoseofDS.com

c v e n g
There’s not much we can do to build a supervised system when the data we begin
with is unlabeled.

Using unsupervised techniques (if they ﬁt the task) can be a solution, but
supervised systems are typically on par with unsupervised ones.

Another way, if feasible, is to rely on self-supervised learning.

Self-supervised learning is when we have an unlabeled dataset (say text data), but
we somehow ﬁgure out a way to build a supervised learning model out of it.

This becomes possible due to the inherent nature of the task.

Consider an LLM, for instance.

In a nutshell, its core objective is to predict the next token based on previously
predicted tokens (or the given context).

19
DailyDoseofDS.com

This is a classiﬁcation task, and the labels are tokens.

But text data is raw. It has no labels.

Then how did we train this classiﬁcation task?

Self-supervised techniques solve this problem.

Due to the inherent nature of the task (next-token prediction, to be speciﬁc),

every piece of raw text data is already self-labeled.

20
DailyDoseofDS.com

The model is only supposed to learn the mapping from previous tokens to the
next token.

This is called self-supervised learning, which is quite promising, but it has

limited applicability, largely depending on the task.

At this stage, the only possibility one notices is annotating the dataset. However,
data annotation is diﬃcult, expensive, time-consuming, and tedious.

Active learning is a relatively easy, inexpensive, quick, and interesting way to

address this.

As the name suggests, the idea is to build the model with active human feedback
on examples it is struggling with. The visual below summarizes this:

21
DailyDoseofDS.com

Let’s get into the details.

We begin by manually labeling a tiny percentage of the dataset.

While there’s no rule on how much data should be labeled, I have used active
learning (successfully) while labeling as low as ~1% of the dataset, so try
something in that range.

Next, build a model on this small labeled dataset.

Of course, this won’t be a perfect model, but that’s okay. Next, generate
predictions on the dataset we did not label:

22
DailyDoseofDS.com

It’s obvious that we cannot determine if these predictions are correct as we do

not have any labels.

That’s why we need to be a bit selective with the type of model we choose.

More speciﬁcally, we need a model that, either implicitly or explicitly, can also
provide a conﬁdence level with its predictions.

As the name suggests, a confidence level reflects the model’s confidence in

generating a prediction.

If a model could speak, it would be like:

● I am predicting a “cat” and am 95% conﬁdent about my prediction.

● I am predicting a “cat” and am 5% conﬁdent about my prediction.
● And so on…

Probabilistic models (ones that provide a probabilistic estimate for each class) are
typically a good ﬁt here.

23
DailyDoseofDS.com

This is because one can determine a proxy for conﬁdence level from probabilistic
outputs.

In the above two examples, consider the gap between 1st and 2nd highest
probabilities:

● In example #1, the gap is large. This can indicate that the model is quite
conﬁdent in its prediction.
● In example #2, the gap is small. This can indicate that the model is NOT
quite conﬁdent in its prediction.

Now, go back to the predictions generated above and rank them in order of
conﬁdence:

24
DailyDoseofDS.com

In the above image:

● The model is already quite confident with the first two instances. There’s
no point checking those.
● Instead, it would be best if we (the human) annotate the instances with
which it is least confident.

To get some more perspective, consider the image below. Logically speaking,
which data point’s human label will provide more information to the model? I
know you already know the answer.

Thus, in the next step, we provide our human label to the low-conﬁdence
predictions and feed it back to the model with the previously labeled dataset:

Repeat this a few times and stop when you are satisﬁed with the performance.

In my experience, active learning has always been an immensely time-saving

approach to building supervised models on unlabeled datasets.

The only thing that you have to be careful about is generating conﬁdence
measures.

25
DailyDoseofDS.com

If you mess this up, it will aﬀect every subsequent training step.

There’s one more thing I like to do when using active learning.

While combining the low-conﬁdence data with the seed data, we can also use the
high-conﬁdence data. The labels would be the model’s predictions.

This variant of active learning is called cooperative learning.

26
DailyDoseofDS.com

u t e n e r
p m a o
o n m

As we progress towards building larger

and larger models, every bit of possible
optimization becomes crucial.

And, of course, there are various ways to speed up model training, like:

● Batch processing
● Leverage distributed training using frameworks like PySpark MLLib.
● Use better Hyperparameter Optimization, like Bayesian Optimization,
which we will discuss in this chapter.
● and many other techniques.

Momentum is another reliable and eﬀective technique to speed up model

training. While Momentum is pretty popular, many people struggle to intuitively
understand how it works and why it is eﬀective. Let’s understand in this chapter.

s e i r i t e e
In gradient descent, every parameter update solely depends on the current
gradient. This is clear from the gradient weight update rule shown below:

27
DailyDoseofDS.com

As a result, we end up having many unwanted oscillations during the

optimization process.

Let’s understand this more visually.

Imagine this is the loss function contour plot, and the optimal location
(parameter conﬁguration where the loss function is minimum) is marked here:

Simply put, this plot illustrates how gradient descent moves towards the optimal
solution. At each iteration, the algorithm calculates the gradient of the loss
function at the current parameter values and updates the weights.

This is depicted below:

Notice two things here:

● It unnecessarily oscillates vertically.

● It ends up at the non-optimal solution a er some epochs.

28
DailyDoseofDS.com

Ideally, we would have expected our weight updates to look this:

● It must have taken longer steps in the horizontal direction…

● …and smaller vertical steps because a movement in this direction is
unnecessary.

This idea is also depicted below:

o o n m o e h r l ?
Momentum-based optimization slightly modiﬁes the update rule of gradient
descent. More speciﬁcally, it also considers a moving average of past gradients:

29
DailyDoseofDS.com

This helps us handle the unnecessary vertical oscillations we saw earlier.

How?

As Momentum considers a moving average of past gradients, so if the recent

gradient update trajectory looks as shown in the following image, then it is clear
that its average in the vertical direction will be very low while that in the
horizontal direction will be large (which is precisely what we want):

As this moving average gets added to the gradient updates, it helps the
optimization algorithm take larger steps in the desired direction.

This way, we can:

● Smoothen the optimization trajectory.

30
DailyDoseofDS.com

● Reduce unnecessary oscillations in parameter updates, which also speeds

up training.

This is also evident from the image below:

This time, the gradient update trajectory shows much smaller oscillations in the
vertical direction, and it also manages to reach an optimum under the same
number of epochs as earlier.

This is the core idea behind Momentum and how it works.

Of course, Momentum does introduce another hyperparameter (Momentum rate)

in the model, which should be tuned appropriately like any other
hyperparameter:

For instance, considering the 2D contours we discussed above:

31
DailyDoseofDS.com

● Setting an extremely large value of Momentum rate will signiﬁcantly

expedite gradient update in the horizontal direction. This may lead to
overshooting the minima, as depicted below:

● What’s more, setting an extremely small value of Momentum will slow

down the optimal gradient update, defeating the whole purpose of
Momentum.

If you want to have a more hands-on experience, check out this tool:
https://bit.ly/4cOrJN1.

32
DailyDoseofDS.com

i d r i o r n g
o e
Typical deep learning libraries are really conservative when it comes to assigning
data types.

The data type assigned by default is usually 64-bit or 32-bit, when there is also
scope for 16-bit, for instance. This is also evident from the code below:

As a result, we are not entirely optimal at eﬃciently allocating memory. Of

course, this is done to ensure better precision in representing information.

33
DailyDoseofDS.com

However, this precision always comes at the cost of additional memory

utilization, which may not be desired in all situations.

In fact, it is also observed that many tensor operations, especially matrix

multiplication, are much faster when we operate under smaller precision data
types than larger ones, as demonstrated below:

34
DailyDoseofDS.com

Moreover, since ﬂ a 6 is only half the size of ﬂ a 2, its usage reduces the
memory required to train the network. This also allows us to train larger models,
train on larger mini-batches (resulting in even more speedup), etc.

Mixed precision training is a pretty reliable and widely adopted technique in the
industry to achieve this.

As the name suggests, the idea is to employ lower precision ﬂ a 6 (wherever

feasible, like in convolutions and matrix multiplications) along with ﬂ a 2—
that is why the name “mixed precision.”

This is a list of some models I found that were trained using mixed precision:

It’s pretty clear that mixed precision training is much more popularly used, but
we don’t get to hear about it o en.

e r e e n h e n a e i …
From the above discussion, it must be clear that as we use a low-precision data
type (ﬂ a 6), we might unknowingly introduce some numerical inconsistencies
and inaccuracies.

35
DailyDoseofDS.com

To avoid them, there are some best practices for mixed precision training that I
want to talk about next, along with the code.

i d r i o r n g n y r n e r t s
Leveraging mixed precision training in PyTorch requires a few modiﬁcations in
the existing network training implementation. Consider this is our current
PyTorch model training implementation:

The ﬁrst thing we introduce here is a scaler object that will scale the loss value:

36
DailyDoseofDS.com

We do this because, at times, the original loss value can be so low, that we might
not be able to compute gradients in ﬂ a 6 with full precision. Such situations
may not produce any update to the model’s weights.

Scaling the loss to a higher numerical range ensures that even small gradients
can contribute to the weight updates.

But these minute gradients can only be accommodated into the weight matrix
when the weight matrix itself is represented in high precision, i.e., ﬂoat32. Thus,
as a conservative measure, we tend to keep the weights in ﬂ a 2.

That said, the loss scaling step is not entirely necessary because, in my
experience, these little updates typically appear towards the end stages of the
model training. Thus, it can be fair to assume that small updates may not
drastically impact the model performance. But don’t take this as a deﬁnite
conclusion, so it’s something that I want you to validate when you use mixed
precision training.

Moving on, as the weights

(which are matrices) are
represented in ﬂ a 2,
we can not expect the
speedup from
representing them in
ﬂ a 6, if they remain
this way:

37
DailyDoseofDS.com

To leverage these ﬂaot16-based speedups, here are the steps we follow:

1. We make a ﬂ a 6 copy of weights during the forward pass.

2. Next, we compute the loss value in fl a 2 and scale it to have more
precision in gradients, which works in fl a 6.
a. The reason we compute gradients in fl a 6 is because, like forward
pass, gradient computations also involve matrix multiplications.
b. Thus, keeping them in fl a 6 can provide additional speedup.
3. Once we have computed the gradients in fl a 6, the heavy matrix
multiplication operations have been completed. Now, all we need to do is
update the original weight matrix, which is in fl a 2.
4. Thus, we make a fl a 2 copy of the above gradients, remove the scale we
applied in Step 2, and update the fl a 2 weights.
5. Done!

38
DailyDoseofDS.com

The mixed-precision settings in the forward pass are carried out by the
o h u c t context manager:

Now, it’s time to handle the backward pass.

39
DailyDoseofDS.com

● Line 13 → c e s l l s b k r ): The scaler object scales the loss

value and a w d is called to compute the gradients.
● Line 14 → c e s p p : Unscale gradients and update weights.
● Line 15 → c e u a ( : Update the scale for the next iteration.
● Line 16 → p z o r ( : Zero gradients.

Done!

The eﬃcacy of mixed precision scaling over traditional training is evident from
the image below:

Mixed precision training is over . faster than conventional training.

40
DailyDoseofDS.com

r i t h k i i
Neural networks primarily utilize memory in two ways:

1. When they store model weights (this is ﬁxed memory utilization).

2. When they are trained (this is dynamic). It happens in two ways:
a. During forward pass while computing and storing activations of all
layers.
b. During backward pass while computing gradients at each layer.

The latter, i.e., dynamic memory utilization, o en restricts us from training

larger models with bigger batch sizes.

This is because memory utilization scales proportionately with the batch size.

That said, there’s a pretty incredible technique that lets us increase the batch size
while maintaining the overall memory utilization.

It is called Gradient checkpointing, and in my experience, it’s a highly underrated

technique to reduce the memory overheads of neural networks.

Let’s understand this in more detail.

41
DailyDoseofDS.com

o r i t h p n g o s
Gradient checkpointing is based on two key observations on how neural
networks typically work:

1) The activations of a speciﬁc layer

can be solely computed using the
activations of the previous layer. For
instance, in the image below, “Layer
B” activations can be computed from
“Layer A” activations only.

2) Updating the weights of a layer only

depends on two things:

a. The activations of that layer.

b. The gradients computed in the
next (right) layer (or rather, the
running gradients).

Gradient checkpointing exploits these two observations to optimize memory

utilization. Here’s how it works:

● Step 1) Divide the network into segments before the forward pass:

42
DailyDoseofDS.com

● Step 2) During the forward pass, only store the activations of the ﬁrst layer
in each segment. Discard the rest when they have been used to compute
the activations of the next layer.

● Step 3) Now comes backpropagation. To update the weights of a layer, we

need its activations. Thus, we recompute those activations using the ﬁrst
layer in that segment. For instance, as shown in the image below, to update
the weights of the red layers, we recompute their activations using the
activations of the cyan layer, which are already available in memory.

Done!

This is how gradient checkpointing works.

43
DailyDoseofDS.com

To summarize, the idea is that we don’t need to store all the intermediate
activations in memory. Instead, storing a few of them and recomputing the rest
only when they are needed can signiﬁcantly reduce the memory requirement. The
whole idea makes intuitive sense as well.

This also allows us to train the

network on larger batches of data.
Typically, my observation has been
that gradient checkpointing (GCP) can
reduce memory usage by at least
50-60%, which is massive.

Of course, as we compute some activations twice, this does come at the cost of
increased run-time, which can typically range between 15-25%. So there’s always
a tradeoﬀ between memory and run-time.

That said, another advantage is that it allows us to use a larger batch size, which
can slightly (not entirely though) counter the increased run-time.

Nonetheless, gradient checkpointing is an extremely powerful technique to train

larger models, which I have found to be pretty helpful at times, without resorting
to more intensive techniques like distributed training, for instance.

Thankfully, gradient checkpointing is also implemented by many open-source

deep learning frameworks like Pytorch, etc.

Here’s a demo.

44
DailyDoseofDS.com

r i t h p n g n y r
To utilize this, we begin by importing the necessary libraries and functions:

Next, we deﬁne our neural network:

As demonstrated above, in the forward method, we use the

checkpoint_sequential method to use gradient checkpointing and divide the
network into two segments.

Next, we can proceed with network training as we usually would.

45
DailyDoseofDS.com

r i t c m a o

Under memory
constraints, it is
always
recommended to
train the neural
network with a
small batch size.

Despite that, there’s a technique called gradient accumulation, which lets us

(logically) increase batch size without explicitly increasing the batch size.

Confused?

Let’s understand in this chapter. But before that, we must understand…

h o e a e o s y c l x o u n r n g

The primary memory

overhead in a neural
network comes from
backpropagation. This is
because, during
backpropagation, we must
store the layer activations
in memory. A er all, they
are used to compute the
gradients.

The bigger the network, the more activations a network must store in memory.
Also, under memory constraints, having a large batch size will result in:

● storing many activations

● using those many activations to compute the gradients

46
DailyDoseofDS.com

This may lead to more resource consumption than available — resulting in

training failure. But by reducing the batch size, we can limit the memory usage
and train the network.

h s r i t c m a o n o o t e n
n e i a h i n e r o t i s
Consider we are training a neural network on mini-batches.

We train the network as follows:

● On every mini-batch:
○ Run the forward pass while storing the activations.
○ During backward pass:
■ Compute the loss
■ Compute the gradients
■ Update the weights

Gradient accumulation modiﬁes the last step of the backward pass, i.e., weight
updates. More speciﬁcally, instead of updating the weights on every mini-batch,
we can do this:

47
DailyDoseofDS.com

1. Run the forward pass on a mini-batch.

2. Compute the gradient values for weights in the network.
3. Don’t update the weights yet.
4. Run the forward pass on the next mini-batch.
5. Compute the gradient values for weights and add them to the gradients
obtained in step 2.
6. Repeat steps 3-5 for a few more mini-batches.
7. Update the weights only a er processing a few mini-batches.

This technique works

because accumulating the
gradients across multiple
mini-batches results in the
same sum of gradients as if
we were processing them
together. Thus, logically
speaking, using gradient
accumulation, we can
mimic a larger batch size
without having to explicitly
increase the batch size.

For instance, say we want to use a batch size of 64. However, current memory can
only support a batch size of 16.

No worries!

● We can use a batch size of size 16.

● We can accumulate the gradients from every mini-batch.
● We can update the weights only once every 8 mini-batches.

Thus, eﬀectively, we used a batch size of 16*8 (=128) instead of what we originally
intended — 64.

48
DailyDoseofDS.com

m e n t n
Let’s look at how we can implement this. In PyTorch, a typical training loop is
implemented as follows:

● We clear the gradients

● Run the forward pass
● Compute the loss
● Compute the gradients
● Update the weights

However, as discussed earlier, if needed, we can only update the weights a er a

few iterations. Thus, we must continue to accumulate the gradients, which is
precisely what o . c a ( does.

49
DailyDoseofDS.com

Also, as p m e z o r ( clears the gradients, we must only execute it

a er updating the weights. This idea is implemented below:

● First, we deﬁne acc_steps — the number of mini-batches a er which we

want to update the weights.
● Next, we run the forward pass.
● Moving on, we compute the loss and the gradients.
● As discussed earlier, we will not update the weights yet and instead let the
gradients accumulate for a few more mini-batches.
● We only update the weights when the if condition is true.
● A er updating, we clear the accumulated gradients.

This way, we can optimize neural network training in memory-constrained

settings.

50
DailyDoseofDS.com

e r n o
Before we end, it is essential to note that gradient accumulation is NOT a remedy
to improve run-time in memory-constrained situations. In fact, we can also verify
this from my experiment:

Instead, its objective is to reduce overall memory usage.

Of course, it’s true that we are updating the weights only a er a few iterations.
So, it will be a bit faster than updating on every iteration. Yet, we are still
processing and computing gradients on small mini-batches, which is the core
operation here.

Nonetheless, the good thing is that even if you are not under memory constraints,
you can still use gradient accumulation.

● Specify your typical batch size.

● Run forward pass.
● Compute loss and gradients.
● Update only a er a few iterations.

You can download the notebook here: https://bit.ly/3xNCfFt.

51
DailyDoseofDS.com

t t i o u i P r n g
By default, deep learning models only utilize a single GPU for training, even if
multiple GPUs are available.

An ideal way to proceed (especially in big-data settings) is to distribute the

training workload across multiple GPUs. The graphic below depicts four
common strategies for multi-GPU training:

52
DailyDoseofDS.com

1 o l a l l m

● Different parts (or layers) of the model are placed on different GPUs.
● Useful for huge models that do not fit on a single GPU.
● However, model parallelism also introduces severe bottlenecks as it
requires data flow between GPUs when activations from one GPU are
transferred to another GPU.

2 e o a l l m

● Distributes and processes individual tensor operations across multiple

devices or processors.
● It is based on the idea that a large tensor operation, such as matrix
multiplication, can be divided into smaller tensor operations, and each
smaller operation can be executed on a separate device or processor.

53
DailyDoseofDS.com

● Such parallelization strategies are inherently built into standard

implementations of PyTorch and other deep learning frameworks, but they
become much more pronounced in a distributed setting.

3 a a l l m

● Replicate the model across all GPUs.

● Divide the available data into smaller batches, and each batch is processed
by a separate GPU.
● The updates (or gradients) from each GPU are then aggregated and used to
update the model parameters on every GPU.

4 i l e a l l m

● This is o en considered a combination of data parallelism and model

parallelism.

54
DailyDoseofDS.com

● So the issue with standard model parallelism is that 1st GPU remains idle
when data is being propagated through layers available in 2nd GPU:

● Pipeline parallelism addresses this by loading the next micro-batch of data

once the 1st GPU has ﬁnished the computations on the 1st micro-batch
and transferred activations to layers available in the 2nd GPU. The process
looks like this:

○ 1st micro-batch passes through the layers on 1st GPU.

○ 2nd GPU receives activations on 1st micro-batch from 1st GPU.
○ While the 2nd GPU passes the data through the layers, another
micro-batch is loaded on the 1st GPU.
○ And the process continues.
● GPU utilization drastically improves this way. This is evident from the
animation below where multi-GPUs are being utilized at the same
timestamp (look at = = = and t=6):

Those were four common strategies for multi-GPU training.

55
DailyDoseofDS.com

i e a o
a l m t n
For every instance in single-label classiﬁcation datasets, the entire probability
mass belongs to a single class, and the rest are zero. This is depicted below:

The issue is that, at times, such label distributions excessively motivate the model
to learn the true class for every sample with pretty high conﬁdence. This can
impact its generalization capabilities.

Label smoothing is a lesser-talked regularisation technique that elegantly

addresses this issue.

56
DailyDoseofDS.com

As depicted above, with label smoothing:

● We intentionally reduce the probability mass of the true class slightly.

● The reduced probability mass is uniformly distributed to all other classes.

Simply put, this can be thought of as asking the model to be “less overconﬁdent”
during training and prediction while still attempting to make accurate
predictions.

The eﬃcacy of this technique is evident from the image below:

In this experiment, I trained two neural networks on the Fashion MNIST dataset
with the exact same weight initialization.

● One without label smoothing.

● Another with label smoothing.

The model with label smoothing resulted in a better test accuracy, i.e., better
generalization.

57
DailyDoseofDS.com

h o o s a l m t n
A er using label smoothing for many of my projects, I have also realized that it is
not well suited for all use cases. So it’s important to know when you should not
use it.

See, if you only care about getting the ﬁnal prediction correct and improving
generalization, label smoothing will be a pretty handy technique. However, I
wouldn’t recommend utilizing it if you care about:

● Getting the prediction correct.

● And understanding the model’s conﬁdence in generating a prediction.

This is because as we discussed above, label smoothing guides the model to

become “less overconﬁdent” about its predictions. Thus, we typically notice a
drop in the conﬁdence values for every prediction, as depicted below:

On a speciﬁc test instance:

● The model without label smoothing outputs 99% probability for class 3.
● With label smoothing, although the prediction is still correct, the
conﬁdence drops to 74%.

This is something to keep in mind when using label smoothing. Nonetheless, the
technique is indeed pretty promising for regularizing deep learning models. You
can download the code notebook for this chapter here: https://bit.ly/4ePt08d.

58
DailyDoseofDS.com

o l o

Binary classiﬁcation
tasks are typically
trained using the
binary cross entropy
(BCE) loss function:

For notational convenience, if we deﬁne p as the following:

…then we can also write the cross-entropy loss function as:

That said, one limitation of BCE loss is that it weighs probability predictions for
both classes equally, which is evident from its symmetry:

For more clarity, consider the table below, which depicts two instances, one from
the minority class and another from the majority class, both with the same loss:

59
DailyDoseofDS.com

This causes problems when we use BCE for imbalanced datasets, wherein most
instances from the dominating class are “easily classiﬁable.” Thus, a loss value of,
say, l ( 3 from the majority class instance should (ideally) be weighed LESS
than the same loss value from the minority class.

Focal loss is a pretty handy and useful alternative to address this issue. It is
deﬁned as follows:

As depicted above, it introduces an additional multiplicative factor called

downweighing, and the parameter γ G m is a hyperparameter.

Plotting BCE (class y=1) and Focal loss (for class y=1 and γ=3), we get the
following curve:

60
DailyDoseofDS.com

As shown in the ﬁgure above, focal loss reduces the contribution of the
predictions the model is pretty conﬁdent about. Also, the higher the value of γ
(Gamma), the more downweighing takes place, as shown in this plot below:

Moving on, while the Focal loss function reduces the contribution of conﬁdent
predictions, we aren’t done yet.

The focal loss function now is still symmetric like BCE:

61
DailyDoseofDS.com

To address this, we must add another weighing parameter (α), which is the
inverse of the class frequency, as depicted below:

The α parameter is the inverse of the class frequency Thus, the ﬁnal loss function
comes out to be the following:

By using both downweighing and inverse weighing, the model gradually learns
patterns speciﬁc to the hard examples instead of always being overly conﬁdent in
predicting easy instances.

62
DailyDoseofDS.com

To test the eﬃcacy of focal loss

in a class imbalance setting, I
created a dummy classiﬁcation
dataset with a 90:10 imbalance
ratio:

Next, I trained two neural network models (with the same architecture of 2
hidden layers):

● One with BCE loss

● Another with Focal loss

The decision region plot and test accuracy for these two models is depicted
below:

It is clear that:

● The model trained with BCE loss (le ) always predicts the majority class.
● The model trained with focal loss (right) focuses relatively more on
minority class patterns. As a result, it performs better.

Download this Jupyter notebook to get started with Focal loss:

https://bit.ly/45XzNZC.

63
DailyDoseofDS.com

o r o c a y o s
Some time back, I was invited by a tech startup to conduct their ML interviews. I
interviewed 12 candidates and mostly asked practical ML questions.

However, there were some conceptual questions as well, like the one below,
which I intentionally asked every candidate:

o o r o o ?

Pretty simple, right? Apparently, every candidate gave me an incomplete answer,

which I have mentioned below:

a i t ’ n e
In a gist, the idea is to zero out neurons randomly in a neural network. This is
done to regularize the network.

Dropout is only applied during training, and which neuron activations to zero out
(or drop) is decided using a Bernoulli distribution:

“p” is the dropout probability speciﬁed in, say, PyTorch → nn.Dropout(p).

64
DailyDoseofDS.com

y o o u u t n s h e n h g l h e o n r o ?

a i t : o h s t e n e u e o n r n h
e o s e s l o d

o o n a o h o c…
Of course, I am not saying that the above details are incorrect. They are correct.

However, this is just 50% of how Dropout works, and disappointingly, most
resources don’t cover the remaining 50%. If you too are only aware of the 50%
details I mentioned above, continue reading.

o r o c a y o s
To begin, we must note that Dropout is only applied during training, but not
during the inference/evaluation stage:

Now, consider that a neuron’s input is computed using 100 neurons in the
previous hidden layer:

65
DailyDoseofDS.com

For simplicity, let’s assume a couple of things here:

● The activation of every yellow neuron is 1.

● The edge weight from the yellow neurons to the blue neuron is also 1.

As a result, the input received by the blue neuron will be 100, as depicted below:

Now, during training, if we were using Dropout with, say, a 40% dropout rate,
then roughly 40% of the yellow neuron activations would have been zeroed out.
As a result, the input received by the blue neuron would have been around 60:

66
DailyDoseofDS.com

However, the above point is only valid for the training stage.

If the same scenario had existed during the inference stage instead, then the
input received by the blue neuron would have been 100.

Thus, under similar conditions:

● The input received during training → 60.

● The input received during inference → 100.

Do you see any problem here?

During training, the average neuron inputs are signiﬁcantly lower than those
received during inference.

More formally, using Dropout signiﬁcantly aﬀects the scale of the activations.
However, it is desired that the neurons throughout the model must receive the
roughly same mean (or expected value) of activations during training and
inference. To address this, Dropout performs one additional step.

This idea is to scale the remaining active inputs during training. The simplest
way to do this is by scaling all activations during training by a factor of / - ,
where p is the dropout rate. For instance, using this technique on the neuron
input of 60, we get the following (recall that we set p=40%):

67
DailyDoseofDS.com

As depicted above, scaling the neuron input brings it to the desired range, which
makes training and inference stages coherent for the network.

e f g x i n l
In fact, we can verify that typical implementations of Dropout, from PyTorch, for
instance, do carry out this step. Let’s deﬁne a dropout layer as follows:

Now, let’s consider a random tensor and apply this dropout layer to it:

As depicted above, the retained values have increased.

● The second value goes from 0.13 → 0.16.

● The third value goes from 0.93 → 1.16.
● and so on…

What’s more, the retained values are precisely the same as we would have
obtained by explicitly scaling the input tensor with 1/(1-p):

68
DailyDoseofDS.com

If we were to do the same thing in evaluation mode instead, we notice that no

value is dropped and no scaling takes place either, which makes sense as Dropout
is only used during training:

This is the remaining 50% details, which, in my experience, most resources do

not cover, and as a result, most people aren’t aware of.

But it is a highly important step in Dropout, which maintains numerical

coherence between training and inference stages.

With that, now you know 100% of how Dropout works.

Next, let’s discuss an issue with Dropout in case of CNNs.

69
DailyDoseofDS.com

s e i r o n N
When it comes to training neural networks, it is always recommended to use
Dropout to improve its generalization power.

This applies not just to CNNs but to all other neural networks. And I am sure you
already know the above details, so let’s get into the interesting part.

h r l f s g r o n N
The core operation that makes CNNs so powerful is convolution, which allows
them to capture local patterns, such as edges and textures, and helps extract
relevant information from the input.

From a purely mathematical perspective, we slide a ﬁlter (shown in yellow below)

over the input (shown in green below) and take the element-wise sum between
the ﬁlter and the overlapped input to get the convolution output:

Here, if were to apply the traditional Dropout, the input features would look
something like this:

70
DailyDoseofDS.com

In fully connected layers, we zero out neurons. In CNNs, however, we randomly

zero out the pixel values before convolution, as depicted above.

But this isn’t found to be that effective specifically for convolution layers. To
understand this, consider we have some image data. In every image, we would
find that nearby features (or pixels) are highly correlated spatially.

For instance, imagine zooming in on the pixel level of the digit ‘9’. Here, we
would notice that the red pixel (or feature) is highly correlated with other features
in its vicinity:

Thus, dropping the red feature using Dropout will likely have no eﬀect and its
information can still be sent to the next layer.

Simply put, the nature of the convolution operation defeats the entire purpose of
the traditional Dropout procedure.

71
DailyDoseofDS.com

h o t n
DropBlock is a much better, eﬀective, and intuitive way to regularize CNNs. The
core idea in DropBlock is to drop a contiguous region of features (or pixels)
rather than individual pixels. This is depicted below:

Similar to Dropout in fully connected layers, wherein the network tries to

generate more robust ways to ﬁt the data in the absence of some activations, in
the case of DropBlock, the convolution layers get more robust to ﬁt the data
despite the absence of a block of features.

Moreover, the idea of DropBlock also makes intuitive sense — if a contiguous

region of a feature is dropped, the problem of using Dropout with convolution
operation can be avoided.

r B c a m e
DropBlock has two main parameters:

● l k i : The size of the box to be dropped.

● r _ t : The drop probability of the central pixel.

72
DailyDoseofDS.com

To apply DropBlock, ﬁrst, we create a binary mask on the input sampled from the
Bernoulli distribution:

Next, we create a block of size l k i * o _ z which has the sampled

pixels at the center:

The eﬃcacy of DropBlock over Dropout is evident from the results table below:

On the ImageNet classiﬁcation dataset:

● DropBlock provides a 1.33% gain over Dropout.

● DropBlock with Label smoothing (discussed in the last chapter) provides a
1.55% gain over Dropout.

73
DailyDoseofDS.com

Thankfully, DropBlock is also integrated with PyTorch.

There’s also a library for DropBlock, called “ r b c ,” which also provides the
linear scheduler for drop_rate.

So the thing is that the researchers who proposed DropBlock found the
technique to be more eﬀective when the drop_rate was increased gradually.

The DropBlock library implements the scheduler. But of course, there are ways to
do this in PyTorch as well. So it’s entirely up to you which implementation you
want to use:

● DropBlock PyTorch: https://bit.ly/3xZfT3E.

● DropBlock library: https://github.com/miguelvr/dropblock.

74
DailyDoseofDS.com

h i e a r n c v i u t n
c a y o
Everyone knows the objective of an activation function in a neural network. They
let the network learn non-linear patterns. There is nothing new here, and I am
sure you are aware of that too.

However, one thing I have o en

realized is that most people struggle to
build an intuitive understanding of
what exactly a neural network
consistently tries to achieve during its
layer-a er-layer transformations.

In this chapter, let me share a unique perspective on this, which would really help
you understand the internal workings of a neural network.

I have supported this chapter with plenty of visuals for better understanding.
Also, for simplicity, we shall consider a binary classiﬁcation use case.

a g u
The data undergoes a series of transformations at each hidden layer:

● Linear transformation of the data obtained from the previous layer

● …followed by a non-linearity using an activation function — ReLU,
Sigmoid, Tanh, etc.

75
DailyDoseofDS.com

The above transformations are performed on every hidden layer of a neural

network. Now, notice something here.

Assume that we just applied the

above data transformation on the
very last hidden layer of the neural
network. Once we do that, the
activations progress toward the
output layer of the network for one
ﬁnal transformation, which is
entirely linear.

The above transformation is entirely linear because all sources of non-linearity

(activations functions) exist on or before the last hidden layer. And during the
forward pass, once the data leaves the last hidden layer, there is no further scope
for non-linearity.

Thus, to make accurate predictions, the data received by the output layer from
the last hidden layer MUST BE linearly separable.

76
DailyDoseofDS.com

To summarize….

While transforming the data through all its hidden layers and just before
reaching the output layer, a neural network is constantly hustling to project the
data to a space where it somehow becomes linearly separable. If it does, the
output layer becomes analogous to a logistic regression model, which can easily
handle this linearly separable data.

In fact, we can also verify this experimentally.

To visualize the input transformation, we can add a dummy hidden layer with
just two neurons right before the output layer and train the neural network again.

Why do we add a layer with just two neurons?

This way, we can easily visualize the transformation. We expect that if we plot
the activations of this 2D dummy hidden layer, they must be linearly separable.
The below visual precisely depicts this.

77
DailyDoseofDS.com

As we notice above, while the input data was linearly inseparable, the input
received by the output layer is indeed linearly separable.

This transformed data can be easily handled by the output classiﬁcation layer.

And this shows that all a neural network is trying to do is transform the data into
a linearly separable form before reaching the output layer.

78
DailyDoseofDS.com

h ﬄ a e r r n g

Deep learning models may fail

to converge due to various
reasons. Some causes are
obvious and common, and
therefore, quickly rectiﬁable,
like too high/low learning
rate, no data normalization,
no batch normalization, etc.

But the problem arises when the cause isn’t that apparent. Therefore, it may take
some serious time to debug if you are unaware of them. In this chapter, I want to
talk about one such data-related mistake, which I once committed during my
early days in machine learning. Admittedly, it took me quite some time to ﬁgure
it out back then because I had no idea about the issue.

x r e
Consider a classiﬁcation neural network trained using mini-batch gradient
descent.

i - t r i t e e : p t e o e h s g e a
o t t i .

79
DailyDoseofDS.com

Here, we train two diﬀerent neural

networks:

● Version 1: The dataset is

ordered by labels.
● Version 2: The dataset is
properly shuﬄed by labels.

And, of course, before training, we ensure that both networks had the same
initial weights, learning rate, and other settings.

The image depicts the

epoch-by-epoch
performance of the two
models. On the le , we
have the model trained
on label-ordered data,
and the one on the right
was trained on the
shuﬄed dataset.

It is clear that the model receiving a label-ordered dataset miserably fails to

converge while the other model, although overﬁts, shows that model has been
learn eﬀectively.

h o h a e
Now, if you think about it for a second, overall, both models received the same
data, didn’t they? Yet, the order in which the data was fed to these models totally
determined their performance. I vividly remember that when I faced this issue, I
knew that my data was ordered by labels.

80
DailyDoseofDS.com

Yet, it never occurred to me that ordering may inﬂuence the model performance
because the data will always be the same regardless of the ordering.

But later, I realized that this point

will only be valid when the model
sees the entire data and updates
the model weights in one go, i.e.,
in batch gradient descent, as
depicted in this image.

But in the case of mini-batch gradient descent, the weights are updated a er
every mini-batch. Thus, the prediction and weight update on a subsequent
mini-batch is inﬂuenced by the previous mini-batches.

In the context of label-ordered data, where samples of the same class are grouped
together, mini-batch gradient descent will lead the model to learn patterns
speciﬁc to the class it excessively saw early on in training. In contrast, randomly
ordered data ensures that each mini-batch contains a balanced representation of
classes. This allows the model to learn a more comprehensive set of features
throughout the training process.

Of course, the idea of shuﬄing is not valid for time-series datasets as their
temporal structure is important. The good thing is that if you happen to use, say,
PyTorch DataLoader, you are safe. This is because it already implements
shuﬄing. But if you have a custom implementation, ensure that you are not
making any such error.

Before I end, one thing that you must ALWAYS remember when training neural
networks is that these models can proﬁciently learn entirely non-existing
patterns about your dataset. So never give them any chance to do so.

81
DailyDoseofDS.com

o l o r s n
n l g i i a o o o l o r s n
Model accuracy alone (or an equivalent performance metric) rarely determines
which model will be deployed.

This is because we also consider several operational metrics, such as:

● Inference Latency: Time taken by the model to return a prediction.

● Model size: The memory occupied by the model.
● Ease of scalability, etc.

In this chapter, let me share a technique (with a demo) called knowledge

distillation, which is commonly used to compress ML models and contribute to
the above operational metrics.

h s n l g i i a o
In a gist, the idea is to train a smaller/simpler model (called the “student” model)
that mimics the behavior of a larger/complex model (called the “teacher” model).

82
DailyDoseofDS.com

This involves two steps:

● Train the teacher model as we typically would.

● Train a student model that matches the output of the teacher model.

If we compare it to an academic teacher-student scenario, the student may not be

as performant as the teacher.

But with consistent training, a smaller model may get (almost) as good as the
larger one.

A classic example of a model developed in this way is DistillBERT. It is a student

model of BERT.

● DistilBERT is approximately 40% smaller than BERT, which is a massive

diﬀerence in size.
● Still, it retains approximately 97% of the BERT’s capabilities.

Next, let’s look at a demo.

83
DailyDoseofDS.com

n l g i i a o e
In the interest of time, let’s say we have already trained the following CNN model
on the MNIST dataset (I have provided the full Jupyter notebook towards the end,
don’t worry):

The epoch-by-epoch training loss and validation accuracy is depicted below:

Next, let’s deﬁne a simpler model without any convolutional layers:

84
DailyDoseofDS.com

Being a classiﬁcation model, the output will be a probability distribution over the
<N> classes:

Thus, we can train the student model such that its probability distribution
matches that of the teacher model.

One way to do this is to use KL divergence as a loss function.

85
DailyDoseofDS.com

It measures how much information is lost when we use distribution Q to

approximate distribution P.

u t n o o h i e h L i r c f =

Thus, in our case:

● P will be the probability distribution from the teacher model.

● Q will be the probability distribution from the student model.

The loss function is implemented below:

Finally, we train the student model as follows:

86
DailyDoseofDS.com

Done!

The following image compares the training loss and validation accuracy of the
two models:

Of course, as shown in the highlighted lines above, the performance of the

student model is not as good as the teacher model, which is expected.

However, it is still pretty promising, given that it was only composed of simple
feed-forward layers.

Also, as depicted below, the student model is approximately 35% faster than the
teacher model, which is a signiﬁcant increase in the inference run-time of the
model for about a - drop in the test accuracy.

87
DailyDoseofDS.com

That said, one of the biggest downsides of knowledge distillation is that one must
still train a larger teacher model ﬁrst to train the student model.

But in a resource-constrained environment, it may not be feasible to train a large

teacher model.

So this technique assumes that we are not resource-constrained at least in the

development environment.

In the next chapter, let’s discuss one more technique to compress ML models and
reduce their memory footprint.

88
DailyDoseofDS.com

c v i r i
Once we complete network training, we are almost always le with plenty of
useless neurons — ones that make nearly zero contribution to the network’s
performance, but they still consume memory.

In other words, there is a high percentage of neurons, which, if removed from the
trained network, will not aﬀect the performance remarkably:

And, of course, I am not saying this as a random and uninformed thought. I have
experimentally veriﬁed this over and over across my projects.

Here’s the core idea.

A er training is complete, we run the

dataset through the model (no
backpropagation this time) and
analyze the average activation of
individual neurons. Here, we o en
observe that many neuron activations
are always close to near-zero values.

Thus, they can be pruned from the network, as they will have very little impact on
the model’s output.

For pruning, we can decide on a pruning threshold (λ) and prune all neurons
whose activations are less than this threshold.

89
DailyDoseofDS.com

This makes intuitive sense as well.

More speciﬁcally, if a neuron rarely possesses a high activation value, then it is

fair to assume that it isn’t contributing to the model’s output, and we can safely
prune it.

The following table

compares the
accuracy of the
pruned model with
the original (full)
model across a
range of pruning
thresholds (λ):

At a pruning threshold λ=0.4, the validation accuracy of the model drops by just
0.62%, but the number of parameters drops by 72%.

That is a huge reduction, while both models being almost equally good! Of
course, there is a trade-oﬀ because we are not doing as well as the original model.
But in many cases, especially when deploying ML models, accuracy is not the
only primary metric that decides these.

Instead, several operational

metrics like eﬃciency, speed,
memory consumption, etc., are
also a key deciding factor.

That is why model

compression techniques are so
crucial in such cases.

90
DailyDoseofDS.com

e o e
e o L o l r u t o b k
The core objective of model deployment is to obtain an API endpoint that can be
used for inference purposes:

While this sounds simple, deployment is typically quite a tedious and

time-consuming process. One must maintain environment ﬁles, conﬁgure
various settings, ensure all dependencies are correctly installed, and many more.

So, in this chapter, I want to help you simplify this process. More speciﬁcally, we
shall learn how to deploy any ML model right from a Jupyter Notebook in just
three simple steps using the Modelbit API.

Modelbit lets us seamlessly deploy ML models directly from our Python

notebooks (or git) to Snowﬂake, Redshi , and REST.

91
DailyDoseofDS.com

e o e i o l t
Assume we have already trained our model.

For simplicity, let’s assume it to be a linear regression model trained using

sklearn, but it can be any other model as well:

Let’s see how we can deploy this model with Modelbit!

● First, we install the Modelbit package via pip:

● Next, we log in to Modelbit from our Jupyter Notebook (make sure you
have created an account here: Modelbit)

92
DailyDoseofDS.com

● Finally, we deploy it, but here’s an important point to note:

To deploy a model using Modelbit, we must deﬁne an inference function.

Simply put, this function contains the code that will be executed at inference.
Thus, it will be responsible for returning the prediction.

We must specify the input parameters required by the model in this method.
Also, we can name it anything we want.

For our linear regression case, the inference function can be as follows:

● We deﬁne a function y r e o e ( .
● Next, we specify the input of the model as a parameter of this method.
● We validate the input for its data type.
● Finally, we return the prediction.

One good thing about Modelbit is that every dependency of the function (the
model object in this case) is pickled and sent to production automatically along

93
DailyDoseofDS.com

with the function. Thus, we can reference any object in this method. Once we
have deﬁned the function, we can proceed with deployment as follows:

We have successfully
deployed the model in
three simple steps,
that too, right from the
Jupyter Notebook!
Once our model has
been successfully
deployed, it will
appear in our Modelbit
dashboard.

As shown above, Modelbit provides an API endpoint. We can use it for inference
purposes as follows:

94
DailyDoseofDS.com

In the above request, data passed to the endpoint is a list of lists.

The ﬁrst number in the list is the input ID. All entries following the ID in a list
are the function parameters.

Lastly, we can also specify speciﬁc versions of the libraries or Python used while
deploying our model. This is depicted below:

Isn’t that cool, simple, and elegant over traditional deployment approaches?

95
DailyDoseofDS.com

a o e L o l n r u i
Despite rigorously testing an ML model locally (on validation and test sets), it
could be a terrible idea to instantly replace the previous model with a new model.

A more reliable strategy is to test the model in production (yes, on real-world

incoming data). While this might sound risky, ML teams do it all the time, and it
isn’t that complicated. The following visual depicts 4 common strategies to do so:

96
DailyDoseofDS.com

● The current model is called the legacy model.

● The new model is called the candidate model.

1 / e n

● Distribute the incoming requests non-uniformly between the legacy model

and the candidate model.
● Intentionally limit the exposure of the candidate model to avoid any
potential risks. Thus, the number of requests sent to the candidate model
must be low.

2 a r e n

● In A/B testing, since traﬃc is randomly redirected to either model

irrespective of the user, it can potentially aﬀect all users.
● In canary testing, the candidate model is released to a small subset of users
in production and gradually rolled out to more users.

97
DailyDoseofDS.com

3 n r v e n

● This involves mixing the predictions of multiple models in the response.

● Consider Amazon’s recommendation engine. In interleaved deployments,
some product recommendations displayed on the homepage can come
from the legacy model, while some can come from the candidate model.

4 h o e n

● All of the above techniques aﬀect some (or all) users.

● Shadow testing (or dark launches) lets us test a new model in a production
environment without affecting the user experience.
● The candidate model is deployed alongside the existing legacy model and
serves requests like the legacy model. However, the output is not sent back
to the user. Instead, the output is logged for later use to benchmark its
performance against the legacy model.
● We explicitly deploy the candidate model instead of testing offline because
the production environment is difficult to replicate offline.

Shadow testing oﬀers risk-free testing of the candidate model in a production

environment.

98
DailyDoseofDS.com

e i o r l g n o l e s y
Real-world ML deployment is never just about “deployment” — host the model
somewhere, obtain an API endpoint, integrate it into the application, and you are
done!

This is because, in reality, plenty of things must be done post-deployment to

ensure the model’s reliability and performance.

1 e i o r

To begin, it is immensely crucial to

version control ML deployments. You
may have noticed this while using
ChatGPT, for instance.

But updating does not simply mean overwriting the previous version.

Instead, ML models are always version-controlled (using git tools).

The advantages of version-controlling ML deployments are pretty obvious:

● In case of sudden mishaps post-deployment, we can instantly roll back to

an older version.
● We can facilitate parallel development with branching, and many more.

99
DailyDoseofDS.com

2 o l e s y
Another practical idea is to maintain a model registry for deployments. Let’s
understand what it is.

Simply put, a model registry can be considered

repository of models. See, typically, we might be
inclined to version our code and the ML model
together:

However, when we use a model registry, we version models separately from the
code. Let me give you an intuitive example to understand this better. Imagine our
deployed model takes three inputs to generate a prediction:

While writing the inference code, we overlooked that, at times, one of the inputs
might be missing. We realized this by analyzing the model’s logs.

100
DailyDoseofDS.com

We may want to ﬁx this quickly (at least for a while) before we decide on the next
steps more concretely. Thus, we may decide to update the inference code by
assigning a dummy value for the missing input.

This will allow the model to still process the incoming request.

Let me ask you a question: “Did we update the model?”

No, right?

Here, we only need to update the inference code. The model will remain the
same.

But if we were to version the model and code together, it would lead to a
redundant model and take up extra space.

However, by maintaining a model registry:

● We can only update the inference code.

● Avoid pushing a new (yet unwanted) model to deployment.

This makes intuitive sense as well, doesn’t it?

101
DailyDoseofDS.com

L
h e i h P e r o
GPT-2 (XL) has 1.5 Billion parameters, and its parameters consume ~3GB of
memory in 16-bit precision.

Under 16-bit precision, one parameter takes up 2 bytes of memory, so 1.5B

parameters will consume 3GB of memory.

What’s your estimate for the minimum memory needed to train GPT-2 on a
single GPU?

● Optimizer → Adam
● Batch size → 32
● Number of transformer layers → 48
● Sequence length → 1000

h ' o s m e o h i m e r e e o r n P 2
o l s e G ) n i l P
- - B
- - B
- 2 5 B
- 2 5 B
- 0 B

The answer might surprise you.

102
DailyDoseofDS.com

One can barely train a 3GB GPT-2 model on a single GPU with 32GB of
memory.

But how could that be even possible? Where does all the memory go?

Let’s understand.

There are so many fronts on which the model consistently takes up memory
during training.

1 p m e t e r i t n a m e e r
Mixed precision training is widely used to speed up model training.

As the name suggests, the idea is to utilize lower-precision ﬂ a 6 (wherever

feasible, like in convolutions and matrix multiplications) along with ﬂ a 2—
that is why the name “mixed precision.”

Both the forward and backward propagation are performed using the 16-bit
representations of weights and gradients.

Thus, if the model has Φ parameters, then:

● Weights will consume 2*Φ bytes.

● Gradients will consume 2*Φ bytes.

103
DailyDoseofDS.com

Here, the ﬁgure “2” represents a memory consumption of 2 bytes/paramter

(16-bit).

Moreover, the updates at the end of the backward propagation are still performed
under 32-bit for eﬀective computation. I am talking about the circled step in the
image below:

Adam is one of the most popular optimizers for model training.

While many practitioners use it just because it is popular, they don’t realize that
during training, Adam stores two optimizer states to compute the updates —
momentum and variance of the gradients:

Thus, if the model has Φ parameters, then these two optimizer states will
consume:

● 4*Φ bytes for momentum.

104
DailyDoseofDS.com

● Another 4*Φ bytes for variance.

Here, the ﬁgure “4” represents a memory consumption of 4 bytes/paramter

(32-bit).

Lastly, as shown in the ﬁgure above, the ﬁnal updates are always adjusted in the
32-bit representation of the model weights. This leads to:

● Another 4*Φ bytes for model parameters.

Let’s sum them up:

That’s 16*Φ, or 24GB of memory, which is ridiculously higher than the 3GB
memory utilized by 16-bit parameters.

And we haven’t considered everything yet.

105
DailyDoseofDS.com

2 c v i s
For big deep learning models, like LLMs, Activations take up signiﬁcant memory
during training.

More formally, the total number of activations computed in one transform block
of GPT-2 are:

Thus, across all transformer blocks, this comes out to be:

This is the conﬁguration for GPT2-XL:

This comes out to be ~30B activations. As each activation is represented in 16-bit,

all activations collectively consume 60GB of memory.

106
DailyDoseofDS.com

With techniques like gradient checkpointing (discussed in the previous chapter),

this could be brought down to about 8-9GB at the expense of 25-30% more
run-time.

This technique takes complete memory consumption to about 32-35GB range,

which I mentioned earlier, for a meager 3GB model, and that too with a pretty
small batch size of just 32. On top of this, there are also some more overheads
involved, like memory fragmentation.

It occurs when there are small, unused gaps between allocated memory blocks,
leading to ineﬃcient use of the available memory.

Memory allocation requests fail because of the unavailability of contiguous

memory blocks.

o l i
In the above discussion, we considered a relatively small model — GPT-2 (XL)
with 1.5 Billion parameters, which is tiny compared to the scale of models being
trained these days.

However, the discussion may have helped you reﬂect on the inherent challenges
of building LLMs. Many people o en say that GPTs are only about stacking more
and more layers in the model and making the network bigger.

If it was that easy, everybody would have been doing it. From this discussion, you
may have understood that it’s not as simple as appending more layers.

Even one additional layer can lead to multiple GBs of additional memory
requirement. Multi-GPU training is at the forefront of these models, which we
covered in an earlier chapter in this book.

107
DailyDoseofDS.com

u - d i - n g s o s A
Here’s a visual which illustrates “full-model ﬁne-tuning,” “ﬁne-tuning with
LoRA,” and “retrieval augmented generation (RAG).”

All three techniques are used to augment the knowledge of an existing model
with additional data.

108
DailyDoseofDS.com

1 u ﬁ e u n
Fine-tuning means adjusting the weights of a pre-trained model on a new dataset
for better performance.

While this ﬁne-tuning technique has been successfully used for a long time,
problems arise when we use it on much larger models — LLMs, for instance,
primarily because of:

● Their size.
● The cost involved in ﬁne-tuning all weights.
● The cost involved in maintaining all large ﬁne-tuned models.

2 o fi e u n
LoRA fine-tuning addresses the limitations of traditional fine-tuning. The core
idea is to decompose the weight matrices (some or all) of the original model into
low-rank matrices and train them instead. For instance, in the graphic below, the
bottom network represents the large pre-trained model, and the top network
represents the model with LoRA layers.

109
DailyDoseofDS.com

The idea is to train only the LoRA network and freeze the large model.

Looking at the above visual, you might think:

But the LoRA model has more neurons than the original model. How does that
help? To understand this, you must make it clear that neurons don't have
anything to do with the memory of the network.

They are just used to illustrate the dimensionality transformation from one layer
to another.

It is the weight matrices (or the connections between two layers) that take up
memory. Thus, we must be comparing these connections instead:

Looking at the above visual, it is pretty clear that the LoRA network has
relatively very few connections.

3 A
Retrieval augmented generation (RAG) is another pretty cool way to augment
neural networks with additional information, without having to ﬁne-tune the
model.

This is illustrated below:

110
DailyDoseofDS.com

There are 7 steps, which are also marked in the above visual:

● Step 1-2: Take additional data, and dump it in a vector database a er

embedding. (This is only done once. If the data is evolving, just keep
dumping the embeddings into the vector database. There’s no need to
repeat this again for the entire data)
● Step 3: Use the same embedding model to embed the user query.
● Step 4-5: Find the nearest neighbors in the vector database to the
embedded query.
● Step 6-7: Provide the original query and the retrieved documents (for more
context) to the LLM to get a response.

In fact, even its name entirely justiﬁes what we do with this technique:

111
DailyDoseofDS.com

● Retrieval: Accessing and retrieving information from a knowledge source,

such as a database or memory.
● Augmented: Enhancing or enriching something, in this case, the text
generation process, with additional information or context.
● Generation: The process of creating or producing something, in this
context, generating text or language.

Of course, there are many problems with RAG too, such as:

● RAGs involve similarity matching between the query vector and the vectors
of the additional documents. However, questions are structurally very
diﬀerent from answers.
● Typical RAG systems are well-suited only for lookup-based
question-answering systems. For instance, we cannot build a RAG pipeline
to summarize the additional data. The LLM never gets info about all the
documents in its prompt because the similarity matching step only
retrieves top matches.

So, it’s pretty clear that RAG has both pros and cons.

● We never have to ﬁne-tune the model, which saves a lot of computing

power.
● But this also limits the applicability to speciﬁc types of systems.

Let’s continue the discussion on LLM ﬁne-tuning in the next chapter.

112
DailyDoseofDS.com

L i - n g e n u
Traditional ﬁne-tuning (depicted below) is infeasible with LLMs because these
models have billions of parameters and are hundreds of GBs in size, and not
everyone has access to such computing infrastructure.

But today, we have many optimal ways to ﬁne-tune LLMs, and ﬁve popular
techniques are depicted below:

113
DailyDoseofDS.com

- LoRA: Add two low-rank matrices A and B alongside weight matrices,

which contain the trainable parameters. Instead of ﬁne-tuning W, adjust
the updates in these low-rank matrices.

- LoRA-FA: While LoRA considerably decreases the total trainable

parameters, it still requires substantial activation memory to update the
low-rank weights. LoRA-FA (FA stands for Frozen-A) freezes the matrix A
and only updates matrix B.

- VeRA: In LoRA, every layer has a diﬀerent pair of low-rank matrices A and
B, and both matrices are trained. In VeRA, however, matrices A and B are
frozen, random, and shared across all model layers. VeRA focuses on
learning small, layer-speciﬁc scaling vectors, denoted as b and d, which are
the only trainable parameters in this setup.

114
DailyDoseofDS.com

- Delta-LoRA: Here, in addition to training low-rank matrices, the matrix W

is also adjusted but not in the traditional way. Instead, the diﬀerence (or
delta) between the product of the low-rank matrices A and B in two
consecutive training steps is added to W:

- LoRA+: In LoRA, both matrices A and B are updated with the same
learning rate. Authors found that setting a higher learning rate for matrix
B results in more optimal convergence.

115
DailyDoseofDS.com

l s a L

116
DailyDoseofDS.com

L u a n l
r n g n n r c i o l i f 0 L
l r h
Here’s the run-time complexity of the 10 most popular ML algorithms.

But why even care about run time? There are multiple reasons why I always care
about run time and why you should too.

117
DailyDoseofDS.com

To begin, we know that everyone is a big fan of sklearn implementations. It

literally takes just two (max three) lines of code to run any ML algorithm with
sklearn. Yet, in my experience, due to this simplicity, most users o en overlook:

● The core understanding of an algorithm.

● The data-speciﬁc conditions that allow us to use an algorithm.

For instance, you’ll be up for a big surprise if you use SVM or t-SNE on a dataset
with plenty of samples.

● SVM’s run-time grows cubically with the total number of samples.

● t-SNE’s run-time grows quadratically with the total number of samples.

Another advantage of ﬁguring out the run-time is that it helps us understand

how an algorithm works end-to-end. Of course, in the above table, I have made
some assumptions here and there. For instance:

● In a random forest, all decision trees may have different depths. But here, I
have assumed that they are equal.
● During inference in kNN, we first find the distance to all data points. This
gives a list of distances of size (total samples).
○ Then, we find the k-smallest distances from this list.
○ The run-time to determine the k-smallest values may depend on the
implementation.
■ Sorting and selecting the k-smallest values will be ( o ).
■ But if we use a priority queue, it will take ( o k .
● In t-SNE, there’s a learning step. Since the major run-time comes from
computing the pairwise similarities in the high-dimensional space, we
have ignored that step.

Nonetheless, the table still accurately reﬂects the general run-time of each of
these algorithms.

As an exercise, I would encourage you to derive these run-time complexities

yourself. This activity will provide you so much conﬁdence in algorithmic
understanding.

118
DailyDoseofDS.com

5 o m r n a e t a eﬁ i o n
a c n
I a e t a n l g m r n n a c n n a i
e n g

This is a question that so many people have, especially those who are just getting
started.

Short answer: Yes, it’s important, and here’s why I say so.

See…these days, one can do “ML” without understanding any mathematical

details of an algorithm. For instance (and thanks to sklearn, by the way):

● One can use any clustering algorithm in 2-3 lines of code.

● One can train classiﬁcation models in 2-3 lines of code.
● And more.

This is both good and bad:

● It’s good because it saves us time.

● It’s bad because this tempts us to ignore the underlying details.

In fact, I know many data scientists (mainly on the applied side) who do not
entirely understand the mathematical details but can still build and deploy
models.

Nothing wrong.

119
DailyDoseofDS.com

However, when I talk to them, I also see some disconnect between “What they
are using” and “Why they are using it.”

Due to a lack of understanding of the underlying details:

● They ﬁnd it quite diﬃcult to optimize their models.

● They struggle to identify potential areas of improvement.
● They take a longer time to debug when things don’t work well.
● They do not fully understand the role of speciﬁc hyperparameters.
● They use any algorithm without estimating their time complexity ﬁrst.

If it feels like you are one of them, it’s okay. This problem can be solved.

That said, if you genuinely aspire to excel in this ﬁeld, building a curiosity for the
underlying mathematical details holds exponential returns.

● Algorithmic awareness will give you conﬁdence.

● It will decrease your time to build and iterate.

Gradually, you will go from a hit-and-trial approach to “I know what should

work.”

To help you take that ﬁrst step, I prepared the following visual, which lists some
of the most important mathematical formulations used in Data Science and
Statistics (in no speciﬁc order).

Before reading ahead, look at them one by one and calculate how many of them
do you already know:

120
DailyDoseofDS.com

Some of the terms are pretty self-explanatory, so I won’t go through each of them,
like:

● Gradient Descent, Normal Distribution, Sigmoid, Correlation, Cosine

similarity, Naive Bayes, F1 score, ReLU, So max, MSE, MSE + L2
regularization, KMeans, Linear regression, SVM, Log loss.

Here are the remaining terms:

● MLE (Maximum Likelihood Estimation): A method for estimating the

parameters of a statistical model by maximizing the likelihood of the
observed data. We covered it in the previous chapter.

121
DailyDoseofDS.com

● Z-score: A standardized value that indicates how many standard deviations

a data point is from the mean.

● Ordinary Least Squares: A closed-form solution for linear regression

obtained using the MLE step mentioned above.

● Entropy: A measure of the uncertainty of a random variable.

● Eigen Vectors: The non-zero vectors that do not change their direction
when a linear transformation is applied. It is widely used in dimensionality
reduction techniques like PCA.

● R2 (R-squared): A statistical measure that represents the proportion of

variance explained by a regression model.

● KL divergence: Assess how much information is lost when one distribution

is used to approximate another distribution. It is used as a loss function in
the t-SNE algorithm.

● SVD: A factorization technique that decomposes a matrix into three other

matrices, o en noted as U, Σ, and V. It is fundamental in linear algebra for
applications like dimensionality reduction, noise reduction, and data
compression.

● Lagrange multipliers: They are commonly used mathematical techniques

to solve constrained optimization problems. For instance, consider an
optimization problem with an objective function ( and assume that the
constraints are ( = and ( = . Lagrange multipliers solve this.

122
DailyDoseofDS.com

o o e a y m o r a l t
u i a - a iﬁ a o o l
ML model building is typically an iterative process. Given some dataset:

● We train a model.
● We evaluate it.
● And we continue to improve it until we are satisﬁed with the performance.

Here, the eﬃcacy of any model improvement strategy (say, introducing a new
feature) is determined using some sort of performance metric.

However, I have o en observed that when improving probabilistic

multiclass-classiﬁcation models, this technique can be a bit deceptive when the
eﬃcacy is determined using “Accuracy.”

123
DailyDoseofDS.com

r a l t u i a - a iﬁ a o o l r h e o l h
u u r a l i o e o i o a l s i e a e o s

In other words, it is possible that we are actually making good progress in

improving the model, but “Accuracy” is not reﬂecting that (yet).

Let’s understand.

i a f c r y
In probabilistic multiclass-classiﬁcation models, Accuracy is determined using
the output label that has the highest probability:

Now, it’s possible that the actual label is not predicted with the highest
probability by the model, but it’s in the top “k” output labels.

124
DailyDoseofDS.com

For instance, in the image below, the actual label (Class C) is not the highest
probability label, but it’s at least in the top 2 predicted probabilities (Class B and
Class C):

And what if in an earlier version of our model, the output probability of Class C
was the lowest, as depicted below:

Now, of course, in both cases, the ﬁnal prediction is incorrect.

However, while iterating from “Version 1” to “Version 2” using some model

improvement techniques, we genuinely made good progress.

125
DailyDoseofDS.com

Nonetheless, Accuracy entirely discards this as it only cares about the highest
probability label.

I hope you understand the problem here.

o t n
Whenever I am building and iteratively improving any probabilistic multiclass
classiﬁcation model, I always use the top-k accuracy score. As the name suggests,
it computes whether the correct label is among the top k labels predicted
probabilities or not.

As you may have already guessed, top-1 accuracy score is the traditional
Accuracy score. This is a much better indicator to assess whether my model
improvement eﬀorts are translating into meaningful enhancements in predictive
performance or not.

For instance, if the top-3 accuracy score goes from 75% to 90%, this totally
suggests that whatever we did to improve the model was eﬀective:

● Earlier, the correct prediction was in the top 3 labels only 75% of the time.
● But now, the correct prediction is in the top 3 labels 90% of the time.

126
DailyDoseofDS.com

As a result, one can effectively redirect their engineering efforts in the right
direction. Of course, what I am saying should only be used to assess the model
improvement efforts.

This is because true predictive power will inevitably be determined using

traditional model accuracy. So make sure you are gradually progressing on the
Accuracy front too.

Ideally, it is expected that “Top-k Accuracy” may continue to increase during

model iterations, which reﬂects improvement in performance. Accuracy,
however, may stay the same for a while, as depicted below:

Top-k accuracy score is also available in Sklearn:

127
DailyDoseofDS.com

o n r o l m o m t ﬀ r i t e
o g n a
Back in 2019, I was working with an ML research group in Germany.

One day, a Ph.D. student

came up to me (and
others in the lab), handed
over a small sample of
the dataset he was
working with, and
requested us to label it,
despite having true
labels.

This made me curious about why gathering human labels was necessary for him
when he already had ground truth labels available. So I asked.

What I learned that day changed my approach to incremental model

improvement, and I am sure you will ﬁnd this idea fascinating too.

Let me explain what I learned.

Consider we are building a multiclass classiﬁcation model. Say it’s a model that
classiﬁes an input image as a rock, paper, or scissors:

128
DailyDoseofDS.com

For simplicity, let’s assume there’s no class imbalance. Calculating the class-wise
validation accuracies gives us the following results:

u t n h h l s o d o o n i v y r e o n e
u h n m o h o l n

A er looking at these results,

most people believe that
“Scissor” is the worst-performing
class and should be inspected
further.

But this might not be true. And this is precisely what that Ph.D. student wanted
to verify by collecting human labels. Let’s say that the human labels give us the
following results:

Based on this, do you still think the model performs the worst on the “Scissor”
class?

No, right?

129
DailyDoseofDS.com

I mean, of course, the model has the least accuracy on the “Scissor” class, and I
am not denying it. However, with more context, we notice that the model is doing
a pretty good job classifying the “Scissor” class. This is because an average
human is achieving just 2% higher accuracy in comparison to what our model is
able to achieve.

However, the above results astonishingly reveal that it is the “Rock” class instead
that demands more attention. The accuracy diﬀerence between an average
human and the model is way too high (13%). Had we not known this, we would
have continued to improve the “Scissor” class, when in reality, “Rock” requires
more improvement.

Ever since I learned this technique, I have found it super helpful to determine my
next steps for model improvement, if possible. I say “if possible” because I
understand that many datasets are hard for humans to interpret and label.
Nonetheless, if it is feasible to set up such a “human baseline,” one can get so
much clarity into how the model is performing.

As a result, one can eﬀectively redirect their engineering eﬀorts in the right
direction.

Of course, I am not claiming that this will be universally useful in all use cases.

For instance, if the model is already performing better than the baseline, the
model improvements from there on will have to be guided based on past results.

Yet, in such cases, surpassing a human baseline at least helps us validate that the
model is doing better than what a human can do.

130
DailyDoseofDS.com

o u t n f 6 L l s
The below visual depicts the most commonly used loss functions by various ML
algorithms.

131
DailyDoseofDS.com

0 o o o o u t n
The below visual depicts some commonly used loss functions in regression and
classiﬁcation tasks.

132
DailyDoseofDS.com

o o c a y s r n a d i n e e
It is pretty conventional to split the given data into train, test, and validation sets.

However, there are quite a few misconceptions about how they are meant to be
used, especially the validation and test sets.

In this chapter, let’s clear them up and see how to truly use train, validation, and
test sets.

We begin by splitting the data into:

● Train
● Validation
● Test

At this point, just assume that the test data does not even exist. Forget about it
instantly.

133
DailyDoseofDS.com

Begin with the train set. This is your whole world now.

● You analyze it
● You transform it
● You use it to determine features
● You ﬁt a model on it

A er modeling, you would want to measure the model’s performance on unseen

data, wouldn’t you?

Bring in the validation set now.

Based on validation performance, improve the model. Here’s how you iteratively
build your model:

134
DailyDoseofDS.com

● Train using a train set

● Evaluate it using the validation set
● Improve the model
● Evaluate again using the validation set
● Improve the model again
● and so on.

Until...

You reach a point where you start overﬁtting the validation set.

This indicates that you have exploited (or polluted) the validation set.

No worries.

Merge it with the train set and generate a new split of train and validation.

Note: Rely on cross-validation if needed, especially when you don’t have much
data. You may still use cross-validation if you have enough data. But it can be
computationally intensive.

Now, if you are happy with the model’s performance, evaluate it on test data.

What you use a test set for:

● Get a ﬁnal and unbiased review of the model.

What you DON’T use a test set for:

● Analysis, decision-making, etc.

135
DailyDoseofDS.com

If the model is underperforming on the test set, no problem. Go back to the

modeling stage and improve it.

BUT (and here’s what most people do wrong)!

They use the same test set again. This is not allowed!

Think of it this way.

Your professor taught you in the classroom. All in-class lessons and examples are
the train set.

The professor gave you take-home assignments, which acted like validation sets.

You got some wrong and some right. Based on this, you adjusted your topic
fundamentals, i.e., improved the model.

Now, if you keep solving the same take-home assignment repeatedly, you will
eventually overﬁt it, won’t you?

That is why we bring in a new validation set a er some iterations.

The ﬁnal exam day paper is your test set. If you do well, awesome!

But if you fail, the professor cannot give you the exact exam paper next time, can
they? This is because you know what’s inside.

Of course, by evaluating a model on the test set, the model never gets to “know”
the precise examples inside that set. But the issue is that the test set has been
exposed now.

Your previous evaluation will inevitably influence any further evaluations on that
specific test set. That is why you must always use a specific test set only ONCE.

Once you do, merge it with the train and validation set and generate an entirely
new split.

Repeat.

And that is how you use train, validation, and test sets in machine learning.

136
DailyDoseofDS.com

r s a d i e n u
Tuning and validating machine learning models on a single validation set can be
misleading and sometimes yield overly optimistic results.

This can occur due to a lucky random split of data, which results in a model that
performs exceptionally well on the validation set but poorly on new, unseen data.

That is why we o en use cross validation instead of simple single-set validation.

Cross validation involves repeatedly partitioning the available data into subsets,
training the model on a few subsets, and validating on the remaining subsets.

The main advantage of cross validation is that it provides a more robust and
unbiased estimate of model performance compared to the traditional validation
method.

Below are ﬁve of the most commonly used and must-know cross validation
techniques.

e e n O r s a d i

● Leave one data point for validation.

● Train the model on the remaining data points.
● Repeat for all points.
● Of course, as you may have guessed, this is practically infeasible when you
have many data points. This is because number of models is equal to
number of data points.
● We can extend this to Leave-p-Out Cross Validation, where, in each
iteration, p observations are reserved for validation, and the rest are used
for training.

137
DailyDoseofDS.com

- l r s a d i

● Split data into k equally-sized subsets.

● Select one subset for validation.
● Train the model on the remaining subsets.
● Repeat for all subsets.

o i r s a d i

● Mostly used for data with temporal structure.

● Data splitting respects the temporal order, using a ﬁxed-size training
window.
● The model is evaluated on the subsequent window.

138
DailyDoseofDS.com

l k r s a d i

● Another common technique for time-series data.

● In contrast to rolling cross validation, the slice of data is intentionally kept
short if the variance does not change appreciably from one window to the
next.
● This also saves computation over rolling cross validation.

t tﬁ d r s a d i

● The above-discussed techniques may not work for imbalanced datasets.

Stratiﬁed cross validation is mainly used for preserving the class
distribution.
● The partitioning ensures that the class distribution is preserved.

Let’s continue our discussion on cross validation in the next chapter.

139
DailyDoseofDS.com

h o o f r r s a d i ?
Let me ask you a question.

But before I do that, I need to borrow your imagination for just a moment.

m i o r u d g o u r s a i e n g o l o
r s g r s a d i o e r n n p m e f
y r r e r

Essentially, every hyperparameter conﬁguration corresponds to a cross-validation

performance:

A er obtaining the best hyperparameters, we need to ﬁnalize a model (say for

production); otherwise, what is the point of all this hassle? Now, here’s the
question:

A er obtaining the optimal hyperparameters, what would you be more inclined

to do:

1) Retrain the model again on the

entire data (train + validation + test)
with the optimal hyperparameters?
If we do this, remember that we can’t
reliably validate this new model as
there is no unseen data le .

140
DailyDoseofDS.com

2) Just proceed with the

best-performing model based on
cross-validation performance itself. If
we do this, remember that we are
leaving out important data, which we
could have trained our model with.
What would you do?

e m n a
My strong preference has almost always been “retraining a new model with
entire data.”

There are, of course, some considerations to keep in mind, which I have learned
through the models I have built and deployed. That said, in most cases, retraining
is the ideal way to proceed.

Let me explain.

h e i h o l
We would want to retrain a new model because, in a way, we are already satisﬁed
with the cross-validation performance, which, by its very nature, is an out-of-fold
metric.

An out-of-fold data is data that has not been seen by the model during the
training. An out-of-fold metric is the performance on that data.

141
DailyDoseofDS.com

In other words, we already believe that the model aligns with how we expect it to
perform on unseen data.

Thus, incorporating this unseen validation set in the training data and retraining
the model will MOST LIKELY have NO eﬀect on its performance on unseen data
a er deployment (assuming a sudden covariate shi hasn’t kicked in, which is a
diﬀerent issue altogether).

If, however, we were not satisfied with the cross-validation performance itself, we
wouldn’t even be thinking about finalizing a model in the first place.

Instead, we would be thinking about ways to improve the model by working on

feature engineering, trying new hyperparameters, experimenting with diﬀerent
models, and more.

The reasoning makes intuitive sense as well.

It’s hard for me to recall any instance where retraining did something
disastrously bad to the overall model.

142
DailyDoseofDS.com

In fact, I vividly remember one instance wherein, while I was productionizing the
model (it took me a couple of days a er retraining), the team had gathered some
more labeled data.

The model didn’t show any performance degradation when I evaluated it (just to
double-check). As an added beneﬁt, this also helped ensure that I had made no
errors while productionizing my model.

o o i r i s
Here, please note that it’s not a rule that you must always retrain a new model.

The ﬁeld itself and the tasks one can solve are pretty diverse, so one must be
open-minded while solving the problem at hand. One of the reasons I wouldn’t
want to retrain a new model is that it takes days or weeks to train the model.

In fact, even if we retrain a new model, there are MANY business situations in
which stakes are just too high.

143
DailyDoseofDS.com

Thus, one can never aﬀord to be negligent about deploying a model without
re-evaluating it — transactional fraud, for instance.

In such cases, I have seen that while a team works on productionizing the model,
data engineers gather some more data in the meantime.

Before deploying, the team would do some ﬁnal checks on that dataset.

The newly gathered data is then considered in the subsequent iterations of model
improvements.

144
DailyDoseofDS.com

o l e e s i - r c r e ff
It is well-known that as the number of model parameters increases, we typically
overfit the data more and more. For instance, consider fitting a polynomial
regression model trained on this dummy dataset below:

n a o o t n , h s a e o n i e s o o l

It is expected that as we’ll increase the degree (m) and train the polynomial
regression model:

● The training loss will get closer and closer to zero.

● The test (or validation) loss will ﬁrst reduce and then get bigger and bigger.

This is because, with a higher degree, the model will ﬁnd it easier to contort its
regression ﬁt through each training data point, which makes sense.

145
DailyDoseofDS.com

In fact, this is also evident from the following loss plot:

But notice what happens when we continue to increase the degree ( ):

That’s strange, right?

Why does the test loss increase to a certain point but then decrease?

This was not expected, was it?

146
DailyDoseofDS.com

Well…what you are seeing is called the “double descent phenomenon,” which is
quite commonly observed in many ML models, especially deep learning models.

It shows that, counterintuitively, increasing the model complexity beyond the

point of interpolation can improve generalization performance.

In fact, this whole idea is deeply rooted to why LLMs, although massively big
(billions or even trillions of parameters), can still generalize pretty well.

And it’s hard to accept it because this phenomenon directly challenges the
traditional bias-variance trade-oﬀ we learn in any introductory ML class:

Putting it another way, training very large models, even with more parameters
than training data points, can still generalize well.

To the best of my knowledge, this is still an open question, and it isn’t entirely
clear why neural networks exhibit this behavior.

There are some theories around regularization, however, such as this one:

It could be that the model applies some sort of implicit regularization, with
which, it can precisely focus on an apt number of parameters for generalization.

But to be honest, nothing is clear yet.

147
DailyDoseofDS.com

t i c o d i s
L s M— h ’ h iﬀ r c
Maximum likelihood estimation (MLE) and expectation maximization (EM) are
two popular techniques to determine the parameters of statistical models.

Due to its applicability in MANY statistical models, I have seen it being asked in
plenty of data science interviews as well, especially the distinction between the
two. The following visual summarizes how they work:

148
DailyDoseofDS.com

a m i l o s m i M )
MLE starts with a labeled dataset and aims to determine the parameters of the
statistical model we are trying to ﬁt.

The process is pretty simple and straightforward. In MLE, we:

● Start by assuming a data generation process. Simply put, this data

generation process reﬂects our belief about the distribution of the output
label ( ), given the input ( ).

● Next, we deﬁne the likelihood of observing the data. As each observation is

independent, the likelihood of observing the entire data is the same as the
product of observing individual observations:

149
DailyDoseofDS.com

● The likelihood function above depends on parameter values (θ). Our

objective is to determine those speciﬁc parameter values that maximize the
likelihood function. We do this as follows:

This gives our parameter estimates that would have most likely generated the
given data.

That was pretty simple, wasn’t it?

But what do we do if we
don’t have true labels?
We still want to
estimate the parameters,
don’t we?

MLE, as you may have guessed, will not be applicable. The true label (y), being
unobserved, makes it impossible to deﬁne a likelihood function like we did
earlier.

In such cases, advanced techniques like expectation maximization are pretty

helpful.

150
DailyDoseofDS.com

x c t n a m a o E
EM is an iterative optimization technique to estimate the parameters of
statistical models. It is particularly useful when we have an unobserved (or
hidden) label. One example situation could be as follows:

As depicted above, we assume that the data was generated from multiple
distributions (a mixture). However, the observed/complete data does not contain
that information. In other words, the observed dataset does not have information
about whether a speciﬁc row was generated from distribution 1 or distribution 2.

Had it contained the label ( ) information, we would have already used MLE.

EM helps us with parameter estimates of such datasets. The core idea behind EM
is as follows:

● Make a guess about the initial parameters (θ).

● Expectation (E) step: Compute the posterior probabilities of the
unobserved label (let’s call it ‘z’) using the above parameters.

151
DailyDoseofDS.com

○ Here, ‘z’ is also called a latent variable, which means hidden or

unobserved.
○ Relating it to our case, we know that the true label exists in nature.
But we don’t know what it is.
○ Thus, we replace it with a latent variable ‘z’ and estimate its
posterior probabilities using the guessed parameters.
● Given that we now have a proxy (not precise, though) for the true label, we
can deﬁne an “expected likelihood” function. Thus, we use the above
posterior probabilities to do so:

● Maximization (M) step: So now we have a likelihood function to work with.

Maximizing it with respect to the parameters will give us a new estimate
for the parameters (θ`).
● Next, we use the updated parameters (θ`) to recompute the posterior
probabilities we deﬁned in the expectation step.

● We will update the likelihood function (L) using the new posterior
probabilities.
● Again, maximizing it will give us a new estimate for the parameters (θ).
● And this process goes on and on until convergence.

The point is that in expectation maximization, we repeatedly iterate between the

E and the M steps until the parameters converge. A good thing about EM is that
it always converges. Yet, at times, it might converge to a local extrema.

MLE vs. EM is a popular question asked in many data science interviews.

152
DailyDoseofDS.com

o ﬁ e e n r l n r i i n r l
Statistical estimates always have some uncertainty.

For instance, a linear regression model never predicts an actual value.

Consider a simple example of modeling house prices just based on its area. A
prediction wouldn’t tell the true value of a house based on its area. This is
because diﬀerent houses of the same size can have diﬀerent prices.

Instead, what it predicts is the mean value related to the outcome at a particular
input.

The point is…

There’s always some uncertainty involved in statistical estimates, and it is

important to communicate it.

In this speciﬁc case, there are two types of uncertainties:

● The uncertainty in estimating the true mean value.

● The uncertainty in estimating the true value.

Conﬁdence interval and prediction interval help us capture these uncertainties.

Let’s understand.

153
DailyDoseofDS.com

Consider the following dummy dataset:

Let’s ﬁt a linear regression model using statsmodel and print a part of the
regression summary:

Notice that the coeﬃcient of the predictor “x1” is . 0 with a 5 interval

of 1 7 . 8.

t s 5 n r l e u . 5 . 5 . .

This is known as the conﬁdence interval, which comes from sampling

uncertainty.

154
DailyDoseofDS.com

More speciﬁcally, this

uncertainty arises because
the data we used above for
modeling is just a sample of
the population.

So, if we gathered more such samples and fit an OLS to each sample, the true
coefficient (which we can only know if we had the data for the entire population)
would lie 95% of the time in this confidence interval.

Next, we use this model to make a prediction as follows:

● The predicted value is . m n.

● The 5 conﬁdence interval is 6 3 . 9 .
● The 5 prediction interval is 1 6 4 9 .

The confidence interval we saw above was for the coefficient, so what does the
confidence interval represent in this case?

Similar to what we discussed above, the data is just a sample of the population.

The regression ﬁt obtained by this sample produced a prediction (some mean

value) for the input = .

However, if we gathered more such samples and ﬁt an OLS to each dataset, the
true mean value (which we can only know if we had the data for the entire

155
DailyDoseofDS.com

population) for this speciﬁc input ( = ) would lie 5 of the time in this
conﬁdence interval.

Coming to the prediction interval…

…we notice that it is wider than the conﬁdence interval. Why is it, and what does
this interval tell?

What we saw above with conﬁdence interval was about estimating the true
population mean at a speciﬁc input.

What we are talking about now is obtaining an interval where the true value for
an input can lie.

Thus, this additional uncertainty appears because in our dataset, for the same
value of input x, there can be multiple diﬀerent values of the outcome. This is
depicted below:

156
DailyDoseofDS.com

Thus, it is wider than the conﬁdence interval. Plotting it across the entire input
range, we get the following plot:

Given that the model is predicting a mean value (as depicted below), we have to
represent the prediction uncertainty that the actual value can lie anywhere in the
prediction interval:

A 5 prediction interval tells us that we can be 5 sure that the actual value
of this observation will fall within this interval.

157
DailyDoseofDS.com

So to summarize:

● A conﬁdence interval captures the sampling uncertainty. More data means

less sampling uncertainty, which in turn leads to a smaller interval.
● In addition to the sampling uncertainty, the prediction interval also
represents the uncertainty in estimating the true value of a particular data
point. Thus, it is wider than the conﬁdence interval.

Communicating these uncertainties is quite crucial in decision-making because

it provides a clearer understanding of the reliability and precision of predictions.

This transparency allows stakeholders to make more informed decisions by

considering the range of possible outcomes and the associated risks.

158
DailyDoseofDS.com

h s L a e n n a d s m o
The OLS estimator for linear regression (shown below) is known as an unbiased
estimator.

● What do we mean by that?

● Why is OLS called such?

a g u
The goal of statistical modeling is to make conclusions about the whole
population.

However, it is pretty obvious that observing the entire population is impractical.

In other words, given that we cannot observe (or collect data of) the entire
population, we cannot obtain the true parameter (β) for the population:

159
DailyDoseofDS.com

Thus, we must obtain parameter estimates (B̂) on samples and infer the true
parameter (β) for the population from those estimates:

And, of course, we want these sample estimates (B̂) to be reliable to determine the
actual parameter (β).

The OLS estimator ensures that.

Let’s understand how!

r o l i o l
When using a linear regression model, we assume that the response variable (Y)
and features (X) for the entire population are related as follows:

● β is the true parameter that we are not aware of.

● ε is the error term.

x c a e f L s m e
The closed-form solution of OLS is given by:

160
DailyDoseofDS.com

What’s more, as discussed above, using OLS on diﬀerent samples will result in
diﬀerent parameter estimates:

Let’s ﬁnd the expected value of OLS estimates [ ̂ .

Simply put, the expected value is the average value of the parameters if we run
OLS on many samples.

This is given by:

Substitute ̂ as the OLS solution

Here, substitute β ε:

If you are wondering how we can substitute β ε when we don’t know

what β is, then here’s the explanation:

See, we can do that substitution because even if we don’t know the parameter β
for the whole population, we know that the sample was drawn from the
population.

161
DailyDoseofDS.com

Thus, the equation in terms of the true parameters ( β ε) still holds for
the sample.

Let me give you an example.

Say the population data was deﬁned by i x ε. Of course, we wouldn’t

know this, but just keep that aside for a second.

Now, even if we were to draw samples from this population data, the true
equation i x ε would still be valid on the sampled data points,
wouldn’t it?

The same idea has been extended for expected value.

Coming back to the following:

Let’s open the inner parenthesis:

162
DailyDoseofDS.com

Simplifying, we get:

And ﬁnally, what do we get?

The expected value of parameter estimates on the samples equals the true
parameter value β.

And this is precisely what the deﬁnition of an unbiased estimator is.

More formally, an estimator is called unbiased if the expected value of the

parameters is equal to the actual parameter value.

And that is why we call OLS an unbiased estimator.

163
DailyDoseofDS.com

n m r n a a y
Many people misinterpret unbiasedness with the idea that the parameter
estimates from a single run of OLS on a sample are equal to the true parameter
values.

Don’t make that mistake.

Instead, unbiasedness implies that if we were to generate OLS estimates on many

diﬀerent samples (drawn from the same population), then the expected value of
obtained estimates will be equal to the true population parameter.

And, of course, all this is based on the assumption that we have good
representative samples and that the assumptions of linear regression are not
violated.

164
DailyDoseofDS.com

h t h y i a e
Assessing the similarity between two probability distributions is quite helpful at
times. For instance, imagine we have a labeled dataset X ).

By analyzing the label ( ) distribution, we may hypothesize its distribution before

building a statistical model.

We also looked at this in an earlier chapter on generalized linear models (GLMs).

While visual inspection is o en

helpful, this approach is quite
subjective and may lead to
misleading conclusions.

Thus, it is essential to be aware of quantitative measures as well. Bhattacharyya

distance is one such reliable measure.

It quantiﬁes the similarity between two probability distributions.

The core idea is to approximate the overlap between two distributions, which
measures the “closeness” between the two distributions under consideration.

165
DailyDoseofDS.com

Bhattacharyya distance is measured as follows:

For two discrete probability

distributions

For two continuous probability

distributions (replace
summation with an integral):

Its eﬀectiveness is evident from the image below.

Here, we have an
observed distribution
(Blue). Next, we
measure its distance
from:

● Gaussian → 0.19.
● Gamma → 0.03.

A high Bhattacharyya
distance indicates less
overlap or more
dissimilarity. This lets
us conclude that the
observed distribution
resembles a Gamma
distribution.

The results also resonate with visual inspection.

166
DailyDoseofDS.com

L i r c s h t h y i a e
Now, many o en get confused between KL Divergence and Bhattacharyya
distance. Eﬀectively, both are quantitative measures to determine the “similarity”
between two distributions.

However, their notion of “similarity” is entirely diﬀerent.

The core idea behind KL Divergence is to assess how much information is lost
when one distribution is used to approximate another distribution.

The more information is lost, the more the KL Divergence and, consequently, the
less the “similarity”. Also, approximating a distribution Q using P may not be the
same as doing the reverse — P using Q. This makes KL Divergence asymmetric
in nature.

Moving on,
Bhattacharyya distance
measures the overlap
between two
distributions.

This “overlap” is o en interpreted as a measure of closeness (or distance)

between the two distributions under consideration. Thus, Bhattacharyya distance
primarily serves a distance metric, like Euclidean, for instance.

167
DailyDoseofDS.com

Being a distance metric, it is symmetric

in nature. We can also verify this from
the distance formula:

Just like we use Euclidean distance to

ﬁnd the distance between two points, we
can use Bhattacharyya distance to ﬁnd
the distance between two distributions.

But if we intend to measure the amount of information lost when we approximate

one distribution using another, KL divergence is more apt. In fact, KL divergence
also serves as a loss function in machine learning algorithms at times (in t-SNE).

● Say we have an observed distribution (P) and want to approximate it with

another simpler distribution Q.
● So, we can deﬁne a simpler parametric distribution Q.
● Next, we can measure the information lost by approximating P using Q
with KL divergence.
● As we want to minimize the information lost, we can use KL divergence as
our objective function.
● Finally, we can use gradient descent to determine the parameters of Q such
that we minimize the KL divergence.

Bhattacharyya distance has many applications, not just in machine learning but
in many other domains. For instance:

● Using this distance, we can simplify complex distributions to simple ones

if the distance is low.
● In image processing, Bhattacharyya distance is o en used for image
matching. By comparing the color or texture distributions of images, it
helps identify similar objects or scenes, etc.

The only small caveat is that Bhattacharyya distance does not satisfy the triangle
inequality, so that’s something to keep in mind.

168
DailyDoseofDS.com

h r e a l o s i a e v u i a
i a e
During distance calculation, Euclidean distance assumes independent axes.

Thus, Euclidean distance will produce misleading results if your features are
correlated. For instance, consider this dummy dataset below:

Clearly, the features are correlated. Here, consider three points marked P1, P2,
and P3 in this dataset.

169
DailyDoseofDS.com

Considering the data distribution, something tells us that P2 is closer to P1 than

P3. This is because P2 lies more within the data distribution than P3.

Yet, Euclidean distance ignores this, and P2 and P3 come out to be equidistant to
P1, as depicted below:

Mahalanobis distance addresses this limitation. It is a distance metric that takes

into account the data distribution.

As a result, it can measure how far away a data point is from the distribution,
which Euclidean can not.

Referring to the earlier dataset again, with Mahalanobis distance, P2 comes out
to be closer to P1 than P3.

170
DailyDoseofDS.com

o o t o ?
In a gist, the objective is to construct a new coordinate system with independent
and orthogonal axes. The steps are:

● Step 1: Transform the columns into uncorrelated variables.

● Step 2: Scale the new variables to make their variance equal to 1.
● Step 3: Find the Euclidean distance in this new coordinate system.

So, eventually, we do use Euclidean distance. However, we ﬁrst transform the

data to ensure that it obeys the assumptions of Euclidean distance.

One of the most common

use cases of Mahalanobis
distance is outlier
detection. Reconsidering
the dataset we discussed
earlier where P3 is clearly
an outlier.

171
DailyDoseofDS.com

If we consider P1 as the distribution’s centroid and use Euclidean distance, we

will infer that P3 is not an outlier as both P2 and P3 are equidistant to P1.

Using Mahalanobis distance, however, provides a clearer picture:

This becomes more useful in a high-dimensional setting where visualization is

infeasible.

Another use case we typically do not hear of o en, but that exists is a variant of
kNN that is implemented with Mahalanobis distance instead.

Scipy implements the Mahalanobis distance, which you can check here:
https://bit.ly/3LjAymm.

172
DailyDoseofDS.com

1 a o e r n a o a t
Many ML models assume (or work better) under the presence of normal
distribution.

For instance:

● Linear regression assumes residuals are normally distributed.

● At times, transforming the data to normal distribution can be beneﬁcial.
● Linear discriminant analysis (LDA) is derived under the assumption of
normal distribution, etc.

Thus, being aware of the ways to test normality is extremely crucial for data
scientists. The visual below depicts the 11 essential ways to test normality.

Let’s understand these in this chapter.

173
DailyDoseofDS.com

1 l t g e o s f x a t y
● Histogram

● QQ Plot (We shall cover it in the plotting section of this book).

● KDE Plot

● Violin Plot
While plotting is o en reliable, it is a subjective approach and prone to errors.

Thus, we must know reliable quantitative measures as well.

2 t i c e o :
1) Shapiro-Wilk test:
● Finds a statistic using the correlation between the observed data and
the expected values under a normal distribution.
● The p-value indicates the likelihood of observing such a correlation
if the data were normally distributed.
● A high p-value indicates a normal distribution.
2) KS test:
● Measures the max difference between the cumulative distribution
functions (CDF) of observed and normal distribution.
● The output statistic is based on the max difference between the two
CDFs.
● A high p-value indicates a normal distribution.
3) Anderson-Darling test:
● Measures the differences between the observed data and the
expected values under a normal distribution.
● Emphasizes the differences in the tail of the distribution.
● This makes it particularly effective at detecting deviations in the
extreme values.

174
DailyDoseofDS.com

4) Lilliefors test:

● It is a modiﬁcation of the KS test.

● The KS test is appropriate in situations where the parameters of the
reference distribution are known.
● If the parameters are unknown, Lilliefors is recommended.
● Get started: Statsmodel Docs.

3 i a e e u s
Distance measures are another reliable and more intuitive way to test normality.

But they can be a bit tricky to use.

See, the problem is that a single distance value needs more context for
interpretability.

For instance, if the distance between two distributions is 5, is this large or small?

We need more context.

I prefer using these measures as follows:

● Find the distance between the observed distribution and multiple reference
distributions.
● Select the reference distribution with the minimum distance to the
observed distribution.

175
DailyDoseofDS.com

Here are a few distance common and useful measures:

1) Bhattacharyya distance:

● Measure the overlap between two distributions.

● This “overlap” is o en interpreted as closeness between two
distributions.
● Choose the distribution that has the least Bhattacharyya distance to
the observed distribution.
2) Hellinger distance:
● It is used quite similar to how we use the Bhattacharyya distance
● The diﬀerence is that Bhattacharyya distance does not satisfy
triangular inequality.
● But Hellinger distance does.

176
DailyDoseofDS.com

3) KL Divergence:

● It is not entirely a "distance metric" per se, but can be used in this
case.
● Measure information lost when one distribution is approximated
using another distribution.
● The more information is lost, the more the KL Divergence.
● Choose the distribution that has the least KL divergence from the
observed distribution.

KL divergence is used as a loss function in the t-SNE algorithm.

177
DailyDoseofDS.com

r a l y s i l o
In data science and statistics, many folks o en use “probability” and “likelihood”
interchangeably.

However, likelihood and probability DO NOT convey the same meaning.

And the misunderstanding is somewhat understandable, given that they carry

similar meanings in our regular language.

While writing this chapter, I searched for their meaning in the Cambridge
Dictionary. Here’s what it says:

● Probability: the level of possibility of something happening or being true.

● Likelihood: the chance that something will happen.

If you notice closely, “likelihood” is the only synonym of “probability”.

Anyway.

In my opinion, it is crucial to understand that probability and likelihood convey

very diﬀerent meanings in data science and statistics.

Let’s understand!

178
DailyDoseofDS.com

Probability is used in contexts where you wish to know the possibility/odds of an

event.

For instance, what is the:

● Probability of obtaining an even number in a die roll?

● Probability of drawing an ace of diamonds from a deck?
● and so on…

When translated to ML, probability can be thought of as:

● What is the probability that a transaction is fraud?

● What is the probability that an image depicts a cat?
● and so on…

Essentially, many classiﬁcation models, like logistic regression or a classiﬁcation

neural network, etc., assign the probability of a speciﬁc label to an input.

When calculating probability, the model’s parameters are known. Also, we

assume that they are trustworthy.

For instance, to determine the probability of a head in a coin toss, we mostly

assume and trust that it is a fair coin.

Likelihood, on the other hand, is about explaining events that have already
occurred.

Unlike probability (where parameters are known and assumed to be

trustworthy)...

179
DailyDoseofDS.com

…likelihood helps us determine if we can trust the parameters in a model based

on the observed data.

Let me elaborate more on that.

Assume you have collected some 2D data and wish to ﬁt a straight line with two
parameters — slope (m) and intercept (c).

Here, likelihood is deﬁned as the support provided by a data point for some
particular parameter values in your model.

Here, you will ask questions like:

● If I model this data with the parameters:

○ = and = , what is the likelihood of observing the data?

○ = and = , what is the likelihood of observing the data?
○ and so on…
The above formulation popularly translates into the maximum likelihood
estimation (MLE), which we discussed here: ( L s M— h ’ h iﬀ r c )

In maximum likelihood estimation, you have some observed data and you are
trying to determine the speciﬁc set of parameters (θ) that maximize the likelihood
of observing the data.

180
DailyDoseofDS.com

Using the term “likelihood” is like:

● I have a possible explanation for my data. (In the above illustration,

“explanation” can be thought of as the parameters you are trying to
determine)
● How well does my explanation explain what I’ve already observed? This is
precisely quantiﬁed with likelihood.

For instance:

● Observation: The outcomes of 0 coin tosses are “ H H H ”.

● Explanation: I think it is a fair coin ( = 5).
● What is the likelihood that my above explanation is true based on the
observed data?

To summarize…

It is immensely important to understand that in data science and statistics,

likelihood and probability DO NOT convey the same meaning.

As explained above, they are pretty diﬀerent.

181
DailyDoseofDS.com

In probability:

● We determine the possibility of an event.

● We know the parameters associated with the event and assume them to be
trustworthy.

In likelihood:

● We have some observations.

● We have an explanation (or parameters).
● Likelihood helps us quantify whether the explanation is trustworthy.

Hope that helped!

182
DailyDoseofDS.com

1 e r a l y i r u o n a c n
Statistical models assume an underlying data generation process.

This is exactly what lets us formulate the generation process, using which we
deﬁne the maximum likelihood estimation (MLE) step.

Thus, when dealing with statistical models, the model performance becomes
entirely dependent on:

● Your understanding of the data generation process.

● The distribution you chose to model data with, which, in turn, depends on
how well you understand various distributions.

e h l l o t h n h h t n L h h t .

Thus, it is crucial to be aware of some of the most important distributions and

the type of data they can model.

The visual below depicts the 11 most important distributions in data science:

183
DailyDoseofDS.com

Let’s understand them brieﬂy and how they are used.

184
DailyDoseofDS.com

o a i r u o

● The most widely used distribution in data science.

● Characterized by a symmetric bell-shaped curve
● It is parameterized by two parameters—mean and standard deviation.
● Example: Height of individuals.

e o l i r u o

● A discrete probability distribution that models the outcome of a binary

event.
● It is parameterized by one parameter—the probability of success.
● Example: Modeling the outcome of a single coin ﬂip.

i m l i r u o

● It is Bernoulli distribution repeated multiple times.

● A discrete probability distribution that represents the number of successes
in a ﬁxed number of independent Bernoulli trials.
● It is parameterized by two parameters—the number of trials and the
probability of success.

185
DailyDoseofDS.com

o s i r u o

● A discrete probability distribution that models the number of events

occurring in a ﬁxed interval of time or space.
● It is parameterized by one parameter—lambda, the rate of occurrence.
● Example: Analyzing the number of goals a team will score during a speciﬁc
time period.

x n t l i r u o

● A continuous probability distribution that models the time between events

occurring in a Poisson process.
● It is parameterized by one parameter—lambda, the average rate of events.
● Example: Analyzing the time between goals scored by a team.

a a i r u o

● It is a variation of the exponential distribution.

● A continuous probability distribution that models the waiting time for a
speciﬁed number of events in a Poisson process.
● It is parameterized by two parameters—alpha (shape) and beta (rate).
● Example: Analysing the time it would take for a team to score three goals.

186
DailyDoseofDS.com

e i r u o

● It is used to model probabilities, thus, it is bounded between [0,1].

● Diﬀers from Binomial in this respect that in Binomial, probability is a
parameter.
● But in Beta, the probability is a random variable.

n o i r u o

● All outcomes within a given range are equally likely.

● It can be continuous or discrete.
● It is parameterized by two parameters: a (min value) and b (max value).
● Example: Simulating the roll of a fair six-sided die, where each outcome (1,
2, 3, 4, 5, 6) has an equal probability.

187
DailyDoseofDS.com

o N m i r u o

● A continuous probability distribution where the logarithm of the variable

follows a normal distribution.
● It is parameterized by two parameters—mean and standard deviation.
● Example: Typically, in stock returns, the natural logarithm follows a
normal distribution.

t e - s i t n

● It is similar to normal distribution but with longer tails (shown above).

● It is used in t-SNE to model low-dimensional pairwise similarities.

e u
● Models the waiting time for an event.
● O en employed to analyze time-to-failure data.

188
DailyDoseofDS.com

o o i n r e t n f o i o
r a l y i r u o

Consider the
following
probability density
function of a
continuous
probability
distribution. Say it
represents the time
one may take to
travel from point A
to B.

● For simplicity, we are assuming a uniform distribution in the interval [1,5].

● Essentially, it says that it will take somewhere between 1 and 5 minutes to
go from A to B. Never more, never less.

Thus, the probability density function (PDF) can be written as follows:

189
DailyDoseofDS.com

Answer the following question for me:

) h s h r a l y h n i a r i l h e i t
( 3 o e h o t ?

● ) / o . )

● ) r n r h u e r = , .

Decide on an answer before you read further.

Well, all of the above answers are wrong.

The correct answer is ZERO.

And I intentionally kept only wrong answers here so that you never forget
something fundamentally important about continuous probability distributions.

Let’s dive in!

The probability density function of a continuous probability distribution may

look as follows:

Some conditions for this probability density function are:

190
DailyDoseofDS.com

● It should be deﬁned for all real numbers (can be zero for some values).

This is in contrast to a discrete probability distribution which is only deﬁned for

a list of values.

● The area should be 1.

● The function should be non-negative for all real values.

Here, many folks o en misinterpret that the probability density function

represents the probability of obtaining a speciﬁc value.

191
DailyDoseofDS.com

For instance, by looking at the above probability density function, many

incorrectly conclude that the probability of the random variable X being is
close to . .

But contrary to this common belief, a probability density function:

● DOES NOT depict the probabilities of a speciﬁc value.

● is not meant to depict a discrete random variable.

Instead, a probability density function:

● depicts the rate at which probabilities accumulate around each point.

● is only meant to depict a continuous random variable.

Now, there are inﬁnitely possible values that a continuous random variable may
take.

So the probability of obtaining a speciﬁc value is always zero (or inﬁnitesimally

small).

Thus, answering our original question, the probability that one will take three
minutes to reach point B is ZERO.

So what is the purpose of using a probability density function?

In statistics, a PDF is used to calculate the probability over an interval of values.

192
DailyDoseofDS.com

Thus, we can use it to answer questions such as…

● What is the probability that it will take between:

○ 3 to 4 minutes to reach point B from point A, or,
○ 2 to 4 minutes to reach point B from point A, and so on…

And we do this using integrals.

More formally, the probability that a random variable will take values in the
interval a ] is:

Simply put, it’s the area under the curve from a ].

From the above probability estimation over an interval, we can also verify that the
probability of obtaining a speciﬁc value is indeed zero.

193
DailyDoseofDS.com

By substituting = , we get:

To summarize, always remember that in a continuous probability distribution:

● The probability density function does not depict the exact probability of
obtaining a speciﬁc value.
● Estimating the probability for a precise value of the random value makes
no sense because it is inﬁnitesimally small.

Instead, we use the probability density function to calculate the probability over
an interval of values.

194
DailyDoseofDS.com

e u eﬁ i o n n r g
n e c o
1 y s f a a e n a s
In any tabular dataset, we typically categorize the columns as either a feature or a
target.

However, there are so many variables that one may ﬁnd/deﬁne in their dataset,
which I want to discuss in this chapter. These are depicted in the image below:

195
DailyDoseofDS.com

1 ) n p d n e n t a a e
These are the most common and fundamental to ML.

Independent variables are the features that are used as input to predict the
outcome. They are also referred to as predictors/features/explanatory variables.

The dependent variable is the outcome that is being predicted. It is also called
the target, response, or output variable.

3 ) o o d n o e t a a e
Confounding variables are typically found in a cause-and-eﬀect study (causal
inference).

These variables are not of primary interest in the cause-and-eﬀect equation but
can potentially lead to spurious associations.

196
DailyDoseofDS.com

To exemplify, say we want to measure the eﬀect of ice cream sales on the sales of
air conditioners.

As you may have guessed, these two measurements are highly correlated.

However, there’s a confounding variable — temperature, which inﬂuences both

ice cream sales and the sales of air conditioners.

To study the true casual impact, it is essential to consider the confounder

(temperature). Otherwise, the study will produce misleading results.

In fact, it is due to the confounding variables that we hear the statement:

“Correlation does not imply causation.”

In the above example:

● There is a high correlation between ice cream sales and sales of air
conditioners.
● But the sales of air conditioners (eﬀect) are NOT caused by ice cream sales.

197
DailyDoseofDS.com

Also, in this case, the air conditioner and ice cream sales are correlated variables.

More formally, a change in one variable is associated with a change in another.

5 o r a a e
In the above example, to measure the true eﬀect of ice cream sales on air
conditioner sales, we must ensure that the temperature remains unchanged
throughout the study.

Once controlled, temperature becomes a control variable.

More formally, these are variables that are not the primary focus of the study but
are crucial to account for to ensure that the eﬀect we intend to measure is not
biased or confounded by other factors.

6 a n a a e
A variable that is not directly observed but is inferred from other observed
variables.

For instance, we use clustering algorithms because the true labels do not exist,
and we want to infer them somehow.

198
DailyDoseofDS.com

The true label is a latent variable in this case.

Another common example of a latent variable is “intelligence.”

Intelligence itself cannot be directly measured; it is a latent variable.

However, we can infer intelligence through various observable indicators such as

test scores, problem-solving abilities, and memory retention.

7 n r t n a a e
As the name suggests, these variables represent the interaction eﬀect between
two or more variables, and are o en used in regression analysis.

Here’s an instance I remember using them in.

In a project, I studied the impact of population density and income levels on

spending behavior.

● I created three groups for population density — HIGH, MEDIUM, and

LOW (one-hot encoded).
● Likewise, I created three groups for income levels — HIGH, MEDIUM,
and LOW (one-hot encoded).

To do regression analysis, I created interaction variables by cross-multiplying

both one-hot columns.

199
DailyDoseofDS.com

This produced 9 interaction variables:

● Population-High and Income-High

● Population-High and Income-Med
● Population-High and Income-Low
● Population-Med and Income-High
● and so on…

Conducting the regression analysis on interaction variables revealed more useful

insights than what I observed without them.

To summarize, the core idea is to study two or more variables together rather
than independently.

8 ) t i a n o S t n y a a e
The concept of stationarity o en appears in time-series analysis.

Stationary variables are those whose statistical properties (mean, variance) DO

NOT change over time.

On the ﬂip side, if a variable’s statistical properties change over time, they are
called non-stationary variables.

Preserving stationarity in statistical learning is critical because these models are

fundamentally reliant on the assumption that samples are identically distributed.

But if the probability distribution of variables is evolving over time,

(non-stationary), the above assumption gets violated.

200
DailyDoseofDS.com

That is why, typically, using direct values of the non-stationary feature (like the
absolute value of the stock price) is not recommended.

Instead, I have always found it better to deﬁne features in terms of relative

changes:

1 a e a a e
Talking of time series, lagged variables are pretty commonly used in feature
engineering and data analytics.

As the name suggests, a lagged variable represents previous time points’ values of
a given variable, essentially shi ing the data series by a speciﬁed number of
periods/rows.

For instance, when predicting next month’s sales ﬁgures, we might include the
sales ﬁgures from the previous month as a lagged variable.

Lagged features may include:

● 7-day lag on website traﬃc to predict current website traﬃc.

● 30-day lag on stock prices to predict the next month’s closing prices.
● And so on…

201
DailyDoseofDS.com

1 e y a a e
Yet again, as the name suggests, these variables (unintentionally) provide
information about the target variable that would not be available at the time of
prediction.

This leads to overly optimistic model performance during training but fails to
generalize to new data.

Consider a dataset containing medical imaging data.

Each sample consists of multiple images (e.g., diﬀerent views of the same
patient’s body part), and the model is intended to detect the severity of a disease.

In this case, randomly splitting the images into train and test sets will result in
data leakage.

This is because images of the same patient will end up in both the training and
test sets, allowing the model to “see” information from the same patient during
training and testing.

202
DailyDoseofDS.com

Here’s a paper that committed this mistake (and later corrected it):

To avoid this, a patient must only belong to the test or train/val set, not both.

This is called group splitting:

Creating forward-lag features is another way leaky variables get created

unintentionally at times:

Let’s get into more detail about the issue with random splitting below.

203
DailyDoseofDS.com

y i l e u n d g
In typical machine learning datasets, we mostly ﬁnd features that progress from
one value to another: For instance:

● Numerical features like age, income, transaction amount, etc.

● Categorical features like t-shirt size, income groups, age groups, etc.

However, there is one more type of feature, which, in most cases, deserves special
feature engineering eﬀort but is o en overlooked. These are cyclical features, i.e.,
features with a recurring pattern (or cycle).

Unlike other features that progress continuously (or have no inherent order),
cyclical features exhibit periodic behavior and repeat a er a speciﬁc interval. For
instance, the hour-of-the-day, the day-of-the-week, and the month-of-an-year are
all common examples of cyclical features. Talking speciﬁcally about, say, the
hour-of-the-day, its value can range between 0 to 23:

If we DON’T consider this as a cyclical feature and don’t utilize appropriate

feature engineering techniques, we will lose some really critical information.

To understand better, consider this:

204
DailyDoseofDS.com

Realistically speaking, the values “23”

and “0” must be close to each other in
our “ideal” feature representation of
the hour-of-the-day.

Moreover, the distance between “0” and “1” must be the same as the distance
between “23” and “0”.

However, standard representation does not fulﬁll these properties. Thus, the
value “23” is far from “0”. In fact, the distance property isn’t satisﬁed either.

Now, think about it for a second. Intuitively speaking, don’t you think this feature
deserves special feature engineering, i.e., one that preserves the inherent natural
property?

I am sure you do!

Let’s understand how we typically do it.

y i l e u n d g
One of the most common techniques to encode such a feature is using
trigonometric functions, speciﬁcally, sine and cosine. These are helpful because
sine and cosine are periodic, bounded, and deﬁned for all real values.

205
DailyDoseofDS.com

f o s v t r r o m r u t n r l e o , u h
r l n ﬁ e o o a e i , a p 2

For instance, consider

representing the linear
hour-of-the-day feature as a
cyclical feature:

The central angle (2π) represents 24 hours. Thus, the linear feature values can be
easily converted into cyclical features as follows:

The beneﬁt of doing this is how neatly the engineered feature satisﬁes the
properties we discussed earlier:

As depicted above, the distance between the cyclical feature representation of

“23” and “0” is the same as the distance between “0” and “1”. The standard linear
representation of the hour-of-the-day feature, however, violates this property,
which results in loss of information…

206
DailyDoseofDS.com

…or rather, I should say that the standard linear representation of the
hour-of-the-day feature results in an underutilization of information, which the
model can beneﬁt from. Had it been the day-of-the-week instead, the central
angle (2π) must have represented 7 days.

The same idea can be extended to all sorts of cyclical features you may ﬁnd in
your dataset:

● Wind direction, if represented categorically, will go in this order: N, NE, E,

SE, S, SW, W, NW, and then back to N.
● Phases of the moon, like new moon, ﬁrst quarter, full moon, and last
quarter, can be represented as categories with a cyclical order.
● Seasons, such as spring, summer, fall, and winter, are categorical features
with a cyclical pattern as they repeat annually.

The point is that as you will inspect the dataset features, you will intuitively
know which features are cyclical and which are not.

Typically, the model will ﬁnd it easier to interpret the engineered features and
utilize them in modeling the dataset accurately.

207
DailyDoseofDS.com

e u i r i t n
During model development, one of the techniques that many don’t experiment
with is feature discretization. As the name suggests, the idea behind
discretization is to transform a continuous feature into discrete features.

Why, when, and how would you do that? Let’s understand in this chapter.

o v i
My rationale for using feature discretization has almost always been simple: “It
just makes sense to discretize a feature.”

For instance, consider your dataset has an age feature:

In many use cases, like understanding spending behavior based on transaction

history, such continuous variables are better understood when they are
discretized into meaningful groups → youngsters, adults, and seniors.

For instance, say we model this transaction dataset without discretization. This
would result in some coefficients for each feature, which would tell us the
influence of each feature on the final prediction.

208
DailyDoseofDS.com

But if you think again, in our goal of understanding spending behavior, are we
really interested in learning the correlation between exact age and spending
behavior?

It makes very little sense to do that. Instead, it makes more sense to learn the
correlation between diﬀerent age groups and spending behavior.

As a result, discretizing the age feature can potentially unveil much more
valuable insights than using it as a raw feature.

o o e n u
Now that we understand the rationale, there are 2 techniques that are widely
preferred.

One way of discretizing features involves decomposing a feature into equally

sized bins.

Another technique involves decomposing a feature into equal frequency bins:

209
DailyDoseofDS.com

A er that, the discrete values are one-hot encoded.

One advantage of feature discretization is that it enables non-linear behavior

even though the model is linear. This can potentially lead to better accuracy,
which is also evident from the image below:

A linear model with feature discretization results in a:

● non-linear decision boundary.

● better test accuracy.

So, in a way, we get to use a simple linear model but still get to learn non-linear
patterns.

Another advantage of discretizing continuous features is that it helps us improve

the signal-to-noise ratio.

210
DailyDoseofDS.com

Simply put, “signal” refers to the meaningful or valuable information in the data.

Binnng a feature helps us mitigate the inﬂuence of minor ﬂuctuations, which are
o en mere noise.

Each bin acts as a means of “smoothing” out the noise within speciﬁc data
segments.

Before I conclude, do remember that feature discretization with one-hot

encoding increases the number of features → thereby increasing the data
dimensionality.

And typically, as we progress towards higher dimensions, data become more

easily linearly separable. Thus, feature discretization can lead to overﬁtting.

To avoid this, don’t overly discretize all features. Instead, use it when it makes
intuitive sense, as we saw earlier.

Of course, its utility can vastly vary from one application to another, but at times,
I have found that:

● Discretizing geospatial data like latitude and longitude can be useful.

● Discretizing age/weight-related data can be useful.
● Features that are typically constrained between a range makes sense, like
savings/income (practically speaking), etc.

211
DailyDoseofDS.com

a g i l a n d g e n u
Here are 7 ways to encode categorical features:

We covered them in detail here: https://bit.ly/3LkfVq5.

212
DailyDoseofDS.com

n h n d g
● Each category is represented by a binary vector of 0s and 1s.
● Each category gets its own binary feature, and only one of them is "hot"
(set to 1) at a time, indicating the presence of that category.
● Number of features = Number of unique categorical labels

u y n d g
● Same as one-hot encoding but with one additional step.
● A er one-hot encoding, we drop a feature randomly.
● We do this to avoid the dummy variable trap (discussed in this chapter).
● Number of features = Number of unique categorical labels - 1

ﬀ c n d g
● Similar to dummy encoding but with one additional step.
● Alter the row with all zeros to -1.
● This ensures that the resulting binary features represent not only the
presence or absence of speciﬁc categories but also the contrast between
the reference category and the absence of any category.
● Number of features = Number of unique categorical labels - 1.

a l n d g
● Assign each category a unique label.
● Label encoding introduces an inherent ordering between categories, which
may not be the case.
● Number of features = 1.

r n n d g
● Similar to label encoding — assign a unique integer value to each category.
● The assigned values have an inherent order, meaning that one category is
considered greater or smaller than another.
● Number of features = 1.

213
DailyDoseofDS.com

o t n d g
● Also known as frequency encoding.
● Encodes categorical features based on the frequency of each category.
● Thus, instead of replacing the categories with numerical values or binary
representations, count encoding directly assigns each category with its
corresponding count.
● Number of features = 1.

i r n d g
● Combination of one-hot encoding and ordinal encoding.
● It represents categories as binary code.
● Each category is ﬁrst assigned an ordinal value, and then that value is
converted to binary code.
● The binary code is then split into separate binary features.
● Useful when dealing with high-cardinality categorical features (or a high
number of features) as it reduces the dimensionality compared to one-hot
encoding.
● Number of features = log(n) (in base 2).

While these are some of the most popular techniques, do note that these are not
the only techniques for encoding categorical data.

You can try plenty of techniques with the category-encoders library:

https://pypi.org/project/category-encoders.

214
DailyDoseofDS.com

h ffl e u m r n
I o en find “Shuffle Feature Importance” to be a handy and intuitive technique
to measure feature importance.

As the name suggests, it observes how shuﬄing a feature inﬂuences the model
performance. The visual below illustrates this technique in four simple steps:

215
DailyDoseofDS.com

Here’s how it works:

● Train the model and measure its performance → 1.

● Shuﬄe one feature randomly and measure performance again → 2 (model
is NOT trained again).
● Measure feature importance using performance drop = ( 1- 2).
● Repeat for all features.

This makes intuitive sense as well, doesn’t it?

Simply put, if we randomly shuﬄe just one feature and everything else stays the
same, then the performance drop will indicate how important that feature is.

● If the performance drop is low → This means the feature has a very low
inﬂuence on the model’s predictions.
● If the performance drop is high → This means that the feature has a very
high inﬂuence on the model’s predictions.

Do note that to eliminate any potential eﬀects of randomness during feature

shuﬄing, it is recommended to:

● Shuﬄe the same feature multiple times

● Measure average performance drop.

A few things that I love about this technique are:

216
DailyDoseofDS.com

● It requires no repetitive model training. Just train the model once and
measure the feature importance.
● It is pretty simple to use and quite intuitive to interpret.
● This technique can be used for all ML models that can be evaluated.

Of course, there is one caveat as well.

Say two features are highly correlated, and one of them is permuted/shuﬄed.

In this case, the model will still have access to the feature through its correlated
feature.

This will result in a lower importance value for both features.

One way to handle this is to cluster highly correlated features and only keep one
feature from each cluster.

217
DailyDoseofDS.com

h r e e o o e u e c o
Real-world ML development is all about achieving a sweet balance between
speed, model size, and performance.

One common way to:

● Improve speed,
● Reduce size, and
● Maintain (or minimally degrade) performance…

…is by using featuring selection. The idea is to select the most useful subset of
features from the dataset.

While there are many many methods for feature selection, I have o en found the
“Probe Method” to be pretty reliable, practical and intuitive to use.

218
DailyDoseofDS.com

The image below depicts how it works:

● Step 1) Add a random feature (noise).

● Step 2) Train a model on the new dataset.
● Step 3) Measure feature importance (can use shuﬄe feature importance)
● Step 4) Discard original features that rank below the random feature.

● Step 5) Repeat until convergence.

This whole idea makes intuitive sense as well.

219
DailyDoseofDS.com

More speciﬁcally, if a feature’s importance is ranked below a random feature, it is

probably a useless feature for the model.

This can be especially useful in cases where we have plenty of features, and we
wish to discard those that don’t contribute to the model.

Of course, one shortcoming is that when using the Probe Method, we must train
multiple models:

1. Train the ﬁrst model with the random feature and discard useless features.
2. Keep training new models until the random feature is ranked as the least
important feature (although typically, convergence does not result in plenty
of models).
3. Train the ﬁnal model without the random feature.

Nonetheless, the approach can be quite useful to reduce model complexity.

220
DailyDoseofDS.com

e e i
h e q r r r M )

Say you wish to train a linear regression

model. We know that we train it by
minimizing the squared error:

But have you ever wondered why we speciﬁcally use the squared error?

See, many functions can potentially minimize the diﬀerence between observed
and predicted values. But of all the possible choices, what is so special about the
squared error?

In my experience, people o en say:

● Squared error is diﬀerentiable. That is why we use it as a loss function.

WRONG.
● It is better than using absolute error as squared error penalizes large errors
more. WRONG.

Sadly, each of these explanations are incorrect.

But approaching it from a probabilistic perspective helps us understand why the

squared error is the ideal choice.

Let’s understand in this chapter.

221
DailyDoseofDS.com

In linear regression, we predict our target variable y using the inputs X as

follows:

Here, epsilon is an error term that captures the random noise for a speciﬁc data
point (i).

We assume the noise is drawn from a Gaussian distribution with zero mean
based on the central limit theorem:

Thus, the probability of observing the error term can be written as:

Substituting the error term from the linear regression equation, we get:

For a speciﬁc set of parameters θ, the above tells us the probability of observing a
data point (i).

Next, we can deﬁne the likelihood function as follows:

222
DailyDoseofDS.com

It means that by varying θ, we can ﬁt a distribution to the observed data and

quantify the likelihood of observing it.

We further write it as a product for individual data points because we assume all
observations are independent.

Thus, we get:

Since the log function is monotonic, we use the log-likelihood and maximize it.
This is called maximum likelihood estimation (MLE).

223
DailyDoseofDS.com

Simplifying, we get:

To reiterate, the objective is to ﬁnd the θ that maximizes the above expression.
But the ﬁrst term is independent of θ. Thus, maximizing the above expression is
equivalent to minimizing the second term. And if you notice closely, it’s precisely
the squared error.

Thus, you can maximize the log-likelihood by minimizing the squared error. And
this is the origin of least-squares in linear regression. See, there’s clear proof and
reasoning behind using squared error as a loss function in linear regression.

Nothing comes from thin air in machine learning.

224
DailyDoseofDS.com

k a i a e e i a o y r r e r
Almost all ML models we work with have some hyperparameters, such as:

● Learning rate
● Regularization
● Layer size (for neural network), etc.

But as shown in the image below, why don’t we see any hyperparameter in
Sklearn’s Linear Regression implementation?

It must have learning rate as a hyperparameter, right?

To understand the reason why it has no hyperparameters, we ﬁrst need to learn

that the Linear Regression can model data in two diﬀerent ways:

225
DailyDoseofDS.com

1. Gradient Descent (which many other ML algorithms use for optimization):

○ It is a stochastic algorithm, i.e., involves some randomness.
○ It ﬁnds an approximate solution using optimization.
○ It has hyperparameters.
2. Ordinary Least Square (OLS):
○ It is a deterministic algorithm. Thus, if run multiple times, it will
always converge to the same weights.
○ It always ﬁnds the optimal solution.
○ It has no hyperparameters.

Now, instead of the typical gradient descent approach, Sklearn’s Linear

Regression class implements the OLS method.

That is why it has no hyperparameters.

o o L o ?

With OLS, the idea is to ﬁnd the set of

parameters (Θ) such that:

● X: input data with dimensions (n,m).

● Θ: parameters with dimensions (m,1).
● y: output data with dimensions (n,1).
● n: number of samples.
● m: number of features.

226
DailyDoseofDS.com

One way to determine the parameter

matrix Θ is by multiplying both sides of
the equation with the inverse of X, as
shown below:

But because X might be a non-square matrix, its inverse may not be deﬁned.

To resolve this, ﬁrst, we multiply with the transpose of X on both sides, as shown
below:

This makes the product of X with its transpose a square matrix.

The obtained matrix, being square, can be inverted (provided it is non-singular).

Next, we take the collective inverse of the product to get the following:

227
DailyDoseofDS.com

It’s clear that the above deﬁnition has:

● No hyperparameters.
● No randomness. Thus, it will always return the same solution, which is also
optimal.

This is precisely what the Linear Regression class of Sklearn implements.

To summarize, it uses the OLS method instead of gradient descent.

That is why it has no hyperparameters.

Of course, do note that there is a signiﬁcant tradeoﬀ between run time and
convenience when using OLS vs. gradient descent.

This is also clear from the time-complexity table we discussed in an earlier

chapter.

As depicted above, the run-time of OLS is cubically related to the number of

features (m).

Thus, when we have many features, it may not be a good idea to use the
i a e e i ( class. Instead, use the G e e o ) class from Sklearn.

Of course, the good thing about LinearRegression() class is that it involves no

hyperparameter tuning.

Thus, when we use OLS, we trade run-time for ﬁnding an optimal solution
without hyperparameter tuning.

228
DailyDoseofDS.com

o s e e i s i a e e i
Linear regression comes with its own set of challenges/assumptions. For
instance, a er modeling, the output can be negative for some inputs.

But this may not make sense at times — predicting the number of goals scored,
number of calls received, etc. Thus, it is clear that it cannot model count (or
discrete) data.

Furthermore, in linear regression, residuals are expected to be normally

distributed around the mean. So the outcomes on either side of the mean ( -
+ ) are equally likely.

For instance:

● if the expected number (mean) of calls received is 1...

● ...then, according to linear regression, receiving 3 calls (1+2) is just as likely
as receiving -1 (1-2) calls. (This relates to the concept of prediction
intervals, which we covered in the an earlier chapter.)
● But in this case, a negative prediction does not make any sense.

Thus, if the above assumptions do not hold, linear regression won’t help.

Instead, in this speciﬁc case, what you may need is Poisson regression.

229
DailyDoseofDS.com

Poisson regression is more suitable if your response (or outcome) is count-based.

It assumes that the response comes from a Poisson distribution.

It is a type of generalized linear model (GLM) that is used to model count data. It
works by estimating a Poisson distribution parameter (λ), which is directly linked
to the expected number of events in a given interval.

Contrary to linear regression, in Poisson regression, residuals may follow an

asymmetric distribution around the mean (λ). Hence, outcomes on either side of
the mean (λ-x, λ+x) are NOT equally likely.

For instance:

● if the expected number (mean) of calls received is 1...

● ...then, according to Poisson regression, it is possible to receive 3 (1+2)
calls, but it is impossible to receive -1 (1-2) calls.
● This is because its outcome is also non-negative.

The regression ﬁt is mathematically deﬁned as follows:

230
DailyDoseofDS.com

The following visual neatly summarizes this post:

We shall continue this discussion in the next chapter.

231
DailyDoseofDS.com

o o u d i a o l
In this chapter, I will help you cultivate what I think is one of the MOST
overlooked and underappreciated skills in developing linear models.

I can guarantee that harnessing this skill will give you a lot of clarity and
intuition in the modeling stages.

To begin, understand that the Poisson regression we disussed above is no magic.

It’s just that, in our speciﬁc use case, the data generation process didn’t perfectly
align with what linear regression is designed to handle. In other words, earlier
when we trained a linear regression model, we inherently assumed that the data
was sampled from a normal distribution. But that was not true in this Poisson
regression case.

Instead, it came from a Poisson distribution, which is why Poisson regression

worked better. Thus, the takeaway is that whenever you train linear models,
always always and always think about the data generation process. It goes like
this:

● Okay, I have this data.

● I want to ﬁt a linear model through it.
● What information do I get about the data generation process that can help
me select an appropriate linear model?

232
DailyDoseofDS.com

You’d start appreciating the importance of data generation when you realize that
literally every member of the generalized linear model family stems from altering
the data generation process.

For instance:

● If the data generation process involves a Normal distribution → you get

linear regression.
● If the data has only positive integers in the response variable, maybe it
came from a Poisson distribution → and this gives us Poisson regression.
This is precisely what we discussed yesterday.
● If the data has only two targets — 0 and 1, maybe it was generated using
Bernoulli distribution → and this gives rise to logistic regression.
● If the data has ﬁnite and ﬁxed categories (0, 1, 2,…n), then this hints
towards Binomial distribution → and we get Binomial regression.

233
DailyDoseofDS.com

See…

Every linear model makes an assumption and is then derived from an underlying
data generation process.

Thus, developing a habit of holding for a second and thinking about the data
generation process will give you so much clarity in the modeling stages.

I am conﬁdent this will help you get rid of that annoying and helpless habit of
relentlessly using a speciﬁc sklearn algorithm without truly knowing why you are
using it.

Consequently, you’d know which algorithm to use and, most importantly, why.

This improves your credibility as a data scientist and allows you to approach data
science problems with intuition and clarity rather than hit-and-trial.

In fact, once you understand the data generation process, you will automatically
get to know about most of the assumptions of that speciﬁc linear model.

234
DailyDoseofDS.com

u y a a e r

With one-hot encoding, we

introduce a big problem in the data.

When we one-hot encode

categorical data, we unknowingly
introduce perfect multicollinearity.

Multicollinearity arises when two or

more features can predict another
feature. In this case, as the sum of
one-hot encoded features is always
1, it leads to perfect
multicollinearity.

This is o en called the Dummy Variable Trap. It is bad because the model has
redundant features. Morevero, the regressions coeﬃcients aren’t reliable in the
presence of multicollinearity.

So how to resolve this?

The solution is simple. Drop any arbitrary

feature from the one-hot encoded
features.

This instantly mitigates multicollinearity

and breaks the linear relationship which
existed before.

235
DailyDoseofDS.com

i a y s s i a e e i e o a e
Linear regression assumes that the model residuals ( a u - e c d) are
normally distributed. If the model is underperforming, it may be due to a
violation of this assumption.

Here, I o en use a residual distribution plot to verify this and determine the
model’s performance. As the name suggests, this plot depicts the distribution of
residuals ( a u - e c d), as shown below:

A good residual plot will: A bad residual plot will:

● Follow a normal distribution ● Show skewness

● NOT reveal trends in residuals ● Reveal patterns in residuals

236
DailyDoseofDS.com

Thus, the more normally distributed the residual plot looks, the more conﬁdent
we can be about our model. This is especially useful when the regression line is
diﬃcult to visualize, i.e., in a high-dimensional dataset.

Why?

Because a residual distribution plot depicts the distribution of residuals, which is

always one-dimensional.

Thus, it can be plotted and visualized easily.

Of course, this was just about validating one assumption — the normality of
residuals.

However, linear regression relies on many other assumptions, which must be

tested as well.

Statsmodel provides a pretty comprehensive report for this, which we shall

discuss in the next chapter.

237
DailyDoseofDS.com

t s d e e i u a
Statsmodel provides one of the most comprehensive summaries for regression
analysis.

Yet, I have seen so many people struggling to interpret the critical model details
mentioned in this report. In this chapter, let’s understand the entire summary
support provided by statsmodel and why it is so important.

e i 1
The first column of the first section lists the model’s settings (or config). This
part has nothing to do with the model’s performance.

238
DailyDoseofDS.com

● Dependent variable: The variable we are predicting.

● Model and Method: We are using OLS to ﬁt a linear model.
● Date and time: You know it.
● No. of observations: The dataset’s size.
● Df residuals: The degrees of freedom associated with the residuals. It is
essentially the number of data points minus the number of parameters
estimated in the model (including intercept term).
● Df Model: This represents the degrees of freedom associated with the
model. It is the number of predictors, 2 in our case — and i X.

If your data has categorical features, statsmodel will one-hot encode them. But in
that process, it will drop one of the one-hot encoded features.

This is done to avoid the dummy variable trap, which we discussed in an earlier
chapter (this chapter).

● Covariance type: This is related to the assumptions about the distribution

of the residual.
○ In linear regression, we assume that the residuals have a constant
variance (homoscedasticity).
○ We use “nonrobust” to train a model under that assumption.

In the second column, statsmodel provides the overall performance-related

details:

239
DailyDoseofDS.com

● R-squared: The fraction of original data variability captured by the model.

○ For instance, in this case, 0.927 means that the current model
captures 92.7% of the original variability in the training data.
○ Statsmodel reports R2 on the input data, so you must not overly
optimize for it. If you do, it will lead to overﬁtting.
● Adj. R-squared:

○ It is somewhat similar to R-squared, but it also accounts for the

number of predictors (features) in the model.
○ The problem is that R-squared always increases as we add more
features.
○ So even if we add totally irrelevant features, R-squared will never
decrease. Adj. R-squared penalizes this behavior of R-squared.
● F-statistic and Prob (F-statistic):

○ These assess the overall signiﬁcance of a regression model.

○ They compare the estimated coeﬃcients by OLS with a model whose
all coeﬃcients (except for the intercept) are zero.

○ F-statistic tests whether the independent variables collectively have

any effect on the dependent variable or not.
○ Prob (F-statistic) is the associated p-value with the F-statistic.
○ A small p-value (typically less than 0.05) indicates that the model as a
whole is statistically significant.
○ This means that at least one independent variable has a significant
effect on the dependent variable.
● Log-Likelihood:
240
DailyDoseofDS.com

○ This tells us the log-likelihood that the given data was generated by
the estimated model.
○ The higher the value, the more likely the data was generated by this
model.
● AIC and BIC:
○ Like adjusted R-squared, these are performance metrics to
determine goodness of ﬁt while penalizing complexity.
○ Lower AIC and BIC values indicate a better ﬁt.

e i 2
The second section provides details related to the features:

● coef: The estimated coeﬃcient for a feature.

● t and P>|t|:
○ Earlier, we used F-statistic to determine the statistical significance
of the model as a whole.
○ t-statistic is more granular on that front as it determines the
significance of every individual feature.
○ P>|t| is the associated p-value with the t-statistic.
○ A small p-value (typically less than 0.05) indicates that the feature is
statistically significant.

○ For instance, the feature “X” has a p-value of ~0.6. This suggests that
there is a 60% chance that the feature “X” has no eﬀect on “Y”.

● [0.025, 0.975] and std err:

241
DailyDoseofDS.com

○ See, the coeﬃcients we have obtained from the model are just
estimates. They may not be absolute true coeﬃcients of the process
that generated the data.
○ Thus, the estimated parameters are subject to uncertainty, aren’t
they?
○ Note: The width of the interval [0.025, 0.975] is 0.95 → or 95%. This
constitutes the area between 2 standard deviations from the mean in
a normal distribution.

○ A 95% conﬁdence interval provides a range of values within which

you can be 95% conﬁdent that the true value of the parameter lies.

○ For instance, the interval for sin_X is (0.092, 6.104). So although the
estimated coefficient is 3.09, we can be 95% confident that the true
coefficient lies in the range (0.092, 6.104).

242
DailyDoseofDS.com

e i 3
Details in the last section of the report test the assumptions of linear regression.

● Omnibus and Prob(Omnibus):

○ They test the normality of the residuals.
○ Omnibus value of zero means residuals are perfectly normal.
○ Prob(Omnibus) is the corresponding p-value.
■ In this case, Prob(Omnibus) is 0.001. This means there is a
0.1% chance that the residuals are normally distributed.
● Skew and Kurtosis:
○ They also provide information about the distribution of the
residuals.
○ Skewness measures the asymmetry of the distribution of residuals.
■ Zero skewness means perfect symmetry.
■ Positive skewness indicates a distribution with a long right
tail. This indicates a concentration of residuals on lower
values. Good to check for outliers in this case.

Negative skewness indicates a

distribution with a long le tail. This is
mostly indicative of poor features. For
instance, consider ﬁtting a sin curve with
a linear feature (X). Most residuals will be
high, resulting in negative skewness.

243
DailyDoseofDS.com

● Durbin-Watson:
○ This measures autocorrelation between residuals.
○ Autocorrelation occurs when the residuals are correlated, indicating
that the error terms are not independent.
○ But linear regression assumes that residuals are not correlated.
○ The Durbin-Watson statistic ranges between 0 and 4.
■ A value close to 2 indicates no autocorrelation.
■ Values closer to 0 indicate positive autocorrelation.
■ Values closer to 4 indicate negative autocorrelation.
● Jarque-Bera (JB) and Prob(JB):
○ They solve the same purpose as Omnibus and Prob(Omnibus) —
measuring the normality of residuals.
● Condition Number:
○ This tests multicollinearity.
○ Multicollinearity occurs when two features are correlated, or two or
more features determine the value of another feature.
○ A standalone value for Condition Number can be diﬃcult to
interpret so here’s how I use it:
■ Add features one by one to the regression model and notice
any spikes in the Condition Number.

As discussed above, every section of this report has its importance:

● The first section tells us about the model’s config, the overall performance
of the model, and its statistical significance.
● The second section tells us about the statistical significance of individual
features, the model’s confidence in finding the true coefficient, etc.
● The last section lets us validate the model’s assumptions, which are
immensely critical to linear regression’s performance.

Now you know how to interpret the entire regression summary from statsmodel.

244
DailyDoseofDS.com

e r i d i a o l G s
A linear regression model is undeniably an extremely powerful model, in my
opinion. However, it makes some strict assumptions about the type of data it can
model, as depicted below.

These conditions o en restrict its applicability to data situations that do not

obey the above assumptions. That is why being aware of its extensions is
immensely important.

Generalized linear models (GLMs) precisely do that. They relax the assumptions
of linear regression to make linear models more adaptable to real-world datasets.

h L ?
Linear regression is pretty restricted in terms of the kind of data it can model.
For instance, its assumed data generation process looks like this:

245
DailyDoseofDS.com

The assumed data generation process of linear regression

● Firstly, it assumes that the conditional distribution of Y given X is a

Gaussian.
● Next, it assumes a very speciﬁc form for the mean of the above Gaussian. It
says that the mean should always be a linear combination of the features
(or predictors).

● Lastly, it assumes a constant variance for the conditional distribution

P(Y|X) across all levels of X. A graphical way of illustrating this is as
follows:

These conditions o en restrict its applicability to data situations that do not

obey the above assumptions.

In other words, nothing stops real-world datasets from violating these

assumptions.

In fact, in many scenarios, the data might exhibit complex relationships,

heteroscedasticity (varying variance), or even follow entirely diﬀerent
distributions altogether.

246
DailyDoseofDS.com

Yet, if we intend to build linear models, we should formulate better algorithms

that can handle these peculiarities.

Generalized linear models (GLMs) precisely do that.

They relax the assumptions of linear regression to make linear models more
adaptable to real-world datasets.

More speciﬁcally, they consider the following:

● What if the distribution isn’t normal but some other distribution?

● What if X has a more sophisticated relationship with the mean?

● What if the variance varies with X?

247
DailyDoseofDS.com

The eﬀectiveness of a speciﬁc GLM — Poisson regression (which we discussed in

an earlier chapter) over linear regression is evident from the image below:

● Linear regression assumes the data is drawn from a Gaussian, when in

reality, it isn’t. Hence, it underperforms.
● Poisson regression adapts its regression ﬁt to a non-Gaussian distribution.
Hence, it performs signiﬁcantly better.

248
DailyDoseofDS.com

e - ﬂ t e e i
The target variable of typical regression datasets is somewhat evenly distributed.
But, at times, the target variable may have plenty of zeros. Such datasets are
called zero-inﬂated datasets.

They may raise many problems during regression modeling. This is because a
regression model can not always predict exact “zero” values when, ideally, it
should. For instance, consider simple linear regression. The regression line will
output exactly “zero” only once (if it has a non-zero slope).

This issue persists not only in higher dimensions but also in complex models like
neural nets for regression.

One great way to solve this is by training a combination of a classiﬁcation and a

regression model.

249
DailyDoseofDS.com

This goes as follows:

● Mark all non-zero targets as “1” and the rest as “0”.

● Train a binary classiﬁer on this dataset.

● Next, train a regression model only on those data points with a non-zero
true target.

During prediction:

250
DailyDoseofDS.com

● If the classiﬁer's output is “0”, the ﬁnal output is also zero.

● If the classiﬁer's output is “1”, use the regression model to predict the ﬁnal
output.

Its eﬀectiveness over the regular regression model is evident from the image
below:

Regression vs. Regression + Classiﬁcation results

● Linear regression alone underﬁts the data.

● Linear regression with a classiﬁer performs as expected.

251
DailyDoseofDS.com

u r e e i
One big problem with regression models is that they are sensitive to outliers.
Consider linear regression. Even a few outliers can signiﬁcantly impact Linear
Regression performance, as shown below:

And it isn’t hard to identify the cause of this problem. Essentially, the loss
function (MSE) scales quickly with the residual term (true-predicted).

Thus, even a few data points with a large residual can impact parameter
estimation.

252
DailyDoseofDS.com

Huber loss (or Huber Regression) precisely addresses this problem. In a gist, it
attempts to reduce the error contribution of data points with large residuals.

One simple, intuitive, and obvious way to do this is by applying a threshold (δ) on
the residual term:

● If the residual is smaller than the threshold, use MSE (no change here).
● Otherwise, use a loss function which has a smaller output than MSE —
linear, for instance.

This is depicted below:

● For residuals
smaller than the
threshold (δ) → we
use MSE.

● Otherwise, we use
a linear loss
function which has
a smaller output
than MSE.

Mathematically, Huber loss is deﬁned as follows:

253
DailyDoseofDS.com

Its eﬀectiveness is
evident from the
following image:

● Linear Regression
is aﬀected by
outliers
● Huber Regression
is more robust.

o o e e r n h h s l δ ?
While trial and error is one way, I o en like to create a residual plot. This is
depicted below: The below plot is generally called a lollipop plot because of its
appearance.

● Train a linear regression model as you usually would.

● Compute the residuals (=true-predicted) on the training data.
● Plot the absolute residuals for every data point.

One good thing is that we can create this plot for any dimensional dataset. The
objective is just to plot (true-predicted) values, which will always be 1D.

254
DailyDoseofDS.com

Next, you can subjectively decide a reasonable threshold value δ.

e ’ n h n r t g d .
By using a linear loss function in Huber regressor, we intended to reduce the
large error contributions that would have happened otherwise by using MSE.
Thus, we can further reduce that error contribution by using, say, a square root
loss function, as shown below:

I am unsure if this has been proposed before, so I decided to call it the

a y s D a i c e e o .

It is clear that the error contribution of the square root loss function is the lowest
for all residuals above the threshold δ.

255
DailyDoseofDS.com

e s n r s n n m e
e o
o e e a o o s n e s n r
There’s an interesting technique, using which, we can condense an entire random
forest model into a single decision tree.

The beneﬁts?

This technique can:

● Decrease the prediction run-time.

● Improve interpretability.
● Reduce the memory footprint.
● Simplify the model.
● Preserve the generalization power of the random forest model.

Let’s understand in this chapter.

256
DailyDoseofDS.com

e n u a t o h
Let’s ﬁt a decision tree model on the following dummy dataset. It produces a
decision region plot shown on the right.

It’s clear that there is high overﬁtting.

In fact, we must note that, by default, a decision tree can always 100% overﬁt any
dataset (we will use this information shortly). This is because it is always allowed
to grow until all samples have been classiﬁed correctly.

This overﬁtting problem is resolved by a random forest model, as depicted below:

257
DailyDoseofDS.com

This time, the decision region plot suggests that we don’t have a complex
decision boundary. The test accuracy has also improved (69.5% to 74%).

Now, here’s an interesting thing we can do.

We know that the random forest model has learned some rules that generalize on
unseen data.

So, how about we train a decision tree on the predictions generated by the
random forest model on the training set?

More speciﬁcally, given a dataset (X, y):

● Train a random forest model. This will learn some rules from the training
set which are expected to generalize on unseen data (due to Bagging).
● Generate predictions on X, which produces the output y'. These
predictions will capture the essence of the rules learned by the random
forest model.
● Finally, train a decision tree model on (X, y'). Here, we want to
intentionally overﬁt this mapping as this mapping from (X) to (y') is a proxy
for the rules learned by the random forest model.

258
DailyDoseofDS.com

This idea is implemented below:

The decision region plot we get with the new decision tree is pretty similar to
what we saw with the random forest earlier:

Measuring the test accuracy of the decision tree and random forest model, we
notice them to be similar too:

259
DailyDoseofDS.com

In fact, this approach also signiﬁcantly reduces the run-time, as depicted below:

Isn’t that cool?

Another rationale for considering doing this is that it adds interpretability.

This is because if we have 100 trees in a random forest, there’s no way we can
interpret them.

However, if we have condensed it to a decision tree, now we can inspect it.

260
DailyDoseofDS.com

e r n o
I devised this very recently. I also tested this approach on a couple of datasets,
and they produced promising results.

But it won’t be fair to make any conclusions based on just two instances.

While the idea makes intuitive sense, I understand there could be some potential
ﬂaws that are not evident right now.

So, I not saying that you should adopt this technique right away.

Instead, it is advised to test this approach on your random forest use cases.
Considering reverting back to me with what you discovered.

The code for this chapter is available here: https://bit.ly/3XSPejD.

In this next chapter, let’s understand a technique to transform a decision tree into
matrix operations which can run on GPUs.

261
DailyDoseofDS.com

r s r e s n r n a i p a o
Inference using a decision tree is an iterative process. We traverse a decision tree
by evaluating the condition at a speciﬁc node in a layer until we reach a leaf
node.

In this chapter, let’s learn a superb technique that to represent inferences from a
decision tree in the form of matrix operations.

As a result:

1. It makes inference much faster as matrix operations can be radically

parallelized.
2. These operations can be loaded on a GPU for even faster inference, making
them more deployment-friendly.

e p
Consider a binary classiﬁcation dataset with 5 features.

262
DailyDoseofDS.com

Let’s say we get the following tree structure a er ﬁtting a decision tree on the
above dataset:

o t n
Before proceeding ahead, let’s assume that:

● → Total features in the dataset (5 in the dataset above).

● → Total evaluation nodes in the tree (4 blue nodes in the tree above).
● → Total leaf nodes in the tree (5 green nodes in the tree).
● → Total classes in the dataset (2 in the dataset above).

r o a i s

The core idea in

this conversion is
to derive ﬁve
matrices ( , , ,
, ) that capture
the structure of the
decision tree.

263
DailyDoseofDS.com

Let’s understand them one by one!

o : h t s e h l e e o i e w i t o a m d t
e o l s e e i . o l n r a v y i n e a
e v l a i s

1 a i

This matrix captures

the relationship
between input features
and evaluation nodes
(blue nodes above).

So it’s an ( × ) shaped
matrix.

A speciﬁc entry is set to “1” if the corresponding node in the column evaluates
the corresponding feature in the row. For instance, in our decision tree, “Node 0”
evaluates “Feature 2”.

Thus, the corresponding entry will be “1” and all other entries will be “0.”

264
DailyDoseofDS.com

Filling out the entire matrix this way, we get:

2 a i
The entries of matrix B are the threshold value at each node. Thus, its shape is
1×e.

This is vector though, but the terminology is not important here.

3 a i
This is a matrix between every pair of leaf nodes and evaluation nodes. Thus, its
dimensions are × .

265
DailyDoseofDS.com

A speciﬁc entry is set to:

● “1” if the corresponding leaf node in the column lies in the le sub-tree of
the corresponding evaluation node in the row.
● “-1” if the corresponding leaf node in the column lies in the right sub-tree
of the corresponding evaluation node in the row.
● “0” if the corresponding leaf node and evaluation node have no link.

For instance, in our decision tree, the “leaf node 4” lies on the le sub-tree of
both “evaluation node 0” and “evaluation node 1”. Thus, the corresponding values
will be 1.

266
DailyDoseofDS.com

Filling out the entire matrix this way, we get:

4 a i o e o
The entries of vector D are the sum of non-negative entries in every column of
Matrix C:

5 a i

Finally, this matrix

holds the mapping
between leaf nodes and
their corresponding
output labels. Thus, its
dimensions are × .

267
DailyDoseofDS.com

If a leaf node classiﬁes a sample to “Class 1”, the corresponding entry will be 1,
and the other cell entry will be 0.

For instance, “lead node 4” outputs “Class 1”, thus the corresponding entries for
the ﬁrst row will be (1,0):

We repeat this for all other leaf nodes to get the following matrix as Matrix E:

With this, we have compiled our decision tree into matrices. To recall, these are
the ﬁve matrices we have created so far:

268
DailyDoseofDS.com

● Matrix A captures which input feature was used at each evaluation node.
● Matrix B stores the threshold of each evaluation node.
● Matrix C captures whether a leaf node lies on the le or right sub-tree of a
speciﬁc evaluation node or has no relation to it.
● Matrix D stores the sum of non-negative entries in every column of Matrix
C.
● Finally, Matrix E maps from leaf nodes to their class labels.

n r c s g a i s
Say this is our input feature vector X (5 dimensions):

The whole inference can now be done using just these matrix operations:

● XA < B gives:

269
DailyDoseofDS.com

● The above result multiplied by C gives:

● The above result, when matched to D, gives:

● Finally, multiplying with E, we get:

The ﬁnal prediction comes out to be “Class 1,” which is indeed correct! Notice
that we carried out the entire inference process using only matrix operations:

270
DailyDoseofDS.com

As a result, the inference operation can largely beneﬁt from parallelization and
GPU capabilities.

The run-time eﬃcacy of this technique is evident from the image below:

● Here, we have trained a random forest model.

● The compiled model runs:
○ Over twice as fast on a CPU (Tensor CPU Model).
○ ~40 times faster on a GPU, which is huge (Tensor GPU Model).
● All models have the same accuracy — indicating no loss of information
during compilation.

271
DailyDoseofDS.com

n r t e r e e s n r

One thing I always appreciate

about decision trees is their ease
of visual interpretability. No
matter how many features our
dataset has, we can ALWAYS
visualize and interpret a decision
tree.

This is not always possible with other intuitive and simple models like linear
regression. But decision trees stand out in this respect. Nonetheless, one thing I
o en ﬁnd a bit time-consuming and somewhat hit-and-trial-driven is pruning a
decision tree.

h r e

The problem is
that under default
conditions,
decision trees
ALWAYS 100%
overﬁt the dataset,
as depicted in this
image:

272
DailyDoseofDS.com

Thus, pruning is ALWAYS necessary to reduce model variance. Scikit-learn

already provides a method to visualize them as shown below:

But the above visualisation is pretty non-elegant, tedious, messy, and static (or
non-interactive). I recommend using an interactive Sankey diagram to prune
decision trees. This is depicted below:

273
DailyDoseofDS.com

As shown above, the Sankey diagram allows us to interactively visualize and

prune a decision tree by collapsing its nodes.

Also, the number of data points

from each class is size and
color-encoded in each node, as
shown below.

This instantly gives an estimate of the node’s impurity, based on which, we can
visually and interactively prune the tree in seconds. For instance, in the full
decision tree shown below, pruning the tree at a depth of two appears reasonable:

Next, we can train a new decision tree a er obtaining an estimate for

hyperparameter values. This will help us reduce the variance of the decision tree.

You can download the code notebook for the interactive decision tree here:
https://bit.ly/4bBwY1p. Instructions are available in the notebook.

Next, let’s understand a point of caution when using decision trees.

274
DailyDoseofDS.com

h e s n r s u e h o h n e e
f r r n g

If we were to visualize the

decision rules (the conditions
evaluated at every node) of ANY
decision tree, we would
ALWAYS ﬁnd them to be
perpendicular to the feature
axes, as depicted in the image.

In other words, every decision tree progressively segregates feature space based
on such perpendicular boundaries to split the data.

Of course, this is not a “problem” per se.

In fact, this perpendicular splitting is what makes it so powerful to perfectly

overfit any dataset. However, this also brings up a pretty interesting point that is
o en overlooked when fitting decision trees. More specifically, what would
happen if our dataset had a diagonal decision boundary, as depicted below:

275
DailyDoseofDS.com

It is easy to guess that in such a case, the

decision boundary learned by a decision tree is
expected to appear as follows:

In fact, if we plot this decision tree, we notice that it creates so many splits just to
ﬁt this easily separable dataset, which a model like logistic regression, support
vector machine (SVM), or even a small neural network can easily handle:

It becomes more evident if we zoom into this decision tree and notice how close
the thresholds of its split conditions are:

276
DailyDoseofDS.com

This is a bit concerning because it clearly shows that the decision tree is
meticulously trying to mimic a diagonal decision boundary, which hints that it
might not be the best model to proceed with. To double-check this, I o en do the
following:

● Take the training data X );

○ Shape of : n ).
○ Shape of : n ).
● Run PCA on to project data into an orthogonal space of m dimensions.
This will give _ a, whose shape will also be n ).
● Fit a decision tree on _ a and visualize it (thankfully, decision trees are
always visualizable).
● If the decision tree depth is signiﬁcantly smaller in this case, it validates
that there is a diagonal separation.

For instance, the PCA projections on the above dataset are shown below:

It is clear that the decision boundary on PCA projections is almost perpendicular

to the 2 feature (the 2nd principal component).

277
DailyDoseofDS.com

Fitting a decision tree on this _ a drastically reduces its depth, as depicted

below:

This lets us determine that we might be better oﬀ using some other algorithm
instead.

Or, we can spend some time engineering better features that the decision tree
model can easily work with using its perpendicular data splits.

At this point, if you are thinking, why can’t we use the decision tree trained on
_ a?

While nothing stops us from doing that, do note that PCA components are not
interpretable, and maintaining feature interpretability can be important at times.
Thus, whenever you train your next decision tree model, consider spending some
time inspecting what it’s doing.

e r n n h h t

o t n n o i o a h s f e s n r s h r h
u d g l k f o f h o o r l n b o l e s
o y

y o t s o r g o a h t c a o u t n f e s n r s
n h w n h i t o e n d l l r h o o i .

278
DailyDoseofDS.com

e s n r s L S v ﬁ !
In addition to the above inspection, there’s one more thing you need to be careful
of when using decision trees. This is about overﬁtting.

The thing is that, by default, a decision tree (in sklearn’s implementation, for
instance), is allowed to grow until all leaves are pure. This happens because a
standard decision tree algorithm greedily selects the best split at each node.

This makes its nodes more and more pure as we traverse down the tree. As the
model correctly classiﬁes ALL training instances, it leads to 100% overﬁtting, and
poor generalization.

For instance, consider this dummy dataset:

279
DailyDoseofDS.com

Fitting a decision tree on this dataset gives us the following decision region plot:

It is pretty evident from the decision region plot, the training and test accuracy
that the model has entirely overﬁtted our dataset.

o - m e t p n g C ) is an eﬀective technique to prevent this.

CCP considers a combination of two factors for pruning a decision tree:

● Cost (C): Number of misclassiﬁcations

● Complexity (C): Number of nodes

The core idea is to iteratively drop sub-trees, which, a er removal, lead to a

minimal increase in classiﬁcation cost AND a maximum reduction of complexity
(or nodes). In other words, if two sub-trees lead to a similar increase in
classiﬁcation cost, then it is wise to remove the sub-tree with more nodes.

280
DailyDoseofDS.com

In sklearn, you can control cost-complexity-pruning using the ccp_alpha

parameter:

● large value of c a h → results in underﬁtting

● small value of c a h → results in overﬁtting

The objective is to determine the optimal value of c a h , which gives a

better model. The eﬀectiveness of cost-complexity-pruning is evident from the
image below:

As depicted above, CCP results in a much simpler and acceptable decision region
plot.

281

Enhancing Performance OF Autosar Transformer and Autosar Applications Using Gpus
No ratings yet
Enhancing Performance OF Autosar Transformer and Autosar Applications Using Gpus
100 pages
Daily Dose of Data Science Full Archive
No ratings yet
Daily Dose of Data Science Full Archive
53 pages
Transfer Learning Seminar
No ratings yet
Transfer Learning Seminar
12 pages
DL Unit-5
No ratings yet
DL Unit-5
7 pages
Data Science Interview QnAs by CloudyML
No ratings yet
Data Science Interview QnAs by CloudyML
21 pages
HCLT106 1 Jan Jun2024 T&L Solutions Week14 RM V.1 21052024
No ratings yet
HCLT106 1 Jan Jun2024 T&L Solutions Week14 RM V.1 21052024
8 pages
Using Pre-Trained Models
No ratings yet
Using Pre-Trained Models
16 pages
Assignment 3
100% (1)
Assignment 3
10 pages
PERFORMANCE ANALYSIS OF SCTP PROTOCOL IN WIFI NETWORK
No ratings yet
PERFORMANCE ANALYSIS OF SCTP PROTOCOL IN WIFI NETWORK
6 pages
Applying LLMs To Threat Intelligence - by Thomas Roccia - Nov, 2023 - SecurityBreak
No ratings yet
Applying LLMs To Threat Intelligence - by Thomas Roccia - Nov, 2023 - SecurityBreak
25 pages
ML For Operations: Pitfalls, Dead Ends, and Hope
No ratings yet
ML For Operations: Pitfalls, Dead Ends, and Hope
5 pages
Machine Learning Yearning
100% (2)
Machine Learning Yearning
116 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
Andrew ML
No ratings yet
Andrew ML
218 pages
AI ML Session Slides
No ratings yet
AI ML Session Slides
34 pages
OOP
No ratings yet
OOP
18 pages
Report Data
No ratings yet
Report Data
5 pages
MLOps
No ratings yet
MLOps
16 pages
Transfer Learnring
No ratings yet
Transfer Learnring
5 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
DataIku Machine Learning Basics p2
No ratings yet
DataIku Machine Learning Basics p2
43 pages
Deep Learning Review and Discussion of Its Future
No ratings yet
Deep Learning Review and Discussion of Its Future
7 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
cs329s 02 Note Intro ML Sys Design
No ratings yet
cs329s 02 Note Intro ML Sys Design
27 pages
DL PRACTICAL FILE
No ratings yet
DL PRACTICAL FILE
58 pages
Buiding Youf First ML Model GUIDE
No ratings yet
Buiding Youf First ML Model GUIDE
43 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Trustworthy - Final Essay
No ratings yet
Trustworthy - Final Essay
21 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Uml Homework
No ratings yet
Uml Homework
7 pages
Machine Learning Strategy
No ratings yet
Machine Learning Strategy
102 pages
Neural Networks & Deep Learning Makaut & & 7th SemNotes
No ratings yet
Neural Networks & Deep Learning Makaut & & 7th SemNotes
36 pages
ISD LAB10-DesignConcepts
No ratings yet
ISD LAB10-DesignConcepts
6 pages
SSL and Few Shot Learning
No ratings yet
SSL and Few Shot Learning
6 pages
BSP 5 Distributed Sandboxed Data Processing
No ratings yet
BSP 5 Distributed Sandboxed Data Processing
5 pages
20
No ratings yet
20
19 pages
CSD411-Week_5-Generalization_1725085952123619433966d2b90031636
No ratings yet
CSD411-Week_5-Generalization_1725085952123619433966d2b90031636
43 pages
The Machine Learning Lifecycle in 2021
No ratings yet
The Machine Learning Lifecycle in 2021
20 pages
Predictive Maintenance Solution
No ratings yet
Predictive Maintenance Solution
15 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
4 pages
711 by Ali Ahad Current Solved Paper
No ratings yet
711 by Ali Ahad Current Solved Paper
20 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Self-Study Module S4: Modeling and Design (Part 2)
No ratings yet
Self-Study Module S4: Modeling and Design (Part 2)
10 pages
General AI Concepts
No ratings yet
General AI Concepts
6 pages
IOS Developer - Experienced
No ratings yet
IOS Developer - Experienced
37 pages
NNFL_assignment_1128
No ratings yet
NNFL_assignment_1128
15 pages
ML Interactively
No ratings yet
ML Interactively
273 pages
How to Develop a CNN for MNIST Handwritten Digit Classification
No ratings yet
How to Develop a CNN for MNIST Handwritten Digit Classification
43 pages
Models For Machine Learning - IBM Developer
No ratings yet
Models For Machine Learning - IBM Developer
12 pages
ML Projects For Final Year
No ratings yet
ML Projects For Final Year
7 pages
3.3-Recent Trends in Parallel Computing
No ratings yet
3.3-Recent Trends in Parallel Computing
12 pages
DL Theory
No ratings yet
DL Theory
20 pages
NB4-10 PT V Transfer Learning
No ratings yet
NB4-10 PT V Transfer Learning
16 pages
Difference Between Machine Learning and Deep Learning
No ratings yet
Difference Between Machine Learning and Deep Learning
5 pages
Lecture Notes 10
No ratings yet
Lecture Notes 10
4 pages
DL UNIT 1
No ratings yet
DL UNIT 1
21 pages
Aldaghri 2021
No ratings yet
Aldaghri 2021
14 pages
uNIT 1
No ratings yet
uNIT 1
16 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
MCS-024: Object Oriented Technologies and Java Programming
From Everand
MCS-024: Object Oriented Technologies and Java Programming
Dr. DK Sukhani
No ratings yet
Machine Learning: Adaptive Behaviour Through Experience: Thinking Machines
From Everand
Machine Learning: Adaptive Behaviour Through Experience: Thinking Machines
alasdair gilchrist
4.5/5 (5)
Phillip Nova 2H2023 Market Outlook
No ratings yet
Phillip Nova 2H2023 Market Outlook
29 pages
GPU-Z Sensor Log - NV
No ratings yet
GPU-Z Sensor Log - NV
2 pages
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
No ratings yet
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
53 pages
Parallel Universe Issue 32
No ratings yet
Parallel Universe Issue 32
74 pages
Hpe Course Gen10 Proliant Family Tecnical Training
No ratings yet
Hpe Course Gen10 Proliant Family Tecnical Training
31 pages
Data Comunication Using Open GL
No ratings yet
Data Comunication Using Open GL
14 pages
NVIDIA ITC Complaint Final
No ratings yet
NVIDIA ITC Complaint Final
65 pages
Output Log
No ratings yet
Output Log
44 pages
Amd 96960
No ratings yet
Amd 96960
110 pages
The Beginner's Guide To Building A PC
No ratings yet
The Beginner's Guide To Building A PC
66 pages
Ebook SysAdmin Guide To Azure IaaS
100% (1)
Ebook SysAdmin Guide To Azure IaaS
109 pages
Ankit-Tejwan-Resume
No ratings yet
Ankit-Tejwan-Resume
1 page
Sift Gpu
No ratings yet
Sift Gpu
5 pages
Manual Placa Mãe GA-X79-UP4
No ratings yet
Manual Placa Mãe GA-X79-UP4
100 pages
Topway - T8 Specification PDF
No ratings yet
Topway - T8 Specification PDF
2 pages
80-NU141-1_A_Adreno_OpenGL_ES_Developer_Guide
No ratings yet
80-NU141-1_A_Adreno_OpenGL_ES_Developer_Guide
170 pages
10 - Introduction and Overview GPGPU
No ratings yet
10 - Introduction and Overview GPGPU
69 pages
VMW Ebook Vmware Vsphere Eight
No ratings yet
VMW Ebook Vmware Vsphere Eight
11 pages
Tales of The M1 GPU - Asahi Linux
No ratings yet
Tales of The M1 GPU - Asahi Linux
16 pages
Dell XPS M1330 - Nvidia GeForce 8400M GS - Copper Mod
No ratings yet
Dell XPS M1330 - Nvidia GeForce 8400M GS - Copper Mod
21 pages
Omen by HP Laptop 15-ce072TX Quickspecs
No ratings yet
Omen by HP Laptop 15-ce072TX Quickspecs
2 pages
Tutorial First Steps
No ratings yet
Tutorial First Steps
45 pages
ANSYS Discovery Live: Instantaneous Simulation Technology Preview Available For
No ratings yet
ANSYS Discovery Live: Instantaneous Simulation Technology Preview Available For
2 pages
Fraud Detection Synopsis
No ratings yet
Fraud Detection Synopsis
4 pages
lastException_63869940068
No ratings yet
lastException_63869940068
5 pages
I-Bytes Technology February Edition 2021
No ratings yet
I-Bytes Technology February Edition 2021
86 pages
Current Challenges and Future Research Areas For Digital Forensic
No ratings yet
Current Challenges and Future Research Areas For Digital Forensic
13 pages
Ansys Inc. Licensing Guide
No ratings yet
Ansys Inc. Licensing Guide
24 pages
Ibs Band10
No ratings yet
Ibs Band10
92 pages