Data Science Guide
Data Science Guide
2024 EDITION
DATA
SCIENCE
FULL ARCHIVE
e n g a d m
r s r e n g i - n g u i s e n g
n e r e e n g
Most ML models are trained independently without any interaction with other
models. However, in the realm of real-world ML, there are many powerful
learning techniques that rely on model interactions to improve performance.
The following
image
summarizes
four such
well-adopted
and must-know
training
methodologies:
8
DailyDoseofDS.com
1 r s r e n g
By training a model on the related task first, we can capture the core patterns of
the task of interest. Later, we can adjust the last few layers to capture
task-specific behavior.
9
DailyDoseofDS.com
2 i - n g
Fine-tuning involves updating the weights of some or all layers of the pre-trained
model to adapt it to the new task.
The idea may appear similar to transfer learning, but in fine-tuning, we typically
do not replace the last few layers of the pre-trained network.
3 u i a e n g
The model shares knowledge across tasks, aiming to improve generalization and
performance on each task.
10
DailyDoseofDS.com
It can help in scenarios where tasks are related, or they can benefit from shared
representations.
In fact, the motive for multi-task learning is not just to improve generalization.
We can also save compute power during training by having a shared layer and
task-specific segments.
4 e r e e n g
Let’s discuss it in the next chapter.
11
DailyDoseofDS.com
n o c o o e r e e n g
In my opinion, federated learning is among those very powerful ML techniques
that is not given the true attention it deserves.
Let’s understand this topic in this chapter and why I consider this to be an
immensely valuable skill to have.
h r l
Modern devices (like smartphones) have access to a wealth of data that can be
suitable for ML models.
To get some perspective, consider the number of images you have on your phone
right now, the number of keystrokes you press daily, etc.
But applications can have millions of users. The amount of data we can train ML
models on is unfathomable.
The problem is that almost all data available on modern devices is private.
12
DailyDoseofDS.com
h o t n
Federated learning smartly addresses this challenge of training ML models on
private data.
13
DailyDoseofDS.com
Send a global model to the user’s device, train a model on private data, and
retrieve it back.
As a result, the central server does not need the enormous computing that it
would have demanded otherwise.
14
DailyDoseofDS.com
u d g u i a e n g M ) o l
Most ML models are trained on one task. As a result, many struggle to intuitively
understand how a model can be trained on multiple tasks simultaneously.
To reiterate, in MTL, the network has a few shared layers and task-specific
segments. During backpropagation, gradients are accumulated from all branches,
as depicted in the animation below:
Consider we want our model to take a real value (x) as input and generate two
outputs:
● sin(x)
● cos(x)
15
DailyDoseofDS.com
● We have some fully connected layers in self.model → These are the shared
layers.
● Furthermore, we have the output-specific layers to predict sin(x) and cos(x).
16
DailyDoseofDS.com
We are almost done. The final part of this implementation is to train the model.
Let’s use mean squared error as the loss function. The training loop is
implemented below:
With this, we have trained our MTL model. Also, we get a decreasing loss, which
depicts that the model is being trained.
And that’s how we train an MTL model. You can extend the same idea to build
any MTL model of your choice.
17
DailyDoseofDS.com
Do remember that building an MTL model on unrelated tasks will not produce
good results.
Or…
At times, I also use dynamic task weights, which could be inversely proportional
to the validation accuracy achieved on that task.
My rationale behind this technique is that in an MTL setting, some tasks can be
easy while others can be difficult.
If the model achieves high accuracy on one task during training, we can safely
reduce its loss contribution so that the model focuses more on the second task.
You can download the notebook for this chapter here: https://bit.ly/3ztY5hy.
18
DailyDoseofDS.com
c v e n g
There’s not much we can do to build a supervised system when the data we begin
with is unlabeled.
Using unsupervised techniques (if they fit the task) can be a solution, but
supervised systems are typically on par with unsupervised ones.
Self-supervised learning is when we have an unlabeled dataset (say text data), but
we somehow figure out a way to build a supervised learning model out of it.
In a nutshell, its core objective is to predict the next token based on previously
predicted tokens (or the given context).
19
DailyDoseofDS.com
20
DailyDoseofDS.com
The model is only supposed to learn the mapping from previous tokens to the
next token.
At this stage, the only possibility one notices is annotating the dataset. However,
data annotation is difficult, expensive, time-consuming, and tedious.
As the name suggests, the idea is to build the model with active human feedback
on examples it is struggling with. The visual below summarizes this:
21
DailyDoseofDS.com
While there’s no rule on how much data should be labeled, I have used active
learning (successfully) while labeling as low as ~1% of the dataset, so try
something in that range.
Of course, this won’t be a perfect model, but that’s okay. Next, generate
predictions on the dataset we did not label:
22
DailyDoseofDS.com
That’s why we need to be a bit selective with the type of model we choose.
More specifically, we need a model that, either implicitly or explicitly, can also
provide a confidence level with its predictions.
Probabilistic models (ones that provide a probabilistic estimate for each class) are
typically a good fit here.
23
DailyDoseofDS.com
This is because one can determine a proxy for confidence level from probabilistic
outputs.
In the above two examples, consider the gap between 1st and 2nd highest
probabilities:
● In example #1, the gap is large. This can indicate that the model is quite
confident in its prediction.
● In example #2, the gap is small. This can indicate that the model is NOT
quite confident in its prediction.
Now, go back to the predictions generated above and rank them in order of
confidence:
24
DailyDoseofDS.com
● The model is already quite confident with the first two instances. There’s
no point checking those.
● Instead, it would be best if we (the human) annotate the instances with
which it is least confident.
To get some more perspective, consider the image below. Logically speaking,
which data point’s human label will provide more information to the model? I
know you already know the answer.
Thus, in the next step, we provide our human label to the low-confidence
predictions and feed it back to the model with the previously labeled dataset:
Repeat this a few times and stop when you are satisfied with the performance.
The only thing that you have to be careful about is generating confidence
measures.
25
DailyDoseofDS.com
If you mess this up, it will affect every subsequent training step.
While combining the low-confidence data with the seed data, we can also use the
high-confidence data. The labels would be the model’s predictions.
26
DailyDoseofDS.com
u t e n e r
p m a o
o n m
And, of course, there are various ways to speed up model training, like:
● Batch processing
● Leverage distributed training using frameworks like PySpark MLLib.
● Use better Hyperparameter Optimization, like Bayesian Optimization,
which we will discuss in this chapter.
● and many other techniques.
s e i r i t e e
In gradient descent, every parameter update solely depends on the current
gradient. This is clear from the gradient weight update rule shown below:
27
DailyDoseofDS.com
Imagine this is the loss function contour plot, and the optimal location
(parameter configuration where the loss function is minimum) is marked here:
Simply put, this plot illustrates how gradient descent moves towards the optimal
solution. At each iteration, the algorithm calculates the gradient of the loss
function at the current parameter values and updates the weights.
28
DailyDoseofDS.com
o o n m o e h r l ?
Momentum-based optimization slightly modifies the update rule of gradient
descent. More specifically, it also considers a moving average of past gradients:
29
DailyDoseofDS.com
How?
As this moving average gets added to the gradient updates, it helps the
optimization algorithm take larger steps in the desired direction.
30
DailyDoseofDS.com
This time, the gradient update trajectory shows much smaller oscillations in the
vertical direction, and it also manages to reach an optimum under the same
number of epochs as earlier.
31
DailyDoseofDS.com
If you want to have a more hands-on experience, check out this tool:
https://bit.ly/4cOrJN1.
32
DailyDoseofDS.com
i d r i o r n g
o e
Typical deep learning libraries are really conservative when it comes to assigning
data types.
The data type assigned by default is usually 64-bit or 32-bit, when there is also
scope for 16-bit, for instance. This is also evident from the code below:
33
DailyDoseofDS.com
34
DailyDoseofDS.com
Moreover, since fl a 6 is only half the size of fl a 2, its usage reduces the
memory required to train the network. This also allows us to train larger models,
train on larger mini-batches (resulting in even more speedup), etc.
Mixed precision training is a pretty reliable and widely adopted technique in the
industry to achieve this.
This is a list of some models I found that were trained using mixed precision:
It’s pretty clear that mixed precision training is much more popularly used, but
we don’t get to hear about it o en.
e r e e n h e n a e i …
From the above discussion, it must be clear that as we use a low-precision data
type (fl a 6), we might unknowingly introduce some numerical inconsistencies
and inaccuracies.
35
DailyDoseofDS.com
To avoid them, there are some best practices for mixed precision training that I
want to talk about next, along with the code.
i d r i o r n g n y r n e r t s
Leveraging mixed precision training in PyTorch requires a few modifications in
the existing network training implementation. Consider this is our current
PyTorch model training implementation:
The first thing we introduce here is a scaler object that will scale the loss value:
36
DailyDoseofDS.com
We do this because, at times, the original loss value can be so low, that we might
not be able to compute gradients in fl a 6 with full precision. Such situations
may not produce any update to the model’s weights.
Scaling the loss to a higher numerical range ensures that even small gradients
can contribute to the weight updates.
But these minute gradients can only be accommodated into the weight matrix
when the weight matrix itself is represented in high precision, i.e., float32. Thus,
as a conservative measure, we tend to keep the weights in fl a 2.
That said, the loss scaling step is not entirely necessary because, in my
experience, these little updates typically appear towards the end stages of the
model training. Thus, it can be fair to assume that small updates may not
drastically impact the model performance. But don’t take this as a definite
conclusion, so it’s something that I want you to validate when you use mixed
precision training.
37
DailyDoseofDS.com
38
DailyDoseofDS.com
The mixed-precision settings in the forward pass are carried out by the
o h u c t context manager:
39
DailyDoseofDS.com
Done!
The efficacy of mixed precision scaling over traditional training is evident from
the image below:
40
DailyDoseofDS.com
r i t h k i i
Neural networks primarily utilize memory in two ways:
This is because memory utilization scales proportionately with the batch size.
That said, there’s a pretty incredible technique that lets us increase the batch size
while maintaining the overall memory utilization.
41
DailyDoseofDS.com
o r i t h p n g o s
Gradient checkpointing is based on two key observations on how neural
networks typically work:
● Step 1) Divide the network into segments before the forward pass:
42
DailyDoseofDS.com
● Step 2) During the forward pass, only store the activations of the first layer
in each segment. Discard the rest when they have been used to compute
the activations of the next layer.
Done!
43
DailyDoseofDS.com
To summarize, the idea is that we don’t need to store all the intermediate
activations in memory. Instead, storing a few of them and recomputing the rest
only when they are needed can significantly reduce the memory requirement. The
whole idea makes intuitive sense as well.
Of course, as we compute some activations twice, this does come at the cost of
increased run-time, which can typically range between 15-25%. So there’s always
a tradeoff between memory and run-time.
That said, another advantage is that it allows us to use a larger batch size, which
can slightly (not entirely though) counter the increased run-time.
Here’s a demo.
44
DailyDoseofDS.com
r i t h p n g n y r
To utilize this, we begin by importing the necessary libraries and functions:
45
DailyDoseofDS.com
r i t c m a o
Under memory
constraints, it is
always
recommended to
train the neural
network with a
small batch size.
Confused?
h o e a e o s y c l x o u n r n g
The bigger the network, the more activations a network must store in memory.
Also, under memory constraints, having a large batch size will result in:
46
DailyDoseofDS.com
h s r i t c m a o n o o t e n
n e i a h i n e r o t i s
Consider we are training a neural network on mini-batches.
● On every mini-batch:
○ Run the forward pass while storing the activations.
○ During backward pass:
■ Compute the loss
■ Compute the gradients
■ Update the weights
Gradient accumulation modifies the last step of the backward pass, i.e., weight
updates. More specifically, instead of updating the weights on every mini-batch,
we can do this:
47
DailyDoseofDS.com
For instance, say we want to use a batch size of 64. However, current memory can
only support a batch size of 16.
No worries!
Thus, effectively, we used a batch size of 16*8 (=128) instead of what we originally
intended — 64.
48
DailyDoseofDS.com
m e n t n
Let’s look at how we can implement this. In PyTorch, a typical training loop is
implemented as follows:
49
DailyDoseofDS.com
50
DailyDoseofDS.com
e r n o
Before we end, it is essential to note that gradient accumulation is NOT a remedy
to improve run-time in memory-constrained situations. In fact, we can also verify
this from my experiment:
Of course, it’s true that we are updating the weights only a er a few iterations.
So, it will be a bit faster than updating on every iteration. Yet, we are still
processing and computing gradients on small mini-batches, which is the core
operation here.
Nonetheless, the good thing is that even if you are not under memory constraints,
you can still use gradient accumulation.
51
DailyDoseofDS.com
t t i o u i P r n g
By default, deep learning models only utilize a single GPU for training, even if
multiple GPUs are available.
52
DailyDoseofDS.com
1 o l a l l m
● Different parts (or layers) of the model are placed on different GPUs.
● Useful for huge models that do not fit on a single GPU.
● However, model parallelism also introduces severe bottlenecks as it
requires data flow between GPUs when activations from one GPU are
transferred to another GPU.
2 e o a l l m
53
DailyDoseofDS.com
3 a a l l m
4 i l e a l l m
54
DailyDoseofDS.com
● So the issue with standard model parallelism is that 1st GPU remains idle
when data is being propagated through layers available in 2nd GPU:
55
DailyDoseofDS.com
i e a o
a l m t n
For every instance in single-label classification datasets, the entire probability
mass belongs to a single class, and the rest are zero. This is depicted below:
The issue is that, at times, such label distributions excessively motivate the model
to learn the true class for every sample with pretty high confidence. This can
impact its generalization capabilities.
56
DailyDoseofDS.com
Simply put, this can be thought of as asking the model to be “less overconfident”
during training and prediction while still attempting to make accurate
predictions.
In this experiment, I trained two neural networks on the Fashion MNIST dataset
with the exact same weight initialization.
The model with label smoothing resulted in a better test accuracy, i.e., better
generalization.
57
DailyDoseofDS.com
h o o s a l m t n
A er using label smoothing for many of my projects, I have also realized that it is
not well suited for all use cases. So it’s important to know when you should not
use it.
See, if you only care about getting the final prediction correct and improving
generalization, label smoothing will be a pretty handy technique. However, I
wouldn’t recommend utilizing it if you care about:
● The model without label smoothing outputs 99% probability for class 3.
● With label smoothing, although the prediction is still correct, the
confidence drops to 74%.
This is something to keep in mind when using label smoothing. Nonetheless, the
technique is indeed pretty promising for regularizing deep learning models. You
can download the code notebook for this chapter here: https://bit.ly/4ePt08d.
58
DailyDoseofDS.com
o l o
Binary classification
tasks are typically
trained using the
binary cross entropy
(BCE) loss function:
That said, one limitation of BCE loss is that it weighs probability predictions for
both classes equally, which is evident from its symmetry:
For more clarity, consider the table below, which depicts two instances, one from
the minority class and another from the majority class, both with the same loss:
59
DailyDoseofDS.com
This causes problems when we use BCE for imbalanced datasets, wherein most
instances from the dominating class are “easily classifiable.” Thus, a loss value of,
say, l ( 3 from the majority class instance should (ideally) be weighed LESS
than the same loss value from the minority class.
Focal loss is a pretty handy and useful alternative to address this issue. It is
defined as follows:
Plotting BCE (class y=1) and Focal loss (for class y=1 and γ=3), we get the
following curve:
60
DailyDoseofDS.com
As shown in the figure above, focal loss reduces the contribution of the
predictions the model is pretty confident about. Also, the higher the value of γ
(Gamma), the more downweighing takes place, as shown in this plot below:
Moving on, while the Focal loss function reduces the contribution of confident
predictions, we aren’t done yet.
61
DailyDoseofDS.com
To address this, we must add another weighing parameter (α), which is the
inverse of the class frequency, as depicted below:
The α parameter is the inverse of the class frequency Thus, the final loss function
comes out to be the following:
By using both downweighing and inverse weighing, the model gradually learns
patterns specific to the hard examples instead of always being overly confident in
predicting easy instances.
62
DailyDoseofDS.com
Next, I trained two neural network models (with the same architecture of 2
hidden layers):
The decision region plot and test accuracy for these two models is depicted
below:
It is clear that:
● The model trained with BCE loss (le ) always predicts the majority class.
● The model trained with focal loss (right) focuses relatively more on
minority class patterns. As a result, it performs better.
63
DailyDoseofDS.com
o r o c a y o s
Some time back, I was invited by a tech startup to conduct their ML interviews. I
interviewed 12 candidates and mostly asked practical ML questions.
However, there were some conceptual questions as well, like the one below,
which I intentionally asked every candidate:
o o r o o ?
a i t ’ n e
In a gist, the idea is to zero out neurons randomly in a neural network. This is
done to regularize the network.
Dropout is only applied during training, and which neuron activations to zero out
(or drop) is decided using a Bernoulli distribution:
64
DailyDoseofDS.com
y o o u u t n s h e n h g l h e o n r o ?
a i t : o h s t e n e u e o n r n h
e o s e s l o d
o o n a o h o c…
Of course, I am not saying that the above details are incorrect. They are correct.
However, this is just 50% of how Dropout works, and disappointingly, most
resources don’t cover the remaining 50%. If you too are only aware of the 50%
details I mentioned above, continue reading.
o r o c a y o s
To begin, we must note that Dropout is only applied during training, but not
during the inference/evaluation stage:
Now, consider that a neuron’s input is computed using 100 neurons in the
previous hidden layer:
65
DailyDoseofDS.com
As a result, the input received by the blue neuron will be 100, as depicted below:
Now, during training, if we were using Dropout with, say, a 40% dropout rate,
then roughly 40% of the yellow neuron activations would have been zeroed out.
As a result, the input received by the blue neuron would have been around 60:
66
DailyDoseofDS.com
However, the above point is only valid for the training stage.
If the same scenario had existed during the inference stage instead, then the
input received by the blue neuron would have been 100.
During training, the average neuron inputs are significantly lower than those
received during inference.
More formally, using Dropout significantly affects the scale of the activations.
However, it is desired that the neurons throughout the model must receive the
roughly same mean (or expected value) of activations during training and
inference. To address this, Dropout performs one additional step.
This idea is to scale the remaining active inputs during training. The simplest
way to do this is by scaling all activations during training by a factor of / - ,
where p is the dropout rate. For instance, using this technique on the neuron
input of 60, we get the following (recall that we set p=40%):
67
DailyDoseofDS.com
As depicted above, scaling the neuron input brings it to the desired range, which
makes training and inference stages coherent for the network.
e f g x i n l
In fact, we can verify that typical implementations of Dropout, from PyTorch, for
instance, do carry out this step. Let’s define a dropout layer as follows:
Now, let’s consider a random tensor and apply this dropout layer to it:
What’s more, the retained values are precisely the same as we would have
obtained by explicitly scaling the input tensor with 1/(1-p):
68
DailyDoseofDS.com
69
DailyDoseofDS.com
s e i r o n N
When it comes to training neural networks, it is always recommended to use
Dropout to improve its generalization power.
This applies not just to CNNs but to all other neural networks. And I am sure you
already know the above details, so let’s get into the interesting part.
h r l f s g r o n N
The core operation that makes CNNs so powerful is convolution, which allows
them to capture local patterns, such as edges and textures, and helps extract
relevant information from the input.
Here, if were to apply the traditional Dropout, the input features would look
something like this:
70
DailyDoseofDS.com
But this isn’t found to be that effective specifically for convolution layers. To
understand this, consider we have some image data. In every image, we would
find that nearby features (or pixels) are highly correlated spatially.
For instance, imagine zooming in on the pixel level of the digit ‘9’. Here, we
would notice that the red pixel (or feature) is highly correlated with other features
in its vicinity:
Thus, dropping the red feature using Dropout will likely have no effect and its
information can still be sent to the next layer.
Simply put, the nature of the convolution operation defeats the entire purpose of
the traditional Dropout procedure.
71
DailyDoseofDS.com
h o t n
DropBlock is a much better, effective, and intuitive way to regularize CNNs. The
core idea in DropBlock is to drop a contiguous region of features (or pixels)
rather than individual pixels. This is depicted below:
r B c a m e
DropBlock has two main parameters:
72
DailyDoseofDS.com
To apply DropBlock, first, we create a binary mask on the input sampled from the
Bernoulli distribution:
The efficacy of DropBlock over Dropout is evident from the results table below:
73
DailyDoseofDS.com
There’s also a library for DropBlock, called “ r b c ,” which also provides the
linear scheduler for drop_rate.
So the thing is that the researchers who proposed DropBlock found the
technique to be more effective when the drop_rate was increased gradually.
The DropBlock library implements the scheduler. But of course, there are ways to
do this in PyTorch as well. So it’s entirely up to you which implementation you
want to use:
74
DailyDoseofDS.com
h i e a r n c v i u t n
c a y o
Everyone knows the objective of an activation function in a neural network. They
let the network learn non-linear patterns. There is nothing new here, and I am
sure you are aware of that too.
In this chapter, let me share a unique perspective on this, which would really help
you understand the internal workings of a neural network.
I have supported this chapter with plenty of visuals for better understanding.
Also, for simplicity, we shall consider a binary classification use case.
a g u
The data undergoes a series of transformations at each hidden layer:
75
DailyDoseofDS.com
Thus, to make accurate predictions, the data received by the output layer from
the last hidden layer MUST BE linearly separable.
76
DailyDoseofDS.com
To summarize….
While transforming the data through all its hidden layers and just before
reaching the output layer, a neural network is constantly hustling to project the
data to a space where it somehow becomes linearly separable. If it does, the
output layer becomes analogous to a logistic regression model, which can easily
handle this linearly separable data.
To visualize the input transformation, we can add a dummy hidden layer with
just two neurons right before the output layer and train the neural network again.
This way, we can easily visualize the transformation. We expect that if we plot
the activations of this 2D dummy hidden layer, they must be linearly separable.
The below visual precisely depicts this.
77
DailyDoseofDS.com
As we notice above, while the input data was linearly inseparable, the input
received by the output layer is indeed linearly separable.
This transformed data can be easily handled by the output classification layer.
And this shows that all a neural network is trying to do is transform the data into
a linearly separable form before reaching the output layer.
78
DailyDoseofDS.com
h ffl a e r r n g
But the problem arises when the cause isn’t that apparent. Therefore, it may take
some serious time to debug if you are unaware of them. In this chapter, I want to
talk about one such data-related mistake, which I once committed during my
early days in machine learning. Admittedly, it took me quite some time to figure
it out back then because I had no idea about the issue.
x r e
Consider a classification neural network trained using mini-batch gradient
descent.
i - t r i t e e : p t e o e h s g e a
o t t i .
79
DailyDoseofDS.com
And, of course, before training, we ensure that both networks had the same
initial weights, learning rate, and other settings.
h o h a e
Now, if you think about it for a second, overall, both models received the same
data, didn’t they? Yet, the order in which the data was fed to these models totally
determined their performance. I vividly remember that when I faced this issue, I
knew that my data was ordered by labels.
80
DailyDoseofDS.com
Yet, it never occurred to me that ordering may influence the model performance
because the data will always be the same regardless of the ordering.
But in the case of mini-batch gradient descent, the weights are updated a er
every mini-batch. Thus, the prediction and weight update on a subsequent
mini-batch is influenced by the previous mini-batches.
In the context of label-ordered data, where samples of the same class are grouped
together, mini-batch gradient descent will lead the model to learn patterns
specific to the class it excessively saw early on in training. In contrast, randomly
ordered data ensures that each mini-batch contains a balanced representation of
classes. This allows the model to learn a more comprehensive set of features
throughout the training process.
Of course, the idea of shuffling is not valid for time-series datasets as their
temporal structure is important. The good thing is that if you happen to use, say,
PyTorch DataLoader, you are safe. This is because it already implements
shuffling. But if you have a custom implementation, ensure that you are not
making any such error.
Before I end, one thing that you must ALWAYS remember when training neural
networks is that these models can proficiently learn entirely non-existing
patterns about your dataset. So never give them any chance to do so.
81
DailyDoseofDS.com
o l o r s n
n l g i i a o o o l o r s n
Model accuracy alone (or an equivalent performance metric) rarely determines
which model will be deployed.
h s n l g i i a o
In a gist, the idea is to train a smaller/simpler model (called the “student” model)
that mimics the behavior of a larger/complex model (called the “teacher” model).
82
DailyDoseofDS.com
But with consistent training, a smaller model may get (almost) as good as the
larger one.
83
DailyDoseofDS.com
n l g i i a o e
In the interest of time, let’s say we have already trained the following CNN model
on the MNIST dataset (I have provided the full Jupyter notebook towards the end,
don’t worry):
84
DailyDoseofDS.com
Being a classification model, the output will be a probability distribution over the
<N> classes:
Thus, we can train the student model such that its probability distribution
matches that of the teacher model.
85
DailyDoseofDS.com
u t n o o h i e h L i r c f =
86
DailyDoseofDS.com
Done!
The following image compares the training loss and validation accuracy of the
two models:
However, it is still pretty promising, given that it was only composed of simple
feed-forward layers.
Also, as depicted below, the student model is approximately 35% faster than the
teacher model, which is a significant increase in the inference run-time of the
model for about a - drop in the test accuracy.
87
DailyDoseofDS.com
That said, one of the biggest downsides of knowledge distillation is that one must
still train a larger teacher model first to train the student model.
In the next chapter, let’s discuss one more technique to compress ML models and
reduce their memory footprint.
88
DailyDoseofDS.com
c v i r i
Once we complete network training, we are almost always le with plenty of
useless neurons — ones that make nearly zero contribution to the network’s
performance, but they still consume memory.
In other words, there is a high percentage of neurons, which, if removed from the
trained network, will not affect the performance remarkably:
And, of course, I am not saying this as a random and uninformed thought. I have
experimentally verified this over and over across my projects.
Thus, they can be pruned from the network, as they will have very little impact on
the model’s output.
For pruning, we can decide on a pruning threshold (λ) and prune all neurons
whose activations are less than this threshold.
89
DailyDoseofDS.com
At a pruning threshold λ=0.4, the validation accuracy of the model drops by just
0.62%, but the number of parameters drops by 72%.
That is a huge reduction, while both models being almost equally good! Of
course, there is a trade-off because we are not doing as well as the original model.
But in many cases, especially when deploying ML models, accuracy is not the
only primary metric that decides these.
90
DailyDoseofDS.com
e o e
e o L o l r u t o b k
The core objective of model deployment is to obtain an API endpoint that can be
used for inference purposes:
So, in this chapter, I want to help you simplify this process. More specifically, we
shall learn how to deploy any ML model right from a Jupyter Notebook in just
three simple steps using the Modelbit API.
91
DailyDoseofDS.com
e o e i o l t
Assume we have already trained our model.
● Next, we log in to Modelbit from our Jupyter Notebook (make sure you
have created an account here: Modelbit)
92
DailyDoseofDS.com
Simply put, this function contains the code that will be executed at inference.
Thus, it will be responsible for returning the prediction.
We must specify the input parameters required by the model in this method.
Also, we can name it anything we want.
For our linear regression case, the inference function can be as follows:
● We define a function y r e o e ( .
● Next, we specify the input of the model as a parameter of this method.
● We validate the input for its data type.
● Finally, we return the prediction.
One good thing about Modelbit is that every dependency of the function (the
model object in this case) is pickled and sent to production automatically along
93
DailyDoseofDS.com
with the function. Thus, we can reference any object in this method. Once we
have defined the function, we can proceed with deployment as follows:
We have successfully
deployed the model in
three simple steps,
that too, right from the
Jupyter Notebook!
Once our model has
been successfully
deployed, it will
appear in our Modelbit
dashboard.
As shown above, Modelbit provides an API endpoint. We can use it for inference
purposes as follows:
94
DailyDoseofDS.com
The first number in the list is the input ID. All entries following the ID in a list
are the function parameters.
Lastly, we can also specify specific versions of the libraries or Python used while
deploying our model. This is depicted below:
Isn’t that cool, simple, and elegant over traditional deployment approaches?
95
DailyDoseofDS.com
a o e L o l n r u i
Despite rigorously testing an ML model locally (on validation and test sets), it
could be a terrible idea to instantly replace the previous model with a new model.
96
DailyDoseofDS.com
1 / e n
2 a r e n
97
DailyDoseofDS.com
3 n r v e n
4 h o e n
98
DailyDoseofDS.com
e i o r l g n o l e s y
Real-world ML deployment is never just about “deployment” — host the model
somewhere, obtain an API endpoint, integrate it into the application, and you are
done!
1 e i o r
But updating does not simply mean overwriting the previous version.
99
DailyDoseofDS.com
2 o l e s y
Another practical idea is to maintain a model registry for deployments. Let’s
understand what it is.
However, when we use a model registry, we version models separately from the
code. Let me give you an intuitive example to understand this better. Imagine our
deployed model takes three inputs to generate a prediction:
While writing the inference code, we overlooked that, at times, one of the inputs
might be missing. We realized this by analyzing the model’s logs.
100
DailyDoseofDS.com
We may want to fix this quickly (at least for a while) before we decide on the next
steps more concretely. Thus, we may decide to update the inference code by
assigning a dummy value for the missing input.
This will allow the model to still process the incoming request.
No, right?
Here, we only need to update the inference code. The model will remain the
same.
But if we were to version the model and code together, it would lead to a
redundant model and take up extra space.
101
DailyDoseofDS.com
L
h e i h P e r o
GPT-2 (XL) has 1.5 Billion parameters, and its parameters consume ~3GB of
memory in 16-bit precision.
What’s your estimate for the minimum memory needed to train GPT-2 on a
single GPU?
● Optimizer → Adam
● Batch size → 32
● Number of transformer layers → 48
● Sequence length → 1000
h ' o s m e o h i m e r e e o r n P 2
o l s e G ) n i l P
- - B
- - B
- 2 5 B
- 2 5 B
- 0 B
102
DailyDoseofDS.com
One can barely train a 3GB GPT-2 model on a single GPU with 32GB of
memory.
But how could that be even possible? Where does all the memory go?
Let’s understand.
There are so many fronts on which the model consistently takes up memory
during training.
1 p m e t e r i t n a m e e r
Mixed precision training is widely used to speed up model training.
Both the forward and backward propagation are performed using the 16-bit
representations of weights and gradients.
103
DailyDoseofDS.com
Moreover, the updates at the end of the backward propagation are still performed
under 32-bit for effective computation. I am talking about the circled step in the
image below:
While many practitioners use it just because it is popular, they don’t realize that
during training, Adam stores two optimizer states to compute the updates —
momentum and variance of the gradients:
Thus, if the model has Φ parameters, then these two optimizer states will
consume:
104
DailyDoseofDS.com
Lastly, as shown in the figure above, the final updates are always adjusted in the
32-bit representation of the model weights. This leads to:
That’s 16*Φ, or 24GB of memory, which is ridiculously higher than the 3GB
memory utilized by 16-bit parameters.
105
DailyDoseofDS.com
2 c v i s
For big deep learning models, like LLMs, Activations take up significant memory
during training.
More formally, the total number of activations computed in one transform block
of GPT-2 are:
106
DailyDoseofDS.com
It occurs when there are small, unused gaps between allocated memory blocks,
leading to inefficient use of the available memory.
o l i
In the above discussion, we considered a relatively small model — GPT-2 (XL)
with 1.5 Billion parameters, which is tiny compared to the scale of models being
trained these days.
However, the discussion may have helped you reflect on the inherent challenges
of building LLMs. Many people o en say that GPTs are only about stacking more
and more layers in the model and making the network bigger.
If it was that easy, everybody would have been doing it. From this discussion, you
may have understood that it’s not as simple as appending more layers.
Even one additional layer can lead to multiple GBs of additional memory
requirement. Multi-GPU training is at the forefront of these models, which we
covered in an earlier chapter in this book.
107
DailyDoseofDS.com
u - d i - n g s o s A
Here’s a visual which illustrates “full-model fine-tuning,” “fine-tuning with
LoRA,” and “retrieval augmented generation (RAG).”
All three techniques are used to augment the knowledge of an existing model
with additional data.
108
DailyDoseofDS.com
1 u fi e u n
Fine-tuning means adjusting the weights of a pre-trained model on a new dataset
for better performance.
While this fine-tuning technique has been successfully used for a long time,
problems arise when we use it on much larger models — LLMs, for instance,
primarily because of:
● Their size.
● The cost involved in fine-tuning all weights.
● The cost involved in maintaining all large fine-tuned models.
2 o fi e u n
LoRA fine-tuning addresses the limitations of traditional fine-tuning. The core
idea is to decompose the weight matrices (some or all) of the original model into
low-rank matrices and train them instead. For instance, in the graphic below, the
bottom network represents the large pre-trained model, and the top network
represents the model with LoRA layers.
109
DailyDoseofDS.com
The idea is to train only the LoRA network and freeze the large model.
But the LoRA model has more neurons than the original model. How does that
help? To understand this, you must make it clear that neurons don't have
anything to do with the memory of the network.
They are just used to illustrate the dimensionality transformation from one layer
to another.
It is the weight matrices (or the connections between two layers) that take up
memory. Thus, we must be comparing these connections instead:
Looking at the above visual, it is pretty clear that the LoRA network has
relatively very few connections.
3 A
Retrieval augmented generation (RAG) is another pretty cool way to augment
neural networks with additional information, without having to fine-tune the
model.
110
DailyDoseofDS.com
There are 7 steps, which are also marked in the above visual:
In fact, even its name entirely justifies what we do with this technique:
111
DailyDoseofDS.com
Of course, there are many problems with RAG too, such as:
● RAGs involve similarity matching between the query vector and the vectors
of the additional documents. However, questions are structurally very
different from answers.
● Typical RAG systems are well-suited only for lookup-based
question-answering systems. For instance, we cannot build a RAG pipeline
to summarize the additional data. The LLM never gets info about all the
documents in its prompt because the similarity matching step only
retrieves top matches.
So, it’s pretty clear that RAG has both pros and cons.
112
DailyDoseofDS.com
L i - n g e n u
Traditional fine-tuning (depicted below) is infeasible with LLMs because these
models have billions of parameters and are hundreds of GBs in size, and not
everyone has access to such computing infrastructure.
But today, we have many optimal ways to fine-tune LLMs, and five popular
techniques are depicted below:
113
DailyDoseofDS.com
- VeRA: In LoRA, every layer has a different pair of low-rank matrices A and
B, and both matrices are trained. In VeRA, however, matrices A and B are
frozen, random, and shared across all model layers. VeRA focuses on
learning small, layer-specific scaling vectors, denoted as b and d, which are
the only trainable parameters in this setup.
114
DailyDoseofDS.com
- LoRA+: In LoRA, both matrices A and B are updated with the same
learning rate. Authors found that setting a higher learning rate for matrix
B results in more optimal convergence.
115
DailyDoseofDS.com
l s a L
116
DailyDoseofDS.com
L u a n l
r n g n n r c i o l i f 0 L
l r h
Here’s the run-time complexity of the 10 most popular ML algorithms.
But why even care about run time? There are multiple reasons why I always care
about run time and why you should too.
117
DailyDoseofDS.com
For instance, you’ll be up for a big surprise if you use SVM or t-SNE on a dataset
with plenty of samples.
● In a random forest, all decision trees may have different depths. But here, I
have assumed that they are equal.
● During inference in kNN, we first find the distance to all data points. This
gives a list of distances of size (total samples).
○ Then, we find the k-smallest distances from this list.
○ The run-time to determine the k-smallest values may depend on the
implementation.
■ Sorting and selecting the k-smallest values will be ( o ).
■ But if we use a priority queue, it will take ( o k .
● In t-SNE, there’s a learning step. Since the major run-time comes from
computing the pairwise similarities in the high-dimensional space, we
have ignored that step.
Nonetheless, the table still accurately reflects the general run-time of each of
these algorithms.
118
DailyDoseofDS.com
5 o m r n a e t a efi i o n
a c n
I a e t a n l g m r n n a c n n a i
e n g
This is a question that so many people have, especially those who are just getting
started.
Short answer: Yes, it’s important, and here’s why I say so.
In fact, I know many data scientists (mainly on the applied side) who do not
entirely understand the mathematical details but can still build and deploy
models.
Nothing wrong.
119
DailyDoseofDS.com
However, when I talk to them, I also see some disconnect between “What they
are using” and “Why they are using it.”
If it feels like you are one of them, it’s okay. This problem can be solved.
That said, if you genuinely aspire to excel in this field, building a curiosity for the
underlying mathematical details holds exponential returns.
To help you take that first step, I prepared the following visual, which lists some
of the most important mathematical formulations used in Data Science and
Statistics (in no specific order).
Before reading ahead, look at them one by one and calculate how many of them
do you already know:
120
DailyDoseofDS.com
Some of the terms are pretty self-explanatory, so I won’t go through each of them,
like:
121
DailyDoseofDS.com
● Eigen Vectors: The non-zero vectors that do not change their direction
when a linear transformation is applied. It is widely used in dimensionality
reduction techniques like PCA.
122
DailyDoseofDS.com
o o e a y m o r a l t
u i a - a ifi a o o l
ML model building is typically an iterative process. Given some dataset:
● We train a model.
● We evaluate it.
● And we continue to improve it until we are satisfied with the performance.
Here, the efficacy of any model improvement strategy (say, introducing a new
feature) is determined using some sort of performance metric.
123
DailyDoseofDS.com
r a l t u i a - a ifi a o o l r h e o l h
u u r a l i o e o i o a l s i e a e o s
Let’s understand.
i a f c r y
In probabilistic multiclass-classification models, Accuracy is determined using
the output label that has the highest probability:
Now, it’s possible that the actual label is not predicted with the highest
probability by the model, but it’s in the top “k” output labels.
124
DailyDoseofDS.com
For instance, in the image below, the actual label (Class C) is not the highest
probability label, but it’s at least in the top 2 predicted probabilities (Class B and
Class C):
And what if in an earlier version of our model, the output probability of Class C
was the lowest, as depicted below:
125
DailyDoseofDS.com
Nonetheless, Accuracy entirely discards this as it only cares about the highest
probability label.
o t n
Whenever I am building and iteratively improving any probabilistic multiclass
classification model, I always use the top-k accuracy score. As the name suggests,
it computes whether the correct label is among the top k labels predicted
probabilities or not.
As you may have already guessed, top-1 accuracy score is the traditional
Accuracy score. This is a much better indicator to assess whether my model
improvement efforts are translating into meaningful enhancements in predictive
performance or not.
For instance, if the top-3 accuracy score goes from 75% to 90%, this totally
suggests that whatever we did to improve the model was effective:
● Earlier, the correct prediction was in the top 3 labels only 75% of the time.
● But now, the correct prediction is in the top 3 labels 90% of the time.
126
DailyDoseofDS.com
As a result, one can effectively redirect their engineering efforts in the right
direction. Of course, what I am saying should only be used to assess the model
improvement efforts.
127
DailyDoseofDS.com
o n r o l m o m t ff r i t e
o g n a
Back in 2019, I was working with an ML research group in Germany.
This made me curious about why gathering human labels was necessary for him
when he already had ground truth labels available. So I asked.
Consider we are building a multiclass classification model. Say it’s a model that
classifies an input image as a rock, paper, or scissors:
128
DailyDoseofDS.com
For simplicity, let’s assume there’s no class imbalance. Calculating the class-wise
validation accuracies gives us the following results:
u t n h h l s o d o o n i v y r e o n e
u h n m o h o l n
But this might not be true. And this is precisely what that Ph.D. student wanted
to verify by collecting human labels. Let’s say that the human labels give us the
following results:
Based on this, do you still think the model performs the worst on the “Scissor”
class?
No, right?
129
DailyDoseofDS.com
I mean, of course, the model has the least accuracy on the “Scissor” class, and I
am not denying it. However, with more context, we notice that the model is doing
a pretty good job classifying the “Scissor” class. This is because an average
human is achieving just 2% higher accuracy in comparison to what our model is
able to achieve.
However, the above results astonishingly reveal that it is the “Rock” class instead
that demands more attention. The accuracy difference between an average
human and the model is way too high (13%). Had we not known this, we would
have continued to improve the “Scissor” class, when in reality, “Rock” requires
more improvement.
Ever since I learned this technique, I have found it super helpful to determine my
next steps for model improvement, if possible. I say “if possible” because I
understand that many datasets are hard for humans to interpret and label.
Nonetheless, if it is feasible to set up such a “human baseline,” one can get so
much clarity into how the model is performing.
As a result, one can effectively redirect their engineering efforts in the right
direction.
Of course, I am not claiming that this will be universally useful in all use cases.
For instance, if the model is already performing better than the baseline, the
model improvements from there on will have to be guided based on past results.
Yet, in such cases, surpassing a human baseline at least helps us validate that the
model is doing better than what a human can do.
130
DailyDoseofDS.com
o u t n f 6 L l s
The below visual depicts the most commonly used loss functions by various ML
algorithms.
131
DailyDoseofDS.com
0 o o o o u t n
The below visual depicts some commonly used loss functions in regression and
classification tasks.
132
DailyDoseofDS.com
o o c a y s r n a d i n e e
It is pretty conventional to split the given data into train, test, and validation sets.
However, there are quite a few misconceptions about how they are meant to be
used, especially the validation and test sets.
In this chapter, let’s clear them up and see how to truly use train, validation, and
test sets.
● Train
● Validation
● Test
At this point, just assume that the test data does not even exist. Forget about it
instantly.
133
DailyDoseofDS.com
Begin with the train set. This is your whole world now.
● You analyze it
● You transform it
● You use it to determine features
● You fit a model on it
Based on validation performance, improve the model. Here’s how you iteratively
build your model:
134
DailyDoseofDS.com
Until...
You reach a point where you start overfitting the validation set.
This indicates that you have exploited (or polluted) the validation set.
No worries.
Merge it with the train set and generate a new split of train and validation.
Note: Rely on cross-validation if needed, especially when you don’t have much
data. You may still use cross-validation if you have enough data. But it can be
computationally intensive.
Now, if you are happy with the model’s performance, evaluate it on test data.
135
DailyDoseofDS.com
They use the same test set again. This is not allowed!
Your professor taught you in the classroom. All in-class lessons and examples are
the train set.
The professor gave you take-home assignments, which acted like validation sets.
You got some wrong and some right. Based on this, you adjusted your topic
fundamentals, i.e., improved the model.
Now, if you keep solving the same take-home assignment repeatedly, you will
eventually overfit it, won’t you?
The final exam day paper is your test set. If you do well, awesome!
But if you fail, the professor cannot give you the exact exam paper next time, can
they? This is because you know what’s inside.
Of course, by evaluating a model on the test set, the model never gets to “know”
the precise examples inside that set. But the issue is that the test set has been
exposed now.
Your previous evaluation will inevitably influence any further evaluations on that
specific test set. That is why you must always use a specific test set only ONCE.
Once you do, merge it with the train and validation set and generate an entirely
new split.
Repeat.
And that is how you use train, validation, and test sets in machine learning.
136
DailyDoseofDS.com
r s a d i e n u
Tuning and validating machine learning models on a single validation set can be
misleading and sometimes yield overly optimistic results.
This can occur due to a lucky random split of data, which results in a model that
performs exceptionally well on the validation set but poorly on new, unseen data.
Cross validation involves repeatedly partitioning the available data into subsets,
training the model on a few subsets, and validating on the remaining subsets.
The main advantage of cross validation is that it provides a more robust and
unbiased estimate of model performance compared to the traditional validation
method.
Below are five of the most commonly used and must-know cross validation
techniques.
e e n O r s a d i
137
DailyDoseofDS.com
- l r s a d i
o i r s a d i
138
DailyDoseofDS.com
l k r s a d i
t tfi d r s a d i
139
DailyDoseofDS.com
h o o f r r s a d i ?
Let me ask you a question.
But before I do that, I need to borrow your imagination for just a moment.
m i o r u d g o u r s a i e n g o l o
r s g r s a d i o e r n n p m e f
y r r e r
140
DailyDoseofDS.com
e m n a
My strong preference has almost always been “retraining a new model with
entire data.”
There are, of course, some considerations to keep in mind, which I have learned
through the models I have built and deployed. That said, in most cases, retraining
is the ideal way to proceed.
Let me explain.
h e i h o l
We would want to retrain a new model because, in a way, we are already satisfied
with the cross-validation performance, which, by its very nature, is an out-of-fold
metric.
An out-of-fold data is data that has not been seen by the model during the
training. An out-of-fold metric is the performance on that data.
141
DailyDoseofDS.com
In other words, we already believe that the model aligns with how we expect it to
perform on unseen data.
Thus, incorporating this unseen validation set in the training data and retraining
the model will MOST LIKELY have NO effect on its performance on unseen data
a er deployment (assuming a sudden covariate shi hasn’t kicked in, which is a
different issue altogether).
If, however, we were not satisfied with the cross-validation performance itself, we
wouldn’t even be thinking about finalizing a model in the first place.
It’s hard for me to recall any instance where retraining did something
disastrously bad to the overall model.
142
DailyDoseofDS.com
In fact, I vividly remember one instance wherein, while I was productionizing the
model (it took me a couple of days a er retraining), the team had gathered some
more labeled data.
The model didn’t show any performance degradation when I evaluated it (just to
double-check). As an added benefit, this also helped ensure that I had made no
errors while productionizing my model.
o o i r i s
Here, please note that it’s not a rule that you must always retrain a new model.
The field itself and the tasks one can solve are pretty diverse, so one must be
open-minded while solving the problem at hand. One of the reasons I wouldn’t
want to retrain a new model is that it takes days or weeks to train the model.
In fact, even if we retrain a new model, there are MANY business situations in
which stakes are just too high.
143
DailyDoseofDS.com
Thus, one can never afford to be negligent about deploying a model without
re-evaluating it — transactional fraud, for instance.
In such cases, I have seen that while a team works on productionizing the model,
data engineers gather some more data in the meantime.
Before deploying, the team would do some final checks on that dataset.
The newly gathered data is then considered in the subsequent iterations of model
improvements.
144
DailyDoseofDS.com
o l e e s i - r c r e ff
It is well-known that as the number of model parameters increases, we typically
overfit the data more and more. For instance, consider fitting a polynomial
regression model trained on this dummy dataset below:
n a o o t n , h s a e o n i e s o o l
It is expected that as we’ll increase the degree (m) and train the polynomial
regression model:
This is because, with a higher degree, the model will find it easier to contort its
regression fit through each training data point, which makes sense.
145
DailyDoseofDS.com
Why does the test loss increase to a certain point but then decrease?
146
DailyDoseofDS.com
Well…what you are seeing is called the “double descent phenomenon,” which is
quite commonly observed in many ML models, especially deep learning models.
In fact, this whole idea is deeply rooted to why LLMs, although massively big
(billions or even trillions of parameters), can still generalize pretty well.
And it’s hard to accept it because this phenomenon directly challenges the
traditional bias-variance trade-off we learn in any introductory ML class:
Putting it another way, training very large models, even with more parameters
than training data points, can still generalize well.
To the best of my knowledge, this is still an open question, and it isn’t entirely
clear why neural networks exhibit this behavior.
There are some theories around regularization, however, such as this one:
It could be that the model applies some sort of implicit regularization, with
which, it can precisely focus on an apt number of parameters for generalization.
147
DailyDoseofDS.com
t i c o d i s
L s M— h ’ h iff r c
Maximum likelihood estimation (MLE) and expectation maximization (EM) are
two popular techniques to determine the parameters of statistical models.
Due to its applicability in MANY statistical models, I have seen it being asked in
plenty of data science interviews as well, especially the distinction between the
two. The following visual summarizes how they work:
148
DailyDoseofDS.com
a m i l o s m i M )
MLE starts with a labeled dataset and aims to determine the parameters of the
statistical model we are trying to fit.
149
DailyDoseofDS.com
This gives our parameter estimates that would have most likely generated the
given data.
But what do we do if we
don’t have true labels?
We still want to
estimate the parameters,
don’t we?
MLE, as you may have guessed, will not be applicable. The true label (y), being
unobserved, makes it impossible to define a likelihood function like we did
earlier.
150
DailyDoseofDS.com
x c t n a m a o E
EM is an iterative optimization technique to estimate the parameters of
statistical models. It is particularly useful when we have an unobserved (or
hidden) label. One example situation could be as follows:
As depicted above, we assume that the data was generated from multiple
distributions (a mixture). However, the observed/complete data does not contain
that information. In other words, the observed dataset does not have information
about whether a specific row was generated from distribution 1 or distribution 2.
Had it contained the label ( ) information, we would have already used MLE.
EM helps us with parameter estimates of such datasets. The core idea behind EM
is as follows:
151
DailyDoseofDS.com
● We will update the likelihood function (L) using the new posterior
probabilities.
● Again, maximizing it will give us a new estimate for the parameters (θ).
● And this process goes on and on until convergence.
152
DailyDoseofDS.com
o fi e e n r l n r i i n r l
Statistical estimates always have some uncertainty.
Consider a simple example of modeling house prices just based on its area. A
prediction wouldn’t tell the true value of a house based on its area. This is
because different houses of the same size can have different prices.
Instead, what it predicts is the mean value related to the outcome at a particular
input.
Let’s understand.
153
DailyDoseofDS.com
Let’s fit a linear regression model using statsmodel and print a part of the
regression summary:
t s 5 n r l e u . 5 . 5 . .
154
DailyDoseofDS.com
So, if we gathered more such samples and fit an OLS to each sample, the true
coefficient (which we can only know if we had the data for the entire population)
would lie 95% of the time in this confidence interval.
The confidence interval we saw above was for the coefficient, so what does the
confidence interval represent in this case?
Similar to what we discussed above, the data is just a sample of the population.
However, if we gathered more such samples and fit an OLS to each dataset, the
true mean value (which we can only know if we had the data for the entire
155
DailyDoseofDS.com
population) for this specific input ( = ) would lie 5 of the time in this
confidence interval.
…we notice that it is wider than the confidence interval. Why is it, and what does
this interval tell?
What we saw above with confidence interval was about estimating the true
population mean at a specific input.
What we are talking about now is obtaining an interval where the true value for
an input can lie.
Thus, this additional uncertainty appears because in our dataset, for the same
value of input x, there can be multiple different values of the outcome. This is
depicted below:
156
DailyDoseofDS.com
Thus, it is wider than the confidence interval. Plotting it across the entire input
range, we get the following plot:
Given that the model is predicting a mean value (as depicted below), we have to
represent the prediction uncertainty that the actual value can lie anywhere in the
prediction interval:
A 5 prediction interval tells us that we can be 5 sure that the actual value
of this observation will fall within this interval.
157
DailyDoseofDS.com
So to summarize:
158
DailyDoseofDS.com
h s L a e n n a d s m o
The OLS estimator for linear regression (shown below) is known as an unbiased
estimator.
a g u
The goal of statistical modeling is to make conclusions about the whole
population.
In other words, given that we cannot observe (or collect data of) the entire
population, we cannot obtain the true parameter (β) for the population:
159
DailyDoseofDS.com
Thus, we must obtain parameter estimates (B̂) on samples and infer the true
parameter (β) for the population from those estimates:
And, of course, we want these sample estimates (B̂) to be reliable to determine the
actual parameter (β).
r o l i o l
When using a linear regression model, we assume that the response variable (Y)
and features (X) for the entire population are related as follows:
x c a e f L s m e
The closed-form solution of OLS is given by:
160
DailyDoseofDS.com
What’s more, as discussed above, using OLS on different samples will result in
different parameter estimates:
Simply put, the expected value is the average value of the parameters if we run
OLS on many samples.
Here, substitute β ε:
See, we can do that substitution because even if we don’t know the parameter β
for the whole population, we know that the sample was drawn from the
population.
161
DailyDoseofDS.com
Thus, the equation in terms of the true parameters ( β ε) still holds for
the sample.
Now, even if we were to draw samples from this population data, the true
equation i x ε would still be valid on the sampled data points,
wouldn’t it?
162
DailyDoseofDS.com
Simplifying, we get:
The expected value of parameter estimates on the samples equals the true
parameter value β.
163
DailyDoseofDS.com
n m r n a a y
Many people misinterpret unbiasedness with the idea that the parameter
estimates from a single run of OLS on a sample are equal to the true parameter
values.
And, of course, all this is based on the assumption that we have good
representative samples and that the assumptions of linear regression are not
violated.
164
DailyDoseofDS.com
h t h y i a e
Assessing the similarity between two probability distributions is quite helpful at
times. For instance, imagine we have a labeled dataset X ).
The core idea is to approximate the overlap between two distributions, which
measures the “closeness” between the two distributions under consideration.
165
DailyDoseofDS.com
Here, we have an
observed distribution
(Blue). Next, we
measure its distance
from:
● Gaussian → 0.19.
● Gamma → 0.03.
A high Bhattacharyya
distance indicates less
overlap or more
dissimilarity. This lets
us conclude that the
observed distribution
resembles a Gamma
distribution.
166
DailyDoseofDS.com
L i r c s h t h y i a e
Now, many o en get confused between KL Divergence and Bhattacharyya
distance. Effectively, both are quantitative measures to determine the “similarity”
between two distributions.
The core idea behind KL Divergence is to assess how much information is lost
when one distribution is used to approximate another distribution.
The more information is lost, the more the KL Divergence and, consequently, the
less the “similarity”. Also, approximating a distribution Q using P may not be the
same as doing the reverse — P using Q. This makes KL Divergence asymmetric
in nature.
Moving on,
Bhattacharyya distance
measures the overlap
between two
distributions.
167
DailyDoseofDS.com
Bhattacharyya distance has many applications, not just in machine learning but
in many other domains. For instance:
The only small caveat is that Bhattacharyya distance does not satisfy the triangle
inequality, so that’s something to keep in mind.
168
DailyDoseofDS.com
h r e a l o s i a e v u i a
i a e
During distance calculation, Euclidean distance assumes independent axes.
Thus, Euclidean distance will produce misleading results if your features are
correlated. For instance, consider this dummy dataset below:
Clearly, the features are correlated. Here, consider three points marked P1, P2,
and P3 in this dataset.
169
DailyDoseofDS.com
Yet, Euclidean distance ignores this, and P2 and P3 come out to be equidistant to
P1, as depicted below:
As a result, it can measure how far away a data point is from the distribution,
which Euclidean can not.
Referring to the earlier dataset again, with Mahalanobis distance, P2 comes out
to be closer to P1 than P3.
170
DailyDoseofDS.com
o o t o ?
In a gist, the objective is to construct a new coordinate system with independent
and orthogonal axes. The steps are:
171
DailyDoseofDS.com
Another use case we typically do not hear of o en, but that exists is a variant of
kNN that is implemented with Mahalanobis distance instead.
Scipy implements the Mahalanobis distance, which you can check here:
https://bit.ly/3LjAymm.
172
DailyDoseofDS.com
1 a o e r n a o a t
Many ML models assume (or work better) under the presence of normal
distribution.
For instance:
Thus, being aware of the ways to test normality is extremely crucial for data
scientists. The visual below depicts the 11 essential ways to test normality.
173
DailyDoseofDS.com
1 l t g e o s f x a t y
● Histogram
● KDE Plot
● Violin Plot
While plotting is o en reliable, it is a subjective approach and prone to errors.
2 t i c e o :
1) Shapiro-Wilk test:
● Finds a statistic using the correlation between the observed data and
the expected values under a normal distribution.
● The p-value indicates the likelihood of observing such a correlation
if the data were normally distributed.
● A high p-value indicates a normal distribution.
2) KS test:
● Measures the max difference between the cumulative distribution
functions (CDF) of observed and normal distribution.
● The output statistic is based on the max difference between the two
CDFs.
● A high p-value indicates a normal distribution.
3) Anderson-Darling test:
● Measures the differences between the observed data and the
expected values under a normal distribution.
● Emphasizes the differences in the tail of the distribution.
● This makes it particularly effective at detecting deviations in the
extreme values.
174
DailyDoseofDS.com
4) Lilliefors test:
3 i a e e u s
Distance measures are another reliable and more intuitive way to test normality.
See, the problem is that a single distance value needs more context for
interpretability.
For instance, if the distance between two distributions is 5, is this large or small?
● Find the distance between the observed distribution and multiple reference
distributions.
● Select the reference distribution with the minimum distance to the
observed distribution.
175
DailyDoseofDS.com
1) Bhattacharyya distance:
176
DailyDoseofDS.com
3) KL Divergence:
● It is not entirely a "distance metric" per se, but can be used in this
case.
● Measure information lost when one distribution is approximated
using another distribution.
● The more information is lost, the more the KL Divergence.
● Choose the distribution that has the least KL divergence from the
observed distribution.
177
DailyDoseofDS.com
r a l y s i l o
In data science and statistics, many folks o en use “probability” and “likelihood”
interchangeably.
While writing this chapter, I searched for their meaning in the Cambridge
Dictionary. Here’s what it says:
Anyway.
Let’s understand!
178
DailyDoseofDS.com
Likelihood, on the other hand, is about explaining events that have already
occurred.
179
DailyDoseofDS.com
Assume you have collected some 2D data and wish to fit a straight line with two
parameters — slope (m) and intercept (c).
Here, likelihood is defined as the support provided by a data point for some
particular parameter values in your model.
In maximum likelihood estimation, you have some observed data and you are
trying to determine the specific set of parameters (θ) that maximize the likelihood
of observing the data.
180
DailyDoseofDS.com
For instance:
To summarize…
181
DailyDoseofDS.com
In probability:
In likelihood:
182
DailyDoseofDS.com
1 e r a l y i r u o n a c n
Statistical models assume an underlying data generation process.
This is exactly what lets us formulate the generation process, using which we
define the maximum likelihood estimation (MLE) step.
Thus, when dealing with statistical models, the model performance becomes
entirely dependent on:
e h l l o t h n h h t n L h h t .
The visual below depicts the 11 most important distributions in data science:
183
DailyDoseofDS.com
184
DailyDoseofDS.com
o a i r u o
e o l i r u o
i m l i r u o
185
DailyDoseofDS.com
o s i r u o
x n t l i r u o
a a i r u o
186
DailyDoseofDS.com
e i r u o
n o i r u o
187
DailyDoseofDS.com
o N m i r u o
t e - s i t n
e u
● Models the waiting time for an event.
● O en employed to analyze time-to-failure data.
188
DailyDoseofDS.com
o o i n r e t n f o i o
r a l y i r u o
Consider the
following
probability density
function of a
continuous
probability
distribution. Say it
represents the time
one may take to
travel from point A
to B.
189
DailyDoseofDS.com
) h s h r a l y h n i a r i l h e i t
( 3 o e h o t ?
● ) / o . )
● ) r n r h u e r = , .
● ) r n r h u e r = , .
And I intentionally kept only wrong answers here so that you never forget
something fundamentally important about continuous probability distributions.
190
DailyDoseofDS.com
● It should be defined for all real numbers (can be zero for some values).
191
DailyDoseofDS.com
Now, there are infinitely possible values that a continuous random variable may
take.
Thus, answering our original question, the probability that one will take three
minutes to reach point B is ZERO.
192
DailyDoseofDS.com
More formally, the probability that a random variable will take values in the
interval a ] is:
From the above probability estimation over an interval, we can also verify that the
probability of obtaining a specific value is indeed zero.
193
DailyDoseofDS.com
By substituting = , we get:
● The probability density function does not depict the exact probability of
obtaining a specific value.
● Estimating the probability for a precise value of the random value makes
no sense because it is infinitesimally small.
Instead, we use the probability density function to calculate the probability over
an interval of values.
194
DailyDoseofDS.com
e u efi i o n n r g
n e c o
1 y s f a a e n a s
In any tabular dataset, we typically categorize the columns as either a feature or a
target.
However, there are so many variables that one may find/define in their dataset,
which I want to discuss in this chapter. These are depicted in the image below:
195
DailyDoseofDS.com
1 ) n p d n e n t a a e
These are the most common and fundamental to ML.
Independent variables are the features that are used as input to predict the
outcome. They are also referred to as predictors/features/explanatory variables.
The dependent variable is the outcome that is being predicted. It is also called
the target, response, or output variable.
3 ) o o d n o e t a a e
Confounding variables are typically found in a cause-and-effect study (causal
inference).
These variables are not of primary interest in the cause-and-effect equation but
can potentially lead to spurious associations.
196
DailyDoseofDS.com
To exemplify, say we want to measure the effect of ice cream sales on the sales of
air conditioners.
As you may have guessed, these two measurements are highly correlated.
● There is a high correlation between ice cream sales and sales of air
conditioners.
● But the sales of air conditioners (effect) are NOT caused by ice cream sales.
197
DailyDoseofDS.com
Also, in this case, the air conditioner and ice cream sales are correlated variables.
5 o r a a e
In the above example, to measure the true effect of ice cream sales on air
conditioner sales, we must ensure that the temperature remains unchanged
throughout the study.
More formally, these are variables that are not the primary focus of the study but
are crucial to account for to ensure that the effect we intend to measure is not
biased or confounded by other factors.
6 a n a a e
A variable that is not directly observed but is inferred from other observed
variables.
For instance, we use clustering algorithms because the true labels do not exist,
and we want to infer them somehow.
198
DailyDoseofDS.com
7 n r t n a a e
As the name suggests, these variables represent the interaction effect between
two or more variables, and are o en used in regression analysis.
199
DailyDoseofDS.com
To summarize, the core idea is to study two or more variables together rather
than independently.
8 ) t i a n o S t n y a a e
The concept of stationarity o en appears in time-series analysis.
On the flip side, if a variable’s statistical properties change over time, they are
called non-stationary variables.
200
DailyDoseofDS.com
That is why, typically, using direct values of the non-stationary feature (like the
absolute value of the stock price) is not recommended.
1 a e a a e
Talking of time series, lagged variables are pretty commonly used in feature
engineering and data analytics.
As the name suggests, a lagged variable represents previous time points’ values of
a given variable, essentially shi ing the data series by a specified number of
periods/rows.
For instance, when predicting next month’s sales figures, we might include the
sales figures from the previous month as a lagged variable.
201
DailyDoseofDS.com
1 e y a a e
Yet again, as the name suggests, these variables (unintentionally) provide
information about the target variable that would not be available at the time of
prediction.
This leads to overly optimistic model performance during training but fails to
generalize to new data.
Each sample consists of multiple images (e.g., different views of the same
patient’s body part), and the model is intended to detect the severity of a disease.
In this case, randomly splitting the images into train and test sets will result in
data leakage.
This is because images of the same patient will end up in both the training and
test sets, allowing the model to “see” information from the same patient during
training and testing.
202
DailyDoseofDS.com
Here’s a paper that committed this mistake (and later corrected it):
To avoid this, a patient must only belong to the test or train/val set, not both.
Let’s get into more detail about the issue with random splitting below.
203
DailyDoseofDS.com
y i l e u n d g
In typical machine learning datasets, we mostly find features that progress from
one value to another: For instance:
However, there is one more type of feature, which, in most cases, deserves special
feature engineering effort but is o en overlooked. These are cyclical features, i.e.,
features with a recurring pattern (or cycle).
Unlike other features that progress continuously (or have no inherent order),
cyclical features exhibit periodic behavior and repeat a er a specific interval. For
instance, the hour-of-the-day, the day-of-the-week, and the month-of-an-year are
all common examples of cyclical features. Talking specifically about, say, the
hour-of-the-day, its value can range between 0 to 23:
204
DailyDoseofDS.com
Moreover, the distance between “0” and “1” must be the same as the distance
between “23” and “0”.
However, standard representation does not fulfill these properties. Thus, the
value “23” is far from “0”. In fact, the distance property isn’t satisfied either.
Now, think about it for a second. Intuitively speaking, don’t you think this feature
deserves special feature engineering, i.e., one that preserves the inherent natural
property?
y i l e u n d g
One of the most common techniques to encode such a feature is using
trigonometric functions, specifically, sine and cosine. These are helpful because
sine and cosine are periodic, bounded, and defined for all real values.
205
DailyDoseofDS.com
f o s v t r r o m r u t n r l e o , u h
r l n fi e o o a e i , a p 2
The central angle (2π) represents 24 hours. Thus, the linear feature values can be
easily converted into cyclical features as follows:
The benefit of doing this is how neatly the engineered feature satisfies the
properties we discussed earlier:
206
DailyDoseofDS.com
…or rather, I should say that the standard linear representation of the
hour-of-the-day feature results in an underutilization of information, which the
model can benefit from. Had it been the day-of-the-week instead, the central
angle (2π) must have represented 7 days.
The same idea can be extended to all sorts of cyclical features you may find in
your dataset:
The point is that as you will inspect the dataset features, you will intuitively
know which features are cyclical and which are not.
Typically, the model will find it easier to interpret the engineered features and
utilize them in modeling the dataset accurately.
207
DailyDoseofDS.com
e u i r i t n
During model development, one of the techniques that many don’t experiment
with is feature discretization. As the name suggests, the idea behind
discretization is to transform a continuous feature into discrete features.
Why, when, and how would you do that? Let’s understand in this chapter.
o v i
My rationale for using feature discretization has almost always been simple: “It
just makes sense to discretize a feature.”
For instance, say we model this transaction dataset without discretization. This
would result in some coefficients for each feature, which would tell us the
influence of each feature on the final prediction.
208
DailyDoseofDS.com
But if you think again, in our goal of understanding spending behavior, are we
really interested in learning the correlation between exact age and spending
behavior?
It makes very little sense to do that. Instead, it makes more sense to learn the
correlation between different age groups and spending behavior.
As a result, discretizing the age feature can potentially unveil much more
valuable insights than using it as a raw feature.
o o e n u
Now that we understand the rationale, there are 2 techniques that are widely
preferred.
209
DailyDoseofDS.com
So, in a way, we get to use a simple linear model but still get to learn non-linear
patterns.
210
DailyDoseofDS.com
Simply put, “signal” refers to the meaningful or valuable information in the data.
Binnng a feature helps us mitigate the influence of minor fluctuations, which are
o en mere noise.
Each bin acts as a means of “smoothing” out the noise within specific data
segments.
To avoid this, don’t overly discretize all features. Instead, use it when it makes
intuitive sense, as we saw earlier.
Of course, its utility can vastly vary from one application to another, but at times,
I have found that:
211
DailyDoseofDS.com
a g i l a n d g e n u
Here are 7 ways to encode categorical features:
212
DailyDoseofDS.com
n h n d g
● Each category is represented by a binary vector of 0s and 1s.
● Each category gets its own binary feature, and only one of them is "hot"
(set to 1) at a time, indicating the presence of that category.
● Number of features = Number of unique categorical labels
u y n d g
● Same as one-hot encoding but with one additional step.
● A er one-hot encoding, we drop a feature randomly.
● We do this to avoid the dummy variable trap (discussed in this chapter).
● Number of features = Number of unique categorical labels - 1
ff c n d g
● Similar to dummy encoding but with one additional step.
● Alter the row with all zeros to -1.
● This ensures that the resulting binary features represent not only the
presence or absence of specific categories but also the contrast between
the reference category and the absence of any category.
● Number of features = Number of unique categorical labels - 1.
a l n d g
● Assign each category a unique label.
● Label encoding introduces an inherent ordering between categories, which
may not be the case.
● Number of features = 1.
r n n d g
● Similar to label encoding — assign a unique integer value to each category.
● The assigned values have an inherent order, meaning that one category is
considered greater or smaller than another.
● Number of features = 1.
213
DailyDoseofDS.com
o t n d g
● Also known as frequency encoding.
● Encodes categorical features based on the frequency of each category.
● Thus, instead of replacing the categories with numerical values or binary
representations, count encoding directly assigns each category with its
corresponding count.
● Number of features = 1.
i r n d g
● Combination of one-hot encoding and ordinal encoding.
● It represents categories as binary code.
● Each category is first assigned an ordinal value, and then that value is
converted to binary code.
● The binary code is then split into separate binary features.
● Useful when dealing with high-cardinality categorical features (or a high
number of features) as it reduces the dimensionality compared to one-hot
encoding.
● Number of features = log(n) (in base 2).
While these are some of the most popular techniques, do note that these are not
the only techniques for encoding categorical data.
214
DailyDoseofDS.com
h ffl e u m r n
I o en find “Shuffle Feature Importance” to be a handy and intuitive technique
to measure feature importance.
As the name suggests, it observes how shuffling a feature influences the model
performance. The visual below illustrates this technique in four simple steps:
215
DailyDoseofDS.com
Simply put, if we randomly shuffle just one feature and everything else stays the
same, then the performance drop will indicate how important that feature is.
● If the performance drop is low → This means the feature has a very low
influence on the model’s predictions.
● If the performance drop is high → This means that the feature has a very
high influence on the model’s predictions.
216
DailyDoseofDS.com
● It requires no repetitive model training. Just train the model once and
measure the feature importance.
● It is pretty simple to use and quite intuitive to interpret.
● This technique can be used for all ML models that can be evaluated.
Say two features are highly correlated, and one of them is permuted/shuffled.
In this case, the model will still have access to the feature through its correlated
feature.
One way to handle this is to cluster highly correlated features and only keep one
feature from each cluster.
217
DailyDoseofDS.com
h r e e o o e u e c o
Real-world ML development is all about achieving a sweet balance between
speed, model size, and performance.
● Improve speed,
● Reduce size, and
● Maintain (or minimally degrade) performance…
…is by using featuring selection. The idea is to select the most useful subset of
features from the dataset.
While there are many many methods for feature selection, I have o en found the
“Probe Method” to be pretty reliable, practical and intuitive to use.
218
DailyDoseofDS.com
219
DailyDoseofDS.com
This can be especially useful in cases where we have plenty of features, and we
wish to discard those that don’t contribute to the model.
Of course, one shortcoming is that when using the Probe Method, we must train
multiple models:
1. Train the first model with the random feature and discard useless features.
2. Keep training new models until the random feature is ranked as the least
important feature (although typically, convergence does not result in plenty
of models).
3. Train the final model without the random feature.
220
DailyDoseofDS.com
e e i
h e q r r r M )
But have you ever wondered why we specifically use the squared error?
See, many functions can potentially minimize the difference between observed
and predicted values. But of all the possible choices, what is so special about the
squared error?
221
DailyDoseofDS.com
Here, epsilon is an error term that captures the random noise for a specific data
point (i).
We assume the noise is drawn from a Gaussian distribution with zero mean
based on the central limit theorem:
Thus, the probability of observing the error term can be written as:
Substituting the error term from the linear regression equation, we get:
For a specific set of parameters θ, the above tells us the probability of observing a
data point (i).
222
DailyDoseofDS.com
We further write it as a product for individual data points because we assume all
observations are independent.
Thus, we get:
Since the log function is monotonic, we use the log-likelihood and maximize it.
This is called maximum likelihood estimation (MLE).
223
DailyDoseofDS.com
Simplifying, we get:
To reiterate, the objective is to find the θ that maximizes the above expression.
But the first term is independent of θ. Thus, maximizing the above expression is
equivalent to minimizing the second term. And if you notice closely, it’s precisely
the squared error.
Thus, you can maximize the log-likelihood by minimizing the squared error. And
this is the origin of least-squares in linear regression. See, there’s clear proof and
reasoning behind using squared error as a loss function in linear regression.
224
DailyDoseofDS.com
k a i a e e i a o y r r e r
Almost all ML models we work with have some hyperparameters, such as:
● Learning rate
● Regularization
● Layer size (for neural network), etc.
But as shown in the image below, why don’t we see any hyperparameter in
Sklearn’s Linear Regression implementation?
225
DailyDoseofDS.com
o o L o ?
226
DailyDoseofDS.com
But because X might be a non-square matrix, its inverse may not be defined.
To resolve this, first, we multiply with the transpose of X on both sides, as shown
below:
Next, we take the collective inverse of the product to get the following:
227
DailyDoseofDS.com
● No hyperparameters.
● No randomness. Thus, it will always return the same solution, which is also
optimal.
Of course, do note that there is a significant tradeoff between run time and
convenience when using OLS vs. gradient descent.
Thus, when we have many features, it may not be a good idea to use the
i a e e i ( class. Instead, use the G e e o ) class from Sklearn.
Thus, when we use OLS, we trade run-time for finding an optimal solution
without hyperparameter tuning.
228
DailyDoseofDS.com
o s e e i s i a e e i
Linear regression comes with its own set of challenges/assumptions. For
instance, a er modeling, the output can be negative for some inputs.
But this may not make sense at times — predicting the number of goals scored,
number of calls received, etc. Thus, it is clear that it cannot model count (or
discrete) data.
For instance:
Thus, if the above assumptions do not hold, linear regression won’t help.
Instead, in this specific case, what you may need is Poisson regression.
229
DailyDoseofDS.com
It is a type of generalized linear model (GLM) that is used to model count data. It
works by estimating a Poisson distribution parameter (λ), which is directly linked
to the expected number of events in a given interval.
For instance:
230
DailyDoseofDS.com
231
DailyDoseofDS.com
o o u d i a o l
In this chapter, I will help you cultivate what I think is one of the MOST
overlooked and underappreciated skills in developing linear models.
I can guarantee that harnessing this skill will give you a lot of clarity and
intuition in the modeling stages.
It’s just that, in our specific use case, the data generation process didn’t perfectly
align with what linear regression is designed to handle. In other words, earlier
when we trained a linear regression model, we inherently assumed that the data
was sampled from a normal distribution. But that was not true in this Poisson
regression case.
232
DailyDoseofDS.com
You’d start appreciating the importance of data generation when you realize that
literally every member of the generalized linear model family stems from altering
the data generation process.
For instance:
233
DailyDoseofDS.com
See…
Every linear model makes an assumption and is then derived from an underlying
data generation process.
Thus, developing a habit of holding for a second and thinking about the data
generation process will give you so much clarity in the modeling stages.
I am confident this will help you get rid of that annoying and helpless habit of
relentlessly using a specific sklearn algorithm without truly knowing why you are
using it.
Consequently, you’d know which algorithm to use and, most importantly, why.
This improves your credibility as a data scientist and allows you to approach data
science problems with intuition and clarity rather than hit-and-trial.
In fact, once you understand the data generation process, you will automatically
get to know about most of the assumptions of that specific linear model.
234
DailyDoseofDS.com
u y a a e r
This is o en called the Dummy Variable Trap. It is bad because the model has
redundant features. Morevero, the regressions coefficients aren’t reliable in the
presence of multicollinearity.
235
DailyDoseofDS.com
i a y s s i a e e i e o a e
Linear regression assumes that the model residuals ( a u - e c d) are
normally distributed. If the model is underperforming, it may be due to a
violation of this assumption.
Here, I o en use a residual distribution plot to verify this and determine the
model’s performance. As the name suggests, this plot depicts the distribution of
residuals ( a u - e c d), as shown below:
236
DailyDoseofDS.com
Thus, the more normally distributed the residual plot looks, the more confident
we can be about our model. This is especially useful when the regression line is
difficult to visualize, i.e., in a high-dimensional dataset.
Why?
Of course, this was just about validating one assumption — the normality of
residuals.
237
DailyDoseofDS.com
t s d e e i u a
Statsmodel provides one of the most comprehensive summaries for regression
analysis.
Yet, I have seen so many people struggling to interpret the critical model details
mentioned in this report. In this chapter, let’s understand the entire summary
support provided by statsmodel and why it is so important.
e i 1
The first column of the first section lists the model’s settings (or config). This
part has nothing to do with the model’s performance.
238
DailyDoseofDS.com
If your data has categorical features, statsmodel will one-hot encode them. But in
that process, it will drop one of the one-hot encoded features.
This is done to avoid the dummy variable trap, which we discussed in an earlier
chapter (this chapter).
239
DailyDoseofDS.com
○ For instance, in this case, 0.927 means that the current model
captures 92.7% of the original variability in the training data.
○ Statsmodel reports R2 on the input data, so you must not overly
optimize for it. If you do, it will lead to overfitting.
● Adj. R-squared:
○ This tells us the log-likelihood that the given data was generated by
the estimated model.
○ The higher the value, the more likely the data was generated by this
model.
● AIC and BIC:
○ Like adjusted R-squared, these are performance metrics to
determine goodness of fit while penalizing complexity.
○ Lower AIC and BIC values indicate a better fit.
e i 2
The second section provides details related to the features:
● t and P>|t|:
○ Earlier, we used F-statistic to determine the statistical significance
of the model as a whole.
○ t-statistic is more granular on that front as it determines the
significance of every individual feature.
○ P>|t| is the associated p-value with the t-statistic.
○ A small p-value (typically less than 0.05) indicates that the feature is
statistically significant.
○ For instance, the feature “X” has a p-value of ~0.6. This suggests that
there is a 60% chance that the feature “X” has no effect on “Y”.
241
DailyDoseofDS.com
○ See, the coefficients we have obtained from the model are just
estimates. They may not be absolute true coefficients of the process
that generated the data.
○ Thus, the estimated parameters are subject to uncertainty, aren’t
they?
○ Note: The width of the interval [0.025, 0.975] is 0.95 → or 95%. This
constitutes the area between 2 standard deviations from the mean in
a normal distribution.
○ For instance, the interval for sin_X is (0.092, 6.104). So although the
estimated coefficient is 3.09, we can be 95% confident that the true
coefficient lies in the range (0.092, 6.104).
242
DailyDoseofDS.com
e i 3
Details in the last section of the report test the assumptions of linear regression.
243
DailyDoseofDS.com
● Durbin-Watson:
○ This measures autocorrelation between residuals.
○ Autocorrelation occurs when the residuals are correlated, indicating
that the error terms are not independent.
○ But linear regression assumes that residuals are not correlated.
○ The Durbin-Watson statistic ranges between 0 and 4.
■ A value close to 2 indicates no autocorrelation.
■ Values closer to 0 indicate positive autocorrelation.
■ Values closer to 4 indicate negative autocorrelation.
● Jarque-Bera (JB) and Prob(JB):
○ They solve the same purpose as Omnibus and Prob(Omnibus) —
measuring the normality of residuals.
● Condition Number:
○ This tests multicollinearity.
○ Multicollinearity occurs when two features are correlated, or two or
more features determine the value of another feature.
○ A standalone value for Condition Number can be difficult to
interpret so here’s how I use it:
■ Add features one by one to the regression model and notice
any spikes in the Condition Number.
● The first section tells us about the model’s config, the overall performance
of the model, and its statistical significance.
● The second section tells us about the statistical significance of individual
features, the model’s confidence in finding the true coefficient, etc.
● The last section lets us validate the model’s assumptions, which are
immensely critical to linear regression’s performance.
Now you know how to interpret the entire regression summary from statsmodel.
244
DailyDoseofDS.com
e r i d i a o l G s
A linear regression model is undeniably an extremely powerful model, in my
opinion. However, it makes some strict assumptions about the type of data it can
model, as depicted below.
Generalized linear models (GLMs) precisely do that. They relax the assumptions
of linear regression to make linear models more adaptable to real-world datasets.
h L ?
Linear regression is pretty restricted in terms of the kind of data it can model.
For instance, its assumed data generation process looks like this:
245
DailyDoseofDS.com
246
DailyDoseofDS.com
They relax the assumptions of linear regression to make linear models more
adaptable to real-world datasets.
247
DailyDoseofDS.com
248
DailyDoseofDS.com
e - fl t e e i
The target variable of typical regression datasets is somewhat evenly distributed.
But, at times, the target variable may have plenty of zeros. Such datasets are
called zero-inflated datasets.
They may raise many problems during regression modeling. This is because a
regression model can not always predict exact “zero” values when, ideally, it
should. For instance, consider simple linear regression. The regression line will
output exactly “zero” only once (if it has a non-zero slope).
This issue persists not only in higher dimensions but also in complex models like
neural nets for regression.
249
DailyDoseofDS.com
● Next, train a regression model only on those data points with a non-zero
true target.
During prediction:
250
DailyDoseofDS.com
Its effectiveness over the regular regression model is evident from the image
below:
251
DailyDoseofDS.com
u r e e i
One big problem with regression models is that they are sensitive to outliers.
Consider linear regression. Even a few outliers can significantly impact Linear
Regression performance, as shown below:
And it isn’t hard to identify the cause of this problem. Essentially, the loss
function (MSE) scales quickly with the residual term (true-predicted).
Thus, even a few data points with a large residual can impact parameter
estimation.
252
DailyDoseofDS.com
Huber loss (or Huber Regression) precisely addresses this problem. In a gist, it
attempts to reduce the error contribution of data points with large residuals.
One simple, intuitive, and obvious way to do this is by applying a threshold (δ) on
the residual term:
● If the residual is smaller than the threshold, use MSE (no change here).
● Otherwise, use a loss function which has a smaller output than MSE —
linear, for instance.
● For residuals
smaller than the
threshold (δ) → we
use MSE.
● Otherwise, we use
a linear loss
function which has
a smaller output
than MSE.
253
DailyDoseofDS.com
Its effectiveness is
evident from the
following image:
● Linear Regression
is affected by
outliers
● Huber Regression
is more robust.
o o e e r n h h s l δ ?
While trial and error is one way, I o en like to create a residual plot. This is
depicted below: The below plot is generally called a lollipop plot because of its
appearance.
One good thing is that we can create this plot for any dimensional dataset. The
objective is just to plot (true-predicted) values, which will always be 1D.
254
DailyDoseofDS.com
e ’ n h n r t g d .
By using a linear loss function in Huber regressor, we intended to reduce the
large error contributions that would have happened otherwise by using MSE.
Thus, we can further reduce that error contribution by using, say, a square root
loss function, as shown below:
It is clear that the error contribution of the square root loss function is the lowest
for all residuals above the threshold δ.
255
DailyDoseofDS.com
e s n r s n n m e
e o
o e e a o o s n e s n r
There’s an interesting technique, using which, we can condense an entire random
forest model into a single decision tree.
The benefits?
256
DailyDoseofDS.com
e n u a t o h
Let’s fit a decision tree model on the following dummy dataset. It produces a
decision region plot shown on the right.
In fact, we must note that, by default, a decision tree can always 100% overfit any
dataset (we will use this information shortly). This is because it is always allowed
to grow until all samples have been classified correctly.
257
DailyDoseofDS.com
This time, the decision region plot suggests that we don’t have a complex
decision boundary. The test accuracy has also improved (69.5% to 74%).
We know that the random forest model has learned some rules that generalize on
unseen data.
So, how about we train a decision tree on the predictions generated by the
random forest model on the training set?
● Train a random forest model. This will learn some rules from the training
set which are expected to generalize on unseen data (due to Bagging).
● Generate predictions on X, which produces the output y'. These
predictions will capture the essence of the rules learned by the random
forest model.
● Finally, train a decision tree model on (X, y'). Here, we want to
intentionally overfit this mapping as this mapping from (X) to (y') is a proxy
for the rules learned by the random forest model.
258
DailyDoseofDS.com
The decision region plot we get with the new decision tree is pretty similar to
what we saw with the random forest earlier:
Measuring the test accuracy of the decision tree and random forest model, we
notice them to be similar too:
259
DailyDoseofDS.com
In fact, this approach also significantly reduces the run-time, as depicted below:
This is because if we have 100 trees in a random forest, there’s no way we can
interpret them.
260
DailyDoseofDS.com
e r n o
I devised this very recently. I also tested this approach on a couple of datasets,
and they produced promising results.
But it won’t be fair to make any conclusions based on just two instances.
While the idea makes intuitive sense, I understand there could be some potential
flaws that are not evident right now.
So, I not saying that you should adopt this technique right away.
Instead, it is advised to test this approach on your random forest use cases.
Considering reverting back to me with what you discovered.
In this next chapter, let’s understand a technique to transform a decision tree into
matrix operations which can run on GPUs.
261
DailyDoseofDS.com
r s r e s n r n a i p a o
Inference using a decision tree is an iterative process. We traverse a decision tree
by evaluating the condition at a specific node in a layer until we reach a leaf
node.
In this chapter, let’s learn a superb technique that to represent inferences from a
decision tree in the form of matrix operations.
As a result:
e p
Consider a binary classification dataset with 5 features.
262
DailyDoseofDS.com
Let’s say we get the following tree structure a er fitting a decision tree on the
above dataset:
o t n
Before proceeding ahead, let’s assume that:
r o a i s
263
DailyDoseofDS.com
o : h t s e h l e e o i e w i t o a m d t
e o l s e e i . o l n r a v y i n e a
e v l a i s
1 a i
So it’s an ( × ) shaped
matrix.
A specific entry is set to “1” if the corresponding node in the column evaluates
the corresponding feature in the row. For instance, in our decision tree, “Node 0”
evaluates “Feature 2”.
Thus, the corresponding entry will be “1” and all other entries will be “0.”
264
DailyDoseofDS.com
2 a i
The entries of matrix B are the threshold value at each node. Thus, its shape is
1×e.
3 a i
This is a matrix between every pair of leaf nodes and evaluation nodes. Thus, its
dimensions are × .
265
DailyDoseofDS.com
● “1” if the corresponding leaf node in the column lies in the le sub-tree of
the corresponding evaluation node in the row.
● “-1” if the corresponding leaf node in the column lies in the right sub-tree
of the corresponding evaluation node in the row.
● “0” if the corresponding leaf node and evaluation node have no link.
For instance, in our decision tree, the “leaf node 4” lies on the le sub-tree of
both “evaluation node 0” and “evaluation node 1”. Thus, the corresponding values
will be 1.
266
DailyDoseofDS.com
4 a i o e o
The entries of vector D are the sum of non-negative entries in every column of
Matrix C:
5 a i
267
DailyDoseofDS.com
If a leaf node classifies a sample to “Class 1”, the corresponding entry will be 1,
and the other cell entry will be 0.
For instance, “lead node 4” outputs “Class 1”, thus the corresponding entries for
the first row will be (1,0):
We repeat this for all other leaf nodes to get the following matrix as Matrix E:
With this, we have compiled our decision tree into matrices. To recall, these are
the five matrices we have created so far:
268
DailyDoseofDS.com
● Matrix A captures which input feature was used at each evaluation node.
● Matrix B stores the threshold of each evaluation node.
● Matrix C captures whether a leaf node lies on the le or right sub-tree of a
specific evaluation node or has no relation to it.
● Matrix D stores the sum of non-negative entries in every column of Matrix
C.
● Finally, Matrix E maps from leaf nodes to their class labels.
n r c s g a i s
Say this is our input feature vector X (5 dimensions):
The whole inference can now be done using just these matrix operations:
● XA < B gives:
269
DailyDoseofDS.com
The final prediction comes out to be “Class 1,” which is indeed correct! Notice
that we carried out the entire inference process using only matrix operations:
270
DailyDoseofDS.com
As a result, the inference operation can largely benefit from parallelization and
GPU capabilities.
The run-time efficacy of this technique is evident from the image below:
271
DailyDoseofDS.com
n r t e r e e s n r
This is not always possible with other intuitive and simple models like linear
regression. But decision trees stand out in this respect. Nonetheless, one thing I
o en find a bit time-consuming and somewhat hit-and-trial-driven is pruning a
decision tree.
h r e
The problem is
that under default
conditions,
decision trees
ALWAYS 100%
overfit the dataset,
as depicted in this
image:
272
DailyDoseofDS.com
But the above visualisation is pretty non-elegant, tedious, messy, and static (or
non-interactive). I recommend using an interactive Sankey diagram to prune
decision trees. This is depicted below:
273
DailyDoseofDS.com
This instantly gives an estimate of the node’s impurity, based on which, we can
visually and interactively prune the tree in seconds. For instance, in the full
decision tree shown below, pruning the tree at a depth of two appears reasonable:
You can download the code notebook for the interactive decision tree here:
https://bit.ly/4bBwY1p. Instructions are available in the notebook.
274
DailyDoseofDS.com
h e s n r s u e h o h n e e
f r r n g
In other words, every decision tree progressively segregates feature space based
on such perpendicular boundaries to split the data.
275
DailyDoseofDS.com
In fact, if we plot this decision tree, we notice that it creates so many splits just to
fit this easily separable dataset, which a model like logistic regression, support
vector machine (SVM), or even a small neural network can easily handle:
It becomes more evident if we zoom into this decision tree and notice how close
the thresholds of its split conditions are:
276
DailyDoseofDS.com
This is a bit concerning because it clearly shows that the decision tree is
meticulously trying to mimic a diagonal decision boundary, which hints that it
might not be the best model to proceed with. To double-check this, I o en do the
following:
For instance, the PCA projections on the above dataset are shown below:
277
DailyDoseofDS.com
This lets us determine that we might be better off using some other algorithm
instead.
Or, we can spend some time engineering better features that the decision tree
model can easily work with using its perpendicular data splits.
At this point, if you are thinking, why can’t we use the decision tree trained on
_ a?
While nothing stops us from doing that, do note that PCA components are not
interpretable, and maintaining feature interpretability can be important at times.
Thus, whenever you train your next decision tree model, consider spending some
time inspecting what it’s doing.
e r n n h h t
o t n n o i o a h s f e s n r s h r h
u d g l k f o f h o o r l n b o l e s
o y
y o t s o r g o a h t c a o u t n f e s n r s
n h w n h i t o e n d l l r h o o i .
278
DailyDoseofDS.com
e s n r s L S v fi !
In addition to the above inspection, there’s one more thing you need to be careful
of when using decision trees. This is about overfitting.
The thing is that, by default, a decision tree (in sklearn’s implementation, for
instance), is allowed to grow until all leaves are pure. This happens because a
standard decision tree algorithm greedily selects the best split at each node.
This makes its nodes more and more pure as we traverse down the tree. As the
model correctly classifies ALL training instances, it leads to 100% overfitting, and
poor generalization.
279
DailyDoseofDS.com
Fitting a decision tree on this dataset gives us the following decision region plot:
It is pretty evident from the decision region plot, the training and test accuracy
that the model has entirely overfitted our dataset.
280
DailyDoseofDS.com
As depicted above, CCP results in a much simpler and acceptable decision region
plot.
281