Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab
Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab
ipynb - Colab
Learning Objectives
Before answering this (in the context of deep learning models), let's take a step back and learn the difference between scalars, vectors, and
multi-dimensional arrays such as matrices. Since we'll be using tabular data to train our first model, let's draw analogies from a spreadsheet.
keyboard_arrow_down Scalars
A single value is called a scalar.
import torch
scalar = torch.tensor(18)
scalar
tensor(18)
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 1/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
keyboard_arrow_down Vectors
A list or one-dimensional array of values, like a single column in a spreadsheet, is called a vector.
vector = torch.tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
vector
tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
keyboard_arrow_down Matrices
A two-dimensional array of values, like a table in a spreadsheet, is called a matrix.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 2/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Let's create a matrix in PyTorch (we'll use just two columns, mpg and horsepower, to keep it simple):
matrix = torch.tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])
matrix
tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])
Of course, you'll never have to type in the values from a spreadsheet. We'll conveniently load the values directly from the file, first using pandas,
and then using PyTorch's own data pipes. We'll get back to it in the "Datasets" section.
keyboard_arrow_down Tensors
A three-dimensional array of values, like a collection of spreadsheets, each containing data for a given month, is called a tensor.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 3/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
From then on, be it four or forty-two dimensions, a multi-dimensional array is called a tensor. So, technically speaking, if an array has three or
more dimensions, it is a tensor.
You can easily create tensors in PyTorch using the tensor() method to create either a scalar or a tensor, as we've been doing in the examples
provided on the previous pages. Moreover, there are methods to create tensors filled with ones, zeros, or random numbers: ones(), zeros(),
rand(), and randn() to name a few.
You can get the shape of a tensor using its shape attribute, but PyTorch also implements a size() method that accomplishes the same thing.
vector.shape, vector.size()
(torch.Size([14]), torch.Size([14]))
As expected, the shape of a scalar is an empty list since scalars are dimensionless (zero dimensions).
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 4/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
matrix.size(), scalar.size()
While scalars are single numbers, thus having zero dimensions, one- and two-dimensional arrays are called vectors and matrices, respectively,
as we've seen in the examples above. But, in order to make matters simple, it is commonplace to refer to any array with one or more
dimensions as a tensor.
In summary, everything is either a scalar or a tensor. There are tensors for data, and tensors for parameters. Right now, we're dealing with the
former, and we'll move on to the latter later in the next chapter.
keyboard_arrow_down Numpy
NumPy brings the computational power of languages like C and Fortran to Python, a language much easier to learn and use. Thanks to its
performance, Numpy sits at the core of many machine and deep learning libraries such as Scikit-Learn, Scipy, Pandas, and Matplotlib. For this
reason, it is fairly common to load tabular data from other sources, such as CSV or Excel files, into a collection of Numpy arrays. Even when
dealing with images, pixel values are often stored inside Numpy arrays.
PyTorch tensors and Numpy arrays have a lot in common. You may create Numpy arrays using its identically-named methods such as zeros(),
ones(), rand(), and randn(), for example.
Moreover, you can easily switch between the two of them, arrays and tensors, using PyTorch's numpy() and as_tensor() methods. The former
converts a PyTorch tensor into a Numpy array, while the latter creates a PyTorch tensor out of a Numpy array. Let's see them in action.
Numpy array:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 5/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
import numpy as np
numpy_array = vector.numpy()
numpy_array
array([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
PyTorch tensor:
back_to_tensor = torch.as_tensor(numpy_array)
back_to_tensor
tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
There's one caveat, though: only "CPU" tensors can be converted into Numpy arrays. Every tensor we created thus far is, by default, a "CPU"
tensor. We'll learn about different types of tensors shortly, in the "Devices" section.
Reshaping Tensors
One of the most common operations you'll need to perform is to reshape a tensor into a different, well, shape!
One typical case, especially in computer vision, is to convert a multi-dimensional tensor representing features into a single sequence of
features. The figure below illustrates this:
There are two data points, and their corresponding features are organized in a two-by-three shape in the tensor at the top. In order to use these
features to train a linear or a logistic regression, however, you'd need to have the features lined up instead. The flattened tensor at the bottom
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 6/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Although the operation itself is quite simple, there are a few pitfalls you need to avoid while reshaping your tensors. Let's go over a few
examples.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 7/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
original_tensor[0, 1] = 2
original_tensor, reshaped_tensor
Moreover, if you created your tensor from a Numpy array, the two of them, array and tensor, are also sharing the underlying data.
numpy_array[-1] = 1000
numpy_array, vector
(array([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15,
14, 15, 1000]),
tensor([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14,
15, 1000]))
In order to effectively duplicate the data and create a new, independent, tensor, you can use the clone() method instead.
cloned_tensor = original_tensor.clone()
cloned_tensor
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 8/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Now, if you make changes to the original tensor, they won't be reflected in the new tensor anymore.
original_tensor[0, 1] = 3
original_tensor, cloned_tensor
transposed_tensor = original_tensor.t()
transposed_tensor.view(1, 6)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-40-13ca45c67fa7> in <cell line: 2>()
1 transposed_tensor = original_tensor.t()
----> 2 transposed_tensor.view(1, 6)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use
.reshape(...) instead.
Remember, view() does not make any copies of the data, but reshape() does, so it will always work, even if the tensors are not contiguous.
But, what does it mean to be contiguous? Simply put, it means two elements in the same row must be next to each other in memory. This is
always the case whenever a tensor is created (like our original_tensor), but once we transpose it, we're not actually changing its allocation in
memory. Transposing, in this case, means traversing it differently, that is, jumping to a different position in memory.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 9/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
We can see the "rules" for moving to the next row or column by checking the tensor's stride method:
original_tensor.stride(), transposed_tensor.stride()
In the original tensor, the stride is telling us that we need to skip three positions in memory to get to the next row, while only one position for the
next column. But, in the transposed tensor, it is the other way around: we need to skip three positions to get to the next column.
If we need to skip two or more positions to get to the next column, it means our tensor is not contiguous anymore. Let's check it out:
transposed_tensor.is_contiguous(), original_tensor.is_contiguous()
(False, True)
Transposed tensor:
transposed_tensor
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 10/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
tensor([[1., 1.],
[3., 1.],
[1., 1.]])
Luckily, you can simply call the contiguous() method, and PyTorch will modify the data in memory in such a way the data can be traversed in its
typical fashion (a stride of one in the last dimension). If the underlying data happens to be contiguous already, this is a zero-cost operation.
transposed_tensor.contiguous().view(1, 6)
Finally, it is also possible to use the flatten() method instead, in case you're trying to make your tensor one-dimensional.
transposed_tensor.flatten()
Don't worry much about memory allocation, though. The purpose of this section was to make you aware of and capable of addressing the error
message at the top, should you run into it by any chance.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 11/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Named Tensors
Named tensors are a long-awaited feature, even if they're still a prototype feature. Many, if not most, implementation bugs - even worse, the
silent kind of bug - in deep learning models arise from the fact that the wrong dimensions are being used in a given operation.
You may be wondering how is it possible that such a serious bug may be a silent one, that is, one that does not raise an exception and crashes
the application?
In many cases, broadcasting is to blame. Broadcasting is both a blessing and a curse. While it makes it extremely easy to perform operations
using tensors of different shapes without the need to explicitly replicate data along some dimension, it may also give you the illusion your
operation is the right one, even when it's not because you messed up the dimensions.
You've probably done similar operations many times without giving a second thought to why it works so seamlessly. As it turns out, you have
broadcasting to thank for this behavior. Under the hood, PyTorch (or Numpy) will "stretch" the variable b so its shape matches that of variable a,
thus allowing the desired element-wise multiplication.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 12/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Moreover, it is actually more efficient to use broadcasting like that than building a tensor full of 2.0s to match the shapes!
What if we'd like to perform an element-wise multiplication? Broadcasting got us covered, it will "understand" that mat2 was "meant" to be 3x3
instead.
mat1 * mat2
Broadcasting works by comparing dimensions of both tensors from right to left, and it will "match" them if they are equal or one of them is one
(so that particular value will be replicated along that dimension). In the example above, these are the dimensions:
mat1.size(), mat2.size()
The right-most dimension is 3 for both tensors, so it is matched. Moving to the left, the first dimension of one tensor is 1, so it is also matched.
There we go, broadcasting can work its magic! But, beware, if you were to transpose the second tensor (mat2) by mistake, broadcasting still
works!
mat2_wrong_shape = mat2.t()
mat1 * mat2_wrong_shape
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 13/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
What does this mean? It means that, if you transposed one of the tensors by mistake, it may still produce a valid output. If you think it's unlikely
that you'll ever get the dimensions in the wrong order, think again: when it comes to tensors representing batches of images or sequences, it
isn't so uncommon to mix dimensions up.
<ipython-input-51-607113069adf>:1: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not
named_mat1 = torch.ones((3, 3), names=['R', 'C'])
Input:
named_mat1 * named_mat2
All is good and well, rows and columns are well aligned, and the result is as expected. Also, notice that the names are propagated to the
resulting tensor.
named_mat1 * named_mat2.t()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-53-7deb473f8df5> in <cell line: 1>()
----> 1 named_mat1 * named_mat2.t()
RuntimeError: Error when attempting to broadcast dims ['R', 'C'] and dims ['C', 'R']: dim 'C' and dim 'R' are at the same position from the right but do
not match.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 14/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Great, we got an error! Even though broadcasting would happily return a 3x3 matrix, the misalignment of the dimensions' names prevented that
and rightfully raised an exception warning us of our mistake.
Of course, it only works if both tensors are named. If one of them isn't named, broadcasting keeps working as expected.
named_mat1 * mat2.t()
keyboard_arrow_down Devices
So far, all the tensors we have created are "CPU" tensors. It means the tensor is stored in the computer's main memory and any operations
performed on these tensors are handled by its central processing unit, the CPU (e.g. an Intel Core i9 processor). The type of tensor is
designated by the device, a CPU in this case, that handles its operations.
We can easily check the device responsible for a given tensor by checking its device attribute:
device = original_tensor.device
device
device(type='cpu')
But the CPU is not the only device we can use to manipulate tensors. We can also use graphics processing units (GPUs), tensor processing
units (TPUs) or even "meta" (fake) devices. Let's take a look at them!
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 15/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
It turns out, though, that matrix multiplication at scale can also be used to train deep learning models. Initially, it wasn't easy to leverage their
power for that purpose since programming a GPU was quite challenging. It was NVIDIA's release of CUDA (Compute Unified Device
Architecture) and, later on, AMD's ROCm (Radeon Open Compute Ecosystem) that allowed deep learning frameworks such as PyTorch to more
easily use them to dramatically speed up training times.
GPUs are freely available on most platforms, such as Google Colab and Kaggle, and you should always check the availability of a GPU before
starting training a model. Since these platforms offer CUDA-compatible GPUs, we'll be focusing solely on them.
PyTorch makes it really easy to accomplish that: you only have to make a call to torch.cuda.is_available() and name your device accordingly.
Once we specify a device, we can send our tensor to it using the aptly named method to():
sent_tensor = original_tensor.to(device)
sent_tensor.device
device(type='cpu')
If a GPU is not available, nothing will happen, and calling to() comes at no cost. So, it's safe to always send your tensors (and later on, your
models) to the specified device. This way, if you share your code with someone else, or if you happen to run it in a different environment in the
future, your code will always leverage the power of a GPU, if one is available to you.
If a GPU is indeed available, the tensor's device will read cuda:0, as it now resides in the memory of the first (and in most cases, only) GPU
available. Moreover, if you check the tensor's type, it will read torch.cuda.FloatTensor.
sent_tensor.type()
't h l t '
If you're lucky enough to have multiple GPUs at your disposal, you can check how many are available to you, and their corresponding names
using torch.cuda.device_count() and torch.cuda.get_device_name(), respectively:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 16/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))
Once a tensor is sent to a GPU, it cannot be directly brought back to Numpy anymore.
sent_tensor.numpy()
You need to bring them back to the CPU first (either using to('cpu') or cpu()), and only then call the numpy() method.
sent_tensor.cpu().numpy()
Devices: TPU
Unlike GPUs, which were originally designed for gamers, TPUs - tensor processing units - as their name suggests, were designed by Google to
be used for training deep learning models in TensorFlow.
TPUs are available on some platforms, such as Google Colab and Kaggle. Although they have been designed to work with TensorFlow, it's also
possible to leverage their immense power in PyTorch using PyTorch/XLA, a package that connects PyTorch to Google's XLA (accelerated linear
algebra) library.
TPUs can be used to speed up training even more by using all its cores at once through multiprocessing.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 17/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
In order to load a model from disk, though, you need to create an instance of the untrained model first, so you have where to load the model
into. Meta devices allow you to create "dummy", empty, models that would be too large to fit in memory, thus making it possible to work around
hardware constraints by only partially loading a model.
Mission accomplished! The fake tensor does not contain any data, as expected. Now let's create a REALLY huge fake tensor.
The tensor above, should it be a real tensor, would have 10 BILLION 32-bit float elements. Let's see what happens if we try to create a regular
tensor of the same size.
Unless your computer has over 40 gigabytes of free RAM, you'll get an error. Fake tensors are useful to handle tensors - and models - that are
too large to fit into memory.
We won't be using these tensors in this course, but if you venture into using really large models, you're already aware of your options.
keyboard_arrow_down Datasets
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 18/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
It is time to get our hands a little dirty with some tiny, yet real, data. Let's start by loading the Auto MPG Dataset directly from the UCI Machine
Learning Repository using pandas' read_csv() method and a URL.
Its description reads: "The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and
5 continuous attributes."
In this dataset, values are separated by spaces, missing values are represented by a question mark. The columns, or attributes, as stated in the
repository, are as follows:
mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)
The last column, car name, is actually separated by tabs (instead of spaces), so we're considering the cars' names as comments while loading
the dataset.
Pandas
To load tabular data, such as CSV or Excel files, one of the most popular choices is the Pandas package, an open-source data analysis and
manipulation tool. Pandas' strength lies in its dataframes, a spreadsheet-like structure that contains two-dimensional data and its
corresponding labels. A dataframe is composed of a sequence of series, each series representing a column, its values stored as Numpy arrays.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 19/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
You can use methods such as read_csv() and read_excel() to load your data, each method offering plenty of arguments to account for different
separators, the existence or not of column headers, comments, missing data, and more. We'll be using the former to load our dataset.
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']
df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True)
df
A dataframe can be easily sliced, both column- and row-wise. Retrieving values from a single column (thus resulting in a Pandas series) is as
simple as that:
df['mpg']
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 20/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
mpg
0 18.0
1 15.0
2 18.0
3 16.0
4 17.0
... ...
393 27.0
394 44.0
395 32.0
396 28.0
397 31.0
dt fl t64
A Pandas series works as a wrapper around the underlying Numpy array that contains its data, which you can retrieve using the values attribute:
df['mpg'].values[:5]
df[['mpg', 'hp']]
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 21/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
mpg hp
0 18.0 130.0
1 15.0 165.0
2 18.0 150.0
3 16.0 150.0
4 17.0 140.0
The dataframe itself also has its own values attribute, which will give you access to a two-dimensional Numpy array containing the whole data:
df[['mpg', 'hp']].values[:5]
To subset the dataframe, you can use its iloc attribute, which allows for selecting rows based on their index:
df.iloc[:5]
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 22/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
It is also possible to use a boolean Series to conditionally subset the rows of a dataframe:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 23/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Train-Validation-Test Split
The purpose of the split is to simulate the arrival of new data, so you can make adjustments to your model if training doesn't go well and, once
you're happy with it, to make a final assessment before going live with it. Each split, training, validation, and test, has its own purpose, as
described below. It is also important to highlight that the split should always be the first thing you do—no preprocessing, no transformations;
nothing happens before the split.
Training Set: the data you use to train your model - you can use and abuse this data!
Validation Set: the data you should only use for hyper-parameter tuning, that is, comparing differently parameterized models trained on
the training data, to decide which parameters are best. You should use, but not abuse this data, as it is intended to provide an unbiased
evaluation of your model and, if you mess around with it too much, you'll end up incorporating knowledge about it in your model without
even noticing.
Test Set: the data you should use only once, when you are done with everything else, to check if your model is still performing well. We like
to pretend this is data from the "future" - that particular day in the future when our model is ready to give it a go in the real world! So, until
that day, we cannot know this data, as the future hasn't arrived yet.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 25/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
df['year'].values[:50]
array([70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71,
71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71])
Although it may make sense if you're handling data in a spreadsheet (making it easier for you to look something up in it), a predefined ordering
is potentially an issue for training and evaluating a model. Ideally, we would like to have all our sets - training, validation, and test - containing
similar information. Taking the example of the year column, we'd like to have cars from 1970 to 1982 in all three sets. The easiest and fastest
way to accomplish that is to simply shuffle the data first.
We can use the sample() method of a Pandas dataframe to sample, that is, to draw data points from the dataframe in random order. The trick
here is to draw the whole dataset using sample (frac=1), so we got our full dataframe back, but in a different order. The resulting dataframe still
has its original index values, but we can easily drop it using the reset_index() method.
To actually perform the split, we'll use Scikit-Learn's train_test_split() twice, once for splitting the data into train and test sets, and then to
subdivide the training data into train and validation sets.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 26/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
from sklearn.model_selection import train_test_split
trainval, test = train_test_split(shuffled, test_size=0.16, shuffle=False)
train, val = train_test_split(trainval, test_size=0.2, shuffle=False)
Ensuring the quality and consistency of your data is of the utmost importance. The most basic checks you can do are looking for missing
values and outliers in your data. Let's start with the latter. Outliers are values that are, literally, "off the charts": they may be produced by a
measurement or input (e.g. typing) errors, in which case they are not real and thus must be handled; but they may also be legitimate, sometimes
indicating an anomaly, in which case they may be exactly the target of your model. Outliers of the first kind, the errors, may affect model training
negatively and badly skew its predictions. There are many techniques for removing outliers but we won't be delving into this topic here.
Unlike outliers, missing values are easy to spot, they show up as NaN (Not a Number) in a dataframe. In Deep Learning models, NaN values
propagate like an infectious disease: any operation between an actual number and a NaN value results in another NaN value.
The process of "fixing" missing values is called imputation, that is, replacing the missing data with substituted values. There are many
techniques to accomplish this, from using the mean or median of the corresponding column in the training data, to more sophisticated
approaches using other Machine Learning algorithms to "predict" the missing value.
Any imputation is based on assumptions you make about its nature, so it will necessarily lead to slightly modifying the data distribution.
Alternatively, if you can afford to lose some data points, it's also possible to simply discard any data points containing missing values.
is_missing_attr = train.isna()
n_missing_attr = is_missing_attr.sum(axis=1)
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 27/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
train[n_missing_attr > 0]
There are three cars with missing horsepower information in our training set. While tree-based algorithms such as Random Forests (RF) or
Gradient-Boosted Trees (GBT) can easily handle missing data, missing values are a big no-no when it comes to neural networks.
In order to keep things simple in our small example, let's simply drop any rows that contain missing values.
train.dropna(inplace=True)
train
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 28/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Then, let's do the same for our validation and test sets. If we had chosen to perform missing value imputation, we would have to apply the same
rules used in the training set for the validation and test sets as well.
val.dropna(inplace=True)
test.dropna(inplace=True)
You should never use the validation or test sets as a source for any kind of data preprocessing (such as imputing data). Using statistics
computed on the validation or test sets are akin to using statistics from the future, that is, computed on the data your users will eventually send
to your application or model. Obviously, you cannot know these values beforehand, and using statistics based on the validation or test sets is a
serious data leakage that will make your models look great during evaluation, even if they're likely to perform poorly when effectively deployed.
You can see that mpg, displacement, horsepower, weight, and acceleration are continuous attributes, that is, they may be any numeric value.
Continuous attributes are the bread and butter of deep learning models. We've already discussed that these models cannot handle missing
values and, as it turns out, they may also have issues with values spread over wildly different ranges. When it comes to deep learning models,
predictable and, better yet, zero-centered ranges for features are a must.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 29/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Let's see how the attributes (other than fuel consumption, mpg, which is the target of our prediction) fare in their own ranges of values:
train_features = train[cont_attr[1:]]
train_features.hist()
It doesn't look very good: not only are the ranges quite different from one another, but the ranges are nowhere near zero-centered, as expected
from real-world physical attributes such as weight. There's no such thing as a negative weight (except, maybe, for theoretical physicists!).
So, what do we do about it and, most importantly, why do we have to do something about it? Let's start with the latter: without going into much
detail, it suffices to know for now that deep learning models are more easily trained if the attributes or features used to train them display
values in symmetrical ranges, preferably in low values, such as from minus three to three. Otherwise, they may exhibit problematic behaviors
during training, failing to converge to a solution.
Therefore, it is best practice to bring all values to a more "digestible" range for the sake of the model's health. The procedure that accomplishes
this is called standardization or, sometimes, normalization. It consists of subtracting the mean (thus zero-centering) of the attribute, and
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 30/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
dividing the result by the standard deviation (thus turning it into unit standard deviation). The resulting attributes shall exhibit similar resulting
ranges.
train_means = train_features.mean()
train_standard_deviations = train_features.std()
train_means, train_standard_deviations
(disp 195.456439
hp 105.087121
weight 2984.075758
acc 15.432955
dtype: float64,
disp 106.255830
hp 39.017837
weight 869.802063
acc 2.743941
dtype: float64)
(disp -1.059758e-16
hp -1.648513e-16
weight 7.569702e-17
acc 4.003532e-16
dtype: float64,
disp 1.0
hp 1.0
weight 1.0
acc 1.0
dtype: float64)
Their means are zero, and their standard deviations are one, so it looks good. Let's visualize them:
train_standardized_features.hist()
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 31/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
As you can see, standardization doesn't change the shape of the distribution, it only brings all the features to a similar footing when it comes to
their ranges.
Even though we've standardized our continuous features manually, we don't have to do it like that. Scikit-Learn offers a StandardScaler class
that can do this for us and, as we'll see later, we can also use PyTorch's own transformations to standardize values, even if they are pixel values
on images!
We used the training set to define the standardization parameters, namely, the means and standard deviations of our features. Now, we need to
standardize the validation and training sets using those same parameters.
Never use the validation and test sets to compute parameters for standardization, or for any other preprocessing step!
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 32/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
val_features = val[cont_attr[1:]]
val_standardized_features = (val_features - train_means)/train_standard_deviations
val_standardized_features.mean(), val_standardized_features.std()
(disp -0.089282
hp -0.151349
weight -0.121501
acc 0.139288
dtype: float64,
disp 0.946465
hp 0.917051
weight 0.918898
acc 0.958077
dtype: float64)
Notice that the resulting means and standard deviations aren't quite zero and one, respectively. That's expected since the validation set should
have a similar, yet not quite exactly the same, distribution as the training set.
If you ever get perfect zero mean and unit standard deviation on a standardized validation set, there's a good chance you're making a mistake
using statistics computed on top of the validation set itself.
test_features = test[cont_attr[1:]]
test_standardized_features = (test_features - train_means)/train_standard_deviations
We'll get back to the topic of standardization/normalization a couple more times. First, we'll use Scikit-Learn's StandardScaler for the task, and
then we'll learn about normalizing batches of data using PyTorch's own batch normalization.
The StandardScaler is part of Scikit-Learn, "an open source machine learning library that supports supervised and unsupervised learning. It also
provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities."
It is a convenient way of avoiding manually standardizing continuous features as we just did. All it takes is to call its fit() method on the training
set to compute the appropriate statistics (mean and standard deviation), and then apply the standardization to all datasets using its
transform() method.
The fit() method takes a feature matrix X, a Numpy array usually in the shape (n_samples, n_features). We can easily retrieve the two-
dimensional Numpy array that contains the underlying data of our dataframe and we have everything we need to have a functioning
StandardScaler.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 33/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
▾ StandardScaler i ?
StandardScaler()
If we want, we can also check the computed statistics (it computes variance instead of standard deviation, though):
scaler.mean_, scaler.var_
Once it has statistics (computed on the training set only), you can apply it to all your datasets:
standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(
takes a Pandas dataframe, a list of column names that are continuous attributes, and an optional scaler
creates and trains a Scikit-Learn's StandardScaler if one isn't provided as an argument
returns a PyTorch tensor containing the standardized features and an instance of Scikit-Learn's StandardScaler
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 34/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
if scaler is None:
scaler = StandardScaler()
scaler.fit(cont_X)
cont_X = scaler.transform(cont_X)
cont_X = torch.as_tensor(cont_X, dtype=torch.float32)
return cont_X, scaler
standardized_data = {}
# The training set is used to fit a scaler
standardized_data['train'], scaler = standardize(train_features, cont_attr[1:])
# The scaler is used as argument to the other datasets
standardized_data['val'], _ = standardize(val_features, cont_attr[1:], scaler)
standardized_data['test'], _ = standardize(test_features, cont_attr[1:], scaler)
The three remaining attributes, cylinders, model year, and origin are multi-valued discrete, that is, there's a set of values each one of them may
assume. The cars in the dataset may have either 4, 6, or 8 cylinders, but no car may have 3.45 cylinders, for example. Even though the values
are discrete, there's still an underlying order to them: 8 cylinders are, indeed, twice as much as 4 cylinders, but that's not always the case for
discrete attributes.
Let's take a look at the origin attribute. The cars come from three different, although unnamed, countries: 1, 2, and 3. The choice of numerical
representation for countries may be misleading, since country "3" is not three times as much as country "1". It would have probably been better
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 35/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
to use letters or abbreviations instead just to make the categorical nature of the attribute more evident.
Sometimes, like in the case of cylinders, discrete attributes can be grouped together with continuous attributes as numeric attributes. More
often than not, though, discrete attributes are considered categorical attributes, thus requiring some extra pre-processing to be handled by deep
learning models.
These pre-processing techniques involve converting each possible value in a categorical attribute into a numerical array of a given length, but
not necessarily the same length as the number of unique values. The process may be called encoding or embedding, depending on how it's
performed.
Let's take a look at this process. Our goal here is to convert each possible value in a discrete or categorical attribute into a numerical array of a
given length (that does not need to match the number of unique values). Before converting them into arrays, though, we need to encode them
as sequential numbers first.
Let's see what this looks like for the cyl attribute of our training dataset. It has only five unique values: 3, 4, 5, 6, and 8 cylinders.
cyls = sorted(train['cyl'].unique())
cyls
[3, 4, 5, 6, 8]
year = sorted(train['year'].unique())
year
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]
origin = sorted(train['origin'].unique())
origin
[1, 2, 3]
{3: 0, 4: 1, 5: 2, 6: 3, 8: 4}
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 36/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Now imagine there's a lookup table with as many entries as unique values, each entry being a numerical array of a given length (say, eight
elements). Let's create such a lookup table filled with random values as an illustration:
n_dim = 8
lookup_table = torch.randn((len(cyls), n_dim))
lookup_table
There are five rows, each corresponding to a unique number of cylinders. Three cylinders, according to our mapping dictionary, corresponds to
the first (index zero) row. Four cylinders, to the second (index one) row, and so on, and so forth.
Let's say we'd like to retrieve the numerical array corresponding to six cylinders. We apply the mapping to find the corresponding index
(cyls_map[6]) and use the result to actually slice the corresponding row from the lookup table (lookup_table[idx]):
idx = cyls_map[6]
lookup_table[idx]
There we go! Now, any number of cylinders can easily be mapped to a sequence of eight numerical values. It is as if any given number of
cylinders, a categorical attribute, were now represented by eight numerical features instead. We have just (re)invented embeddings! The fact
that these numbers are random is not necessarily an issue: we can simply turn the whole lookup table into parameters of the model itself, so
they are also learned during training. The model will learn the best way to represent each value in a categorical attribute as a sequence of
numerical attributes! How cool is that?
PyTorch offers an Embedding class that wraps a random tensor like the one we've just created. This is actually a layer, and we'll see how layers
work in more detail in the next chapter. For now, it should suffice to know that its arguments are the same as our own: the number of unique
values, and the desired number of elements - or dimensions - in the returned numerical array.
import torch.nn as nn
emb_table = nn.Embedding(len(cyls), n_dim)
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 37/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
The embedding layer, like any other layer in PyTorch, is also a model. Its weights are, surprise, surprise, the lookup table itself. Besides, since it's
a model, it can be called as such and its expected input is a batch of indices. Let's try it out and see what we get out of it:
idx = cyls_map[6]
emb_table(torch.as_tensor([idx]))
There we go, you created your first embeddings! Embeddings are an important part of modern deep learning, and a fundamental piece of
natural language processing, as we'll see in later chapters. Notice that the values are actually different from our previous example because the
newly created emb_table instance initializes its own random tensor under the hood.
A special case of embedding is the one-hot encoding (OHE) approach: instead of letting the model learn it during training, the mapping is fixed.
In OHE, the numerical array has the same length as the number of unique values and it has only one nonzero element. It works as if each unique
value were a dummy variable, for example: cyl3, cyl4, cyl5, cyl6, and cyl8, and only one of those dummy variables may have a nonzero value.
ohe_table = torch.eye(len(cyls))
ohe_table
idx = cyls_map[6]
ohe_table[idx]
Even though the embeddings themselves are going to be part of the model, we still need to convert our categorical features into their
corresponding sequential indices, so we can use them to retrieve the right values from the embeddings' internal lookup table.
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 38/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Instead of building dictionaries to manually encode categorical values into their sequential indices, though, we can use yet another Scikit-Learn
preprocessing utility: the OrdinalEncoder. It works in a similar fashion as the StandardScaler: you can use its fit() method so it builds the
mapping between the original values and their corresponding sequential indices, and then you can call its transform() method to actually
perform the conversion. Let's see an example of this:
▾ OrdinalEncoder i ?
OrdinalEncoder()
We can check the categories found for each one of the attributes (cylinders, year, and origin, in our case):
encoder.categories
' t '
Each value in a given list will be converted into its corresponding sequential index, and that's exactly what the transform() method does:
train_cat_features = encoder.transform(train[disc_attr])
train_cat_features[:5]
Let's take a quick look at the resulting encoding for the first row:
the first column (cylinders) is three, thus corresponding to the fourth value in the first list of categories, that is, six
the second column (year) is five, thus corresponding to the sixth value in the second list of categories, that is, 75
the third column (origin) is zero, thus corresponding to the first value in the third list of categories, that is, one
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 39/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
train[disc_attr].iloc[0]
cyl 6
year 75
origin 1
dt i t64
Once again, to better streamline the process, we can write a function quite similar to the previous one:
takes a Pandas dataframe, a list of column names that are categorical attributes, and an optional encoder
creates and trains a Scikit-Learn's OrdinalEncoder if one isn't provided as an argument
returns a PyTorch tensor containing the encoded categorical features and an instance of Scikit-Learn's OrdinalEncoder
cat_data = {}
cat_data['train'], encoder = encode(train, disc_attr)
cat_data['val'], _ = encode(val, disc_attr, encoder)
cat_data['test'], _ = encode(test, disc_attr, encoder)
The resulting features are nothing but indices now. Later on, for each column in the results (which corresponds to a particular categorical
attribute) we'll use its values to retrieve their embeddings. In our example with the cyl column (the first categorical attribute), it will look like this:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 40/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
In our example, we're indeed trying to predict fuel consumption (the mpg attribute), so ours is a regression task. We're starting with a simple
linear regression with a single feature, that is, we'll be using only one (continuous) attribute to predict our target, fuel consumption. Of course,
later on, we'll expand our problem into a multivariate linear regression, thus including all (continuous) attributes at first, and then add the
categorical attributes to the mix while training a non-linear model in Lab 2.
(tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]),
tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]))
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 41/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
import matplotlib.pyplot as plt
plt.scatter(train_single_feature_pt, train_target_pt)
plt.xlabel('Horsepower (standardized)')
plt.ylabel('Fuel Consumption - miles per gallon')
plt.title('Training Set - HP x MPG')
The relationship isn't quite linear, but there's clearly an inverse correlation between a car's power and its fuel consumption, as you'd expect. A
small 50 HP car is certainly much more fuel-efficient (hence more miles per gallon) than a high-powered 200 HP sports car.
keyboard_arrow_down TensorDataset
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 42/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Cool, we have two tensors now, let's use them to build a TensorDataset! Tensor datasets are one of the most basic types of datasets you'll find
in PyTorch. They simply wrap a couple of tensors containing your data - feature(s) and target(s) - so you can conveniently load your data in
mini-batches at will for training your model. We'll get back to it when we discuss PyTorch's data loader in the next section.
PyTorch's datasets work pretty much like Python lists. You can think of a dataset as a list of tuples, each tuple corresponding to one data point
(features, target).
You can create your own, custom, dataset by inheriting from the Dataset class. Datasets need to implement some basic methods such as
init(self), getitem(self, index) and len(self).
If we check the source code of the TensorDataset, that's what we'll find:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 43/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
def __len__(self):
return self.tensors[0].size(0)
In the constructor (init()) method, it makes sure all tensors are of the same size, and assigns them to its tensors attribute. In the getitem()
method, which makes a dataset "sliceable" just like a Python list, it loops over all tensors and builds a tuple containing the index-th element of
each tensor. Finally, in the len() method, it simply returns the first dimension of the first tensor (since it is guaranteed they're all of the same
size).
Simple enough, right? Let's retrieve a few elements from our dataset:
train_ds[:5]
(tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]),
tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]))
As expected, we got a tuple back, the first element being five data points from the first (feature) tensor, the second element being the
corresponding five data points from the second (target) tensor. It really works like a list of tuples!
Tensor datasets are as simple as they can be, but PyTorch offers many other datasets, such as the ImageFolder dataset that you can use with
your own images, or many other built-in datasets. We'll see them in more detail in the second part of this course while tackling computer vision
tasks.
Let's create datasets for our validation and test sets as well. We'll be skipping some intermediate steps and creating tensor datasets directly
out of the pandas dataframes:
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 44/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
val_ds = TensorDataset(standardized_data['val'][:, [hp_idx]],
torch.as_tensor(val[['mpg']].values, dtype=torch.float32))
test_ds = TensorDataset(standardized_data['test'][:, [hp_idx]],
torch.as_tensor(test[['mpg']].values, dtype=torch.float32))
PyTorch offers plenty of built-in datasets in both computer vision and natural language processing areas.
There are datasets for image classification (e.g. CIFAR10, MNIST, SHVN), object detection, image segmentation, optical flow, stereo matching,
image pairs, image captioning, video classification and prediction. For a complete list of available datasets, please check the Datasets section
of Torchvision documentation.
There are also datasets for text classification (e.g. AG News, IMDb, MNLI, SST2), language modeling, machine translation, sequence tagging,
question answering, and unsupervised learning. For a complete list of available datasets for natural language processing, please check the
Datasets section of Torchtext documentation.
Perhaps you noticed that, so far, we've been handling "CPU" tensors only. That is actually by design: while building a dataset, you may want to
keep your data out of your precious, and expensive, GPU memory. Only the data that is going to be actively used for training in any given step - a
mini-batch of data - should be sent to the GPU.
Mini-Batches
A mini-batch is a subset of a dataset, usually drawn randomly from it, and the number of data points in a mini-batch is usually a power of two.
Typical mini-batch sizes are 32, 64, 128, etc., but, in many cases, mini-batch size may be limited by the size of the available memory. This is
especially true for large models that take up a lot of space, where sometimes it is only feasible to load one data point at a time. In these cases,
the restriction imposed by hardware may be circumvented by accumulating the results over time thus simulating a mini-batch.
For now, let's draw mini-batches from our dataset using PyTorch's DataLoader!
keyboard_arrow_down DataLoaders
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 45/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
Data loaders can be used to randomly draw a given number of data points - the mini-batch size - out of a dataset. By default, they will return
different mini-batches every time until the underlying dataset runs out of available data points. At this point - pun very much intended - it will
start over.
The data loader is a rich class and it has many parameters. At first, we're focusing on a few of them only:
The last parameter, shuffle, is quite important. In the vast majority of cases, you should set shuffle=True for the training set, the major
exception to this rule being time series. Shuffling your data, thus ensuring there's no underlying order to it (e.g. ordered by date of creation)
makes learning more robust. Of course, in our case, we had already shuffled it at the very start before splitting our full dataset into training,
validation, and test sets, so shuffling at this point is redundant - but it surely doesn't hurt!
Moreover, we're ensuring the reproducibility of the results by explicitly assigning a random number generator to our data loader and setting its
seed using the manual_seed() method. This way we can control the data sampling during training.
Even though we did our best to ensure the reproducibility of the results, you may still find some differences in the results or in the loss curves.
P T h’ d t ti b t d ibilit t t th f ll i
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 46/46