0% found this document useful (0 votes)
21 views46 pages

Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab

This document outlines the learning objectives for building a dataset in PyTorch, including loading and manipulating tensors, preprocessing data, and creating mini-batches. It explains the concepts of scalars, vectors, matrices, and tensors, along with their creation and manipulation in PyTorch. Additionally, it covers the relationship between PyTorch tensors and NumPy arrays, reshaping tensors, and the importance of contiguous memory in tensor operations.

Uploaded by

ayomide.adekoya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views46 pages

Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab

This document outlines the learning objectives for building a dataset in PyTorch, including loading and manipulating tensors, preprocessing data, and creating mini-batches. It explains the concepts of scalars, vectors, matrices, and tensors, along with their creation and manipulation in PyTorch. Additionally, it covers the relationship between PyTorch tensors and NumPy arrays, reshaping tensors, and the importance of contiguous memory in tensor operations.

Uploaded by

ayomide.adekoya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.

ipynb - Colab

Learning Objectives

By the end of this chapter, you should be able to:

1. Load and manipulate tensors in PyTorch, sending them to different devices

2. Perform basic preprocessing on your data, such as standardizing continuous attributes

3. Build a dataset out of tensors

4. Split your dataset into mini-batches using data loaders

Tensors, Devices & CUDA


In Deep Learning, we see tensors everywhere. But, what is a Tensor, anyway?

Before answering this (in the context of deep learning models), let's take a step back and learn the difference between scalars, vectors, and
multi-dimensional arrays such as matrices. Since we'll be using tabular data to train our first model, let's draw analogies from a spreadsheet.

keyboard_arrow_down Scalars
A single value is called a scalar.

Let's create a scalar in PyTorch:

import torch
scalar = torch.tensor(18)
scalar

tensor(18)

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 1/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

keyboard_arrow_down Vectors
A list or one-dimensional array of values, like a single column in a spreadsheet, is called a vector.

Let's create a vector in PyTorch:

vector = torch.tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
vector

tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

keyboard_arrow_down Matrices
A two-dimensional array of values, like a table in a spreadsheet, is called a matrix.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 2/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Let's create a matrix in PyTorch (we'll use just two columns, mpg and horsepower, to keep it simple):

matrix = torch.tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])
matrix

tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])

Of course, you'll never have to type in the values from a spreadsheet. We'll conveniently load the values directly from the file, first using pandas,
and then using PyTorch's own data pipes. We'll get back to it in the "Datasets" section.

keyboard_arrow_down Tensors
A three-dimensional array of values, like a collection of spreadsheets, each containing data for a given month, is called a tensor.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 3/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

From then on, be it four or forty-two dimensions, a multi-dimensional array is called a tensor. So, technically speaking, if an array has three or
more dimensions, it is a tensor.

You can easily create tensors in PyTorch using the tensor() method to create either a scalar or a tensor, as we've been doing in the examples
provided on the previous pages. Moreover, there are methods to create tensors filled with ones, zeros, or random numbers: ones(), zeros(),
rand(), and randn() to name a few.

matrix_of_ones = torch.ones((2, 3), dtype=torch.float)


random_tensor = torch.randn((2, 3, 4), dtype=torch.float)
matrix_of_ones, random_tensor

(tensor([[1., 1., 1.],


[1., 1., 1.]]),
tensor([[[ 0.5540, -0.5200, 0.2645, 0.0977],
[ 1.1350, 1.0698, -0.9284, -0.6075],
[ 0.0084, 0.0668, 0.7904, 0.5460]],

[[-0.8818, 0.8938, 0.0853, -0.4218],


[ 0.8283, 1.0426, 1.5458, 1.3809],
[ 0.6804, -0.5616, -0.0655, -1.6407]]]))

You can get the shape of a tensor using its shape attribute, but PyTorch also implements a size() method that accomplishes the same thing.

vector.shape, vector.size()

(torch.Size([14]), torch.Size([14]))

As expected, the shape of a scalar is an empty list since scalars are dimensionless (zero dimensions).

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 4/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

matrix.size(), scalar.size()

(torch.Size([2, 14]), torch.Size([]))

While scalars are single numbers, thus having zero dimensions, one- and two-dimensional arrays are called vectors and matrices, respectively,
as we've seen in the examples above. But, in order to make matters simple, it is commonplace to refer to any array with one or more
dimensions as a tensor.

In summary, everything is either a scalar or a tensor. There are tensors for data, and tensors for parameters. Right now, we're dealing with the
former, and we'll move on to the latter later in the next chapter.

keyboard_arrow_down Numpy
NumPy brings the computational power of languages like C and Fortran to Python, a language much easier to learn and use. Thanks to its
performance, Numpy sits at the core of many machine and deep learning libraries such as Scikit-Learn, Scipy, Pandas, and Matplotlib. For this
reason, it is fairly common to load tabular data from other sources, such as CSV or Excel files, into a collection of Numpy arrays. Even when
dealing with images, pixel values are often stored inside Numpy arrays.

PyTorch tensors and Numpy arrays have a lot in common. You may create Numpy arrays using its identically-named methods such as zeros(),
ones(), rand(), and randn(), for example.

Moreover, you can easily switch between the two of them, arrays and tensors, using PyTorch's numpy() and as_tensor() methods. The former
converts a PyTorch tensor into a Numpy array, while the latter creates a PyTorch tensor out of a Numpy array. Let's see them in action.

Numpy array:

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 5/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
import numpy as np
numpy_array = vector.numpy()
numpy_array

array([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

PyTorch tensor:

back_to_tensor = torch.as_tensor(numpy_array)
back_to_tensor

tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

There's one caveat, though: only "CPU" tensors can be converted into Numpy arrays. Every tensor we created thus far is, by default, a "CPU"
tensor. We'll learn about different types of tensors shortly, in the "Devices" section.

Reshaping Tensors
One of the most common operations you'll need to perform is to reshape a tensor into a different, well, shape!

One typical case, especially in computer vision, is to convert a multi-dimensional tensor representing features into a single sequence of
features. The figure below illustrates this:

There are two data points, and their corresponding features are organized in a two-by-three shape in the tensor at the top. In order to use these
features to train a linear or a logistic regression, however, you'd need to have the features lined up instead. The flattened tensor at the bottom
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 6/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

represents this, the flattened version of both tensors.

Although the operation itself is quite simple, there are a few pitfalls you need to avoid while reshaping your tensors. Let's go over a few
examples.

keyboard_arrow_down Reshaping Tensors: Avoiding Copies


In PyTorch, you can reshape a tensor using its view() or reshape() methods. The latter may or may not create a copy, so the former is preferred,
since it doesn't make copies of the data.

original_tensor = torch.ones((2, 3), dtype=torch.float)


reshaped_tensor = original_tensor.view(1, 6)
original_tensor, reshaped_tensor

(tensor([[1., 1., 1.],


[1., 1., 1.]]),
tensor([[1., 1., 1., 1., 1., 1.]]))

keyboard_arrow_down Reshaping Tensors: Sharing Underlying Data


The view() method only returns a tensor with the desired shape that happens to share the underlying data with the original tensor. It does not
create a new, independent, tensor. This means that, if you make changes to one of the two tensors, the original, or the reshaped one, these
changes will be reflected in both of them.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 7/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

original_tensor[0, 1] = 2
original_tensor, reshaped_tensor

(tensor([[1., 2., 1.],


[1., 1., 1.]]),
tensor([[1., 2., 1., 1., 1., 1.]]))

Moreover, if you created your tensor from a Numpy array, the two of them, array and tensor, are also sharing the underlying data.

numpy_array[-1] = 1000
numpy_array, vector

(array([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15,
14, 15, 1000]),
tensor([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14,
15, 1000]))

In order to effectively duplicate the data and create a new, independent, tensor, you can use the clone() method instead.

cloned_tensor = original_tensor.clone()
cloned_tensor

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 8/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

tensor([[1., 2., 1.],


[1., 1., 1.]])

Now, if you make changes to the original tensor, they won't be reflected in the new tensor anymore.

original_tensor[0, 1] = 3
original_tensor, cloned_tensor

(tensor([[1., 3., 1.],


[1., 1., 1.]]),
tensor([[1., 2., 1.],
[1., 1., 1.]]))

keyboard_arrow_down Reshaping Tensors: Contiguous Tensors


The view() method is a convenient way of reshaping a tensor, but it may fail if the underlying tensor is not contiguous in memory.

transposed_tensor = original_tensor.t()
transposed_tensor.view(1, 6)

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-40-13ca45c67fa7> in <cell line: 2>()
1 transposed_tensor = original_tensor.t()
----> 2 transposed_tensor.view(1, 6)

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use
.reshape(...) instead.

Remember, view() does not make any copies of the data, but reshape() does, so it will always work, even if the tensors are not contiguous.

But, what does it mean to be contiguous? Simply put, it means two elements in the same row must be next to each other in memory. This is
always the case whenever a tensor is created (like our original_tensor), but once we transpose it, we're not actually changing its allocation in
memory. Transposing, in this case, means traversing it differently, that is, jumping to a different position in memory.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 9/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Contiguous vs Non-Contiguous Tensors

We can see the "rules" for moving to the next row or column by checking the tensor's stride method:

original_tensor.stride(), transposed_tensor.stride()

((3, 1), (1, 3))

In the original tensor, the stride is telling us that we need to skip three positions in memory to get to the next row, while only one position for the
next column. But, in the transposed tensor, it is the other way around: we need to skip three positions to get to the next column.

If we need to skip two or more positions to get to the next column, it means our tensor is not contiguous anymore. Let's check it out:

transposed_tensor.is_contiguous(), original_tensor.is_contiguous()

(False, True)

Transposed tensor:

transposed_tensor

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 10/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

tensor([[1., 1.],
[3., 1.],
[1., 1.]])

Luckily, you can simply call the contiguous() method, and PyTorch will modify the data in memory in such a way the data can be traversed in its
typical fashion (a stride of one in the last dimension). If the underlying data happens to be contiguous already, this is a zero-cost operation.

transposed_tensor.contiguous().view(1, 6)

tensor([[1., 1., 3., 1., 1., 1.]])

Making a Tensor Contiguous

Finally, it is also possible to use the flatten() method instead, in case you're trying to make your tensor one-dimensional.

transposed_tensor.flatten()

tensor([1., 1., 3., 1., 1., 1.])

Don't worry much about memory allocation, though. The purpose of this section was to make you aware of and capable of addressing the error
message at the top, should you run into it by any chance.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 11/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Named Tensors

Named tensors are a long-awaited feature, even if they're still a prototype feature. Many, if not most, implementation bugs - even worse, the
silent kind of bug - in deep learning models arise from the fact that the wrong dimensions are being used in a given operation.

You may be wondering how is it possible that such a serious bug may be a silent one, that is, one that does not raise an exception and crashes
the application?

In many cases, broadcasting is to blame. Broadcasting is both a blessing and a curse. While it makes it extremely easy to perform operations
using tensors of different shapes without the need to explicitly replicate data along some dimension, it may also give you the illusion your
operation is the right one, even when it's not because you messed up the dimensions.

keyboard_arrow_down Named Tensors: Broadcasting


Broadcasting happens whenever you're trying to perform an operation on tensors of different shapes. For example, if you try to multiply a one-
dimensional tensor by a scalar (zero dimensions):

a = np.array([1.0, 2.0, 3.0])


b = 2.0
a * b

array([2., 4., 6.])

You've probably done similar operations many times without giving a second thought to why it works so seamlessly. As it turns out, you have
broadcasting to thank for this behavior. Under the hood, PyTorch (or Numpy) will "stretch" the variable b so its shape matches that of variable a,
thus allowing the desired element-wise multiplication.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 12/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Moreover, it is actually more efficient to use broadcasting like that than building a tensor full of 2.0s to match the shapes!

Let's go over an example:

mat1 = torch.ones((3, 3))


mat2 = torch.tensor([[1, 2, 3]])
mat1, mat2

(tensor([[1., 1., 1.],


[1., 1., 1.],
[1., 1., 1.]]),
tensor([[1, 2, 3]]))

What if we'd like to perform an element-wise multiplication? Broadcasting got us covered, it will "understand" that mat2 was "meant" to be 3x3
instead.

mat1 * mat2

tensor([[1., 2., 3.],


[1., 2., 3.],
[1., 2., 3.]])

Broadcasting works by comparing dimensions of both tensors from right to left, and it will "match" them if they are equal or one of them is one
(so that particular value will be replicated along that dimension). In the example above, these are the dimensions:

mat1.size(), mat2.size()

(torch.Size([3, 3]), torch.Size([1, 3]))

The right-most dimension is 3 for both tensors, so it is matched. Moving to the left, the first dimension of one tensor is 1, so it is also matched.
There we go, broadcasting can work its magic! But, beware, if you were to transpose the second tensor (mat2) by mistake, broadcasting still
works!

mat2_wrong_shape = mat2.t()
mat1 * mat2_wrong_shape

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 13/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

tensor([[1., 1., 1.],


[2., 2., 2.],
[3., 3., 3.]])

What does this mean? It means that, if you transposed one of the tensors by mistake, it may still produce a valid output. If you think it's unlikely
that you'll ever get the dimensions in the wrong order, think again: when it comes to tensors representing batches of images or sequences, it
isn't so uncommon to mix dimensions up.

Luckily, named tensors can help us keep broadcasting eagerness in check.

named_mat1 = torch.ones((3, 3), names=['R', 'C'])


named_mat2 = torch.tensor([[1, 2, 3]], names=['R', 'C'])

<ipython-input-51-607113069adf>:1: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not
named_mat1 = torch.ones((3, 3), names=['R', 'C'])

Input:

named_mat1 * named_mat2

tensor([[1., 2., 3.],


[1., 2., 3.],
[1., 2., 3.]], names=('R', 'C'))

All is good and well, rows and columns are well aligned, and the result is as expected. Also, notice that the names are propagated to the
resulting tensor.

Now, what happens if we transpose (supposedly by mistake) the second matrix?

named_mat1 * named_mat2.t()

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-53-7deb473f8df5> in <cell line: 1>()
----> 1 named_mat1 * named_mat2.t()

RuntimeError: Error when attempting to broadcast dims ['R', 'C'] and dims ['C', 'R']: dim 'C' and dim 'R' are at the same position from the right but do
not match.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 14/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Great, we got an error! Even though broadcasting would happily return a 3x3 matrix, the misalignment of the dimensions' names prevented that
and rightfully raised an exception warning us of our mistake.

Of course, it only works if both tensors are named. If one of them isn't named, broadcasting keeps working as expected.

Start coding or generate with AI.

named_mat1 * mat2.t()

tensor([[1., 1., 1.],


[2., 2., 2.],
[3., 3., 3.]], names=('R', 'C'))

keyboard_arrow_down Devices
So far, all the tensors we have created are "CPU" tensors. It means the tensor is stored in the computer's main memory and any operations
performed on these tensors are handled by its central processing unit, the CPU (e.g. an Intel Core i9 processor). The type of tensor is
designated by the device, a CPU in this case, that handles its operations.

We can easily check the device responsible for a given tensor by checking its device attribute:

device = original_tensor.device
device

device(type='cpu')

But the CPU is not the only device we can use to manipulate tensors. We can also use graphics processing units (GPUs), tensor processing
units (TPUs) or even "meta" (fake) devices. Let's take a look at them!

keyboard_arrow_down Devices: GPU


Graphics processing cards are a powerful tool in the deep learning practitioner's toolbelt. They were originally designed for gamers, and they are
especially fast in handling matrix multiplication at scale, since this is the most common operation performed for rendering the 3D scenes of a
game.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 15/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

It turns out, though, that matrix multiplication at scale can also be used to train deep learning models. Initially, it wasn't easy to leverage their
power for that purpose since programming a GPU was quite challenging. It was NVIDIA's release of CUDA (Compute Unified Device
Architecture) and, later on, AMD's ROCm (Radeon Open Compute Ecosystem) that allowed deep learning frameworks such as PyTorch to more
easily use them to dramatically speed up training times.

GPUs are freely available on most platforms, such as Google Colab and Kaggle, and you should always check the availability of a GPU before
starting training a model. Since these platforms offer CUDA-compatible GPUs, we'll be focusing solely on them.

PyTorch makes it really easy to accomplish that: you only have to make a call to torch.cuda.is_available() and name your device accordingly.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Once we specify a device, we can send our tensor to it using the aptly named method to():

sent_tensor = original_tensor.to(device)
sent_tensor.device

device(type='cpu')

If a GPU is not available, nothing will happen, and calling to() comes at no cost. So, it's safe to always send your tensors (and later on, your
models) to the specified device. This way, if you share your code with someone else, or if you happen to run it in a different environment in the
future, your code will always leverage the power of a GPU, if one is available to you.

If a GPU is indeed available, the tensor's device will read cuda:0, as it now resides in the memory of the first (and in most cases, only) GPU
available. Moreover, if you check the tensor's type, it will read torch.cuda.FloatTensor.

sent_tensor.type()

't h l t '

If you're lucky enough to have multiple GPUs at your disposal, you can check how many are available to you, and their corresponding names
using torch.cuda.device_count() and torch.cuda.get_device_name(), respectively:

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 16/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))

Once a tensor is sent to a GPU, it cannot be directly brought back to Numpy anymore.

sent_tensor.numpy()

array([[1., 3., 1.],


[1., 1., 1.]], dtype=float32)

You need to bring them back to the CPU first (either using to('cpu') or cpu()), and only then call the numpy() method.

sent_tensor.cpu().numpy()

array([[1., 3., 1.],


[1., 1., 1.]], dtype=float32)

Devices: TPU
Unlike GPUs, which were originally designed for gamers, TPUs - tensor processing units - as their name suggests, were designed by Google to
be used for training deep learning models in TensorFlow.

TPUs are available on some platforms, such as Google Colab and Kaggle. Although they have been designed to work with TensorFlow, it's also
possible to leverage their immense power in PyTorch using PyTorch/XLA, a package that connects PyTorch to Google's XLA (accelerated linear
algebra) library.

Unfortunately, TPUs aren't freely available in Google Colab anymore.

TPUs can be used to speed up training even more by using all its cores at once through multiprocessing.

keyboard_arrow_down Devices: "Meta" (Fake)


The "meta" device is an elegant solution for a problem you may run into if your models grow really large. After training a model, you may save it
to disk, and later load it back as the backend of an application, for example (we'll do that later).

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 17/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

In order to load a model from disk, though, you need to create an instance of the untrained model first, so you have where to load the model
into. Meta devices allow you to create "dummy", empty, models that would be too large to fit in memory, thus making it possible to work around
hardware constraints by only partially loading a model.

meta_tensor = torch.zeros(2, 3, device='meta')


meta_tensor

tensor(..., device='meta', size=(2, 3))

Mission accomplished! The fake tensor does not contain any data, as expected. Now let's create a REALLY huge fake tensor.

huge_tensor = torch.zeros(100000, 100000, device='meta')


huge_tensor

tensor(..., device='meta', size=(100000, 100000))

The tensor above, should it be a real tensor, would have 10 BILLION 32-bit float elements. Let's see what happens if we try to create a regular
tensor of the same size.

#huge_tensor = torch.zeros(100000, 100000)

Unless your computer has over 40 gigabytes of free RAM, you'll get an error. Fake tensors are useful to handle tensors - and models - that are
too large to fit into memory.

We won't be using these tensors in this course, but if you venture into using really large models, you're already aware of your options.

keyboard_arrow_down Datasets

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 18/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

It is time to get our hands a little dirty with some tiny, yet real, data. Let's start by loading the Auto MPG Dataset directly from the UCI Machine
Learning Repository using pandas' read_csv() method and a URL.

Its description reads: "The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and
5 continuous attributes."

In this dataset, values are separated by spaces, missing values are represented by a question mark. The columns, or attributes, as stated in the
repository, are as follows:

mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)

The last column, car name, is actually separated by tabs (instead of spaces), so we're considering the cars' names as comments while loading
the dataset.

Pandas

To load tabular data, such as CSV or Excel files, one of the most popular choices is the Pandas package, an open-source data analysis and
manipulation tool. Pandas' strength lies in its dataframes, a spreadsheet-like structure that contains two-dimensional data and its
corresponding labels. A dataframe is composed of a sequence of series, each series representing a column, its values stored as Numpy arrays.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 19/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

You can use methods such as read_csv() and read_excel() to load your data, each method offering plenty of arguments to account for different
separators, the existence or not of column headers, comments, missing data, and more. We'll be using the former to load our dataset.

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']
df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True)
df

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17.0 8 302.0 140.0 3449.0 10.5 70 1

... ... ... ... ... ... ... ... ...

393 27.0 4 140.0 86.0 2790.0 15.6 82 1

394 44.0 4 97.0 52.0 2130.0 24.6 82 2

395 32.0 4 135.0 84.0 2295.0 11.6 82 1

396 28.0 4 120.0 79.0 2625.0 18.6 82 1

397 31.0 4 119.0 82.0 2720.0 19.4 82 1

398 rows × 8 columns

A dataframe can be easily sliced, both column- and row-wise. Retrieving values from a single column (thus resulting in a Pandas series) is as
simple as that:

df['mpg']

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 20/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

mpg

0 18.0

1 15.0

2 18.0

3 16.0

4 17.0

... ...

393 27.0

394 44.0

395 32.0

396 28.0

397 31.0

398 rows × 1 columns

dt fl t64

A Pandas series works as a wrapper around the underlying Numpy array that contains its data, which you can retrieve using the values attribute:

df['mpg'].values[:5]

array([18., 15., 18., 16., 17.])

Selecting multiple columns will return a sliced dataframe:

df[['mpg', 'hp']]

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 21/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

mpg hp

0 18.0 130.0

1 15.0 165.0

2 18.0 150.0

3 16.0 150.0

4 17.0 140.0

... ... ...

393 27.0 86.0

394 44.0 52.0

395 32.0 84.0

396 28.0 79.0

397 31.0 82.0

398 rows × 2 columns

The dataframe itself also has its own values attribute, which will give you access to a two-dimensional Numpy array containing the whole data:

df[['mpg', 'hp']].values[:5]

array([[ 18., 130.],


[ 15., 165.],
[ 18., 150.],
[ 16., 150.],
[ 17., 140.]])

To subset the dataframe, you can use its iloc attribute, which allows for selecting rows based on their index:

df.iloc[:5]

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 22/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17 0 8 302 0 140 0 3449 0 10 5 70 1

It is also possible to use a boolean Series to conditionally subset the rows of a dataframe:

cond = (df['year'] == 70)


df[cond]

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 23/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17.0 8 302.0 140.0 3449.0 10.5 70 1

5 15.0 8 429.0 198.0 4341.0 10.0 70 1

6 14.0 8 454.0 220.0 4354.0 9.0 70 1

7 14.0 8 440.0 215.0 4312.0 8.5 70 1

8 14.0 8 455.0 225.0 4425.0 10.0 70 1

9 15.0 8 390.0 190.0 3850.0 8.5 70 1

10 15.0 8 383.0 170.0 3563.0 10.0 70 1

11 14.0 8 340.0 160.0 3609.0 8.0 70 1

12 15.0 8 400.0 150.0 3761.0 9.5 70 1

13 14.0 8 455.0 225.0 3086.0 10.0 70 1

14 24.0 4 113.0 95.0 2372.0 15.0 70 3

15 22.0 6 198.0 95.0 2833.0 15.5 70 1

16 18.0 6 199.0 97.0 2774.0 15.5 70 1

17 21.0 6 200.0 85.0 2587.0 16.0 70 1

18 27.0 4 97.0 88.0 2130.0 14.5 70 3

19 26.0 4 97.0 46.0 1835.0 20.5 70 2

20 25.0 4 110.0 87.0 2672.0 17.5 70 2

21 24.0 4 107.0 90.0 2430.0 14.5 70 2

22 25.0 4 104.0 95.0 2375.0 17.5 70 2

23 26.0 4 121.0 113.0 2234.0 12.5 70 2

24 21.0 6 199.0 90.0 2648.0 15.0 70 1

25 10.0 8 360.0 215.0 4615.0 14.0 70 1

26 10 0 8 307 0 200 0 4376 0 15 0 70 1


https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 24/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
26 10.0 8 307.0 200.0 4376.0 15.0 70 1

27 11.0 8 318.0 210.0 4382.0 13.5 70 1

28 90 8 304 0 193 0 4732 0 18 5 70 1

Train-Validation-Test Split

The purpose of the split is to simulate the arrival of new data, so you can make adjustments to your model if training doesn't go well and, once
you're happy with it, to make a final assessment before going live with it. Each split, training, validation, and test, has its own purpose, as
described below. It is also important to highlight that the split should always be the first thing you do—no preprocessing, no transformations;
nothing happens before the split.

Training Set: the data you use to train your model - you can use and abuse this data!
Validation Set: the data you should only use for hyper-parameter tuning, that is, comparing differently parameterized models trained on
the training data, to decide which parameters are best. You should use, but not abuse this data, as it is intended to provide an unbiased
evaluation of your model and, if you mess around with it too much, you'll end up incorporating knowledge about it in your model without
even noticing.
Test Set: the data you should use only once, when you are done with everything else, to check if your model is still performing well. We like
to pretend this is data from the "future" - that particular day in the future when our model is ready to give it a go in the real world! So, until
that day, we cannot know this data, as the future hasn't arrived yet.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 25/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

keyboard_arrow_down Train-Validation-Test Split: Shuffling


In most cases, you will need to shuffle the data - the rows in a dataframe - before the split, so your data isn't in any particular order anymore.
Perhaps you noticed that our Auto MPG dataset is ordered by year:

df['year'].values[:50]

array([70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71,
71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71])

Although it may make sense if you're handling data in a spreadsheet (making it easier for you to look something up in it), a predefined ordering
is potentially an issue for training and evaluating a model. Ideally, we would like to have all our sets - training, validation, and test - containing
similar information. Taking the example of the year column, we'd like to have cars from 1970 to 1982 in all three sets. The easiest and fastest
way to accomplish that is to simply shuffle the data first.

We can use the sample() method of a Pandas dataframe to sample, that is, to draw data points from the dataframe in random order. The trick
here is to draw the whole dataset using sample (frac=1), so we got our full dataframe back, but in a different order. The resulting dataframe still
has its original index values, but we can easily drop it using the reset_index() method.

shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True)

To actually perform the split, we'll use Scikit-Learn's train_test_split() twice, once for splitting the data into train and test sets, and then to
subdivide the training data into train and validation sets.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 26/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
from sklearn.model_selection import train_test_split
trainval, test = train_test_split(shuffled, test_size=0.16, shuffle=False)
train, val = train_test_split(trainval, test_size=0.2, shuffle=False)

keyboard_arrow_down Cleaning Data

Ensuring the quality and consistency of your data is of the utmost importance. The most basic checks you can do are looking for missing
values and outliers in your data. Let's start with the latter. Outliers are values that are, literally, "off the charts": they may be produced by a
measurement or input (e.g. typing) errors, in which case they are not real and thus must be handled; but they may also be legitimate, sometimes
indicating an anomaly, in which case they may be exactly the target of your model. Outliers of the first kind, the errors, may affect model training
negatively and badly skew its predictions. There are many techniques for removing outliers but we won't be delving into this topic here.

Unlike outliers, missing values are easy to spot, they show up as NaN (Not a Number) in a dataframe. In Deep Learning models, NaN values
propagate like an infectious disease: any operation between an actual number and a NaN value results in another NaN value.

The process of "fixing" missing values is called imputation, that is, replacing the missing data with substituted values. There are many
techniques to accomplish this, from using the mean or median of the corresponding column in the training data, to more sophisticated
approaches using other Machine Learning algorithms to "predict" the missing value.

Any imputation is based on assumptions you make about its nature, so it will necessarily lead to slightly modifying the data distribution.
Alternatively, if you can afford to lose some data points, it's also possible to simply discard any data points containing missing values.

Let's check our data for missing values:

is_missing_attr = train.isna()
n_missing_attr = is_missing_attr.sum(axis=1)

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 27/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
train[n_missing_attr > 0]

mpg cyl disp hp weight acc year origin

89 34.5 4 100.0 NaN 2320.0 15.8 81 2

208 25.0 4 98.0 NaN 2046.0 19.0 71 1

211 40 9 4 85 0 NaN 1835 0 17 3 80 2

There are three cars with missing horsepower information in our training set. While tree-based algorithms such as Random Forests (RF) or
Gradient-Boosted Trees (GBT) can easily handle missing data, missing values are a big no-no when it comes to neural networks.

In order to keep things simple in our small example, let's simply drop any rows that contain missing values.

train.dropna(inplace=True)
train

mpg cyl disp hp weight acc year origin

0 18.0 6 171.0 97.0 2984.0 14.5 75 1

1 28.1 4 141.0 80.0 3230.0 20.4 81 2

2 19.4 8 318.0 140.0 3735.0 13.2 78 1

3 20.3 5 131.0 103.0 2830.0 15.9 78 2

4 20.2 6 232.0 90.0 3265.0 18.2 79 1

... ... ... ... ... ... ... ... ...

262 26.0 4 91.0 70.0 1955.0 20.5 71 1

263 26.4 4 140.0 88.0 2870.0 18.1 80 1

264 31.9 4 89.0 71.0 1925.0 14.0 79 2

265 19.2 8 267.0 125.0 3605.0 15.0 79 1

266 33.0 4 91.0 53.0 1795.0 17.5 75 3

264 rows × 8 columns

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 28/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Then, let's do the same for our validation and test sets. If we had chosen to perform missing value imputation, we would have to apply the same
rules used in the training set for the validation and test sets as well.

val.dropna(inplace=True)
test.dropna(inplace=True)

Beware of Data Leakage!

You should never use the validation or test sets as a source for any kind of data preprocessing (such as imputing data). Using statistics
computed on the validation or test sets are akin to using statistics from the future, that is, computed on the data your users will eventually send
to your application or model. Obviously, you cannot know these values beforehand, and using statistics based on the validation or test sets is a
serious data leakage that will make your models look great during evaluation, even if they're likely to perform poorly when effectively deployed.

keyboard_arrow_down Continuous Attributes

You can see that mpg, displacement, horsepower, weight, and acceleration are continuous attributes, that is, they may be any numeric value.

cont_attr = ['mpg', 'disp', 'hp', 'weight', 'acc']

Continuous attributes are the bread and butter of deep learning models. We've already discussed that these models cannot handle missing
values and, as it turns out, they may also have issues with values spread over wildly different ranges. When it comes to deep learning models,
predictable and, better yet, zero-centered ranges for features are a must.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 29/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Let's see how the attributes (other than fuel consumption, mpg, which is the target of our prediction) fare in their own ranges of values:

train_features = train[cont_attr[1:]]
train_features.hist()

array([[<Axes: title={'center': 'disp'}>, <Axes: title={'center': 'hp'}>],


[<Axes: title={'center': 'weight'}>,
<Axes: title={'center': 'acc'}>]], dtype=object)

It doesn't look very good: not only are the ranges quite different from one another, but the ranges are nowhere near zero-centered, as expected
from real-world physical attributes such as weight. There's no such thing as a negative weight (except, maybe, for theoretical physicists!).

So, what do we do about it and, most importantly, why do we have to do something about it? Let's start with the latter: without going into much
detail, it suffices to know for now that deep learning models are more easily trained if the attributes or features used to train them display
values in symmetrical ranges, preferably in low values, such as from minus three to three. Otherwise, they may exhibit problematic behaviors
during training, failing to converge to a solution.

Therefore, it is best practice to bring all values to a more "digestible" range for the sake of the model's health. The procedure that accomplishes
this is called standardization or, sometimes, normalization. It consists of subtracting the mean (thus zero-centering) of the attribute, and
https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 30/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

dividing the result by the standard deviation (thus turning it into unit standard deviation). The resulting attributes shall exhibit similar resulting
ranges.

Let's start by computing both means and standard deviations:

train_means = train_features.mean()
train_standard_deviations = train_features.std()
train_means, train_standard_deviations

(disp 195.456439
hp 105.087121
weight 2984.075758
acc 15.432955
dtype: float64,
disp 106.255830
hp 39.017837
weight 869.802063
acc 2.743941
dtype: float64)

Then, let's standardize our features:

train_standardized_features = (train_features - train_means)/train_standard_deviations


train_standardized_features.mean(), train_standardized_features.std()

(disp -1.059758e-16
hp -1.648513e-16
weight 7.569702e-17
acc 4.003532e-16
dtype: float64,
disp 1.0
hp 1.0
weight 1.0
acc 1.0
dtype: float64)

Their means are zero, and their standard deviations are one, so it looks good. Let's visualize them:

train_standardized_features.hist()

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 31/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

array([[<Axes: title={'center': 'disp'}>, <Axes: title={'center': 'hp'}>],


[<Axes: title={'center': 'weight'}>,
<Axes: title={'center': 'acc'}>]], dtype=object)

As you can see, standardization doesn't change the shape of the distribution, it only brings all the features to a similar footing when it comes to
their ranges.

Even though we've standardized our continuous features manually, we don't have to do it like that. Scikit-Learn offers a StandardScaler class
that can do this for us and, as we'll see later, we can also use PyTorch's own transformations to standardize values, even if they are pixel values
on images!

We used the training set to define the standardization parameters, namely, the means and standard deviations of our features. Now, we need to
standardize the validation and training sets using those same parameters.

Never use the validation and test sets to compute parameters for standardization, or for any other preprocessing step!

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 32/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
val_features = val[cont_attr[1:]]
val_standardized_features = (val_features - train_means)/train_standard_deviations
val_standardized_features.mean(), val_standardized_features.std()

(disp -0.089282
hp -0.151349
weight -0.121501
acc 0.139288
dtype: float64,
disp 0.946465
hp 0.917051
weight 0.918898
acc 0.958077
dtype: float64)

Notice that the resulting means and standard deviations aren't quite zero and one, respectively. That's expected since the validation set should
have a similar, yet not quite exactly the same, distribution as the training set.

If you ever get perfect zero mean and unit standard deviation on a standardized validation set, there's a good chance you're making a mistake
using statistics computed on top of the validation set itself.

Finally, let's standardize the test set as well:

test_features = test[cont_attr[1:]]
test_standardized_features = (test_features - train_means)/train_standard_deviations

We'll get back to the topic of standardization/normalization a couple more times. First, we'll use Scikit-Learn's StandardScaler for the task, and
then we'll learn about normalizing batches of data using PyTorch's own batch normalization.

The StandardScaler is part of Scikit-Learn, "an open source machine learning library that supports supervised and unsupervised learning. It also
provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities."

It is a convenient way of avoiding manually standardizing continuous features as we just did. All it takes is to call its fit() method on the training
set to compute the appropriate statistics (mean and standard deviation), and then apply the standardization to all datasets using its
transform() method.

The fit() method takes a feature matrix X, a Numpy array usually in the shape (n_samples, n_features). We can easily retrieve the two-
dimensional Numpy array that contains the underlying data of our dataframe and we have everything we need to have a functioning
StandardScaler.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 33/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(train_features.values)

▾ StandardScaler i ?
StandardScaler()

If we want, we can also check the computed statistics (it computes variance instead of standard deviation, though):

scaler.mean_, scaler.var_

(array([ 195.45643939, 105.08712121, 2984.07575758, 15.43295455]),


array([1.12475350e+04, 1.51662499e+03, 7.53689888e+05, 7.50069430e+00]))

Once it has statistics (computed on the training set only), you can apply it to all your datasets:

standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)

/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
warnings.warn(

To better streamline the process, we can write a standardize() function that:

takes a Pandas dataframe, a list of column names that are continuous attributes, and an optional scaler
creates and trains a Scikit-Learn's StandardScaler if one isn't provided as an argument
returns a PyTorch tensor containing the standardized features and an instance of Scikit-Learn's StandardScaler

from sklearn.preprocessing import StandardScaler


def standardize(df, cont_attr, scaler=None):
cont_X = df[cont_attr].values

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 34/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
if scaler is None:
scaler = StandardScaler()
scaler.fit(cont_X)
cont_X = scaler.transform(cont_X)
cont_X = torch.as_tensor(cont_X, dtype=torch.float32)
return cont_X, scaler

Using the above function, our standardization looks like this:

standardized_data = {}
# The training set is used to fit a scaler
standardized_data['train'], scaler = standardize(train_features, cont_attr[1:])
# The scaler is used as argument to the other datasets
standardized_data['val'], _ = standardize(val_features, cont_attr[1:], scaler)
standardized_data['test'], _ = standardize(test_features, cont_attr[1:], scaler)

keyboard_arrow_down Discrete and Categorical Attributes

The three remaining attributes, cylinders, model year, and origin are multi-valued discrete, that is, there's a set of values each one of them may
assume. The cars in the dataset may have either 4, 6, or 8 cylinders, but no car may have 3.45 cylinders, for example. Even though the values
are discrete, there's still an underlying order to them: 8 cylinders are, indeed, twice as much as 4 cylinders, but that's not always the case for
discrete attributes.

Let's take a look at the origin attribute. The cars come from three different, although unnamed, countries: 1, 2, and 3. The choice of numerical
representation for countries may be misleading, since country "3" is not three times as much as country "1". It would have probably been better

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 35/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

to use letters or abbreviations instead just to make the categorical nature of the attribute more evident.

Sometimes, like in the case of cylinders, discrete attributes can be grouped together with continuous attributes as numeric attributes. More
often than not, though, discrete attributes are considered categorical attributes, thus requiring some extra pre-processing to be handled by deep
learning models.

These pre-processing techniques involve converting each possible value in a categorical attribute into a numerical array of a given length, but
not necessarily the same length as the number of unique values. The process may be called encoding or embedding, depending on how it's
performed.

Let's take a look at this process. Our goal here is to convert each possible value in a discrete or categorical attribute into a numerical array of a
given length (that does not need to match the number of unique values). Before converting them into arrays, though, we need to encode them
as sequential numbers first.

Let's see what this looks like for the cyl attribute of our training dataset. It has only five unique values: 3, 4, 5, 6, and 8 cylinders.

cyls = sorted(train['cyl'].unique())
cyls

[3, 4, 5, 6, 8]

year = sorted(train['year'].unique())
year

[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]

origin = sorted(train['origin'].unique())
origin

[1, 2, 3]

We can easily build a dictionary to map them into sequential numbers:

cyls_map = dict((v, i) for i, v in enumerate(cyls))


cyls_map

{3: 0, 4: 1, 5: 2, 6: 3, 8: 4}

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 36/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Now imagine there's a lookup table with as many entries as unique values, each entry being a numerical array of a given length (say, eight
elements). Let's create such a lookup table filled with random values as an illustration:

n_dim = 8
lookup_table = torch.randn((len(cyls), n_dim))
lookup_table

tensor([[ 0.6300, 0.4335, -0.6521, 0.1708, 1.2063, -0.7340, -0.7439, 0.0028],


[-1.0897, 1.1370, 1.7471, 0.3194, 0.4317, 0.1624, -0.0836, -0.6387],
[-1.3730, 0.5371, 0.6504, 0.4326, -0.2706, -0.3949, -0.1482, -0.9970],
[-0.3008, -0.8808, -0.0040, -1.0591, 1.2760, 2.4210, -1.0005, 0.7834],
[-0.0731, 0.7370, -0.2961, -1.6068, 0.1728, -1.2520, 0.3217, -0.1416]])

There are five rows, each corresponding to a unique number of cylinders. Three cylinders, according to our mapping dictionary, corresponds to
the first (index zero) row. Four cylinders, to the second (index one) row, and so on, and so forth.

Let's say we'd like to retrieve the numerical array corresponding to six cylinders. We apply the mapping to find the corresponding index
(cyls_map[6]) and use the result to actually slice the corresponding row from the lookup table (lookup_table[idx]):

idx = cyls_map[6]
lookup_table[idx]

tensor([-0.3008, -0.8808, -0.0040, -1.0591, 1.2760, 2.4210, -1.0005, 0.7834])

There we go! Now, any number of cylinders can easily be mapped to a sequence of eight numerical values. It is as if any given number of
cylinders, a categorical attribute, were now represented by eight numerical features instead. We have just (re)invented embeddings! The fact
that these numbers are random is not necessarily an issue: we can simply turn the whole lookup table into parameters of the model itself, so
they are also learned during training. The model will learn the best way to represent each value in a categorical attribute as a sequence of
numerical attributes! How cool is that?

PyTorch offers an Embedding class that wraps a random tensor like the one we've just created. This is actually a layer, and we'll see how layers
work in more detail in the next chapter. For now, it should suffice to know that its arguments are the same as our own: the number of unique
values, and the desired number of elements - or dimensions - in the returned numerical array.

import torch.nn as nn
emb_table = nn.Embedding(len(cyls), n_dim)

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 37/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

The embedding layer, like any other layer in PyTorch, is also a model. Its weights are, surprise, surprise, the lookup table itself. Besides, since it's
a model, it can be called as such and its expected input is a batch of indices. Let's try it out and see what we get out of it:

idx = cyls_map[6]
emb_table(torch.as_tensor([idx]))

tensor([[-2.4736, -1.1113, -0.0137, 0.4004, -0.0134, 0.0216, 0.0412, 0.1218]],


grad_fn=<EmbeddingBackward0>)

There we go, you created your first embeddings! Embeddings are an important part of modern deep learning, and a fundamental piece of
natural language processing, as we'll see in later chapters. Notice that the values are actually different from our previous example because the
newly created emb_table instance initializes its own random tensor under the hood.

A special case of embedding is the one-hot encoding (OHE) approach: instead of letting the model learn it during training, the mapping is fixed.
In OHE, the numerical array has the same length as the number of unique values and it has only one nonzero element. It works as if each unique
value were a dummy variable, for example: cyl3, cyl4, cyl5, cyl6, and cyl8, and only one of those dummy variables may have a nonzero value.

ohe_table = torch.eye(len(cyls))
ohe_table

tensor([[1., 0., 0., 0., 0.],


[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])

idx = cyls_map[6]
ohe_table[idx]

tensor([0., 0., 0., 1., 0.])

Even though the embeddings themselves are going to be part of the model, we still need to convert our categorical features into their
corresponding sequential indices, so we can use them to retrieve the right values from the embeddings' internal lookup table.

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 38/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Instead of building dictionaries to manually encode categorical values into their sequential indices, though, we can use yet another Scikit-Learn
preprocessing utility: the OrdinalEncoder. It works in a similar fashion as the StandardScaler: you can use its fit() method so it builds the
mapping between the original values and their corresponding sequential indices, and then you can call its transform() method to actually
perform the conversion. Let's see an example of this:

from sklearn.preprocessing import OrdinalEncoder


disc_attr = ['cyl', 'year', 'origin']
encoder = OrdinalEncoder()
encoder.fit(train[disc_attr])

▾ OrdinalEncoder i ?
OrdinalEncoder()

We can check the categories found for each one of the attributes (cylinders, year, and origin, in our case):

encoder.categories

' t '

Each value in a given list will be converted into its corresponding sequential index, and that's exactly what the transform() method does:

train_cat_features = encoder.transform(train[disc_attr])
train_cat_features[:5]

array([[ 3., 5., 0.],


[ 1., 11., 1.],
[ 4., 8., 0.],
[ 2., 8., 1.],
[ 3., 9., 0.]])

Let's take a quick look at the resulting encoding for the first row:

the first column (cylinders) is three, thus corresponding to the fourth value in the first list of categories, that is, six
the second column (year) is five, thus corresponding to the sixth value in the second list of categories, that is, 75
the third column (origin) is zero, thus corresponding to the first value in the third list of categories, that is, one

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 39/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

If we compare it to the original values in the first row, it's a match:

train[disc_attr].iloc[0]

cyl 6

year 75

origin 1

dt i t64

Once again, to better streamline the process, we can write a function quite similar to the previous one:

takes a Pandas dataframe, a list of column names that are categorical attributes, and an optional encoder
creates and trains a Scikit-Learn's OrdinalEncoder if one isn't provided as an argument
returns a PyTorch tensor containing the encoded categorical features and an instance of Scikit-Learn's OrdinalEncoder

def encode(df, cat_attr, encoder=None):


cat_X = df[cat_attr].values
if encoder is None:
encoder = OrdinalEncoder()
encoder.fit(cat_X)
cat_X = encoder.transform(cat_X)
cat_X = torch.as_tensor(cat_X, dtype=torch.int)
return cat_X, encoder

Using the above function, our encoding looks like this:

cat_data = {}
cat_data['train'], encoder = encode(train, disc_attr)
cat_data['val'], _ = encode(val, disc_attr, encoder)
cat_data['test'], _ = encode(test, disc_attr, encoder)

The resulting features are nothing but indices now. Later on, for each column in the results (which corresponds to a particular categorical
attribute) we'll use its values to retrieve their embeddings. In our example with the cyl column (the first categorical attribute), it will look like this:

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 40/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

emb_table(cat_data['train'][:, 0]) # cylinders is the first (zero) column

tensor([[-2.4736, -1.1113, -0.0137, ..., 0.0216, 0.0412, 0.1218],


[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684],
[ 0.7061, 1.5819, -0.1650, ..., 0.6547, 0.9400, 0.2905],
...,
[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684],
[ 0.7061, 1.5819, -0.1650, ..., 0.6547, 0.9400, 0.2905],
[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684]],
grad_fn=<EmbeddingBackward0>)

keyboard_arrow_down Target and Task


The target is the attribute you're trying to predict. If the target is a continuous attribute, such as fuel consumption, we're dealing with a
regression task. If the target is a categorical attribute, such as the country of origin, we're dealing with a classification task.

In our example, we're indeed trying to predict fuel consumption (the mpg attribute), so ours is a regression task. We're starting with a simple
linear regression with a single feature, that is, we'll be using only one (continuous) attribute to predict our target, fuel consumption. Of course,
later on, we'll expand our problem into a multivariate linear regression, thus including all (continuous) attributes at first, and then add the
categorical attributes to the mix while training a non-linear model in Lab 2.

For now, let's pick hp as our single feature:

# _pt stands for PyTorch, in case you're wondering :-)


hp_idx = cont_attr.index('hp')
train_target_pt = torch.as_tensor(train[['mpg']].values, dtype=torch.float32)
train_single_feature_pt = standardized_data['train'][:, [hp_idx]]
train_target_pt[:5], train_single_feature_pt[:5]

(tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]),
tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]))

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 41/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
import matplotlib.pyplot as plt
plt.scatter(train_single_feature_pt, train_target_pt)
plt.xlabel('Horsepower (standardized)')
plt.ylabel('Fuel Consumption - miles per gallon')
plt.title('Training Set - HP x MPG')

Text(0.5, 1.0, 'Training Set - HP x MPG')

The relationship isn't quite linear, but there's clearly an inverse correlation between a car's power and its fuel consumption, as you'd expect. A
small 50 HP car is certainly much more fuel-efficient (hence more miles per gallon) than a high-powered 200 HP sports car.

keyboard_arrow_down TensorDataset

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 42/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Cool, we have two tensors now, let's use them to build a TensorDataset! Tensor datasets are one of the most basic types of datasets you'll find
in PyTorch. They simply wrap a couple of tensors containing your data - feature(s) and target(s) - so you can conveniently load your data in
mini-batches at will for training your model. We'll get back to it when we discuss PyTorch's data loader in the next section.

from torch.utils.data import TensorDataset


train_ds = TensorDataset(train_single_feature_pt, train_target_pt)

PyTorch's datasets work pretty much like Python lists. You can think of a dataset as a list of tuples, each tuple corresponding to one data point
(features, target).

You can create your own, custom, dataset by inheriting from the Dataset class. Datasets need to implement some basic methods such as
init(self), getitem(self, index) and len(self).

If we check the source code of the TensorDataset, that's what we'll find:

class TensorDataset(Dataset[Tuple[Tensor, ...]]):


r"""Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Args:
*tensors (Tensor): tensors that have the same size of the first dimension.
"""
tensors: Tuple[Tensor, ...]
def __init__(self, *tensors: Tensor) -> None:
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors),
"Size mismatch between tensors"
self.tensors = tensors
def __getitem__(self, index):
return tuple(tensor[index] for tensor in self.tensors)

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 43/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
def __len__(self):
return self.tensors[0].size(0)

File "<ipython-input-109-bf10485439c3>", line 7


tensors: Tuple[Tensor, ...]
^
IndentationError: unexpected indent

In the constructor (init()) method, it makes sure all tensors are of the same size, and assigns them to its tensors attribute. In the getitem()
method, which makes a dataset "sliceable" just like a Python list, it loops over all tensors and builds a tuple containing the index-th element of
each tensor. Finally, in the len() method, it simply returns the first dimension of the first tensor (since it is guaranteed they're all of the same
size).

Simple enough, right? Let's retrieve a few elements from our dataset:

train_ds[:5]

(tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]),
tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]))

As expected, we got a tuple back, the first element being five data points from the first (feature) tensor, the second element being the
corresponding five data points from the second (target) tensor. It really works like a list of tuples!

Tensor datasets are as simple as they can be, but PyTorch offers many other datasets, such as the ImageFolder dataset that you can use with
your own images, or many other built-in datasets. We'll see them in more detail in the second part of this course while tackling computer vision
tasks.

Let's create datasets for our validation and test sets as well. We'll be skipping some intermediate steps and creating tensor datasets directly
out of the pandas dataframes:

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 44/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab
val_ds = TensorDataset(standardized_data['val'][:, [hp_idx]],
torch.as_tensor(val[['mpg']].values, dtype=torch.float32))
test_ds = TensorDataset(standardized_data['test'][:, [hp_idx]],
torch.as_tensor(test[['mpg']].values, dtype=torch.float32))

PyTorch offers plenty of built-in datasets in both computer vision and natural language processing areas.

There are datasets for image classification (e.g. CIFAR10, MNIST, SHVN), object detection, image segmentation, optical flow, stereo matching,
image pairs, image captioning, video classification and prediction. For a complete list of available datasets, please check the Datasets section
of Torchvision documentation.

There are also datasets for text classification (e.g. AG News, IMDb, MNLI, SST2), language modeling, machine translation, sequence tagging,
question answering, and unsupervised learning. For a complete list of available datasets for natural language processing, please check the
Datasets section of Torchtext documentation.

Perhaps you noticed that, so far, we've been handling "CPU" tensors only. That is actually by design: while building a dataset, you may want to
keep your data out of your precious, and expensive, GPU memory. Only the data that is going to be actively used for training in any given step - a
mini-batch of data - should be sent to the GPU.

Mini-Batches
A mini-batch is a subset of a dataset, usually drawn randomly from it, and the number of data points in a mini-batch is usually a power of two.
Typical mini-batch sizes are 32, 64, 128, etc., but, in many cases, mini-batch size may be limited by the size of the available memory. This is
especially true for large models that take up a lot of space, where sometimes it is only feasible to load one data point at a time. In these cases,
the restriction imposed by hardware may be circumvented by accumulating the results over time thus simulating a mini-batch.

For now, let's draw mini-batches from our dataset using PyTorch's DataLoader!

keyboard_arrow_down DataLoaders

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 45/46
1/25/25, 5:57 PM Copy of 03_Building_Your_First_Dataset.ipynb - Colab

Data loaders can be used to randomly draw a given number of data points - the mini-batch size - out of a dataset. By default, they will return
different mini-batches every time until the underlying dataset runs out of available data points. At this point - pun very much intended - it will
start over.

The data loader is a rich class and it has many parameters. At first, we're focusing on a few of them only:

dataset: the underlying dataset it will be drawing samples from


batch_size: the number of data points in each batch returned by it
drop_last: drops the last mini-batch if there aren't batch_size data points in it
shuffle: shuffle (or not) the data

The last parameter, shuffle, is quite important. In the vast majority of cases, you should set shuffle=True for the training set, the major
exception to this rule being time series. Shuffling your data, thus ensuring there's no underlying order to it (e.g. ordered by date of creation)
makes learning more robust. Of course, in our case, we had already shuffled it at the very start before splitting our full dataset into training,
validation, and test sets, so shuffling at this point is redundant - but it surely doesn't hurt!

Moreover, we're ensuring the reproducibility of the results by explicitly assigning a random number generator to our data loader and setting its
seed using the manual_seed() method. This way we can control the data sampling during training.

Even though we did our best to ensure the reproducibility of the results, you may still find some differences in the results or in the loss curves.
P T h’ d t ti b t d ibilit t t th f ll i

https://colab.research.google.com/drive/1c1Lv1OxVStX-kMP-Wkp86uyvryVurIIK#printMode=true 46/46

You might also like