0% found this document useful (0 votes)

7 views12 pages

Dataset and DataLoader Class

The document discusses the Dataset and DataLoader classes in PyTorch, which are essential for efficiently managing data during model training. It outlines the problems with traditional data handling methods, introduces the functionalities of these classes, and explains how they facilitate batching, shuffling, and parallel data loading. Additionally, it covers important parameters for customizing the DataLoader's behavior to optimize performance during training.

Uploaded by

bafot55056

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

Dataset and DataLoader Class

Uploaded by

bafot55056

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Recap

10 December 2024 17:04

Problems
1. Memory inefficient
2. Better Convergence

Solution - using batches of data to train the model

6. Dataset and DataLoader Class Page 1

Simple Solution
12 December 2024 08:55

6. Dataset and DataLoader Class Page 2

Problem with this approach!
12 December 2024 16:22

1. No standard interface for data

2. No easy way to apply transformations
3. Shuffling and sampling
4. Batch management & Parallelization

6. Dataset and DataLoader Class Page 3

The Dataset and Dataloader Classes
12 December 2024 08:47

Dataset and DataLoader are core abstractions in PyTorch that decouple how you
define your data from how you efficiently iterate over it in training loops. Dataset Class

The Dataset class is essentially a blueprint. When you create a

custom Dataset, you decide how data is loaded and returned.

It defines:

• init() which tells how data should be loaded.

• __len__() which returns the total number of samples.
• __getitem__(index) which returns the data (and label) at the
given index.

DataLoader Class

The DataLoader wraps a Dataset and handles batching, shuffling,

and parallel loading for you.

DataLoader Control Flow:

• At the start of each epoch, the DataLoader (if shuffle=True)

shuffles indices(using a sampler).

• It divides the indices into chunks of batch_size.

• for each index in the chunk, data samples are fetched from
the Dataset object

• The samples are then collected and combined into a batch

(using collate_fn)

• The batch is returned to the main training loop

6. Dataset and DataLoader Class Page 4

A Simple Example
12 December 2024 08:48

6. Dataset and DataLoader Class Page 5

A note about data transformations
12 December 2024 17:51

6. Dataset and DataLoader Class Page 6

A note about Parallelization
12 December 2024 18:00

Imagine the entire data loading and training process for one epoch with num_workers=4:
Assumptions:
• Total samples: 10,000
• Batch size: 32
• Workers (num_workers): 4
• Approximately 312 full batches per epoch (10000 / 32 ≈ 312).
Workflow:
1. Sampler and Batch Creation (Main Process):
Before training starts for the epoch, the DataLoader’s sampler generates a shuffled list of all 10,000 indices. These
are then grouped into 312 batches of 32 indices each. All these batches are queued up, ready to be fetched by
workers.
2. Parallel Data Loading (Workers):
○ At the start of the training epoch, you run a training loop like:

python
Copy code
for batch_data, batch_labels in dataloader:
# Training logic
○ Under the hood, as soon as you start iterating over dataloader, it dispatches the first four batches of indices
to the four workers:
▪ Worker #1 loads batch 1 (indices [batch_1_indices])
▪ Worker #2 loads batch 2 (indices [batch_2_indices])
▪ Worker #3 loads batch 3 (indices [batch_3_indices])
▪ Worker #4 loads batch 4 (indices [batch_4_indices])
Each worker:
○ Fetches the corresponding samples by calling __getitem__ on the dataset for each index in that batch.
○ Applies any defined transforms and passes the samples through collate_fn to form a single batch tensor.
3. First Batch Returned to Main Process:
○ Whichever worker finishes first sends its fully prepared batch (e.g., batch 1) back to the main process.
○ As soon as the main process gets this first prepared batch, it yields it to your training loop, so your codefor
batch_data, batch_labels in dataloader: receives (batch_data, batch_labels) for the first batch.
4. Model Training on the Main Process:
○ While you are now performing the forward pass, computing loss, and doing backpropagation on the first
batch, the other three workers are still preparing their batches in parallel.
○ By the time you finish updating your model parameters for the first batch, the DataLoader likely has the
second, third, or even more batches ready to go (depending on processing speed and hardware).
5. Continuous Processing:
○ As soon as a worker finishes its batch, it grabs the next batch of indices from the queue.
○ For example, after Worker #1 finishes with batch 1, it immediately starts on batch 5. After Worker #2
finishes batch 2, it takes batch 6, and so forth.
○ This creates a pipeline effect: at any given moment, up to 4 batches are being prepared concurrently.
6. Loop Progression:
○ Your training loop simply sees:

python
Copy code
for batch_data, batch_labels in dataloader:
# forward pass
# loss computation
# backward pass
# optimizer step
○ Each iteration, it gets a new, ready-to-use batch without long I/O waits, because the workers have been pre-
loading and processing data in parallel.
7. End of the Epoch:
○ After ~312 iterations, all batches have been processed. All indices have been consumed, so the DataLoader
has no more batches to yield.
○ The epoch ends. If shuffle=True, on the next epoch, the sampler reshuffles indices, and the whole process
repeats with workers again loading data in parallel.

6. Dataset and DataLoader Class Page 7

6. Dataset and DataLoader Class Page 8
A note about samplers
12 December 2024 17:51

In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from
the dataset during data loading. It controls how indices of the dataset are drawn for each
batch.

Types of Samplers

PyTorch provides several predefined samplers, and you can create custom ones:

1. SequentialSampler:

○ Samples elements sequentially, in the order they appear in the dataset.

○ Default when shuffle=False.

2. RandomSampler:

○ Samples elements randomly without replacement.

○ Default when shuffle=True.

6. Dataset and DataLoader Class Page 9

A note about collate_fn
12 December 2024 17:51

The collate_fn in PyTorch's DataLoader is a function that specifies how to combine a list of
samples from a dataset into a single batch. By default, the DataLoader uses a simple batch
collation mechanism, but collate_fn allows you to customize how the data should be
processed and batched.

6. Dataset and DataLoader Class Page 10

DataLoader Important Parameters
12 December 2024 18:01

The DataLoader class in PyTorch comes with several parameters that allow you to customize
how data is loaded, batched, and preprocessed. Some of the most commonly used and
important parameters include:
1. dataset (mandatory):
○ The Dataset from which the DataLoader will pull data.
○ Must be a subclass of torch.utils.data.Dataset that implements __getitem__ and
__len__.
2. batch_size:
○ How many samples per batch to load.
○ Default is 1.
○ Larger batch sizes can speed up training on GPUs but require more memory.
3. shuffle:
○ If True, the DataLoader will shuffle the dataset indices each epoch.
○ Helpful to avoid the model becoming too dependent on the order of samples.
4. num_workers:
○ The number of worker processes used to load data in parallel.
○ Setting num_workers > 0 can speed up data loading by leveraging multiple CPU
cores, especially if I/O or preprocessing is a bottleneck.
5. pin_memory:
○ If True, the DataLoader will copy tensors into pinned (page-locked) memory before
returning them.
○ This can improve GPU transfer speed and thus overall training throughput,
particularly on CUDA systems.
6. drop_last:
○ If True, the DataLoader will drop the last incomplete batch if the total number of
samples is not divisible by the batch size.
○ Useful when exact batch sizes are required (for example, in some batch
normalization scenarios).
7. collate_fn:
○ A callable that processes a list of samples into a batch (the default simply stacks
tensors).
○ Custom collate_fn can handle variable-length sequences, perform custom batching
logic, or handle complex data structures.
8. sampler:
○ sampler defines the strategy for drawing samples (e.g., for handling imbalanced
classes, or custom sampling strategies).
○ batch_sampler works at the batch level, controlling how batches are formed.
○ Typically, you don’t need to specify these if you are using batch_size and shuffle.
However, they provide lower-level control if you have advanced requirements.

6. Dataset and DataLoader Class Page 11

Improving our existing code
12 December 2024 08:49

6. Dataset and DataLoader Class Page 12

Py Torch
No ratings yet
Py Torch
786 pages
Pytorch Tutorial 1 Rev 1
No ratings yet
Pytorch Tutorial 1 Rev 1
48 pages
Deep Learning With PyTorch 1
No ratings yet
Deep Learning With PyTorch 1
1 page
Pytorch Cheatsheet EN
No ratings yet
Pytorch Cheatsheet EN
1 page
Pytorch Tutorial: Narges Honarvar Nazari January 30
No ratings yet
Pytorch Tutorial: Narges Honarvar Nazari January 30
29 pages
Deep Learning With PyTorch
No ratings yet
Deep Learning With PyTorch
19 pages
PyTorch PDF
No ratings yet
PyTorch PDF
72 pages
NN From Scratch
No ratings yet
NN From Scratch
5 pages
Correlation
100% (1)
Correlation
29 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
Pytorch Tutorial: - Ntu Machine Learning Course
No ratings yet
Pytorch Tutorial: - Ntu Machine Learning Course
64 pages
Ai Unit-2
No ratings yet
Ai Unit-2
45 pages
Module02 PyTorch
No ratings yet
Module02 PyTorch
36 pages
Lecture 08 Dataset and Dataloader
No ratings yet
Lecture 08 Dataset and Dataloader
21 pages
Chapter 4
No ratings yet
Chapter 4
34 pages
Unbalanced Data Loading For Multi-Task Learning in PyTorch (Blog)
No ratings yet
Unbalanced Data Loading For Multi-Task Learning in PyTorch (Blog)
11 pages
Pytorch Slides
No ratings yet
Pytorch Slides
31 pages
Activation Functions: Ismail Elezi
No ratings yet
Activation Functions: Ismail Elezi
30 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
No ratings yet
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
18 pages
Chapter 4
No ratings yet
Chapter 4
34 pages
Pytorch
No ratings yet
Pytorch
38 pages
Assignment 4x
No ratings yet
Assignment 4x
19 pages
PyTorch Made Easy A Quick Overview
No ratings yet
PyTorch Made Easy A Quick Overview
55 pages
Chapter 1
No ratings yet
Chapter 1
37 pages
MLP Pytorch Softmax Crossentr
No ratings yet
MLP Pytorch Softmax Crossentr
20 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
No ratings yet
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
8 pages
Chapter1 Intro
No ratings yet
Chapter1 Intro
35 pages
Lastest Advancements in Process Control in Refinery
100% (1)
Lastest Advancements in Process Control in Refinery
15 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
MLP Pytorch Sigmoid Mse
No ratings yet
MLP Pytorch Sigmoid Mse
20 pages
Fibercablelength Understanding
No ratings yet
Fibercablelength Understanding
5 pages
Computer Speech Processing Assignment 2
No ratings yet
Computer Speech Processing Assignment 2
5 pages
Pytorch Waste Classification Using Densenet Jupyter Notebook
No ratings yet
Pytorch Waste Classification Using Densenet Jupyter Notebook
14 pages
CIFAR - 10 - Dataset - Using - CNN - Aniiiii - HTML
No ratings yet
CIFAR - 10 - Dataset - Using - CNN - Aniiiii - HTML
8 pages
Transient Response Stability: Solutions To Case Studies Challenges Antenna Control: Stability Design Via Gain
No ratings yet
Transient Response Stability: Solutions To Case Studies Challenges Antenna Control: Stability Design Via Gain
43 pages
Assignment 3 DL
No ratings yet
Assignment 3 DL
6 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Pytorch
No ratings yet
Pytorch
4 pages
LLM Fine Tune
No ratings yet
LLM Fine Tune
11 pages
Video 9 - PyTorch Datasets and Dataloaders
No ratings yet
Video 9 - PyTorch Datasets and Dataloaders
9 pages
Building Deep Learning Models Using The PyTorch Library
No ratings yet
Building Deep Learning Models Using The PyTorch Library
4 pages
PyTorch Workflow Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
No ratings yet
PyTorch Workflow Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
43 pages
Train Your Image Classifier Model With PyTorch
No ratings yet
Train Your Image Classifier Model With PyTorch
6 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
Assignment3 AL
No ratings yet
Assignment3 AL
23 pages
ML Code Analysis
No ratings yet
ML Code Analysis
6 pages
2c PyTorch4
No ratings yet
2c PyTorch4
4 pages
Skill 7
No ratings yet
Skill 7
11 pages
PyTorch Guide With Code
No ratings yet
PyTorch Guide With Code
4 pages
Assignment 3 DS5620
No ratings yet
Assignment 3 DS5620
11 pages
Py Torch
No ratings yet
Py Torch
19 pages
Project Documentation
No ratings yet
Project Documentation
24 pages
Unit 4 Part 3 DL - 1
No ratings yet
Unit 4 Part 3 DL - 1
5 pages
About PyTorch
No ratings yet
About PyTorch
2 pages
Mid Term Question Paper AI Soln v2
No ratings yet
Mid Term Question Paper AI Soln v2
6 pages
Chapter 3 - Training Deep Neural Networks
No ratings yet
Chapter 3 - Training Deep Neural Networks
25 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
Credit Card Clustering Autoencoder
No ratings yet
Credit Card Clustering Autoencoder
6 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Module 3 Mid Mod Assessment Study Guide
No ratings yet
Module 3 Mid Mod Assessment Study Guide
1 page
MIT Intro To Derivatives
No ratings yet
MIT Intro To Derivatives
11 pages
Queue Implementation Using Arrays
50% (2)
Queue Implementation Using Arrays
7 pages
Global State: - Global State of A Distributed System Consists of
No ratings yet
Global State: - Global State of A Distributed System Consists of
4 pages
Dimensionality Reduction Using PCA (Principal Component Analysis)
No ratings yet
Dimensionality Reduction Using PCA (Principal Component Analysis)
13 pages
Sampling of Continuous-Time Signals
No ratings yet
Sampling of Continuous-Time Signals
11 pages
A Dynamic Piecewise Linear Model For DC Transmission Losses in Optimal Scheduling Problems
No ratings yet
A Dynamic Piecewise Linear Model For DC Transmission Losses in Optimal Scheduling Problems
12 pages
The Math Behind The Machines
No ratings yet
The Math Behind The Machines
2 pages
A CRT-based (T, N) Threshold Signature Scheme Without A Dealer ?
No ratings yet
A CRT-based (T, N) Threshold Signature Scheme Without A Dealer ?
12 pages
Stock Market Analysis Using Supervised Machine Learning: Kunal Pahwa Neha Agarwal
No ratings yet
Stock Market Analysis Using Supervised Machine Learning: Kunal Pahwa Neha Agarwal
4 pages
Time Response of 2nd Order System
No ratings yet
Time Response of 2nd Order System
4 pages
CH 2 - Nonlinear Equations
No ratings yet
CH 2 - Nonlinear Equations
22 pages
Lab 11: Implementation of The BINARY SEARCH TREE Data Structure With The Help of Algorithms
No ratings yet
Lab 11: Implementation of The BINARY SEARCH TREE Data Structure With The Help of Algorithms
4 pages
Ba Sas
No ratings yet
Ba Sas
5 pages
Ds 6 Relation
No ratings yet
Ds 6 Relation
40 pages
Intro To Special Relativity & General Relivity Chapter 7
No ratings yet
Intro To Special Relativity & General Relivity Chapter 7
8 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Daley Etal 2022 Practical Quantum Advantage in Quantum Simulation
No ratings yet
Daley Etal 2022 Practical Quantum Advantage in Quantum Simulation
14 pages
W31 L1 Plotting Graphs of Linear Functions
No ratings yet
W31 L1 Plotting Graphs of Linear Functions
16 pages
Studentplacement
No ratings yet
Studentplacement
10 pages
Huffman Coding Ms 140400147 Sadia Yunas Butt
No ratings yet
Huffman Coding Ms 140400147 Sadia Yunas Butt
9 pages
1m - 2014 Mueller TheBoltzmannfactor Asimplifiedderivation
No ratings yet
1m - 2014 Mueller TheBoltzmannfactor Asimplifiedderivation
7 pages
Yellow and White Modern School Project Education Presentation
No ratings yet
Yellow and White Modern School Project Education Presentation
7 pages
Assignment #1-Fall 2023
No ratings yet
Assignment #1-Fall 2023
1 page
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Dataset and DataLoader Class

Uploaded by

Dataset and DataLoader Class

Uploaded by

Recap

10 December 2024 17:04

Solution - using batches of data to train the model

6. Dataset and DataLoader Class Page 1

6. Dataset and DataLoader Class Page 2

1. No standard interface for data

6. Dataset and DataLoader Class Page 3

The Dataset class is essentially a blueprint. When you create a

• __init__() which tells how data should be loaded.

The DataLoader wraps a Dataset and handles batching, shuffling,

DataLoader Control Flow:

• At the start of each epoch, the DataLoader (if shuffle=True)

• It divides the indices into chunks of batch_size.

• The samples are then collected and combined into a batch

• The batch is returned to the main training loop

6. Dataset and DataLoader Class Page 4

6. Dataset and DataLoader Class Page 5

6. Dataset and DataLoader Class Page 6

6. Dataset and DataLoader Class Page 7

○ Samples elements sequentially, in the order they appear in the dataset.

○ Samples elements randomly without replacement.

6. Dataset and DataLoader Class Page 9

6. Dataset and DataLoader Class Page 10

6. Dataset and DataLoader Class Page 11

6. Dataset and DataLoader Class Page 12

You might also like

• init() which tells how data should be loaded.