0% found this document useful (0 votes)

65 views12 pages

Distilling Datasets Into Less Than One Image

The document proposes a new method called Poster Dataset Distillation (PoDD) that can distill an entire dataset into a single large image called a poster, using less than one image per class. Existing distillation methods require at least one image per class. PoDD initializes a poster with random pixels and optimizes the pixels so that a classifier trained on overlapping patches extracted from the poster achieves high accuracy. PoDD also introduces a labeling method (PoDDL) to assign labels to overlapping patches and an algorithm (PoCO) to semantically order classes on the poster. The method can match or improve previous methods' performance using less than one image per class.

Uploaded by

xepit98367

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views12 pages

Distilling Datasets Into Less Than One Image

Uploaded by

xepit98367

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

D ISTILLING DATASETS I NTO L ESS T HAN O NE I MAGE

Asaf Shul* Eliahu Horwitz* Yedid Hoshen

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel
https://vision.huji.ac.il/podd/
arXiv:2403.12040v1 [cs.CV] 18 Mar 2024

{asaf.shul, eliahu.horwitz, yedid.hoshen}@mail.huji.ac.il

202

32 32

32 32
Classic Dataset Distillation Poster Dataset Distillation (PoDD)
1 Image-Per-Class 0.4 Image-Per-Class
Figure 1: Poster Dataset Distillation (PoDD): We propose PoDD, a new dataset distillation setting for a tiny, under 1
image-per-class (IPC) budget. In this example, the standard method attains an accuracy of 35.5% on CIFAR-100 with
approximately 100k pixels, PoDD achieves an accuracy of 35.7% with less than half the pixels (roughly 40k)

A BSTRACT
Dataset distillation aims to compress a dataset into a much smaller one so that a model trained on
the distilled dataset achieves high accuracy. Current methods frame this as maximizing the distilled
classification accuracy for a budget of K distilled images-per-class, where K is a positive integer.
In this paper, we push the boundaries of dataset distillation, compressing the dataset into less than
an image-per-class. It is important to realize that the meaningful quantity is not the number of
distilled images-per-class but the number of distilled pixels-per-dataset. We therefore, propose Poster
Dataset Distillation (PoDD), a new approach that distills the entire original dataset into a single poster.
The poster approach motivates new technical solutions for creating training images and learnable
labels. Our method can achieve comparable or better performance with less than an image-per-class
compared to existing methods that use one image-per-class. Specifically, our method establishes
a new state-of-the-art performance on CIFAR-10, CIFAR-100, and CUB200 using as little as 0.3
images-per-class.
∗
Equal contribution
Shul et al. Distilling Datasets Into Less Than One Image

Original Dataset Subset/Coreset Dataset Distillation Poster Dataset Distillation

All Images !% Images-Per-Class ! Images-Per-Class !< 1 Image-Per-Class

… … …

…
…

…
All Images = ~ 60M Pixels

$% IPC for !=20% = ~ 12M Pixels

$ IPC for "=10 = ~ 1M Pixels

Class 1 Class 2 Class #
… $<1 IPC for "=0.3 = ~ 30K Pixels
*IPC = Images-Per-Class

Figure 2: Dataset Compression Scale: We show increasingly more compressed methods from left to right. The original
dataset contains all of the training data and does not perform any compression. Coreset methods select a subset of the
original dataset, without modifying the images. Dataset distillation methods compress an entire dataset by synthesizing
K ∈ N+ images-per-class (IPC). Our method, Poster Dataset Distillation (PoDD) distills an entire dataset into a single
poster that achieves the same performance as 1 IPC while using as little as 0.3 IPC

1 Introduction

Deep-learning methods require large training datasets to achieve high accuracy. Dataset distillation [25] allows distilling
large datasets into smaller ones so that training on the distilled dataset results in high accuracy. Concretely, dataset
distillation methods synthesize the K images-per-class (IPC) that are most relevant for the classification task. Dataset
distillation has been very successful, achieving high accuracy with as little as a single image-per-class.
In this paper, we ask: “can we go lower than one image-per-class?” Existing dataset distillation methods are unable
to do this as they synthesize one or more distinct images for each class. Assuming there are n classes, such methods
would require distilling at least n images. On the other hand, using less than 1 IPC implies that several classes share the
same image, which current methods do not allow. We therefore propose Poster Dataset Distillation (PoDD), which
distills an entire dataset into a single larger image, that we call a poster. The benefit of the poster representation is the
ability to use patches that overlap between the classes. We can set the size of the poster so it has significantly fewer
pixels than in n images, therefore enabling distillation with less than 1 IPC. We find that a correctly distilled poster is
sufficient for training a model with high accuracy. See fig. 2 for an overview of different dataset compression methods.
To illustrate the idea of a poster, consider CIFAR-100[11] where each image is of size 32 × 32 pixels. Current methods
synthesize images independently and thus must use at least 1 IPC (see fig. 1 (left)). Choosing 1 IPC for CIFAR-100
entails using 100 images, each of size 32 × 32 pixels. In contrast, PoDD synthesizes a single poster shared between
all classes. During distillation, we optimize all the pixels so that a classifier trained on the resulting dataset achieves
maximal accuracy. For example, to achieve 0.4 IPC, we represent the entire dataset as a single poster of size 202 × 202
pixels (see fig. 1 (right)). This has about the same number of pixels as 40 images, each of size 32 × 32. The number of
effective IPCs is therefore directly given by the size of the poster.
To distill a poster, we first initialize all pixels with random values. We transform the poster into a dataset in a
differentiable way, by extracting overlapping patches, each with the same size as an image in the source dataset (e.g.,
32 × 32 for CIFAR-100). During distillation, we optimize this set of overlapping poster patches and propagate the
(overlapping) gradients from the distillation algorithm back to the poster. The optimization objective is to synthesize a
poster such that a classifier trained on the dataset extracted from it will reach high classification accuracy. This process
requires a label for every patch, we therefore propose Poster Dataset Distillation Labeling (PoDDL), a method for
poster labeling that supports both fixed and learned labels.

2
Shul et al. Distilling Datasets Into Less Than One Image

Since classes can now share pixels, their order within the poster matters as it implies which classes share pixels with
each other. It is thus important to find the optimal ordering of classes on the poster. To this end, we propose Poster Class
Ordering (PoCO), an algorithm that uses CLIP [19] text embeddings to order the classes semantically and efficiently.
Overall, using less than 1 IPC, PoDD can match or improve the performance that previous methods achieved with at
least 1 IPC. Indeed, sometimes PoDD can outperform competing methods using as low as 0.3 IPC. Moreover, PoDD
sets a new state-of-the-art for 1 IPC on CIFAR-10, CIFAR-100, and CUB200.
To summarize, our main contributions are:

1. Proposing PoDD, a new dataset distillation setting for tiny, less than 1 IPC budgets.
2. Developing a method to perform PoDD that constitutes of a class ordering algorithm (PoCO) and a labeling
strategy (PoDDL).
3. Performing extensive experiments that demonstrate the effectiveness of PoDD with as low as 0.3 IPC and
achieving a new 1 IPC SoTA across multiple datasets.

2 Related Works
Dataset distillation, introduced by Wang et al. [25], aims to compress an entire dataset into a smaller, synthetic one. The
goal is that methods trained on the distilled dataset will achieve similar accuracy to a model trained on the original dataset.
As highlighted by [1, 10, 20], dataset distillation shares similarities with coreset selection. Coreset selection identifies a
representative subset of samples from the training set that can be used to train a model to the same accuracy. Unlike
coreset selection, the generated synthetic samples of dataset distillation provide flexibility and improved performance
through continuous gradient-based optimization techniques. Dataset distillation methods can be categorized into 4 main
groups: i) Meta-Model Matching [8, 15, 18, 25, 30] minimize the discrepancy between the transferability of models
trained on a distilled data and those trained on the original dataset. ii) Gradient Matching [14, 27, 29], proposed by Zhao
et al. [29], performs one-step distance matching between a network trained on the target dataset and the same network
trained on the distilled dataset. This avoids unrolling the inner loop of Meta-Model Matching methods. iii) Trajectory
Matching[2, 4], proposed by Cazenavette et al. [2], focuses on matching the training trajectories of models trained on
the target distilled dataset and the original dataset. iv) Distribution Matching[13, 24, 28], introduced by Zhao and Bilen
[28], solves a proxy task via a single-level optimization, directly matching the distribution of the original dataset and
the distilled dataset. See [21] for an in-depth explanation and comparisons of the various distillation methods. Common
to all these methods is the use of at least one IPC. This paper proposes a method for distilling a dataset into less than 1
IPC.

3 Preliminaries
Many methods tackle dataset distillation as a bi-level optimization problem. In this setup, the inner loop essentially
involves training a model with weights θ on a distilled dataset Dsyn . The outer loop optimizes the pixels of the distilled
dataset Dsyn , so a model trained on Dsyn has the maximal accuracy on the original dataset D. Let LD (θ) denote
the average value of the objective function, of model θ on the dataset D. Formally the optimization problem can be
described as follows:
arg min LD (θ∗ ) s.t θ∗ = arg min LDsyn (θ) (1)
Dsyn θ
| {z } | {z }
Outer loop Inner loop

We consider the case where the inner-loop optimization consists of Tend SGD steps. The most common solution is
backpropagation through time (BPTT) which unrolls the inner loop SGD optimization for Tend steps. It then uses
computationally demanding backpropagation calculation to compute the gradient of the loss with respect to the distilled
dataset Dsyn ,
arg min E [LD (θTend )] s.t θt+1 ←− θt − η · ∇θ LDsyn (θt ) (2)
Dsyn θ0 ∼Pθ | {z }
| {z } Inner loop unrolling
Outer loop

Backpropagating through Tend timesteps is infeasible for large values of Tend , therefore many methods propose ways to
reduce the computational costs. Here, we use RaT-BPTT [8], which computes the inner loop through a random number
of SGD steps Tend ∼ Random[∆T, ∆T + 1, ..., T ]. It then approximates the gradient with respect to the distilled
dataset by only backpropagating through the final ∆T << T steps. We chose RaT-BPTT because it achieves the top
performance across many dataset distillation benchmarks.

3
Shul et al. Distilling Datasets Into Less Than One Image

Distillation
(c) Distillation
(b) Overlapping algorithm
patches with
soft labels

...

(a) Randomly (d) Distilled

initialized poster poster

Inference
Truck Plane Bird Dear Cat

Car Boat Frog Horse Dog

(e) Distilled (f) Dataset from overlapping

poster patches with soft labels

Figure 3: PoDD Overview: We propose PoDD, a new dataset distillation setting for under 1 images-per-class. We start
by initializing a random poster (a), during distillation, we optimize overlapping patches and soft labels (b-c). The final
distilled poster has fewer pixels than the combined pixels of the individual images (d). During inference, we extract
overlapping patches and soft labels from the distilled poster and use them to train a downstream model (e-f). PoDD
achieves comparable or better accuracy to current methods while using as little as a third of the pixels

4 PoDD: Poster Dataset Distillation

Our goal is to perform dataset distillation with less than 1 IPC. Our main insight is that sharing pixels between classes
can be effective, as this would make better use of redundant pixels. Clearly, this requires more than one class to share an
image (the pigeonhole principle [6] states this formally). In this section, we propose Poster Dataset Distillation (PoDD)
which provides the methods to realize this idea.

4.1 A Shared Poster Representation

The key idea in this work is to distill an entire dataset into a single larger image that we call a poster. The poster can
be of arbitrary height dh and width dw , leading to a total of dh × dw pixels; posters with fewer pixels are said to be
more compact. Furthermore, we define a fixed procedure for converting a poster into a distilled dataset. Our procedure
extracts multiple overlapping patches at a fixed stride; each extracted patch is equal in size to the original images (see
fig. 3 for a schematic of the procedure). We employ the same stride for both rows and columns and denote the total
number of extracted patches by p. Consequently, this yields a dataset of images, some of which share pixels.
We initialize the poster with standard Gaussian noise and optimize its pixels end-to-end through both the dataset
expansion described above and the inner-loop bi-level optimization described in section 3. The size of the distilled
dataset is measured by the total number of pixels in the poster, allowing us to compare with previous approaches that
used one or more IPC. We provide pseudocode for PoDD in algorithm 1.

4
Shul et al. Distilling Datasets Into Less Than One Image

Algorithm 1 PoDD: Pseudocode using PoDDL learned labels

Input: Alg: Distillation algorithm.
Input: D: Dataset with classes C.
Input: dh , dw : Poster size.
Input: p: Number of overlapping patches.
1: P ∼ N (0, 1)dx ×dw # Initialize a poster from Gaussian
2: O ← PoCO(C) # Initialize class order
3: Y ← PoDDL(O, p) # Initialize distilled labels array
4: for each distillation step do
5: Dsyn := {(p, l) overlapping patches and labels from P and Y }
6: P, Y ← Alg(D, Dsyn ) # Distill one step
7: end for
8: Return P, Y

4.2 PoCO: Poster Class Ordering

The poster representation relies on shared pixels between neighboring classes. To maximize the effectiveness of the
shared pixels, it is therefore important to establish the optimal structure of neighboring classes. We hypothesize that a
poster would be more effective when pixels are shared between semantically related classes (e.g., Man, Woman, Boy vs.
Plate, Tree, Bus).
We propose a method for approximating the optimal class neighborhood structure. First, we extract an embedding for
each class name using the CLIP [19] text-encoder E. Given the set of embeddings, we calculate the pairwise distance
matrix between all classes. Using this pairwise class distance, we design a greedy procedure for placing the classes on
a rectangular grid Oox ×oy , denoting the spatial positions of classes. Let C be the set of classes and Cp be the set of
already placed classes. We traverse the grid in a Zigzag order [3], and at each step place a class as follows:
 

X E(m) · E(c)
Oi,j = min  1− , i ∈ [ox ], j ∈ [oy ]
c∈C\Cp ||E(m)|| · ||E(c)||
m∈{Oi−1,j ,Oi,j−1 }

Intuitively, at each step, we place a remaining class that has the lowest distance from all its already-placed neighbors.
We visualize this Zigzag traversal in fig. 4 and summarize the PoCO algorithm in algorithm 2, in fig. 7 we show an
example PoCO tiling for CIFAR-100.

Algorithm 2 PoCO: Pseudocode for PoCO class ordering

(a) Initialized
Input: E: CLIP Text encoder. empty grid
Input: ox , oy : Class grid dimensions.
Input: C: Class list of the target dataset.
1: O ← 0ox ×oy
2: O[0,0] ← C[0] (b) Zigzag
3: C ← C\{C[0] } traverse path
4: G ← Distance matrix using E embeddings
5: for i, j in ox , oy (Zigzag traverse) do
6: ngbr ← {(i, j) neighboring classes in O}
7: c ← C[arg minngbr (G)] (c) Greedy
tiling
8: O[i,j] ← c
9: C ← C\{c}
10: end for
11: return O Figure 4: PoCO Tiling: (a) Initialize empty grid O,
(b) Traverse O in a Zigzag order, (c) Tile O in a
greedy manner using CLIP text embeddings

5
Shul et al. Distilling Datasets Into Less Than One Image

(a) Poster (b) Label array Y (c) extracted window (d) Soft label
Figure 5: PoDDL Extraction: Each poster patch has a corresponding patch in the label array (a-b). We compute the
poster patch label by extracting a patch along the channels of the label array (c). To obtain the final soft label for a given
poster patch, we pool and normalize the extracted label window, resulting in a soft label vector (d). PoDDL supports
both fixed and learned labels

4.3 PoDDL: Poster Dataset Distillation Labeling

Having initialized the poster and the class order matrix O, we now describe our labeling strategy. Previous approaches
use one or more images-per-class, hence, they can simply assign a single label per image. However, using a single
label for the entire poster is not a good option, instead, we assign a soft label [22] vector to each overlapping patch. We
therefore design a poster-oriented soft labeling strategy that supports both fixed and learned labels, see fig. 5 for an
overview.
Fixed labels. We upsample the class order matrix O to the size of the poster dh × dw . For each overlapping patch, we
extract its corresponding class label window. We compute its majority class and use it as the one-hot label for the patch.
In the case of ties, we use a soft label with equal probabilities for the majority classes.
Learned labels. As our method extracts an arbitrary number of overlapping patches, learning a soft label for each patch
would require more parameters than previous approaches. To keep the the number of parameters constant, we learn a
parameter tensor of the same size as previous works, and interpolate it to each overlapping window.
Concretely, we learn a label tensor Y of size ox × oy × n and spatially upscale it to the shape of the poster using nearest
neighbor interpolation. The final size of Y is dh × dw × n. For each overlapping patch, we extract the corresponding
label window and average pool it. To achieve a valid label distribution, we L1 normalize the resulting vector. We use
this vector as the learned soft label of the window. After each gradient step, we clip negative values of Y to zero to
avoid negative probabilities. Unlike the fixed labels, the learned labels are optimized alongside the distillation process.

5 Experiments

5.1 Experimental Setting

Datasets. We evaluate PoDD on four datasets commonly used to benchmark dataset distillation methods: i) CIFAR-10:
10 classes, 50k images of size 32 × 32 × 3 [11]. ii) CIFAR-100: 100 classes, 50k images of size 32 × 32 × 3 [11].
iii) CUB200: 200 classes, 6k images of size 32 × 32 × 3 [26]. iv) Tiny-ImageNet: 200 classes, 100k images of size
64 × 64 × 3 [12].
Distillation Method. For the dataset distillation algorithm, we use RaT-BPTT [8], a recent method that achieved SoTA
performance on CIFAR-10, CIFAR-100, and CUB200 by a wide margin. In particular, RaT-BPTT’s architecture uses
three convolutional layers for 32 × 32 datasets and four layers for 64 × 64 datasets. Baselines. Our baselines can
be divided into two groups: i) Inner-Loop: BPTT [5], empirical feature kernel (FRePO) [30], and reparameterized
convex implicit gradient (RCIG) [16], and ii) Modified Objectives: gradient matching with augmentations (DSA) [27],
distribution matching (DM) [28], trajectory matching (MTT) [2], flat trajectory distillation (FDT) [7], and TESLA [4].
We use the results reported by the baselines.
Evaluation. Following the protocol of [5, 27], we evaluate the distilled poster using a set of 8 different randomly
initialized models with the same ConvNet [9] architecture used by DSA, DM, MTT, RaT-BPTT, and others. The
architecture includes convolutional layers of 128 filters with kernel size 3 × 3 followed by instance normalization[23],
ReLU [17] activation, and an average pooling layer with kernel size 2 × 2 and stride 2. We report the mean and standard
deviation across the 8 random models. We evaluate two IPC regimes:
Less than one IPC. We compute the results for IPCs within the range: K ∈ [0.3, 0.4, ..., 1). Our criterion for success is
if PoDD with K IPC performs comparably or better than RaT-BPTT with 1 IPC. To comply with previous baselines,
we used PoDDL, which optimized the same number of label parameters as previous methods with 1 IPC.

6
Shul et al. Distilling Datasets Into Less Than One Image

Table 1: Less than One Image-Per-Class: PoDD with less than one IPC (image-per-class) often outperforms state-of-
the-art methods with 1-IPC; in some cases with as little as 0.3 IPC. In (red), the relative performance drop compared to
the 1 IPC results. In bold, the lowest IPC for which PoDD beats the current SoTA
Method IPC ↓ CIFAR-10 ↑ CIFAR-100 ↑ CUB200 ↑ T-ImageNet ↑
RaT-BPTT [8] 1.0 53.2±0.7 35.3±0.4 13.8±0.3 20.1±0.3
PoDD (Ours) 1.0 59.1±0.5 38.3±0.2 16.2±0.3 20.0±0.3
0.9 58.4±0.5(1%) 37.4±0.2(2%) 15.2±0.4(6%) 19.5±0.2(2%)
0.8 56.7±0.7(4%) 37.3±0.1(3%) 15.6±0.3(4%) 19.0±0.2(5%)
0.7 54.6±0.5(8%) 37.0±0.2(3%) 15.0±0.3(8%) 18.6±0.1(7%)
PoDD (Ours) 0.6 50.6±0.3(15%) 36.6±0.3(5%) 15.1±0.2(7%) 18.8±0.2(6%)
0.5 49.5±0.5(16%) 36.0±0.3(6%) 15.0±0.3(8%) 18.7±0.1(7%)
0.4 47.1±0.4(20%) 35.7±0.2(7%) 15.0±0.2(7%) 18.4±0.2(8%)
0.3 42.3±0.3(28%) 34.7±0.2(10%) 14.8±0.5(9%) 18.4±0.1(8%)
Full Dataset All
83.5±0.2 55.3±0.3 20.1±0.3 37.6±0.5
(No Dist.) Images

One IPC. To properly compare PoDD with existing methods we also evaluate it with the same total number of pixels
used by our baselines (1 IPC). Since PoDD still uses overlapping patches, this evaluates the impact of compressing
inter-class redundancies in the shared pixels, this is compared to the baselines which are unable to do so.
Implementation Details We use the same distillation hyper-parameters used by RaT-BPTT [8] except for the batch
sizes. To fit the distillation into a single GPU (we use an NVIDIA A40), we use the maximal batch size we can fit into
memory for a given dataset (see exact breakdown below). Since the optimization is bi-level, distillation methods have
two batch types, one for the distilled data which we denote by bsd , and one for the original dataset which we denote by
bs.
In addition to the K IPC parameter, we need to control the degree of patch overlap. In other words, given a dataset with
n classes and after fixing K (i.e., once we fix the size of the poster), we need to decide on the number of overlapping
patches to divide the poster into. As the number of classes and image resolution varies between the datasets, we use a
different p for each one. Concretely, we use: i) CIFAR-10: p = 96(16×6) patches, bsd = 96, bs = 5000, 4k epochs. ii)
CIFAR-100: p = 400(20×20) patches, bsd = 50, bs = 2000, 2k epochs. iii) CUB200: p = 1800(60×30) patches, bsd =
200, bs = 3000, 8k epochs. iv) Tiny-ImageNet: p = 800(40×20) patches, bsd = 30, bs = 500, 500 epochs.
We use the learned labels variant of PoDDL for all of our experiments except for CIFAR-10 with K ∈ [0.7, 0, 8, 0.9, 1.0]
IPC in which we use the fixed labels variant (the learned labels did not provide additional benefit in these cases). We
use a learning rate of 0.001 for CIFAR-10, CIFAR-100, and CUB200. To fit Tiny-ImageNet into a single GPU we use a
much smaller batch size and a learning rate of 0.0005.

5.2 Results

Less than one IPC. We now test our initial question: “can we go lower than one image-per-class?” Using PoDD, we
show that across all four datasets, we can go much lower than 1 IPC and for 3 out of the 4 datasets even maintain on-par
performance to the SoTA baseline. As hypothesized, using a poster that shares pixels between multiple classes allows
us to reduce redundancies between classes in the distilled patches (See table 1). This effect is intensified when distilling
datasets with a large number of classes, e.g., for CIFAR-100 we can use 0.4 IPC and for CUB200 we can use as little as
0.3 IPC and still outperform the baseline method.
One IPC. Having shown the feasibility of distilling a dataset into less than one IPC, we now quantitatively evaluate the
benefit of the poster representation. To this end, we use the 1 IPC setting which allows us to decouple the pixel count
from the pixel sharing. Essentially, we are investigating whether the pixel-sharing in our poster can boost performance,
even when the number of pixels matches our baseline. Our method outperforms the state-of-the-art for CIFAR-10,
CIFAR-100, and CUB200, setting a new SoTA for 1 IPC dataset distillation (See table 2).

5.3 Ablations

Class order ablation. We ablate the impact of the class ordering on the performance of PoDD on CIFAR-10. We
first compute the distillation performance after 250 distillation steps with 0.3 IPC for 5 random class orderings. The
score of each ordering is the inverse of the sum of distances between all neighboring class pairs. The distance matrix is

7
Shul et al. Distilling Datasets Into Less Than One Image

Table 2: One Image-Per-Class: Performance of PoDD under the 1 image-per-class (IPC) setting compared to SoTA
dataset distillation methods across 4 datasets. PoDD sets a new SoTA for CIFAR-10, CIFAR-100, and CUB200. On
Tiny-ImageNet, PoDD achieves comparable results to the underlying distillation method it uses (RaT-BPTT)
Inner Loop Method CIFAR-10 ↑ CIFAR-100 ↑ CUB200 ↑ T-ImageNet ↑ Average ↑
BPTT [5] 49.1±0.6 21.3±0.6 - - -
FRePO [30] 45.6±0.1 26.3±0.1 - 16.9±0.1 -
RCIG [16] 49.6±1.2 35.5±0.7 - 22.4±0.3 -
RaT-BPTT [8] 53.2±0.7 35.3±0.4 13.8±0.3 20.1±0.3 30.6±0.4
DSA [27] 28.8±0.7 13.9±0.3 1.3±0.1 6.6±0.2 12.7±0.3
Objectives
Modified

DM [28] 26.0±0.8 11.4±0.3 1.6±0.1 3.9±0.2 10.8±0.3

MTT [2] 46.3±0.8 24.3±0.3 2.2±0.1 8.8±0.3 20.4±0.4
FTD [7] 46.8±0.3 25.2±0.2 - 10.4±0.3 -
TESLA [4] 48.5±0.8 24.8±0.4 - - -
PoDD (Ours) 59.1±0.5 38.3±0.2 16.2±0.3 20.0±0.3 33.4±0.3
Full Dataset
83.5±0.2 55.3±0.3 20.1±0.3 37.6±0.5 49.1±0.3
(No Distillation)

Number of Test Accuracy

Patches (ox × oy ) 500epochs
10(5×2) 45.15%
24(8×3) 47.28%
40(10×4) 56.77%
60(12×5) 54.14%
96(16×6) 56.73%
126(18×7) 55.55%
160(20×8) 57.61%

Figure 6: Number of Patches Ablation: We ablate the effect of the patch overlap on the accuracy (CIFAR-10, 1 IPC,
500 distillation steps). Using 10 patches, which is the number of classes (i.e., no patch overlap) results in the lowest
accuracy. When increasing the number of patches beyond 24, the results improve significantly

defined in the same way as in PoCO i.e., using the embeddings of the CLIP text encoder. We find that the class ordering
can indeed impact the performance of the distilled poster, with a correlation coefficient of 0.76 between the score of the
ordering and the accuracy the distilled poster achieves. This correlation motivates PoCO’s search for the optimal class
ordering.
Patch number ablation. We ablate the role of the amount of overlap between patches of the poster (i.e., the number
of patches for a given poster size and dataset). To study this, we use CIFAR-10 at 1 IPC, we run PoDD for 500 steps
multiple times, each with a progressively increasing number of patches. As can be seen in fig. 6, using the same number
of patches as the number of classes (i.e., no overlap between patches) results in the lowest score; this is expected as this
is exactly the RaT-BPTT baseline. When increasing the number of patches, we observe that beyond a certain patch
number threshold, the results improve drastically. This demonstrates the significance of the poster representation and
the use of overlapping patches. Since the number of patches has a direct effect on the distillation time and the training
time of downstream models, we use a small number of patches for the larger datasets and a larger number of patches for
the smaller datasets.

6 Discussion and Future Work

Beyond the exciting result of distilling a dataset into less than one image, PoDD presents a new setting and distillation
representation that opens up new and intriguing research problems.
Class ordering algorithm. As shown in section 5.3, the ordering of the classes within a poster is strongly correlated
with the performance of the distilled poster. We proposed PoCO, a greedy algorithm for choosing a class ordering.
In fig. 7 we show an example ordering for CIFAR-100, as can be seen, the classes are separated into semantically

8
Shul et al. Distilling Datasets Into Less Than One Image

Apple Rose Road Sea Shark Dolphin Dinosaur Crocodile Lizard Leopard

Man Woman Bed Bus Whale Elephent Camel Turtle Lobster Caterpillar

Boy Girl Couch Chair Plate Mouse Lion Crab Rabbit Squirrel

Baby Tank Tracktor Table Bowl House Tiger Spider Kangaroo Seal

Ray Train Motorcicle Bridge Cap Lamp Butterfly Orchid Skunk Otter

Can Plane Bicycle Cattle Clock Rocket Sunflower Palm Tree Pine Tree Lawn Mower

Snake Mountain Bottle Castle Cloud Trout Tulip Oak Tree Willow Tree Streetcar

Worm Wolf Bear Forest Mushroom Poppy Telephone Maple Tree Flatfish Skyscraper

Snail Fox Beaver Hamster Shrew Sweet Paper Wardrobe Aquarium Fish Cockroach Orange

Beetle Bee Raccoon Possum Porcupine Pickup Truck Television Keyboard Pear Chimpanzee

Figure 7: PoCO Ordering: An example output of PoCO for CIFAR-100. The classes are separated into semantically
meaningful classes (e.g., trees, humans, vehicles). We colored semantically related clusters manually

Car Truck Horses Dog

Figure 8: Distilled Poster Semantics: We illustrate some of the semantics captured by a CIFAR-10, 1 IPC poster by
sketching over the distilled poster. The poster is of dimension 5 × 2, with the top row containing: Truck, Plane, Bird,
Deer, Cat, and the bottom row containing Car, Boat, Frog, Horse, Dog. We can see that the pixels shared between
classes exhibit a smooth transition between colors

meaningful classes (e.g., trees, humans, vehicles). However, PoCO does not always yield a perfect ordering, e.g., the
leopard may not fit well in the top right corner. It might fit better next to the lion and the tiger (4 left and 2 down).
Indeed, other ordering strategies may be better suited for the distillation task, e.g., a photometric-based ordering that
uses the color of the images or an ordering that uses the image semantics. Further investigation of alternative ordering
methods is left for future work.
Other IPC values. In this work, we focused on ≤ 1 IPC, however, it is possible to extend PoDD to values exceeding 1
IPC. In a preliminary investigation, we found that 1 ≤ IPC achieves SoTA results for some of the datasets. However,
we expect that leveraging the full potential of 1 ≤ IPC will require new labeling and class ordering algorithms.
Global and local scale semantic results. We investigate whether PoDD can produce distilled posters that exhibit both
local and global semantics. We found that in the case of 1 IPC, both local and global semantics are present, but are hard
to detect. For example, in fig. 8 we illustrate some of the captured semantics by sketching over the poster. To further
explore this idea, we tested [4] with a CIFAR-10 variant of PoDD where we use 10 IPC and distilled a poster per class.
Each poster now represents a single class and overlapping patches are always from the same class. To enable this, we

9
Shul et al. Distilling Datasets Into Less Than One Image

Planes Cars Birds Cats Deers

Boats Trucks Horses Dogs Frogs

Figure 9: Global and Local Semantics: We train a CIFAR-10 variant of PoDD with 10 IPC and a separate per class
poster. The local semantics are well preserved, showing multiple modalities per class, e.g., different colors of cars,
poses of animals, and locations of the planes. Moreover, some of the classes demonstrate global scale semantics, e.g.,
the planes have sky on the top and grass on the bottom

√
retrofit PoDDL by increasing the size of Y proportionally by K, allowing it to operate with 1 < K. As seen in fig. 9,
the method preserves the local semantics and shows multiple modalities from each class. Moreover, some of the posters
also demonstrate global semantics, e.g., the planes have the sky on the top and the grass on the bottom.
Patch Augmentations Throughout this work we use the extracted patches with no modifications. However, performing
spatial augmentations (e.g., scale, rotation) on the distilled patches during the distillation process may be beneficial.
Another option is to create a cyclic poster where patches near the border of the poster are wrapped around.

7 Conclusion
In this work, we propose poster dataset distillation (PoDD), a new dataset distillation setting for tiny, less than 1
image-per-class budgets. We develop a method to perform PoDD and present a strategy for ordering the classes within
the poster. We demonstrate the effectiveness of PoDD by achieving on-par SoTA performance with as low as 0.3 IPC
and by setting a new 1 IPC SoTA performance for CIFAR-10, CIFAR-100, and CUB200.

8 Broader Impact
Our work demonstrates the potential for PoDD to reduce the environmental footprint of deep learning by achieving
higher compression rates for image-based datasets. This innovation not only lowers storage requirements but also
leads to reduced training times, fostering more sustainable deep-learning research practices. By introducing the pixels-
per-dataset approach to dataset distillation, we encourage the development of more efficient and resource-conscious
practices.
However, it’s essential to acknowledge that, like any machine learning project, our work may have various societal
consequences. While we believe these should be considered, we do not feel it necessary to delve into specific societal
impacts in this statement.

10
Shul et al. Distilling Datasets Into Less Than One Image

References
[1] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end
incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248,
2018. 3
[2] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation
by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4750–4759, 2022. 3, 6, 8
[3] Xiuli Chai, Zhihua Gan, Yiran Chen, and Yushu Zhang. A visually secure image encryption scheme based on
compressive sensing. Signal Processing, 134:35–51, 2017. 5
[4] Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant
memory. In International Conference on Machine Learning, pages 6565–6590. PMLR, 2023. 3, 6, 8, 9
[5] Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural
networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022. 6, 8
[6] P.G.L. Dirichlet and R. Dedekind. Vorlesungen über Zahlentheorie. English translation: Lectures on Number
Theory, American Mathematical Society, 1999 ISBN 0-8218-2017-6, 1863. 4
[7] Jiawei Du, Yidi Jiang, Vincent T. F. Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated
trajectory error to improve dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2023. 6, 8
[8] Yunzhen Feng, Shanmukha Ramakrishna Vedantam, and Julia Kempe. Embarrassingly simple dataset distillation.
In The Twelfth International Conference on Learning Representations, 2023. 3, 6, 7, 8
[9] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018. 6
[10] Ibrahim Jubran, Alaa Maalouf, and Dan Feldman. Introduction to coresets: Accurate coresets. arXiv preprint
arXiv:1910.08707, 2019. 3
[11] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 2, 6
[12] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. 6
[13] Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge
factorization and sharing. arXiv preprint arXiv:2208.10494, 2022. 3
[14] Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with
contrastive signals. In International Conference on Machine Learning, pages 12352–12364. PMLR, 2022. 3
[15] Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature
approximation. Advances in Neural Information Processing Systems, 35:13877–13891, 2022. 3
[16] Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit
gradients. arXiv preprint arXiv:2302.06755, 2023. 6, 8
[17] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings
of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. 6
[18] Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide
convolutional networks. Advances in Neural Information Processing Systems, 34:5186–5198, 2021. 3
[19] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language
supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5
[20] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental
classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition, pages 2001–2010, 2017. 3

11
Shul et al. Distilling Datasets Into Less Than One Image

[21] Noveen Sachdeva and Julian McAuley. Data distillation: A survey. arXiv preprint arXiv:2301.04272, 2023. 3
[22] Ilia Sucholutsky and Matthias Schonlau. Soft-label dataset distillation and text dataset distillation. In 2021
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021. 6
[23] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast
stylization. arXiv preprint arXiv:1607.08022, 2016. 6
[24] Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao
Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 12196–12205, 2022. 3
[25] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. Dataset distillation. arXiv preprint
arXiv:1811.10959, 2018. 2, 3
[26] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. 2010. 6
[27] Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In Proceedings of the
International Conference on Machine Learning (ICML), pages 12674–12685, 2021. 3, 6, 8
[28] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision (WACV), 2023. 3, 6, 8
[29] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. arXiv preprint
arXiv:2006.05929, 2020. 3
[30] Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. In
Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022. 3, 6, 8