Efficient Data Subset Selection To Generalize Training Across Models: Transductive and Inductive Networks
Efficient Data Subset Selection To Generalize Training Across Models: Transductive and Inductive Networks
Abstract
Existing subset selection methods for efficient learning predominantly employ discrete combinatorial and
model-specific approaches which lack generalizability. For an unseen architecture, one cannot use the subset
chosen for a different model. To tackle this problem, we propose S UB S EL N ET, a trainable subset selection
framework, that generalizes across architectures. Here, we first introduce an attention-based neural gadget
that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for
quick model prediction. Then, we use these predictions to build subset samplers. This naturally provides us
two variants of S UB S EL N ET. The first variant is transductive (called as Transductive-S UB S EL N ET) which
computes the subset separately for each model by solving a small optimization problem. Such an optimization
is still super fast, thanks to the replacement of explicit model training by the model approximator. The second
variant is inductive (called as Inductive-S UB S EL N ET) which computes the subset using a trained subset selector,
without any optimization. Our experiments show that our model outperforms several methods across several
real datasets.
1 Introduction
In the last decade, neural networks have drastically enhanced the performance of state-of-the-art ML models.
However, they often demand massive data to train, which renders them heavily contingent on the availability
of high-performance computing machineries such as GPUs and RAM. Such resources entail heavy energy
consumption, excessive CO2 emission, and maintenance cost.
Driven by this challenge, a recent body of work focuses on suitably selecting a subset of instances so that
the model can be trained quickly using lightweight computing infrastructure [4, 23, 51, 32, 54, 37, 18–21, 36].
However, these methods are not generalizable across architectures— the subset selected by such a method is
tailored to train only one specific architecture and thus need not be optimal for training another architecture.
Hence, to select data subsets for a new architecture, they need to be run from scratch. However, these methods
rely heavily on discrete combinatorial algorithms, which impose significant barriers against scaling them for
multiple unseen architectures. Appendix C contains further details about related work.
1
the widths of layers, learning rates or scheduler-specific hyperparameters, we can train each architecture on the
corresponding data subset obtained from our method to quickly obtain the trained model for cross-validation.
Design of neural pipeline to eschew model training for new architecture. We initiate our investigation by
writing down a combinatorial optimization problem instance that outputs a subset specifically for one given
model architecture. Then, we gradually develop S UB S EL N ET, by building upon this setup. The key blocker
in scaling up a model-specific combinatorial subset selector across different architectures is the involvement
of the model parameters as optimization variables along with the candidate data subset. We design the neural
pipeline of S UB S EL N ET to circumvent this blocker specifically. This neural pipeline consists of the following
three components:
(1) GNN-guided architecture encoder: This converts the architecture into an embedded vector space. (2)
Neural model approximator: It approximates the predictions of a trained model for any given architecture. Thus,
it provides the accuracy of a new (test) model per instance without explicitly training it. (3) Subset sampler: It
uses the predictions from the model approximator and an instance to provide a selection score of the instance.
Due to the architecture encoder and the neural approximator, we do not need to explicitly train a test model for
selecting the subset since the model approximator directly provides the predictions the model will make.
Transductive and Inductive S UB S EL N ET. Depending on the functioning of the subset sampler in the final
component of our neural pipeline, we design two variants of our model.
Transductive-S UB S EL N ET: The first variant is transductive in nature. For each new architecture, we utilize
the the model approximator’s predictions for replacing the model training step in the original combinatorial
subset selection problem. However, the candidate subset still remains involved as an optimization variable. Thus,
we still solve a fresh optimization problem with respect to the selection score provided by the subset sampler
every time we encounter a new architecture. However, the direct predictions from the model approximator allow
us to skip explicit model training, making this strategy extremely fast in terms of memory and time.
Inductive-S UB S EL N ET: In contrast to Transductive-S UB S EL N ET, the second variant does not require to
solve any optimization problem and instead models the selection scores using a neural network. Consequently, it
is extremely fast.
We compare our method against six state-of-the-art methods on five real world datasets, which show that
S UB S EL N ET provides the best trade-off between accuracy and inference time as well as accuracy and memory
usage, among all the methods.
2 Preliminaries
Setting. We are given a set of training instances {(xi , yi )}i∈D where we use D to index the data. Here, xi ∈ Rdx
denotes the features, and yi ∈ Y denotes the labels. In our experiments, we consider Y as a set of categorical
labels. However, our framework can also be used for continuous labels. We use m to denote a neural architecture
and represent its parameterization as mθ . We also use M to denote the set of neural architectures. Given an
architecture m ∈ M, Gm = (Vm , Em ) provides the graph representation of m, where the nodes u ∈ Vm
represent the operations and the e = (um , vm ) indicates an edge, where the output given by the operation
represented by the node um is fed to one of the operands of the operation given by the node vm . Finally, we use
H(·) to denote the entropy of a probability distribution and ℓ(mθ (x), y) as the cross entropy loss hereafter.
2
m2M
<latexit sha1_base64="Pu6zYbmXj7cy/v5RTUORPsjIHG0=">AAAB+XicdVDLSgMxFM3UV62vUZdugkVwNWS0Y+2u6MaNUME+oDOUTJq2oZnMkGQKZeifuHGhiFv/xJ1/Y6atoKIHAodz7uWenDDhTGmEPqzCyura+kZxs7S1vbO7Z+8ftFScSkKbJOax7IRYUc4EbWqmOe0kkuIo5LQdjq9zvz2hUrFY3OtpQoMIDwUbMIK1kXq2HflMQD/CekQwz25nPbuMHA+5Na8GF+QcGYLcStW7gK6D5iiDJRo9+93vxySNqNCEY6W6Lkp0kGGpGeF0VvJTRRNMxnhIu4YKHFEVZPPkM3hilD4cxNI8oeFc/b6R4UipaRSayTyi+u3l4l9eN9WDyyBjIkk1FWRxaJByqGOY1wD7TFKi+dQQTCQzWSEZYYmJNmWVTAlfP4X/k9aZ43oOuquU61fLOorgCByDU+CCKqiDG9AATUDABDyAJ/BsZdaj9WK9LkYL1nLnEPyA9fYJ2H2T0A==</latexit>
BFS
<latexit sha1_base64="3JgOnndGIxirKyAAUXwuZm6ziu4=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoscQQTxGNA9MljA7mU2GzM4uM71iWPIXXjwo4tW/8ebfOHkcNFrQUFR1090VJFIYdN0vJ7e0vLK6ll8vbGxube8Ud/caJk4143UWy1i3Amq4FIrXUaDkrURzGgWSN4Ph5cRvPnBtRKzucJRwP6J9JULBKFrpvoP8EbPq1e24Wyy5ZXcK8pd4c1KCOWrd4menF7M04gqZpMa0PTdBP6MaBZN8XOikhieUDWmfty1VNOLGz6YXj8mRVXokjLUthWSq/pzIaGTMKApsZ0RxYBa9ifif104xvPAzoZIUuWKzRWEqCcZk8j7pCc0ZypEllGlhbyVsQDVlaEMq2BC8xZf/ksZJ2TsruzenpUp1HkceDuAQjsGDc6jANdSgDgwUPMELvDrGeXbenPdZa86Zz+zDLzgf3450kNY=</latexit>
m2M
<latexit sha1_base64="Pu6zYbmXj7cy/v5RTUORPsjIHG0=">AAAB+XicdVDLSgMxFM3UV62vUZdugkVwNWS0Y+2u6MaNUME+oDOUTJq2oZnMkGQKZeifuHGhiFv/xJ1/Y6atoKIHAodz7uWenDDhTGmEPqzCyura+kZxs7S1vbO7Z+8ftFScSkKbJOax7IRYUc4EbWqmOe0kkuIo5LQdjq9zvz2hUrFY3OtpQoMIDwUbMIK1kXq2HflMQD/CekQwz25nPbuMHA+5Na8GF+QcGYLcStW7gK6D5iiDJRo9+93vxySNqNCEY6W6Lkp0kGGpGeF0VvJTRRNMxnhIu4YKHFEVZPPkM3hilD4cxNI8oeFc/b6R4UipaRSayTyi+u3l4l9eN9WDyyBjIkk1FWRxaJByqGOY1wD7TFKi+dQQTCQzWSEZYYmJNmWVTAlfP4X/k9aZ43oOuquU61fLOorgCByDU+CCKqiDG9AATUDABDyAJ/BsZdaj9WK9LkYL1nLnEPyA9fYJ2H2T0A==</latexit>
D
<latexit sha1_base64="g0OibCAePhrnyWWCFxtSQfASzkM=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KokoeizqwWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8Hoduq3nlBpHssHM07Qj+hA8pAzaqxUv+uVym7FnYEsEy8nZchR65W+uv2YpRFKwwTVuuO5ifEzqgxnAifFbqoxoWxEB9ixVNIItZ/NDp2QU6v0SRgrW9KQmfp7IqOR1uMosJ0RNUO96E3F/7xOasJrP+MySQ1KNl8UpoKYmEy/Jn2ukBkxtoQyxe2thA2poszYbIo2BG/x5WXSPK94lxW3flGu3uRxFOAYTuAMPLiCKtxDDRrAAOEZXuHNeXRenHfnY9664uQzR/AHzucPmLuMzA==</latexit>
g
<latexit sha1_base64="0yyAJCsP1kclF/whjAO2hgH+TB8=">AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0TJoYxnBxEgSwt5mLlmyu3fs7gnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTAQ31ve/vcLK6tr6RnGztLW9s7tX3j9omjjVDBssFrFuhdSg4AoblluBrUQjlaHAh3B0M/UfnlAbHqt7O06wK+lA8Ygzap30OOhlnRAtnfTKFb/qz0CWSZCTCuSo98pfnX7MUonKMkGNaQd+YrsZ1ZYzgZNSJzWYUDaiA2w7qqhE081mB0/IiVP6JIq1K2XJTP09kVFpzFiGrlNSOzSL3lT8z2unNrrqZlwlqUXF5ouiVBAbk+n3pM81MivGjlCmubuVsCHVlFmXUcmFECy+vEyaZ9XgourfnVdq13kcRTiCYziFAC6hBrdQhwYwkPAMr/Dmae/Fe/c+5q0FL585hD/wPn8AB8OQjg==</latexit>
⇡ or ⇡
<latexit sha1_base64="nWEGia6Wrai3p2HZ0P74CHZFsjc=">AAACA3icbVC7TsNAEDyHVwgvAx00JyIkqshGICgjaCiDRB5SbEXnyzk55Wyf7taIyLJEw6/QUIAQLT9Bx99wSVxAwlSzM7va3Qmk4Boc59sqLS2vrK6V1ysbm1vbO/buXksnqaKsSRORqE5ANBM8Zk3gIFhHKkaiQLB2MLqe+O17pjRP4jsYS+ZHZBDzkFMCRurZB57k2AP2ABlOFM6xqXuZJzXPe3bVqTlT4EXiFqSKCjR69pfXT2gasRioIFp3XUeCnxEFnAqWV7xUM0noiAxY19CYREz72fSHHB8bpY9Dc0OYxICn6u+JjERaj6PAdEYEhnrem4j/ed0Uwks/47FMgcV0tihMBYYETwLBfa4YBTE2hFDFza2YDokiFExsFROCO//yImmd1tzzmnN7Vq1fFXGU0SE6QifIRReojm5QAzURRY/oGb2iN+vJerHerY9Za8kqZvbRH1ifP7w6l5Y=</latexit>
h 1 pos[1]
<latexit sha1_base64="I4qg8iaXsiiWYjKhHuBEauH9TK4=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKosegF48RTAwmS5idzCZD5rHMzAphyV948aCIV//Gm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOlOhA3lTNKWZZbTTqIpFhGnD9H4Jvcfnqg2TMl7O0loKPBQspgRbJ302EtElI2m/aDSr9b8uj8DWiZBQWpQoNmvfvUGiqSCSks4NqYb+IkNM6wtI5xOK73U0ASTMR7SrqMSC2rCbHbxFJ04ZYBipV1Ji2bq74kMC2MmInKdAtuRWfRy8T+vm9r4KsyYTFJLJZkvilOOrEL5+2jANCWWTxzBRDN3KyIjrDGxLqQ8hGDx5WXSPqsHF3X/7rzWuC7iKMMRHMMpBHAJDbiFJrSAgIRneIU3z3gv3rv3MW8tecXMIfyB9/kD/pWQdw==</latexit>
<latexit sha1_base64="44A7A7rPoIdZ9e62TrOYwBOaV4w=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBE8lUQUPRa9eKxgP6ANZbPdtEs3u3F3Uiyhv8OLB0W8+mO8+W/ctjlo64OBx3szzMwLE8ENet63s7K6tr6xWdgqbu/s7u2XDg4bRqWasjpVQulWSAwTXLI6chSslWhG4lCwZji8nfrNEdOGK/mA44QFMelLHnFK0EpBB9kTZokyk7YfdEtlr+LN4C4TPydlyFHrlr46PUXTmEmkghjT9r0Eg4xo5FSwSbGTGpYQOiR91rZUkpiZIJsdPXFPrdJzI6VtSXRn6u+JjMTGjOPQdsYEB2bRm4r/ee0Uo+sg4zJJkUk6XxSlwkXlThNwe1wzimJsCaGa21tdOiCaULQ5FW0I/uLLy6RxXvEvK979Rbl6k8dRgGM4gTPw4QqqcAc1qAOFR3iGV3hzRs6L8+58zFtXnHzmCP7A+fwBIi2SVA==</latexit>
LayerNorm o m,xxi
<latexit sha1_base64="JgXjpE1v18fmWDT57RWYLd8edB4=">AAAB/nicbZDLSsNAFIYn9VbrLSqu3AwWwYWURBRdFt24rGAv0IYwmU7boXMJMxOxhICv4saFIm59Dne+jdM0C239YeDjP+dwzvxRzKg2nvftlJaWV1bXyuuVjc2t7R13d6+lZaIwaWLJpOpESBNGBWkaahjpxIogHjHSjsY303r7gShNpbg3k5gEHA0FHVCMjLVC96AX8yiVWZjy0xwfs5BmoVv1al4uuAh+AVVQqBG6X72+xAknwmCGtO76XmyCFClDMSNZpZdoEiM8RkPStSgQJzpI8/MzeGydPhxIZZ8wMHd/T6SIaz3hke3kyIz0fG1q/lfrJmZwFaRUxIkhAs8WDRIGjYTTLGCfKoINm1hAWFF7K8QjpBA2NrGKDcGf//IitM5q/kXNuzuv1q+LOMrgEByBE+CDS1AHt6ABmgCDFDyDV/DmPDkvzrvzMWstOcXMPvgj5/MHL5KWRg==</latexit>
Update↵
<latexit sha1_base64="R8iJKJLF1eH0nRD1TzdkPNl1WC0=">AAACAXicbVDJSgNBEO2JW4xb1IvgZTAInsKMKHoMevEYwSyQDENNp5I06VnorhHDEC/+ihcPinj1L7z5N3aWgyY+KHi8V0VVvSCRQpPjfFu5peWV1bX8emFjc2t7p7i7V9dxqjjWeCxj1QxAoxQR1kiQxGaiEMJAYiMYXI/9xj0qLeLojoYJeiH0ItEVHMhIfvGgTfhAmme1pAOEIz9rg0z6MPKLJafsTGAvEndGSmyGql/8andinoYYEZegdct1EvIyUCS4xFGhnWpMgA+ghy1DIwhRe9nkg5F9bJSO3Y2VqYjsifp7IoNQ62EYmM4QqK/nvbH4n9dKqXvpZSJKUsKITxd1U2lTbI/jsDtCISc5NAS4EuZWm/dBAScTWsGE4M6/vEjqp2X3vOzcnpUqV7M48uyQHbET5rILVmE3rMpqjLNH9sxe2Zv1ZL1Y79bHtDVnzWb22R9Ynz+qMZeq</latexit>
.. Attention ReLU ⇡ or ⇡
FFWF
<latexit sha1_base64="UgW0KQ59qekyNFjXhbg6jot7Ajg=">AAAB+HicbVBNS8NAEN34WetHox69BIvgqSSi6LEoFI8V7Ae0IWy203bpZhN2J2IN/SVePCji1Z/izX/jts1BWx/M8Hhvhp19YSK4Rtf9tlZW19Y3Ngtbxe2d3b2SvX/Q1HGqGDRYLGLVDqkGwSU0kKOAdqKARqGAVji6mfqtB1Cax/Iexwn4ER1I3ueMopECu9RFeMSsVpsEWSsw3S67FXcGZ5l4OSmTHPXA/ur2YpZGIJEJqnXHcxP0M6qQMwGTYjfVkFA2ogPoGCppBNrPZodPnBOj9Jx+rExJdGbq742MRlqPo9BMRhSHetGbiv95nRT7V37GZZIiSDZ/qJ8KB2NnmoLT4woYirEhlClubnXYkCrK0GRVNCF4i19eJs2zindRce/Oy9XrPI4COSLH5JR45JJUyS2pkwZhJCXP5JW8WU/Wi/VufcxHV6x855D8gfX5A+WWkz0=</latexit>
<latexit sha1_base64="nWEGia6Wrai3p2HZ0P74CHZFsjc=">AAACA3icbVC7TsNAEDyHVwgvAx00JyIkqshGICgjaCiDRB5SbEXnyzk55Wyf7taIyLJEw6/QUIAQLT9Bx99wSVxAwlSzM7va3Qmk4Boc59sqLS2vrK6V1ysbm1vbO/buXksnqaKsSRORqE5ANBM8Zk3gIFhHKkaiQLB2MLqe+O17pjRP4jsYS+ZHZBDzkFMCRurZB57k2AP2ABlOFM6xqXuZJzXPe3bVqTlT4EXiFqSKCjR69pfXT2gasRioIFp3XUeCnxEFnAqWV7xUM0noiAxY19CYREz72fSHHB8bpY9Dc0OYxICn6u+JjERaj6PAdEYEhnrem4j/ed0Uwks/47FMgcV0tihMBYYETwLBfa4YBTE2hFDFza2YDokiFExsFROCO//yImmd1tzzmnN7Vq1fFXGU0SE6QifIRReojm5QAzURRY/oGb2iN+vJerHerY9Za8kqZvbRH1ifP7w6l5Y=</latexit>
<latexit sha1_base64="B1pwU2JRVAB4dpn6t+NGyIzL2QQ=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMF+wFtKJvNpl272Q27k0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGpVpyhpUCaXbITFMcMkayFGwdqoZSULBWuHwbua3RkwbruQjjlMWJKQvecwpQSs1u6NIoemVK17Vm8NdJX5OKpCj3it/dSNFs4RJpIIY0/G9FIMJ0cipYNNSNzMsJXRI+qxjqSQJM8Fkfu3UPbNK5MZK25LoztXfExOSGDNOQtuZEByYZW8m/ud1MoxvggmXaYZM0sWiOBMuKnf2uhtxzSiKsSWEam5vdemAaELRBlSyIfjLL6+S5kXVv6p6D5eV2m0eRxFO4BTOwYdrqME91KEBFJ7gGV7hzVHOi/PufCxaC04+cwx/4Hz+AMzBj0Y=</latexit>
.
GNN↵ Layer LayerNorm
<latexit sha1_base64="Y7a+Tr2GpREll+LIxOSJDsXbBI0=">AAAB/HicbVBNS8NAEN3Ur1q/oj16CRbBU0lE0WPRg55KBfsBTQib7bZdutmE3YkYQvwrXjwo4tUf4s1/47bNQVsfDDzem2FmXhBzpsC2v43Syura+kZ5s7K1vbO7Z+4fdFSUSELbJOKR7AVYUc4EbQMDTnuxpDgMOO0Gk+up332gUrFI3EMaUy/EI8GGjGDQkm9WXaCPkN00m7mfuZjHY5z7Zs2u2zNYy8QpSA0VaPnmlzuISBJSAYRjpfqOHYOXYQmMcJpX3ETRGJMJHtG+pgKHVHnZ7PjcOtbKwBpGUpcAa6b+nshwqFQaBrozxDBWi95U/M/rJzC89DIm4gSoIPNFw4RbEFnTJKwBk5QATzXBRDJ9q0XGWGICOq+KDsFZfHmZdE7rznndvjurNa6KOMroEB2hE+SgC9RAt6iF2oigFD2jV/RmPBkvxrvxMW8tGcVMFf2B8fkDNxiVIg==</latexit>
<latexit sha1_base64="B7yKnfgUj1HFVbQfsLHquSnpR9c=">AAAB9HicbVBNS8NAEN34WetX1aOXYBE8lUQUPRa9eKxgP6ANZbOdtEs3u3F3Uiyhv8OLB0W8+mO8+W/ctjlo64OBx3szzMwLE8ENet63s7K6tr6xWdgqbu/s7u2XDg4bRqWaQZ0poXQrpAYEl1BHjgJaiQYahwKa4fB26jdHoA1X8gHHCQQx7UsecUbRSkEH4QmzRJlJOw26pbJX8WZwl4mfkzLJUeuWvjo9xdIYJDJBjWn7XoJBRjVyJmBS7KQGEsqGtA9tSyWNwQTZ7OiJe2qVnhspbUuiO1N/T2Q0NmYch7Yzpjgwi95U/M9rpxhdBxmXSYog2XxRlAoXlTtNwO1xDQzF2BLKNLe3umxANWVocyraEPzFl5dJ47ziX1a8+4ty9SaPo0COyQk5Iz65IlVyR2qkThh5JM/klbw5I+fFeXc+5q0rTj5zRP7A+fwBiYGSmA==</latexit>
h u pos[u]
<latexit sha1_base64="wM2alntSXXwQ8gMpFZ5EVDL94ac=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgKeyKosegF48RzAOSJcxOZpMh81hmeoWw7Gd48aCIV7/Gm3/jJNmDJhY0FFXddHdFieAWfP/bK62tb2xulbcrO7t7+wfVw6O21amhrEW10KYbEcsEV6wFHATrJoYRGQnWiSZ3M7/zxIzlWj3CNGGhJCPFY04JOKnXT2SUjfNBluaDas2v+3PgVRIUpIYKNAfVr/5Q01QyBVQQa3uBn0CYEQOcCpZX+qllCaETMmI9RxWRzIbZ/OQcnzlliGNtXCnAc/X3REaktVMZuU5JYGyXvZn4n9dLIb4JM66SFJiii0VxKjBoPPsfD7lhFMTUEUINd7diOiaGUHApVVwIwfLLq6R9UQ+u6v7DZa1xW8RRRifoFJ2jAF2jBrpHTdRCFGn0jF7Rmwfei/fufSxaS14xc4z+wPv8AfgFkbM=</latexit>
InitNode↵ EdgeEmbed↵
<latexit sha1_base64="3QaL5vmYQPn6UnUX/9wGbS9oTHk=">AAACA3icbVBNS8NAEN3Ur1q/qt70EiyCp5KIoseiF71IBfsBTSib7bRdutmE3YlYQsGLf8WLB0W8+ie8+W/ctjlo64OBx3szzMwLYsE1Os63lVtYXFpeya8W1tY3NreK2zt1HSWKQY1FIlLNgGoQXEINOQpoxgpoGAhoBIPLsd+4B6V5JO9wGIMf0p7kXc4oGqld3PMQHlCz9FpyvIk6MGqnHhVxn47axZJTdiaw54mbkRLJUG0Xv7xOxJIQJDJBtW65Tox+ShVyJmBU8BINMWUD2oOWoZKGoP108sPIPjRKx+5GypREe6L+nkhpqPUwDExnSLGvZ72x+J/XSrB77qdcxgmCZNNF3UTYGNnjQOwOV8BQDA2hTHFzq836VFGGJraCCcGdfXme1I/L7mnZuT0pVS6yOPJknxyQI+KSM1IhV6RKaoSRR/JMXsmb9WS9WO/Wx7Q1Z2Uzu+QPrM8fHOqYdQ==</latexit>
S
<latexit sha1_base64="6l4H0Xqko7ewaii5aYvsXrB67fY=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kokoeix68dii/YA2lM120q7dbMLuRiihv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LBjBP0IzqQPOSMGivV73ulsltxZyDLxMtJGXLUeqWvbj9maYTSMEG17nhuYvyMKsOZwEmxm2pMKBvRAXYslTRC7WezQyfk1Cp9EsbKljRkpv6eyGik9TgKbGdEzVAvelPxP6+TmvDaz7hMUoOSzReFqSAmJtOvSZ8rZEaMLaFMcXsrYUOqKDM2m6INwVt8eZk0zyveZcWtX5SrN3kcBTiGEzgDD66gCndQgwYwQHiGV3hzHp0X5935mLeuOPnMEfyB8/kDr3eM2w==</latexit>
..
g
<latexit sha1_base64="CpMPGj91YH2KFRQVFVxF9qTRWb4=">AAACBHicbVBNS8NAEN3Ur1q/oh57CRbBU0lE0WNRCh4r2A9oS9lspu3S3STsTsQSevDiX/HiQRGv/ghv/hu3bQ7a+mDg8d4MM/P8WHCNrvtt5VZW19Y38puFre2d3T17/6Cho0QxqLNIRKrlUw2Ch1BHjgJasQIqfQFNf3Q99Zv3oDSPwjscx9CVdBDyPmcUjdSzix2EB9QsrQYDqEofgkkv7VARD+mkZ5fcsjuDs0y8jJRIhlrP/uoEEUskhMgE1brtuTF2U6qQMwGTQifREFM2ogNoGxpSCbqbzp6YOMdGCZx+pEyF6MzU3xMplVqPpW86JcWhXvSm4n9eO8H+ZTflYZwghGy+qJ8IByNnmogTcAUMxdgQyhQ3tzpsSBVlaHIrmBC8xZeXSeO07J2X3duzUuUqiyNPiuSInBCPXJAKuSE1UieMPJJn8krerCfrxXq3PuatOSubOSR/YH3+AJ7QmLc=</latexit>
xi
<latexit sha1_base64="6BdaEd+7uLalQ08fYahfpvty8FY=">AAAB8HicdZDLSgMxFIYz9VbrrerSTbAIroaMWHRZdOOygr1IO5RMeqYNTWaGJCOUoU/hxoUibn0cd76NmXYKKvpD4Oc755Bz/iARXBtCPp3Syura+kZ5s7K1vbO7V90/aOs4VQxaLBax6gZUg+ARtAw3ArqJAioDAZ1gcp3XOw+gNI+jOzNNwJd0FPGQM2osuh8Nsn4Ahs4G1Rpx6yQXJi5ZmoJ4BamhQs1B9aM/jFkqITJMUK17HkmMn1FlOBMwq/RTDQllEzqCnrURlaD9bL7wDJ9YMsRhrOyLDJ7T7xMZlVpPZWA7JTVj/buWw79qvdSEl37GoyQ1ELHFR2EqsIlxfj0ecgXMiKk1lClud8VsTBVlxmZUsSEsL8X/m/aZ69Vdcntea1wVcZTRETpGp8hDF6iBblATtRBDEj2iZ/TiKOfJeXXeFq0lp5g5RD/kvH8BFlWQmA==</latexit>
<latexit sha1_base64="B1pwU2JRVAB4dpn6t+NGyIzL2QQ=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMF+wFtKJvNpl272Q27k0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGpVpyhpUCaXbITFMcMkayFGwdqoZSULBWuHwbua3RkwbruQjjlMWJKQvecwpQSs1u6NIoemVK17Vm8NdJX5OKpCj3it/dSNFs4RJpIIY0/G9FIMJ0cipYNNSNzMsJXRI+qxjqSQJM8Fkfu3UPbNK5MZK25LoztXfExOSGDNOQtuZEByYZW8m/ud1MoxvggmXaYZM0sWiOBMuKnf2uhtxzSiKsSWEam5vdemAaELRBlSyIfjLL6+S5kXVv6p6D5eV2m0eRxFO4BTOwYdrqME91KEBFJ7gGV7hzVHOi/PufCxaC04+cwx/4Hz+AMzBj0Y=</latexit>
<latexit sha1_base64="Kw7nSwIenZmX0+r61H9ifCeM4C8=">AAAB8HicdVDLSgMxFM3UV62vqks3wSK4GjJqte6KblxWsA9ph5JJM21okhmSjFiGfoUbF4q49XPc+Tem0xFU9MCFwzn3cu89QcyZNgh9OIWFxaXlleJqaW19Y3OrvL3T0lGiCG2SiEeqE2BNOZO0aZjhtBMrikXAaTsYX8789h1VmkXyxkxi6gs8lCxkBBsr3fZiEaT30z7rlyvIrSLv/BRB5KIMGal5xx70cqUCcjT65ffeICKJoNIQjrXueig2foqVYYTTaamXaBpjMsZD2rVUYkG1n2YHT+GBVQYwjJQtaWCmfp9IsdB6IgLbKbAZ6d/eTPzL6yYmrPkpk3FiqCTzRWHCoYng7Hs4YIoSwyeWYKKYvRWSEVaYGJtRyYbw9Sn8n7SOXK/qouuTSv0ij6MI9sA+OAQeOAN1cAUaoAkIEOABPIFnRzmPzovzOm8tOPnMLvgB5+0TZcaQzQ==</latexit>
.
S
<latexit sha1_base64="6l4H0Xqko7ewaii5aYvsXrB67fY=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kokoeix68dii/YA2lM120q7dbMLuRiihv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LBjBP0IzqQPOSMGivV73ulsltxZyDLxMtJGXLUeqWvbj9maYTSMEG17nhuYvyMKsOZwEmxm2pMKBvRAXYslTRC7WezQyfk1Cp9EsbKljRkpv6eyGik9TgKbGdEzVAvelPxP6+TmvDaz7hMUoOSzReFqSAmJtOvSZ8rZEaMLaFMcXsrYUOqKDM2m6INwVt8eZk0zyveZcWtX5SrN3kcBTiGEzgDD66gCndQgwYwQHiGV3hzHp0X5935mLeuOPnMEfyB8/kDr3eM2w==</latexit>
D
<latexit sha1_base64="g0OibCAePhrnyWWCFxtSQfASzkM=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KokoeizqwWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8Hoduq3nlBpHssHM07Qj+hA8pAzaqxUv+uVym7FnYEsEy8nZchR65W+uv2YpRFKwwTVuuO5ifEzqgxnAifFbqoxoWxEB9ixVNIItZ/NDp2QU6v0SRgrW9KQmfp7IqOR1uMosJ0RNUO96E3F/7xOasJrP+MySQ1KNl8UpoKYmEy/Jn2ukBkxtoQyxe2thA2poszYbIo2BG/x5WXSPK94lxW3flGu3uRxFOAYTuAMPLiCKtxDDRrAAOEZXuHNeXRenHfnY9664uQzR/AHzucPmLuMzA==</latexit>
<latexit sha1_base64="ZCVHXx3oXDPoorp84n8CdP6I8H0=">AAAB83icbVA9TwJBEJ3DL8Qv1NJmIzHBhtyRGC2JNnZiIkgCF7K3zMGGvb3L7h6GEP6GjYXG2Ppn7Pw3LnCFgi+Z5OW9mczMCxLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0WsWgHVKLjEhuFGYCtRSKNA4GMwvJn5jyNUmsfywYwT9CPalzzkjBordcr0nNxZf8TxqVssuRV3DrJKvIyUIEO9W/zq9GKWRigNE1Trtucmxp9QZTgTOC10Uo0JZUPax7alkkao/cn85ik5s0qPhLGyJQ2Zq78nJjTSehwFtjOiZqCXvZn4n9dOTXjlT7hMUoOSLRaFqSAmJrMASI8rZEaMLaFMcXsrYQOqKDM2poINwVt+eZU0qxXvouLeV0u16yyOPJzAKZTBg0uowS3UoQEMEniGV3hzUufFeXc+Fq05J5s5hj9wPn8AKRuRHQ==</latexit>
<latexit sha1_base64="UbqMSWgTY0LeVlwGWOmqAlsNBWY=">AAACH3icbVDLSgMxFM34rPVVdekmWIS6KTPia1l040oq2Ae0pWTSO21oZjIkd8Qy9E/c+CtuXCgi7vo3po+Fth4IHM65l9xz/FgKg647cpaWV1bX1jMb2c2t7Z3d3N5+1ahEc6hwJZWu+8yAFBFUUKCEeqyBhb6Emt+/Gfu1R9BGqOgBBzG0QtaNRCA4Qyu1cxdNhCdMC/4JvYNEM0mZ5j2BwDHRQFVAOyIIQEOElKswVpFlZtjO5d2iOwFdJN6M5MkM5Xbuu9lRPAntNpfMmIbnxthKmUbBJQyzzcRAzHifdaFhacRCMK10km9Ij63SoYHS9k3usOrvjZSFxgxC306GDHtm3huL/3mNBIOrViqiOEGI+PSjIJEUFR2XZbNrW4QcWMK4FvZWyntMM4620qwtwZuPvEiqp0XvvOjen+VL17M6MuSQHJEC8cglKZFbUiYVwskzeSXv5MN5cd6cT+drOrrkzHYOyB84ox/AW6Nz</latexit>
Figure 1: Illustration of S UB S EL N ET. (a) Overview: Given a model architecture m ∈ M, S UB S EL N ET takes its
graph Gm as input to the architecture encoder GNNα to compute the architecture embedding. This, together
with x is fed into the model approximator gβ which predicts the output of the trained model mθ∗ (x). Then
this is fed as input to the subset sampler π to obtain the training subset S. (b) Neural architecture of different
components: GNNα consists of recursive message passing layer. The model approximator gβ performs a BFS
ordering on the emebddings Hm = {hu } and feeds them into a transformer. Subset sampler optimizes for π
either via direct optimization for π (Transductive) or via a neural network πψ (Inductive).
architectures. Moreover, the involvement of both combinatorial and continuous optimization variables, prevents
the underlying solver from scaling across multiple architectures.
We address these challenges by designing a neural surrogate of the objective (1), which would lead to the
generalization of subset selection across different architectures.
3 Overview of S UB S EL N ET
Here, we give an outline of our proposed model S UB S EL N ET that leads to substituting the optimization (1) with
its neural surrogate, which would enable us to compute the optimal subset S for an unseen model, once trained
on a set of model architectures.
3.1 Components
At the outset, S UB S EL N ET consists of three key components: (i) the architecture encoder, (ii) the neural
approximator of the trained model, and (iii) the subset sampler. Figure 1 illustrates our model.
GNN-guided encoder for neural architectures. Generalizing any task across the different architectures requires
the architectures to be embedded in vector space. Since a neural architecture is essentially a graph between
multiple operations, we use a graph neural network (GNN) [59] to achieve this goal. Given a model architecture
m ∈ M, we first feed the underlying DAG Gm into a GNN (GNNα ) with parameters α, which outputs the node
representations for Gm , i.e., Hm = {hu }u∈Vm .
Approximator of the trained model mθ∗ . To tackle lack of generalizability of the optimization (1), we design a
neural model approximator gβ which approximates the predictions of any trained model for any given architecture
m. To this end, gβ takes input as Hm and the instance xi and compute gβ (Hm , xi ) ≈ mθ∗ (xi ). Here, θ∗ is the
set of learned parameters of the model mθ on dataset D.
Subset sampler. We design a subset sampler using a probabilistic model Prπ (•). Given a budget b, it sequentially
draws instances S = {s1 , ..., sb } from a softmax distribution of the logit vector π ∈ R|D| where π(xi , yi )
indicates a score for the element (xi , yi ). We would like to highlight that we use S as an ordered set of elements,
selected in a sequential manner. However, such an order does not affect the trained model, which is inherently
invariant of permutations of the training data; it only affects the choice of S. Now, depending on how we compute
π during test, we have two variants of S UB S EL N ET: Transductive-S UB S EL N ET and Inductive-S UB S EL N ET.
Transductive-S UB S EL N ET: During test, since we have already trained the architecture encoder GNNα and
the model approximator gβ , we do not have to perform any training when we select a subset for an unseen
architecture m′ , since the trained model can then be replaced with gβ (Hm′ , xi ). Thus, the key bottleneck
of solving the combinatorial optimization (1)— training the model simultaneously with exploring for S— is
ameliorated. Now, we can perform optimization over π, each time for a new architecture. However, since no
model training is involved, such explicit optimization is fast enough and memory efficient. Due to explicit
optimization every time for an unseen architecture, this approach is transductive in nature.
Inductive-S UB S EL N ET: Here, we introduce a neural network to approximate π, which is trained together
with GNNα and gβ . This allows us to directly select the subset S without explicitly optimizing for π, unlike
Transductive-S UB S EL N ET.
3
3.2 Training and inference
Training objective. Using the approximation gβ (Hm , xi ) ≈ mθ∗ (xi ), we replace the combinatorial optimiza-
tion problem in Eq. (1) with a continuous optimization problem, across different model architectures m ∈ M.
To that goal, we define
P
Λ(S; m; π, gβ , GNNα ) = i∈S ℓ(gβ (Hm , xi ), yi ) − λH(Pr π (•)) with, Hm = GNNα (Gm ) (2)
and seek to solve the following problem:
X X
min E Λ(S; m; π, gβ , GNNα ) + γKL(gβ (Hm , xi ), mθ∗ (xi )) (3)
π,α,β S∼Prπ
m∈M i∈S
Here, we use entropy on the subset sampler H(Prπ (•)) to model the diversity of samples in the selected subset.
We call our neural pipeline, which consists of architecture encoder GNNα , the model approximator gβ , and the
subset selector π, as S UB S EL N ET. In the above, γ penalizes the difference between the output of the model
approximator and the prediction made by the trained model, which allows us to generalize the training of different
models m ∈ M through the model gβ (Hm , xi ).
4 Design of S UB S EL N ET
Bottlenecks of end-to-end training and proposed multi-stage approach. End-to-end optimization of the
above problem is difficult for the following reasons. (i) Our architecture representation Hm only represents
the architectures and thus should be independent of the parameters of the architecture θ and the instances x.
End-to-end training can make them sensitive to these quantities. (ii) To enable the model approximator gβ
accurately fit the output of the trained model mθ , we need explicit training for β with the target mθ .
In our multi-stage training method, we first train the architecture encoder GNNα , then the model approximator
gβ and then train our subset sampler Prπ (resp. Prπψ ) for the transductive (inductive) model. In the following,
we describe the design and training of these components in details.
4
σ(zu⊤ zv )]. Here, σ is a parameterized sigmoid function. Finally, we estimate α, µ, Σ, and σ by maximizing the
evidence lower bound (ELBO): maxα,µ,Σ,σ EZ∼q(• | Gm ) [p(Gm | Z)] − KL(q(• | Gm )||ppr (•)).
5
problem explicitly with respect to π every time when we wish to select a data subset for a new architecture.
Given trained model GNNα̂ , gβ̂ and a new architecture m′ ∈ M, we solve the optimization problem to find the
subset sampler Prπ during inference time for a new architecture m′ .
minπ ES∈Prπ (•) Λ(S; m′ ; π, gβ̂ , GNNα̂ ) (9)
Such an optimization still consumes time during inference. However, it is still significantly faster than the
combinatorial methods [20, 19, 37, 47] thanks to sidestepping the explicit model training using a model
approximator.
Inductive-S UB S EL N ET. In contrast to the transductive model, the inductive model does not require explicit
optimization of π in the face of a new architecture. To that aim, we approximate π using a neural network
πψ which takes two signals as inputs— the dataset D and the outputs of the model approximator for different
instances {gβ̂ (Hm , xi ) | i ∈ D} and finally outputs a score for each instance πψ (xi , yi ). Here, the training of
πψ follows from the optimization (3):
P
minψ m∈M ES∼Prπψ Λ(S; m; πψ , gβ̂ , GNNα̂ ) (10)
Such an inductive model can select an optimal distribution of the subset that should be used to efficiently train
any model mθ , without explicitly training θ or searching for the underlying subset.
Architecture of πψ for Inductive-S UB S EL N ET. We approximate π using πψ using a neural network which takes
three inputs – (xj , yj ), the corresponding output of the model approximator, i.e., om,xj = gβ (GNNα (Gm ), xj )
from Eq. (6) and the node representation matrix Hm and provides us a positive selection score πψ (Hm , xj , yj , om,xj ).
In practice, πψ is a three-layer feed-forward network containing Leaky-ReLU activation functions for the first
two layers and sigmoid activation at the last layer.
5 Experiments
In this section, we provide comprehensive evaluation of S UB S EL N ET against several strong baselines on five
real world datasets. In Appendix E, we present additional results. Our code is in https://github.com/
structlearning/subselnet.
6
5.1 Experimental setup
Datasets. We use FMNIST [56], CIFAR10 [26], CIFAR100 [25], Tiny-Imagenet-200 [27] and Caltech-256 [13]
(Cal-256). Cal-256 has imbalanced class distribution; the rest are balanced. We transform an input image Xi to a
vector xi of dimension 2048 by feeding it to a pre-trained ResNet50 v1.5 model [16] and use the output from the
penultimate layer as the image representation.
Model architectures and baselines. We use model architectures from NAS-Bench-101 [61] in our experi-
ments. We compare Transductive-S UB S EL N ET and Inductive-S UB S EL N ET against P three non-adaptive subset
selection methods – (i) Facility location [11, 17] where we maximize F L(S) = j∈D maxi∈S x⊤ i xj to find S,
(ii) Pruning [48], and (iii) Selection-via-Proxy [5] and four adaptive subset selection methods – (iii) Glister [20],
(iv) Grad-Match [19], (v) EL2N [42] and (vi) GraNd [42]. The non-adaptive subset selectors select the subset
before the training begins and thus, never access the rest of the training set again during the training iterations.
On the other hand, the adaptive subset selectors refine the choice of subset during training iterations and thus
they need to access the full training set at each training iteration. Appendix D contains additional details about
the baselines and Appendix E contains experiments with more baselines.
Evaluation protocol. We split the model architectures M into 70% training (Mtr ), 10% validation (Mval ) and
20% test (Mtest ) folds. However, training model approximator requires supervision from the pre-trained models
mθ∗ . Pre-training large number of models can be expensive. Therefore, we limit the number of pre-trained models
to a diverse set of size 250, that ensures efficient representation over low-parameter and high-parameter regimes,
and using more than this showed no visible advantage. We show the parameter statistics in Appendix D. However,
for the architecture encoder, we use the entire set Mtr for GNN training. We split the dataset D into Dtr , Dval and
Dtest in the similar 70:10:20 folds. We present Mtr , Mval , Dtr and Dval to our method and estimate α b, βb and ψb
(for Inductive-S UB S EL N ET model). None of the baseline methods supports any generalizable learning protocol
for different architectures and thus cannot leverage the training architectures during test. Given an architecture
m′ ∈ Mtest , we select the subset S from Dtr using our subset sampler (Prπ for Transductive-S UB S EL N ET or
Prπψb for Inductive-S UB S EL N ET). Similarly, all the non-adaptive subset selectors select S ⊂ Dtr using their own
algorithms.
Once S is selected, we train the test models m′ ∈ Mtest on S. We perform our experiments with different
|S| = b ∈ (0.005|D|, 0.9|D|) and compare the performance between different methods using three quantities:
(1) Relative Accuracy Reduction (RAR) computed as the drop in test P accuracy on training with a chosen subset
as compared to training with the entire dataset, i.e, RAR(S, D) = |M1test | m′ ∈Mtest (1 − Acc(m′ | S)/Acc(m′ | D))
where Acc(m′ | X) denotes the test accuracy when m′ is trained on the set X. Lower RAR indicates better
performance. (2) Computational efficiency, i.e., the speedup achieved with respect to training with full dataset.
It is measured with respect to Tf /T . Here, Tf is the time taken for training with full dataset; and, T is the
time taken for the entire inference task, which is the average time for selecting subsets across the test models
m′ ∈ Mtest plus the average training time of these test models on the respective selected subsets. (3) Resource
efficiency in terms of the amount of memory consumed during the entire inference task, described in item (2),
RT
which is measured as 0 memory(t) dt where memory(t) is amount of memory consumed at timestamp t in the
unit of GB-min.
5.2 Results
Comparison with baselines. Here, we compare different methods in terms of the trade-off between Relative
accuracy reduction RAR (lower is better) and computational efficiency as well as RAR and resource efficiency.
In Figures 2 and 3, we probe the variation between these quantities by varying the size of the selected subset
|S| = b ∈ (0.005|D|, 0.9|D|) for non-adaptive and adaptive baselines, respectively. We make the following
observations.
(1) Our methods trade-off between accuracy vs. computational efficiency as well as accuracy vs. resource
efficiency more effectively than all the methods, including the adaptive methods which refine their choice of
subset as the model training progresses. (2) In FMNIST, our method achieves 10% RAR at ∼4.4 times the
speed-up and using 77% lesser memory than EL2N, the best baseline (Table 1, tables for other datasets are in
Appendix E). (3) There is no consistent winner across baselines. However, Glister and Grad-Match mostly remain
among top three baselines, across different methods. In particular, they outperform others in Tiny-Imagenet and
Cal-256, in high accuracy (low RAR) regime.
Hybrid-S UB S EL N ET. In FMINST, CIFAR10 and CIFAR100, we observe that Transductive-S UB S EL N ET offers
better traded off than Inductive-S UB S EL N ET.
Here, we design a hybrid version of our model, called as Hybrid-S UB S EL N ET and evaluate it on a regime
7
Budget (%) →
3
FacLoc Pruning Inductive-SubSelNet 10 104
Proxy Transductive-SubSelNet
1.0 1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8 0.8
RAR→
(a) FMNIST (b) CIFAR10 (c) CIFAR100 (d) TINY-IN (e) CAL-256
Figure 2: Trade-off between RAR (lower is better) and speedup (top row) and RAR and memory consumption
in GB-min (bottom row) for the non-adaptive methods – Facility location [11, 17], Pruning [48], Selection-via-
Proxy [5] on all five datasets - FMNIST, CIFAR10 CIFAR100, Tiny-ImageNet and Caltech-256. In all cases, we
vary |S| = b ∈ (0.005|D|, 0.9|D|).
Budget (%) →
3
GLISTER GraNd Transductive-SubSelNet 10 104
GRAD-MATCH EL2N Inductive-SubSelNet
1.0 1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8 0.8
RAR→
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
103 104
0.2 0.2 0.2 0.2 0.2
RT
101 102 101 102 103 101 102 103 103 104 103 0 memory(t)dt
10 4 →
RT RT RT RT RT
0 memory(t)dt → 0 memory(t)dt → 0 memory(t)dt → 0 memory(t)dt → 0 memory(t)dt →
(a) FMNIST (b) CIFAR10 (c) CIFAR100 (d) TINY-IN (e) CAL-256
Figure 3: Trade-off between RAR (lower is better) and speedup (top row) and RAR and memory consumption
in GB-min (bottom row) for the adaptive methods – Glister [20], Grad-Match [19], EL2N [42]; GraNd [42]
on all five datasets - FMNIST, CIFAR10 CIFAR100, Tiny-ImageNet and Caltech-256. In all cases, we vary
|S| = b ∈ (0.005|D|, 0.9|D|).
where the gap between transductive and inductive S UB S EL N ET is significant. One of such regimes is the
part of the trade-off plot in CIFAR100, where the speed up Tf /T ≥ 28.09 (Figures 2 and 3). Here, given
the budget of the subset b, we first choose B > b instances using Inductive-S UB S EL N ET and the final b
instances by running the explicit optimization routines in Transductive-S UB S EL N ET. Figure 4 shows the results
for B = {25K, 30K, 35K, 45K, 50K} . We observe that Hybrid-S UB S EL N ET allow us to smoothly trade
off between Inductive-S UB S EL N ET and Transductive-S UB S EL N ET, by tuning B. It allows us to effectively
use resource-constrained setup with limited GPU memory, wherein the larger subset B can be selected using
Inductive-S UB S EL N ET on a CPU, and the smaller refined subset b can then be selected by solving transductive
variant on GPU.
Ablation study. Here, we experiment with three candidates of model approximator gβ ( Feedforward, LSTM
and our proposed attention based approximator) with three different subset samplers π (uncertainty based, loss
based and our proposed subset sampler). Thus, we have nine different combinations of model approximator
and subset selection strategies. In the uncertainty and loss based subset samplers, we take top-b instances
based on the uncertainty and loss. We measure uncertainty using the entropy of the predicted distribution
of the target classes. We compare the performance in terms of the test RAR of the test architectures. More-
over, we also evaluate the model approximator gβ alone — without the presence of the subset sampler —
using KL divergence between the gold model outputs and predicted model outputs on the training instances
8
B = 25000 B = 45000
Speedup Memory B = 30000 Transductive (B = 50000)
RAR 10% 20% 10% 20% B = 35000 Inductive (B = NA)
GLISTER 5.64 7.85 116.36 98.51 0.9 0.9
RAR →
0.1
RAR→
GradMatch 4.17 5.24 243.75 136.40
0.8
EL2N 6.50 16.42 139.89 77.63 0.8 10 20
Inductive 28.64 69.24 22.73 8.24 0.7
0.7
Transductive 28.63 68.36 21.25 8.24
0.6 0 10 20 30
100 200
RT
Speed up (Tf /T ) → 0 memory(t)dt →
Table 1: Speedup and memory (GB-min) in reach-
ing 10% and 20% RAR on FMNIST
Figure 4: Hybrid-S UB S EL N ET
RAR KL-div
Design choice of gβ and π
b = 0.03|D| b = 0.05|D| b = 0.1|D| (does not depend on b)
Feedforward (gβ )+ Uncertainty (π) 0.657 0.655 0.547
Feedforward (gβ )+ Loss (π) 0.692 0.577 0.523 0.171
Feedforward + Inductive (our) (π) 0.451 0.434 0.397
LSTM (gβ )+ Uncertainty (π) 0.566 0.465 0.438
LSTM (gβ )+ Loss (π) 0.705 0.541 0.455 0.102
LSTM (gβ )+ Inductive (our) (π) 0.452 0.412 0.386
Attn. (our) (gβ )+ Uncertainty (π) 0.794 0.746 0.679
Attn. (our) (gβ )+ Loss (π) 0.781 0.527 0.407 0.089
Attn. (our) (gβ )+ Inductive (our) (π) 0.429 0.310 0.260
Table 3: RAR and KL-divergence for different gβ + π on CIFAR10 for 3%, 5% and 10% subset sizes
1
P
i∈Dtr ,m∈Mtest KL(mθ (xi )||gβ (Hm , xi )). Table 3 summarizes the results for 3%, 5% and 10%
∗
|Dtr ||Mtest |
subsets for CIFAR10. We make the following observations: (1) The complete design of our method, i.e., Our
model approximator (Transformer) + Our subset sampler (S UB S EL N ET) performs best in terms of RAR. (2) Our
neural-network for model approximator mimics the trained model output better than LSTM and Feedforward
architectures.
Can model approximator substitute our subset selector pipeline? The task of the model approximator gβ is to
predict accuracy for unseen architecture. Then, a natural question is that is it possible to use the model approxima-
tor to directly predict accuracy of the unseen architecture m′ , instead of using such long pipeline to select subset
S followed with training on S. However, as discussed in the end of Section 4.2, the model approximator gβ is re-
quired to generalize across unseen architectures but not the unseen instances, as its task is to help select the training
subset.
Table 3 already showed that gβ closely mimics the output
b (in % ) 90% 70% 20%
of the trained model for the unseen architecture m′ ∈ RAR(our | S) - RAR(gβ ) -0.487 -0.447 -0.327
Mtest and on training instances x (KL div. column).
Here, we investigate the performance of gβ on the test Table 2: RAR using gβ on CIFAR10
instances and test architectures. Table 2 shows that the
performance of gβ on the test instances is significantly poorer than our method. This is intuitive as generalizing
both the model space and the instance space is extremely challenging, and we also do not need it in general.
Using S UB S EL N ET in AutoML. AutoML-related tasks can be significantly sped-up when we replace the
entire dataset with a representative subset. Here, we apply S UB S EL N ET to two AutoML applications: Neural
Architecture Search (NAS) and Hyperparameter Optimization (HPO).
Neural Architecture Search: We apply our method on DARTS architecture space to search for an architecture
using subsets. During this search process, at each iteration, the underlying network is traditionally trained
on the entire dataset. In contrast, we train this underlying network on the subset returned by our method for
this architecture. Following Na et al. [39], we report test misclassification error of the architecture which is
selected by the corresponding subset selector guided NAS methods, i.e., our method (transductive), random
subset selection (averaged over 5 runs) and proxy-data [39]. Table 4 shows that our method performs better than
the baselines.
Hyperparameter Optimization: Finding the best set of hyperparameters from their search space for a model
is computationally intensive. We look at speeding-up the tuning process by searching the hyperparameters
while training the model on a small representative subset S instead of D. Following Killamsetty et al. [22], we
consider optimizer and scheduler specific hyperparameters and report average test misclassification error across
9
b = 5% b = 10% # Test Architectures
b (in %) 10% 20% 40% Method Method
TE S/U TE S/U 200 300 400
Full 2.78 Full 2.48 1 2.48 1 Full 7111 7111 7111
Random 3.02 2.88 2.96 Random 5.4 16.66 3.72 11.29 GLISTER 3419 3419 3419
Proxy [39] 2.92 2.87 2.88 AUTOMATA 5.26 0.51 3.39 0.20 GRAD-MATCH 2909 2909 2909
Our 2.82 2.76 2.68 Our 4.11 16.11 2.70 10.96 Our 1844 1635 1496
Table 4: Test Error (%) on architecture Table 5: Test Error (%) (TE) and Speed- Table 6: Amortization cost (seconds)
given by NAS on CIFAR10 up (S/U) for the hyperparameters selected after querying test architectures on CI-
by HPO on CIFAR10 FAR10
the models trained on optimal hyperparameter choice returned by our method (transductive), random subset
selection (averaged over 5 runs) and AUTOMATA [22]. Table 5 shows that we are outperforming the baselines
in terms of accuracy-speedup tradeoff. Appendix D contains more details about the implementation.
Amortization Analysis. Figures 2 and 3 show that our method is substantially faster than the baselines during
inference, once we have our neural pipeline trained. Such inference time speedup is the focus of many other
applications, E.g., complex models like LLMs are difficult and computationally intensive, but their inference is
fast for several queries once trained. However, we recognize that there is a computational overhead in training
our model, arising due to the pre-training of the models mθ∗ used for supervision. Since the prior training is
only a one-time overhead, the overall cost is amortized by querying multiple architectures for their subsets. We
measure amortized cost Ttotal /Mtotal (time in seconds), where Ttotal is the total time used from beginning of the
pipeline to end of reporting final accuracy on the test architectures and Mtotal is the total number of training and
test architectures. Table 6, shows the results for 10% subset on the top baselines for CIFAR10, which shows that
the training overhead of our method (transductive) quickly diminishes with number of test architectures.
6 Conclusion
In this work, we develop S UB S EL N ET, a subset selection framework, which can be trained on a set of model
architectures, to be able to predict a suitable training subset before training a model, for an unseen architecture.
To do so, we first design a neural architecture encoder and model approximator, which predicts the output of a
new candidate architecture without explicitly training it. We use that output to design transductive and inductive
variants of our model.
Limitations. The S UB S EL N ET pipeline offers quick inference-time subset selection but a key limitation of our
method is that it entails a pre-training overhead, although its overhead vanishes as we query more architectures.
Such expensive training can be reduced by efficient training methods [65]. In future, it would be interesting to
incorporate signals from different epochs with a sequence encoder to train a subset selector. Apart from this,
our work does not assume the distribution shift of architectures from training to test. If the architectures vary
significantly from training to test, then there is significant room for performance improvement.
Acknowledgements. Abir De acknowledges the Google Research gift funding.
References
[1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures
using reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https:
//openreview.net/forum?id=S1c2cvqee.
[2] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter opti-
mization. In Proceedings of the 24th International Conference on Neural Information Processing Systems,
NIPS’11, page 2546–2554, Red Hook, NY, USA, 2011. Curran Associates Inc. ISBN 9781618395993.
[3] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures. In Sanjoy Dasgupta and David McAllester,
editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings
of Machine Learning Research, pages 115–123, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL
https://proceedings.mlr.press/v28/bergstra13.html.
[4] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for least-squares
regression. IEEE transactions on information theory, 59(10):6880–6892, 2013.
10
[5] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang,
Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. arXiv
preprint arXiv:1906.11829, 2019.
[6] Abir De, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human
assistance. AAAI, 2020.
[7] Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez-Rodriguez. Classification under human
assistance. AAAI, 2021.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255,
2009. doi: 10.1109/CVPR.2009.5206848.
[9] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search,
2020. URL https://arxiv.org/abs/2001.00326.
[10] Dan Feldman. Core-sets: Updated survey. In Sampling Techniques for Supervised or Unsupervised Tasks,
pages 23–44. Springer, 2020.
[11] Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005.
[12] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message
passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR,
2017.
[13] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
[14] Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in
deep learning. In Database and Expert Systems Applications: 33rd International Conference, DEXA 2022,
Vienna, Austria, August 22–24, 2022, Proceedings, Part I, pages 181–195. Springer, 2022.
[15] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings
of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300, 2004.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
[17] Rishabh Krishnan Iyer. Submodular optimization and machine learning: Theoretical results, unifying and
scalable algorithms, and applications. PhD thesis, 2015.
[18] Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakr-
ishnan. Learning from less data: A unified data subset selection and active learning framework for computer
vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299.
IEEE, 2019.
[19] Krishnateja Killamsetty, Durga Sivasubramanian, Baharan Mirzasoleiman, Ganesh Ramakrishnan, Abir
De, and Rishabh Iyer. Grad-match: A gradient matching based data subset selection for efficient learning.
arXiv preprint arXiv:2103.00123, 2021.
[20] Krishnateja Killamsetty, Durga Subramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: A
generalization based data selection framework for efficient and robust learning. In AAAI, 2021.
[21] Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh Iyer. Retrieve: Coreset selection for
efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:
14488–14501, 2021.
[22] Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh
Ramakrishnan, and Rishabh Iyer. Automata: Gradient based data subset selection for compute-efficient
hyper-parameter tuning, 2022.
[23] Katrin Kirchhoff and Jeff Bilmes. Submodularity for data selection in machine translation. In Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 131–141,
2014.
11
[24] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization
of machine learning hyperparameters on large datasets, 2017.
[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[26] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. In online:
https://www.cs.toronto.edu/∼kriz/cifar.html, 2014.
[27] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
[28] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and
Ameet Talwalkar. A system for massively parallel hyperparameter tuning, 2020.
[29] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. Tune: A
research platform for distributed model selection and training, 2018.
[30] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan
Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search, 2017. URL https:
//arxiv.org/abs/1712.00559.
[31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search, 2018. URL
https://arxiv.org/abs/1806.09055.
[32] Yuzong Liu, Rishabh Iyer, Katrin Kirchhoff, and Jeff Bilmes. Svitchboard ii and fisver i: High-quality
limited-complexity corpora of conversational english speech. In Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[33] Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. Training gaussian mixture models at
scale via coresets. The Journal of Machine Learning Research, 18(1):5885–5909, 2017.
[34] Jovita Lukasik, David Friede, Arber Zela, Frank Hutter, and Margret Keuper. Smooth variational graph
embeddings for efficient neural architecture search. In International Joint Conference on Neural Networks,
IJCNN 2021, Shenzhen, China, July 18-22, 2021, 2021.
[35] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture op-
timization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/
933670f1ac8ba969f32989c312faba75-Paper.pdf.
[36] Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie
Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized
training on points that are learnable, worth learning, and not yet learnt. In Kamalika Chaudhuri, Stefanie
Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th
International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research,
pages 15630–15649. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/
mindermann22a.html.
[37] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine
learning models. In Proc. ICML, 2020.
[38] Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan,
and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings
of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications
of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial
Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019. ISBN 978-1-57735-809-1. doi: 10.1609/
aaai.v33i01.33014602. URL https://doi.org/10.1609/aaai.v33i01.33014602.
[39] Byunggook Na, Jisoo Mok, Hyeokjun Choe, and Sungroh Yoon. Accelerating neural architecture search
via proxy data. CoRR, abs/2106.04784, 2021. URL https://arxiv.org/abs/2106.04784.
[40] Xuefei Ning, Changcheng Tang, Wenshuo Li, Songyi Yang, Tianchen Zhao, Niansong Zhang, Tianyi Lu,
Shuang Liang, Huazhong Yang, and Yu Wang. Awnas: A modularized and extensible nas framework, 2020.
12
[41] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters, 2019.
URL https://arxiv.org/abs/1905.09550.
[42] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding
important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607,
2021.
[43] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via
parameters sharing. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4095–
4104. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/pham18a.html.
[44] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le,
and Alex Kurakin. Large-scale evolution of image classifiers, 2017. URL https://arxiv.org/abs/
1703.01041.
[45] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier
architecture search, 2018. URL https://arxiv.org/abs/1802.01548.
[46] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational
autoencoders. CoRR, abs/1802.03480, 2018. URL http://arxiv.org/abs/1802.03480.
[47] Durga Sivasubramanian, Rishabh K. Iyer, Ganesh Ramakrishnan, and Abir De. Training data subset
selection for regression with controlled generalization error. In ICML, 2021.
[48] Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S Morcos. Beyond neural scaling
laws: beating power law scaling via data pruning. arXiv preprint arXiv:2206.14486, 2022.
[49] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le.
Mnasnet: Platform-aware neural architecture search for mobile. 2018. doi: 10.48550/ARXIV.1807.11626.
URL https://arxiv.org/abs/1807.11626.
[50] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and
Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv
preprint arXiv:1812.05159, 2018.
[51] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Fast multi-stage submodular maximization. In International
conference on machine learning, pages 1494–1502. PMLR, 2014.
[52] Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, and Jeff Bilmes. Submodular subset selection for
large-scale speech training data. In 2014 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 3311–3315. IEEE, 2014.
[53] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. Unsupervised submodular subset selection for
speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 4107–4111. IEEE, 2014.
[54] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In
International Conference on Machine Learning, pages 1954–1963, 2015.
[55] Colin White, Willie Neiswanger, Sam Nolen, and Yash Savani. A study on encodings for neu-
ral architecture search. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, ed-
itors, Advances in Neural Information Processing Systems, volume 33, pages 20309–20319. Cur-
ran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/
ea4eb49329550caaa1d2044105223721-Paper.pdf.
[56] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[57] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?,
2018. URL https://arxiv.org/abs/1810.00826.
13
[58] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?
arXiv preprint arXiv:1810.00826, 2018.
[59] Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, and Mi Zhang. Does unsupervised architecture representation
learning help neural architecture search? In NeurIPS, 2020.
[60] Shen Yan, Kaiqiang Song, Fei Liu, and Mi Zhang. Cate: Computation-aware neural architecture encoding
with transformers. In ICML, 2021.
[61] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NAS-
bench-101: Towards reproducible neural architecture search. In Kamalika Chaudhuri and Ruslan
Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Research, pages 7105–7114. PMLR, 09–15 Jun 2019. URL
https://proceedings.mlr.press/v97/ying19a.html.
[62] Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn: A deep generative
model for graphs. CoRR, abs/1802.08773, 2018. URL http://arxiv.org/abs/1802.08773.
[63] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are
transformers universal approximators of sequence-to-sequence functions?, 2020.
[64] Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, and Yixin Chen. D-vae: A variational autoen-
coder for directed acyclic graphs. pages 1586–1598, 2019.
[65] Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Curriculum learning by optimizing learning dynamics. In
Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages
433–441. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/zhou21a.
html.
[66] Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Curriculum learning by optimizing learning dynamics. In
International Conference on Artificial Intelligence and Statistics, pages 433–441. PMLR, 2021.
[67] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning, 2016. URL
https://arxiv.org/abs/1611.01578.
[68] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for
scalable image recognition, 2017. URL https://arxiv.org/abs/1707.07012.
14
Efficient Data Subset Selection to Generalize Training Across Models:
Transductive and Inductive Networks
(Appendix)
A Limitations
While our work outperforms several existing subset selection methods, it suffers from three key limitations.
(1) We acknowledge that there indeed is a computation time for pre-training the model approximator.
However, as we mentioned in amortization analysis in Section 5.2, the one-time overhead is offset quickly by the
speed and effectiveness of the following selection pipeline for the subset selection of unseen architectures. Since
the prior training is only a one-time overhead, the overall cost is amortized by the number of unseen architectures
during inference time training. In practice, when it is used to predict the subset for a large number of unseen
architectures, the effect of training overhead quickly vanishes. As demonstrated by our experiments (Figures 2
and 3, Table 6), once the pipeline of all the neural networks is set up, the selection procedure is remarkably fast
and can be easily adapted for use with unseen architectures.
To give an analogy, premier search engines invest a lot of resources in making fast inferences rather than
training. They build complex models that are difficult and computationally intensive, but their inference is
fast for several queries once trained. Thus, the cost is amortized by the large number of queries. Another
example is locality-sensitive hashing. Researchers design trainable models for LSH whose purpose is to make
fast predictions. Training LSH models can take a lot of time, but again this cost is amortized by the number of
unseen queries.
Finally, we would like to highlight many efficient model training methods without complete training (running
a few epochs via curriculum learning [66]), which one can easily explore and plug with our method for a larger
dataset like Imagenet-1K.
(2) We use the space of neural architectures which comprises only of CNNs. We did not experiment with
sequence models such as RNNs or transformers. However, we believe that our work can be extended with RNNs
or transformer based architectures.
(3) If the distribution of network architectures varies widely from training to test, then there is significant
room for performance improvement. In this context, one can develop domain adaptation methods for graphs to
tackle different out-of-distribution architectures more effectively.
B Broader Impact
Our work can be used to provide significant compute efficiency by the trainable subset selection method we
propose. It can be used to save a lot of time and power, that ML model often demands. Specifically, it can be
used in the following applications in the context of AutoML.
Fast tuning of hyperparameters related to optimizer/training. Consider the case where we need to tune
non-network hyperparameters, such as learning rate, momentum, and weight decay. Given the architecture,
we can choose the subset obtained using our method to train the underlying model parameters for different
hyperparameters, which can then be used for cross-validation. Note that we would use the same subset in this
problem since the underlying model architecture is fixed, and we obtain a subset for the given architecture
independent of the underlying non-network hyperparameters. We have shown utility of our method in our
experiments in Section 5.2.
Fast tuning of model related hyperparameters. Consider the case where we need to tune network-related
hyperparameters, such as the number of layers, activation functions, and the width of intermediate layers. Instead
of training each instance of these models on the entire data, we can train them on the subset of data obtained
from our method to quickly obtain the trained model, which can then be used for cross-validation.
Network architecture search. As we shown in our experiments, our method can provide speedup in network
architecture search. Here, instead of training the network the entire network during architecture exploration, we
can restrict the training on a subset of data, which can provide significant speedup.
Note that, the key goal of our method is design a trainable subset selection method that generalizes across
architectures. As we observed in our experiments, these methods can be useful in the above applications—
however, our method is a generic framework and not tailored to any one of the above applications. Therefore, our
15
method may need application specific modifications before directly deploying it practice. However, our method
can serve as a base model for the practitioner who intends to speed up for one of the above applications.
We do not foresee any negative social impact of our work.
16
D Additional details about experimental setup
D.1 Dataset
Datasets (D).
Dataset No. of Classes Imbalanced Train-Test Split Shape Transformations Applied
Table 7: A brief description of the datasets used along with the transformations applied during training
Architectures (M). We leverage the NASBench-101 search space as an architecture pool. It consists of 423, 624
unique architectures with the following constraints – (1) number of nodes in each cell is at most 7, (2) number
of edges in each cell is at most 9, (3) barring the input and output, there are three unique operations, namely
1 × 1 convolution, 3 × 3 convolution and 3 × 3 max-pool. We utilize the architectures from the search space
in generating the sequence of embeddings along with sampling architectures for the training and testing of the
encoder and datasets for the subset selector. As mentioned in the experimental setup, pre-training large number
Figure 5: Distribution of parameters of architectures in Mtr when |Mtr | = 423k (blue), and Mtr
with the sampled set of 250 architectures (orange).
of models can be expensive. Therefore, we choose a diverse subset of architectures from Mtr of size 250 that
ensures efficient representation over low-parameter and high-parameter regimes. The distributions of true and
sampled architectures are given in Figure 5.
Note that the pre-training can be made faster by efficient model training methods without complete training
(running a few epochs via curriculum learning [66]), which can be easily plugged with our method.
17
Online-Glister approximates the objective function with a Taylor series expansion up to an arbitrary number of
terms to speed up the process; we used 15 terms in our experiments.
Grad-Match applies the orthogonal matching (OMP) pursuit algorithm to the data points of each mini-batch
to match gradient of a subset to the entire training/validation set. Here, we set the learning rate is set to 0.01. The
regularization constant in OMP is 1.0 and the algorithm optimizes the objective function within an error margin
of 10−4 .
GraNd. This is an adaptive subset selection strategy in which the norm of the gradient of the loss function is
used as a score to rank a data point. The gradient scores are computed after the model has trained on the full
dataset for the first few epochs. For the rest of epochs, the model is trained only on the top-k data points, selected
using the gradient scores. In our implementation, we let the model train on the full dataset for the first 5 epochs,
and computed the gradient of the loss only with respect to the last layer fully connected layer.
EL2N. When the loss function used to compute the GraNd scores is the cross entropy loss, the norm of the
gradient for a data point x can be approximated by E||p(x) − y||2 , where p(x) is the discrete probability
distribution over the classes, computed by taking softmax of the logits, and y is the one-hot encoded true label
corresponding to the data point x. Similar to our implementation of GraNd, we computed the EL2N scores after
letting the models train on the full data for the first 5 epochs.
KL-Divergence →
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0 20 40 60 80 0 20 40 60 80
Epochs → Epochs →
Figure 6: Kullback-Leibler divergence values (KL(mθ∗ (xi ) || gβ (Hm , xi ))) computed during the training of the
model encoder gβ over 80 epochs.
Model Approximator gβ . The model approximator gβ is essentially a single-head attention block that acts on a
sequence of node embeddings Hm = {hu |u ∈ Vm }. The Query, Key and Value are three linear networks with
parameters: Wquery , Wkey and Wvalue ∈ R16×8 . Note that the matrix WC ∈ R8×16 in Eq. (5). As described in
18
Section 4.2, for each node u, we use a feedforward network, preceded and succeeded by layer normalization
operations, which are given by the following set of equations (where LN the denotes Layer-Normalization
operation):
ζu,1 = LN(Attu + hu ; γ1 , γ2 ),
ζu,2 = W2⊤ R E LU(W1⊤ ζu,1 ),
ζu,3 = LN(ζu,1 + ζu,2 ; γ3 , γ4 )
The fully connected network acting on ζu,1 consists of matrices W1 ∈ R16×64 and W2 ∈ R64×16 . All the
trainable matrices along with the layer normalizations were implemented using the Linear and LayerNorm
functions in Pytorch. The last item of the output sequence ζu,3 is concatenated with the data embedding xi and
fed to another 2-layer fully-connected network with hidden dimension 256 and dropout probability of 0.3. The
model approximator is trained by minimizing the KL-divergence between gβ (Hm , xi ) and mθ∗ (xi ). We used
an AdamW optimizer with learning rate of 10−3 , ϵ = 10−8 , betas = (0.9, 0.999) and weight decay of 0.005.
We also used Cosine Annealing to decay the learning rate, and used gradient clipping with maximum norm set to
5. Figure 6 shows the convergence of the outputs of the model approximator gβ (Hm , xi ) with the outputs of the
model mθ∗ (xi ).
Neural Network πψ . The inductive model is a three-layer fully-connected neural network with two Leaky ReLU
activations and a sigmoid activation after the last layer. The input to πψ is the concatenation (Hm ; om,i ; xi ; yi ).
The hidden dimensions of the two intermediary layers are 64 and 16, and the final layer is a single neuron
thatPoutputs the score corresponding to a data point xi . While training πψ we add a regularization term
λ′ ( i∈D πψ (Hm , om,i , xi , yi ) − |S|) to ensure that nearly |S| samples have high scores out of the entire dataset
D. Both the regularization constants λ (in equation 3) and λ′ are set to 0.1. We train the model weights using
an Adam optimizer with a learning rate of 0.001. During training, at each iteration we sort them based on their
probability values. During each computation step, we use one instance of the ranked list to compute the unbiased
estimate of the objective in (3).
19
E Additional experiments
E.1 Comparison with additional baselines
Here, we compare the performance of S UB S EL N ET against two baselines. They are the two variants of our
method–Bottom-b-loss and Bottom-b-loss+gumbel.
In Bottom-b-loss, we sort the data instances based on their predicted loss ℓ(gβ (Hm , x), y) and consider those
points with the bottom b values.
In Bottom-b-loss+gumbel, we add noise sampled from the gumbel distribution with µgumbel = 0 and βgumbel =
0.025, and sort the instances based on these noisy loss values, i.e., ℓ(gβ (Hm , x), y) + Gumbel(0, βgumbel =
0.025).
Figure 7 compares the performance of the variants of S UB S EL N ET, Bottom-b-loss, and Bottom-b-loss+gumbel.
We observe that Bottom-b-loss and Bottom-b-loss+gumbel do not perform that well in spite of being efficient in
terms of time and memory.
Transductive-SubSelNet Bottom-b-loss
Inductive-SubSelNet Bottom-b-loss+gumbel
1.0 1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8 0.8
RAR →
RT
101 102
RT
101 102
RT
101 102
RT
103 104
RT
RT
103 104
Transductive-SubSelNet Pruning
Inductive-SubSelNet Proxy
0.8 0.8 0.8
RAR→
20
E.3 Analysis of compute efficiency in high accuracy regime
We analyze the compute efficiency of all the methods, given an allowance of reaching 20% and 10% of the
relative accuracy reduction (RAR) (80% and 90% of the accuracy) achieved by training on the full data. We
make the following observations:
1. Transductive-S UB S EL N ET achieves the best speedup and consumes the least memory, followed by
Inductive-S UB S EL N ET.
2. For CIFAR100, Tiny-Imagenet and Caltech-256, Bottom-b-loss and Bottom-b-loss+gumbel achieve better
performance than the baselines which are able to reach the desired RAR milestones (10% or 20%).
3. For CIFAR100, Tiny-Imagenet and Caltech-256, most baselines could not achieve an accuracy of either
10% or even 20% of the RAR on full data.
FMNIST CIFAR10 CIFAR100
Speedup Memory Speedup Memory Speedup Memory
Method 10% 20% 10% 20% 10% 20% 10% 20% 10% 20% 10% 20%
GLISTER 5.64 7.85 98.51 116.36 1.52 2.12 515.96 365.05 0.54 1.02 1427.77 758.55
GradMatch 4.17 5.24 136.40 243.75 1.69 2.20 457.67 362.47 — 0.84 — 917.04
EL2N 6.50 16.42 77.63 139.89 1.93 4.78 413.90 170.03 — — — —
GraNd — 1.18 — 450.73 — — — — — — — —
FacLoc 0.82 2.37 652.67 81.01 — 0.80 — 558.56 — — — —
Pruning 3.12 4.68 559.44 19.55 3.54 5.53 221.10 139.41 — 1.71 — 452.09
Selection-via-Proxy 3.65 18.09 168.20 35.27 1.95 1.03 819.22 410.26 — 1.02 — 765.05
Tiny-Imagenet Caltech-256
Speedup Memory Speedup Memory
Method 10% 20% 10% 20% 10% 20% 10% 20%
GLISTER 1.65 2.18 26705.5 22687.6 1.31 1.76 16904.5 13921.4
GradMatch 1.53 2.08 28249.9 25530.4 1.45 2.07 16499.3 12507.9
EL2N — 2.30 — 21811.1 — — — —
GraNd — 2.16 — 24822.6 — — — —
FacLoc — — — — — — — —
Pruning — — — — — — — —
Selection-via-Proxy — 1.04 — 30717.2 — — — —
Table 8: Time and memory in reaching 10% and 20% RAR (90% and 80% of maximum accuracy of Full
selection) in tradeoff curve in Figure 2 and 3 for all datasets. In the table, "—" denotes that under the current
setup of experiments, i.e., the range of subsets considered, the method could not attain an accuracy equal to or
less than 20% or 10% of RAR. Note that Bottom-b-loss and Bottom-b-loss+gumbel are variants/ablations of our
method.
21
coefficient (KTau) along with Jaccard coefficent |A ∩ A∗ |/|A ∪ A∗ |.
Figure 9 summarizes the results for three non-adaptive subset selectors in terms of the accuracy, namely -
Transductive-S UB S EL N ET, Inductive-S UB S EL N ET and FL. We make the following observations: (1) One of
our variant outperforms FL in most of the cases in CIFAR10 and CIFAR100. (2) There is no consistent winner
between Transductive-S UB S EL N ET and Inductive-S UB S EL N ET, although Inductive-S UB S EL N ET outperforms
both Transductive-S UB S EL N ET and FL consistently in CIFAR100 in terms of the Jaccard coefficient.
Transductive-SubSelNet Inductive-SubSelNet FL
0.5 1.0
0.8
Coeffecient →
Coeffecient →
Coeffecient →
0.4
Jaccard
Jaccard
Jaccard
0.6 0.5
0.3
0.2 0.4
300 600 1800 3000 250 500 1500 2500 250 500 1500 2500
Subset Size → Subset Size → 0.1 Size →
Subset
Kτ →
Kτ →
0.6
0.4
0.5
0.4
0.2 0.4
300 600 1800 3000 250 500 1500 2500 250 500 1500 2500
Subset Size → Subset Size → Subset Size →
Figure 9: Comparison of the three non-adaptive subset selectors (Transductive-S UB S EL N ET, Inductive-
S UB S EL N ET and FL) on ranking and choosing of the top-15 architectures on the basis of Jaccard Coefficient and
Kendall tau rank correlation coefficient (Kτ ).
22
Inductive-S UB S EL N ET, although it remains extremely small as compared to the final training on the inferred
subset; and, (3) the selection time of FL is large— as close as 323% of the training time.
Table 11: Variation of accuracy with subset size of both the variants of S UB S EL N ET on training, validation and
test set of CIFAR10
Since the amount of training data is small, there is a possibility of overfitting. However, the coefficient λ of
the entropy regularizer λH(Prπ ), can be increased to draw instances from the different regions of the feature
space, which in turn can reduce the overfitting. In practice, we tuned λ on the validation set to control such
overfitting.
We present the accuracies on (training, validation, test) folds for both Transductive-S UB S EL N ET and
Inductive-S UB S EL N ET in Table 11. We make the following observations:
1. From training to test, in most cases, the decrease in accuracy is ∼ 7%.
2. This small accuracy gap is further reduced from validation to test. Here, in most cases, the decrease in
accuracy is ∼ 4%.
skip_connect
dil_conv_3x3 0 dil_conv_5x5
skip_connect 3
c_{k-2}
sep_conv_3x3 c_{k}
3
sep_conv_5x5
c_{k} sep_conv_5x5 skip_connect skip_connect 2
2 c_{k-2} 0 1
c_{k-1} dil_conv_5x5
sep_conv_3x3 dil_conv_5x5 skip_connect
1
sep_conv_3x3 c_{k-1}
dil_conv_3x3
Figure 10: Normal [left] and Reduction [right] cells found by S UB S EL N ET during Neural Architecture
Search using a 40% subset of the CIFAR10 dataset.
Standard error results on NAS and HPO. Here, we present the mean and standard error over the runs for the
Test Error (%) on NAS and HPO, which has been presented in the main draft (Section 5.2). We observe our
method offers less deviation across runs. Moreover, we found that the gain offered by our method is statistically
significant with p ≈ 0.05.
Method b = 0.1|D| b = 0.2|D| b = 0.4|D|
Table 12: Mean and standard error of Test Error (%) on architectures given by NAS on CIFAR10
23
Method b = 0.05|D| b = 0.1|D|
Table 13: Mean and standard deviation of Test Error (%) for hyperparameters selected by HPO on CIFAR10
F Pros and cons of using GNNs
We have used a GNN in our model encoder to encode the architecture representations into an embedding. We
chose a GNN for the task due to following reasons -
1. Message passing between the nodes (which may be the input, output, or any of the operations) allows us to
generate embeddings that capture the contextual structural information of the node, i.e., the embedding of
each node captures not only the operation for that node but also the operations preceding that node to a
large extent. To better illustrate the impact of the GNN, we compared it with a baseline where we directly
fed the graph structure to the model approximator using the adjacency matrix, in lieu of the GNN-derived
node embeddings. This alteration resulted in a notable performance decline, leading to a 5-6% RAR on
subset size of 10% of CIFAR10.
Variations of RAR
embedding KL-div
b = 0.05|D| b = 0.1|D|
Table 14: RAR and KL-div for different embeddings (A: Adjacency Matrix, H: GNN embedding) in model approximator
2. It has been shown by [38] and [57] that GNNs are as powerful as the Weisfeiler-Lehman algorithm and
thus give a powerful representation for the graph. Thus, we obtain smooth embeddings of the nodes/edges
that can effectively distill information from its neighborhood without significant compression.
3. GNNs embed model architecture into representations independent of the underlying dataset and the model
parameters. This is because it operates on only the nodes and edges— the structure of the architecture and
does not use the parameter values or input data.
However, the GNN faces the following drawbacks -
1. GNN uses a symmetric aggregator for message passing over node neighbors to ensure that the representation
of any node should be invariant to a permutation of its neighbors. Such a symmetric aggregator renders it a
low-pass filter, as shown in [41], which attenuates important high-frequency signals.
2. We are training one GNN using several architectures. This can lead to the insensitivity of the embedding
to change in the architecture. In the context of model architecture, if we change the operation of one node
in the architecture (either remove, add or change the operation), then the model’s output can significantly
change. However, the embedding of GNN may become immune to such changes, since the GNN is being
trained over many architectures.
G Licensing Information
The NAS-Bench-101 dataset and DARTS Neural Architecture Search are publicly available under the Apache
License. The baselines GRAD-MATCH and GLISTER, publicly available via CORDS, and the apricot library
used for Facility Location, are under the MIT License.
24