Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Trivedi, Aashka; Udagawa, Takuma; Merler, Michele; Panda, Rameswar; El-Kurdi, Yousef; Bhattacharjee, Bishwaranjan

Computer Science > Computation and Language

arXiv:2303.09639 (cs)

[Submitted on 16 Mar 2023 (v1), last revised 13 Oct 2023 (this version, v2)]

Title:Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Authors:Aashka Trivedi, Takuma Udagawa, Michele Merler, Rameswar Panda, Yousef El-Kurdi, Bishwaranjan Bhattacharjee

View PDF

Abstract:Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments. However, KD can be ineffective when the student is manually selected from a set of existing options, since it can be a sub-optimal choice within the space of all possible student architectures. We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task agnostic distillation from a multilingual teacher. In each episode of the search process, a NAS controller predicts a reward based on the distillation loss and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full training corpus. KD-NAS can automatically trade off efficiency and effectiveness, and recommends architectures suitable to various latency budgets. Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance, and has been deployed in 3 software offerings requiring large throughput, low latency and deployment on CPU.

Comments:	11 pages, 5 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2303.09639 [cs.CL]
	(or arXiv:2303.09639v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.09639

Submission history

From: Aashka Trivedi [view email]
[v1] Thu, 16 Mar 2023 20:39:44 UTC (870 KB)
[v2] Fri, 13 Oct 2023 21:34:39 UTC (1,548 KB)

Computer Science > Computation and Language

Title:Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators