EL-Attention: Memory Efficient Lossless Attention for Generation

Yan, Yu; Chen, Jiusheng; Qi, Weizhen; Bhendawade, Nikhil; Gong, Yeyun; Duan, Nan; Zhang, Ruofei

Computer Science > Computation and Language

arXiv:2105.04779 (cs)

[Submitted on 11 May 2021 (v1), last revised 11 Jun 2021 (this version, v2)]

Title:EL-Attention: Memory Efficient Lossless Attention for Generation

Authors:Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan Duan, Ruofei Zhang

View PDF

Abstract:Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.

Comments:	ICML 2021. Version 2: add pseudocode
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2105.04779 [cs.CL]
	(or arXiv:2105.04779v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2105.04779

Submission history

From: Yu Yan [view email]
[v1] Tue, 11 May 2021 04:37:52 UTC (345 KB)
[v2] Fri, 11 Jun 2021 21:18:59 UTC (351 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-05

Change to browse by:

cs
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yu Yan
Yeyun Gong
Nan Duan
Ruofei Zhang

export BibTeX citation

Computer Science > Computation and Language

Title:EL-Attention: Memory Efficient Lossless Attention for Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:EL-Attention: Memory Efficient Lossless Attention for Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators