Energy and Policy Considerations For Deep Learning in NLP: Emma Strubell Ananya Ganesh Andrew Mccallum
Energy and Policy Considerations For Deep Learning in NLP: Emma Strubell Ananya Ganesh Andrew Mccallum
3645
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650
Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics
model training and development likely make up Consumer Renew. Gas Coal Nuc.
a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4%
sions attributed to many NLP researchers. Germany 40% 7% 38% 13%
To heighten the awareness of the NLP commu- United States 17% 35% 27% 19%
nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26%
policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10%
emissions that result from training the neural net- Microsoft 32% 23% 31% 10%
works at the core of many state-of-the-art NLP
models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g.
of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for
off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al., 2017),
to approximate carbon emissions and electricity compared to the United States,4 China5 and Germany
(Burger, 2019).
costs. To estimate the even greater resources re-
quired to transfer an existing model to a new task
or develop new models, we perform a case study We estimate the total time expected for mod-
of the full computational resources required for the els to train to completion using training times and
development and tuning of a recent state-of-the-art hardware reported in the original papers. We then
NLP pipeline (Strubell et al., 2018). We conclude calculate the power consumption in kilowatt-hours
with recommendations to the community based on (kWh) as follows. Let pc be the average power
our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train-
sitivity to hyperparameters should be reported for ing, let pr be the average power draw from all
NLP machine learning models; (2) academic re- DRAM (main memory) sockets, let pg be the aver-
searchers need equitable access to computational age power draw of a GPU during training, and let
resources; and (3) researchers should prioritize de- g be the number of GPUs used to train. We esti-
veloping efficient models and hardware. mate total power consumption as combined GPU,
CPU and DRAM consumption, then multiply this
2 Methods by Power Usage Effectiveness (PUE), which ac-
counts for the additional energy required to sup-
To quantify the computational and environmen-
port the compute infrastructure (mainly cooling).
tal cost of training deep neural network mod-
We use a PUE coefficient of 1.58, the 2018 global
els for NLP, we perform an analysis of the en-
average for data centers (Ascierto, 2018). Then the
ergy required to train a variety of popular off-
total power pt required at a given instance during
the-shelf NLP models, as well as a case study of
training is given by:
the complete sum of resources required to develop
LISA (Strubell et al., 2018), a state-of-the-art NLP 1.58t(pc + pr + gpg )
pt = (1)
model from EMNLP 2018, including all tuning 1000
and experimentation. The U.S. Environmental Protection Agency (EPA)
We measure energy use as follows. We train the provides average CO2 produced (in pounds per
models described in §2.1 using the default settings kilowatt-hour) for power consumed in the U.S.
provided, and sample GPU and CPU power con- (EPA, 2018), which we use to convert power to
sumption during training. Each model was trained estimated CO2 emissions:
for a maximum of 1 day. We train all models on
a single NVIDIA Titan X GPU, with the excep- CO2 e = 0.954pt (2)
tion of ELMo which was trained on 3 NVIDIA
This conversion takes into account the relative pro-
GTX 1080 Ti GPUs. While training, we repeat-
portions of different energy sources (primarily nat-
edly query the NVIDIA System Management In-
ural gas, coal, nuclear and renewable) consumed
terface2 to sample the GPU power consumption
to produce energy in the United States. Table 2
and report the average over all samples. To sample
lists the relative energy sources for China, Ger-
CPU power consumption, we use Intel’s Running
many and the United States compared to the top
Average Power Limit interface.3
5
U.S. Dept. of Energy: https://bit.ly/2JTbGnI
2 5
nvidia-smi: https://bit.ly/30sGEbi China Electricity Council; trans. China Energy Portal:
3
RAPL power meter: https://bit.ly/2LObQhV https://bit.ly/2QHE5O3
3646
three cloud service providers. The U.S. break- as question answering and natural language infer-
down of energy is comparable to that of the most ence. Devlin et al. (2019) report that the BERT
popular cloud compute service, Amazon Web Ser- base model (BERTbase ; 110M parameters) was
vices, so we believe this conversion to provide a trained on 16 TPU chips for 4 days (96 hours).
reasonable estimate of CO2 emissions per kilowatt NVIDIA reports that they can train a BERT model
hour of compute energy used. in 3.3 days (79.2 hours) using 4 DGX-2H servers,
totaling 64 Tesla V100 GPUs (Forster et al., 2019).
2.1 Models GPT-2. This model is the latest edition of
OpenAI’s GPT general-purpose token encoder,
We analyze four models, the computational re-
also based on Transformer-style self-attention and
quirements of which we describe below. All mod-
trained with a language modeling objective (Rad-
els have code freely available online, which we
ford et al., 2019). By training a very large model
used out-of-the-box. For more details on the mod-
on massive data, Radford et al. (2019) show high
els themselves, please refer to the original papers.
zero-shot performance on question answering and
Transformer. The Transformer (T2T) model
language modeling benchmarks. The large model
(Vaswani et al., 2017) is an encoder-decoder archi-
described in Radford et al. (2019) has 1542M pa-
tecture primarily recognized for efficient and accu-
rameters and is reported to require 1 week (168
rate machine translation. The encoder and decoder
hours) of training on 32 TPUv3 chips. 6
each consist of 6 stacked layers of multi-head self-
attention. Vaswani et al. (2017) report that the 3 Related work
Transformer base model (T2Tbase ; 65M param-
eters) was trained on 8 NVIDIA P100 GPUs for There is some precedent for work characterizing
12 hours, and the Transformer big model (T2Tbig ; the computational requirements of training and in-
213M parameters) was trained for 3.5 days (84 ference in modern neural network architectures in
hours; 300k steps). This model is also the ba- the computer vision community. Li et al. (2016)
sis for recent work on neural architecture search present a detailed study of the energy use required
(NAS) for machine translation and language mod- for training and inference in popular convolutional
eling (So et al., 2019), and the NLP pipeline that models for image classification in computer vi-
we study in more detail in §4.2 (Strubell et al., sion, including fine-grained analysis comparing
2018). So et al. (2019) report that their full ar- different neural network layer types. Canziani
chitecture search ran for a total of 979M training et al. (2016) assess image classification model ac-
steps, and that their base model requires 10 hours curacy as a function of model size and gigaflops
to train for 300k steps on one TPUv2 core. This required during inference. They also measure av-
equates to 32,623 hours of TPU or 274,120 hours erage power draw required during inference on
on 8 P100 GPUs. GPUs as a function of batch size. Neither work an-
ELMo. The ELMo model (Peters et al., 2018) alyzes the recurrent and self-attention models that
is based on stacked LSTMs and provides rich have become commonplace in NLP, nor do they
word representations in context by pre-training on extrapolate power to estimates of carbon and dol-
a large amount of data using a language model- lar cost of training.
ing objective. Replacing context-independent pre- Analysis of hyperparameter tuning has been
trained word embeddings with ELMo has been performed in the context of improved algorithms
shown to increase performance on downstream for hyperparameter search (Bergstra et al., 2011;
tasks such as named entity recognition, semantic Bergstra and Bengio, 2012; Snoek et al., 2012). To
role labeling, and coreference. Peters et al. (2018) our knowledge there exists to date no analysis of
report that ELMo was trained on 3 NVIDIA GTX the computation required for R&D and hyperpa-
1080 GPUs for 2 weeks (336 hours). rameter tuning of neural network models in NLP.
BERT. The BERT model (Devlin et al., 2019) pro- 6
Via the authors on Reddit.
vides a Transformer-based architecture for build- 7
GPU lower bound computed using pre-emptible
ing contextual representations similar to ELMo, P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper
but trained with a different language modeling ob- bound uses on-demand U.S. resources priced at $1.46–
$2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr)
jective. BERT substantially improves accuracy on and on-demand ($4.50/hr–$8/hr) pricing as lower and upper
tasks requiring sentence-level representations such bounds for TPU v2/3; cheaper bulk contracts are available.
3647
Model Hardware Power (W) Hours kWh·PUE CO2 e Cloud compute cost
T2Tbase P100x8 1415.78 12 27 26 $41–$140
T2Tbig P100x8 1515.43 84 201 192 $289–$981
ELMo P100x3 517.66 336 275 262 $433–$1472
BERTbase V100x64 12,041.51 79 1507 1438 $3751–$12,571
BERTbase TPUv2x16 — 96 — — $2074–$6912
NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722
NAS TPUv2x1 — 32,623 — — $44,055–$146,848
GPT-2 TPUv3x32 — 168 — — $12,902–$43,008
Table 3: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).7 Power
and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
3648
are compatible with their setting. More explicit half the estimated cost to use on-demand cloud
characterization of tuning time could also reveal GPUs. Unlike money spent on cloud compute,
inconsistencies in time spent tuning baseline mod- however, that invested in centralized resources
els compared to proposed contributions. Realiz- would continue to pay off as resources are shared
ing this will require: (1) a standard, hardware- across many projects. A government-funded aca-
independent measurement of training time, such demic compute cloud would provide equitable ac-
as gigaflops required to convergence, and (2) a cess to all researchers.
standard measurement of model sensitivity to data
and hyperparameters, such as variance with re- Researchers should prioritize computationally
spect to hyperparameters searched. efficient hardware and algorithms.
We recommend a concerted effort by industry and
Academic researchers need equitable access to academia to promote research of more computa-
computation resources. tionally efficient algorithms, as well as hardware
that requires less energy. An effort can also be
Recent advances in available compute come at a made in terms of software. There is already a
high price not attainable to all who desire access. precedent for NLP software packages prioritizing
Most of the models studied in this paper were de- efficient models. An additional avenue through
veloped outside academia; recent improvements in which NLP and machine learning software de-
state-of-the-art accuracy are possible thanks to in- velopers could aid in reducing the energy asso-
dustry access to large-scale compute. ciated with model tuning is by providing easy-
Limiting this style of research to industry labs to-use APIs implementing more efficient alterna-
hurts the NLP research community in many ways. tives to brute-force grid search for hyperparameter
First, it stifles creativity. Researchers with good tuning, e.g. random or Bayesian hyperparameter
ideas but without access to large-scale compute search techniques (Bergstra et al., 2011; Bergstra
will simply not be able to execute their ideas, and Bengio, 2012; Snoek et al., 2012). While
instead constrained to focus on different prob- software packages implementing these techniques
lems. Second, it prohibits certain types of re- do exist,10 they are rarely employed in practice
search on the basis of access to financial resources. for tuning NLP models. This is likely because
This even more deeply promotes the already prob- their interoperability with popular deep learning
lematic “rich get richer” cycle of research fund- frameworks such as PyTorch and TensorFlow is
ing, where groups that are already successful and not optimized, i.e. there are not simple exam-
thus well-funded tend to receive more funding ples of how to tune TensorFlow Estimators using
due to their existing accomplishments. Third, the Bayesian search. Integrating these tools into the
prohibitive start-up cost of building in-house re- workflows with which NLP researchers and practi-
sources forces resource-poor groups to rely on tioners are already familiar could have notable im-
cloud compute services such as AWS, Google pact on the cost of developing and tuning in NLP.
Cloud and Microsoft Azure.
While these services provide valuable, flexi- Acknowledgements
ble, and often relatively environmentally friendly
We are grateful to Sherief Farouk and the anony-
compute resources, it is more cost effective for
mous reviewers for helpful feedback on earlier
academic researchers, who often work for non-
drafts. This work was supported in part by the
profit educational institutions and whose research
Centers for Data Science and Intelligent Infor-
is funded by government entities, to pool resources
mation Retrieval, the Chan Zuckerberg Initiative
to build shared compute centers at the level of
under the Scientific Knowledge Base Construc-
funding agencies, such as the U.S. National Sci-
tion project, the IBM Cognitive Horizons Network
ence Foundation. For example, an off-the-shelf
agreement no. W1668553, and National Science
GPU server containing 8 NVIDIA 1080 Ti GPUs
Foundation grant no. IIS-1514053. Any opinions,
and supporting hardware can be purchased for
findings and conclusions or recommendations ex-
approximately $20,000 USD. At that cost, the
pressed in this material are those of the authors and
hardware required to develop the model in our
do not necessarily reflect those of the sponsor.
case study (approximately 58 GPUs for 172 days)
10
would cost $145,000 USD plus electricity, about For example, the Hyperopt Python library.
3649
References Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Rhonda Ascierto. 2018. Uptime Institute Global Data Zettlemoyer. 2018. Deep contextualized word rep-
Center Survey. Technical report, Uptime Institute. resentations. In NAACL.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural Machine Translation by Jointly Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Learning to Align and Translate. In 3rd Inter- Dario Amodei, and Ilya Sutskever. 2019. Language
national Conference for Learning Representations models are unsupervised multitask learners.
(ICLR), San Diego, California, USA. Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
James Bergstra and Yoshua Bengio. 2012. Random 2012. Practical bayesian optimization of machine
search for hyper-parameter optimization. Journal of learning algorithms. In Advances in neural informa-
Machine Learning Research, 13(Feb):281–305. tion processing systems, pages 2951–2959.
James S Bergstra, Rémi Bardenet, Yoshua Bengio, and David R. So, Chen Liang, and Quoc V. Le. 2019.
Balázs Kégl. 2011. Algorithms for hyper-parameter The evolved transformer. In Proceedings of the
optimization. In Advances in neural information 36th International Conference on Machine Learning
processing systems, pages 2546–2554. (ICML).
Bruno Burger. 2019. Net Public Electricity Generation Emma Strubell, Patrick Verga, Daniel Andor,
in Germany in 2018. Technical report, Fraunhofer David Weiss, and Andrew McCallum. 2018.
Institute for Solar Energy Systems ISE. Linguistically-Informed Self-Attention for Se-
mantic Role Labeling. In Conference on Empir-
Alfredo Canziani, Adam Paszke, and Eugenio Culur- ical Methods in Natural Language Processing
ciello. 2016. An analysis of deep neural network (EMNLP), Brussels, Belgium.
models for practical applications.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Gary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Deans, Brian Johnson, Elizabeth Jardim, and Brian Kaiser, and Illia Polosukhin. 2017. Attention is all
Johnson. 2017. Clicking Clean: Who is winning you need. In 31st Conference on Neural Information
the race to build a green internet? Technical report, Processing Systems (NIPS).
Greenpeace.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. In NAACL.
Timothy Dozat and Christopher D. Manning. 2017.
Deep biaffine attention for neural dependency pars-
ing. In ICLR.
EPA. 2018. Emissions & Generation Resource Inte-
grated Database (eGRID). Technical report, U.S.
Environmental Protection Agency.
Christopher Forster, Thor Johnsen, Swetha Man-
dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
Bernauer, Allison Gray, Sharan Chetlur, and Raul
Puri. 2019. BERT Meets GPUs. Technical report,
NVIDIA AI.
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
2016. Evaluating the energy efficiency of deep con-
volutional neural networks on cpus and gpus. 2016
IEEE International Conferences on Big Data and
Cloud Computing (BDCloud), Social Computing
and Networking (SocialCom), Sustainable Comput-
ing and Communications (SustainCom) (BDCloud-
SocialCom-SustainCom), pages 477–484.
Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015. Effective approaches to attention-based
neural machine translation. In Proceedings of the
2015 Conference on Empirical Methods in Natural
Language Processing, pages 1412–1421. Associa-
tion for Computational Linguistics.
3650