Representation Learning and NLP
Representation Learning and NLP
1.1 Motivation
That is, to build an effective machine learning system, we first transform useful
information on raw data into internal representations such as feature vectors. Then by
designing appropriate objective functions, we can employ optimization algorithms
to find the optimal parameter settings for the system.
Data representation determines how much useful information can be extracted
from raw data for further classification or prediction. If there is more useful infor-
mation transformed from raw data to feature representations, the performance of
classification or prediction will tend to be better. Hence, data representation is a
crucial component to support effective machine learning.
Conventional machine learning systems adopt careful feature engineering as
preprocessing to build feature representations from raw data. Feature engineering
needs careful design and considerable expertise, and a specific task usually requires
customized feature engineering algorithms, which makes feature engineering labor
intensive, time consuming, and inflexible.
Representation learning aims to learn informative representations of objects from
raw data automatically. The learned representations can be further fed as input to
machine learning systems for prediction or classification. In this way, machine learn-
ing algorithms will be more flexible and desirable while handling large-scale and
noisy unstructured data, such as speech, images, videos, time series, and texts.
Deep learning [9] is a typical approach for representation learning, which has
recently achieved great success in speech recognition, computer vision, and natural
language processing. Deep learning has two distinguishing features:
Currently, the improvements caused by deep learning for NLP may still not be
so significant as compared to speech and vision. However, deep learning for NLP
has been able to significantly reduce the work of feature engineering in NLP in the
meantime of performance improvement. Hence, many researchers are devoting to
developing efficient algorithms on representation learning (especially deep learning)
for NLP.
In this chapter, we will first discuss why representation learning is important for
NLP and introduce the basic ideas of representation learning. Afterward, we will
iPhone
iPhone
CEO
Apple Inc.
Tim Cook
Entities Embeddings
briefly review the development history of representation learning for NLP, introduce
typical approaches of contemporary representation learning, and summarize existing
and potential applications of representation learning. Finally, we will introduce the
general organization of this book.
Knowledge
NLP Applications
Network
Semantic Analysis
Document
Phrase
Lexical Analysis
Word
Fig. 1.2 Distributed representation can provide unified semantic space for multi-grained language
entries and for multiple NLP tasks
Fig. 1.3 The timeline for the development of representation learning in NLP. With the growing
computing power and large-scale text data, distributed representation trained with neural networks
and large corpora has become the mainstream
Pre-training Objective
Fig. 1.4 This figure shows how word embeddings and pre-trained language models work in NLP
pipelines. They both learn distributed representations for language entries (e.g., words) through
pretraining objectives and transfer them to target tasks. Furthermore, pre-trained language models
can also transfer model parameters
it is hard to tell what each element of a word embedding actually means, the vectors
indeed encode semantic meanings about the words, verified by the performance of
NPLM.
Inspired by NPLM, there came many methods that embed words into distributed
representations and use the language modeling objective to optimize them as model
parameters. Famous examples include word2vec [12], GloVe [13], and fastText [3].
Though differing in detail, these methods are all very efficient to train, utilize large-
scale corpora, and have been widely adopted as word embeddings in many NLP
models. Word embeddings in the NLP pipeline map discrete words into informative
low-dimensional vectors, and help to shine a light on neural networks in comput-
ing and understanding languages. It makes representation learning a critical part of
natural language processing.
The research on representation learning in NLP took a big leap when ELMo
[14] and BERT [4] came out. Besides using larger corpora, more parameters, and
more computing resources as compared to word2vec, they also take complicated
context in text into consideration. It means that instead of assigning each word
with a fixed vector, ELMo and BERT use multilayer neural networks to calculate
dynamic representations for the words based on their context, which is especially
useful for the words with multiple meanings. Moreover, BERT starts a new fashion
(though not originated from it) of the pretrained fine-tuning pipeline. Previously,
word embeddings are simply adopted as input representation. But after BERT, it
becomes a common practice to keep using the same neural network structure such as
BERT in both pretraining and fine-tuning, which is taking the parameters of BERT
for initialization and fine-tuning the model on downstream tasks (Fig. 1.4).
Though not a big theoretical breakthrough, BERT-like models (also known as
Pre-trained Language Models (PLM), for they are pretrained through language
modeling objective on large corpora) have attracted wide attention in the NLP and
machine learning community, for they have been so successful and achieved state-
of-the-art on almost every NLP benchmarks. These models show what large-scale
data and computing power can lead to, and new research works on the topic of Pre-
Trained language Models (PLMs) emerge rapidly. Probing experiments demonstrate
that PLMs implicitly encode a variety of linguistic knowledge and patterns inside
1.4 Development of Representation Learning for NLP 7
their multilayer network parameters [8, 10]. All these significant performances and
interesting analyses suggest that there are still a lot of open problems to explore in
PLMs, as the future of representation learning for NLP.
Based on the distributional hypothesis, representation learning for NLP has
evolved from symbol-based representation to distributed representation. Starting
from word2vec, word embeddings trained from large corpora have shown significant
power in most NLP tasks. Recently, emerged PLMs (like BERT) take complicated
context into word representation and start a new trend of the pretraining fine-tuning
pipeline, bringing NLP to a new level. What will be the next big change in repre-
sentation learning for NLP? We hope the contents of this book can give you some
inspiration.
People have developed various effective and efficient approaches to learn semantic
representations for NLP. Here we list some typical approaches.
Statistical Features: As introduced before, semantic representations for NLP in
the early stage often come from statistics, instead of emerging from the optimization
process. For example, in n-gram or bag-of-words models, elements in the representa-
tion are usually frequencies or numbers of occurrences of the corresponding entries
counted in large-scale corpora.
Hand-craft Features: In certain NLP tasks, syntactic and semantic features are
useful for solving the problem. For example, types of words and entities, semantic
roles and parse trees, etc. These linguistic features may be provided with the tasks
or can be extracted by specific NLP systems. In a long period before the wide use
of distributed representation, researchers used to devote lots of effort into designing
useful features and combining them as the inputs for NLP models.
Supervised Learning: Distributed representations emerge from the optimization
process of neural networks under supervised learning. In the hidden layers of neu-
ral networks, the different activation patterns of neurons represent different entities
or attributes. With a training objective (usually a loss function for the target task)
and supervised signals (usually the gold-standard labels for training instances of the
target tasks), the networks can learn better parameters via optimization (e.g., gra-
dient descent). With proper training, the hidden states will become informative and
generalized as good semantic representations of natural languages.
For example, to train a neural network for a sentiment classification task, the loss
function is usually set as the cross-entropy of the model predictions with respect to
the gold-standard sentiment labels as supervision. While optimizing the objective,
the loss gets smaller, and the model performance gets better. In the meantime, the
hidden states of the model gradually form good sentence representations by encoding
the necessary information for sentiment classification inside the continuous hidden
space.
8 1 Representation Learning and NLP
In general, there are two kinds of applications of representation learning for NLP. In
one case, the semantic representation is trained in a pretraining task (or designed by
human experts) and is transferred to the model for the target task. Word embedding
is an example of the application. It is trained by using language modeling objective
and is taken as inputs for other down-stream NLP models. In this book, we will
1.6 Applications of Representation Learning for NLP 9
Knowledge
Guidance
Under-
Learning standing KRL GNN
Knowledge
Extraction
References
1. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic
language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003.
2. Leonard Bloomfield. A set of postulates for the science of language. Language, 2(3):153–164,
1926.
3. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors
with subword information. Transactions of the Association for Computational Linguistics,
5:135–146, 2017.
4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.
5. Pedro Domingos. A few useful things to know about machine learning. Communications of
the ACM, 55(10):78–87, 2012.
6. John R Firth. A synopsis of linguistic theory, 1930–1955. 1957.
7. Zellig S Harris. Distributional structure. Word, 10(2–3):146–162, 1954.
8. John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word repre-
sentations. In Proceedings of NAACL-HLT, 2019.
9. Goodfellow Ian, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for
MIT Press, 2016.
10. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Lin-
guistic knowledge and transferability of contextual representations. In Proceedings of NAACL-
HLT, 2019.
References 11
11. James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed
processing. Explorations in the Microstructure of Cognition, 2:216–271, 1986.
12. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-
tionality. Proceedings of NeurIPS, 2013.
13. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of EMNLP, 2014.
14. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-
HLT, pages 2227–2237, 2018.
15. Claude E Shannon. A mathematical theory of communication. Bell system technical journal,
27(3):379–423, 1948.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.