We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
7,500
Open access books available
196,000
International authors and editors
215M
Downloads
154
Countries delivered to
TOP 1%
most cited scientists
14%
Contributors from top 500 universities
Abstract
Deep learning has become the most popular approach in machine learning in
recent years. The reason lies in considerably high accuracies obtained by deep
learning methods in many tasks especially with textual and visual data. In fact,
natural language processing (NLP) and computer vision are the two research areas
that deep learning has demonstrated its impact at utmost level. This chapter will
firstly summarize the historical evolution of deep neural networks and their funda-
mental working principles. After briefly introducing the natural language
processing and computer vision research areas, it will explain how exactly deep
learning is used to solve the problems in these two areas. Several examples
regarding the common tasks of these research areas and some discussion are also
provided.
1. Introduction
1
Data Mining - Methods, Applications and Systems
representation learning, where the main impact of deep learning approaches has
been observed. Good dense representations are learned for words, senses,
sentences, paragraphs, and documents. These embeddings are proved useful in
capturing both syntactic and semantic features. Recent works are able to compute
contextual embeddings, which can provide different representations for the same
word in different contextual units. Consequently, state-of-the-art embedding
methods along with their applications in different NLP tasks will be stated as the use
of these pre-trained embeddings in various downstream NLP tasks introduced a
substantial performance improvement.
The second path concentrates on visual data. It will introduce the use of deep
learning for computer vision research area. In this aim, it will first cover the
principles of convolutional neural networks (CNNs)—the fundamental structure
while working on images and videos. On a typical CNN architecture, it will explain
the main components such as convolutional, pooling, and classification layers.
Then, it will go over one of the main tasks of computer vision, namely, image
classification. Using several examples of image classification, it will explain several
concepts related to training CNNs (regularization, dropout and data augmentation).
Lastly, it will provide a discussion on visualizing and understanding the features
learned by a CNN. Based on this discussion, it will go through the principles of how
and when transfer learning should be applied with a concrete example of real-world
four-class classification problem.
Deep neural networks currently provide the best solutions to many problems in
computer vision and natural language processing. Although we have been hearing
the success news in recent years, artificial neural networks are not a new research
area. In 1943, McCulloch and Pitts [1] built a neuron model that sums binary inputs,
and outputs 1 if the sum exceeds a certain threshold value, and otherwise outputs 0.
They demonstrated that such a neuron can model the basic OR/AND/NOT
Figure 1.
A neuron that mimics the behavior of logical AND operator. It multiplies each input (x1 and x2 ) and the bias
unit ðþ1Þ with a weight and thresholds the sum of these to output 1 if the sum is big enough (similar to our
neurons that either fire or not).
2
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
functions (Figure 1). Such structures are called neurons due to the biological inspi-
ration: inputs (xi ) represent activations from nearby neurons, weights (wi ) repre-
sent the synapse strength to nearby neurons, and activation function ( f w) is the cell
body, and if the function output is strong enough, it will be sensed by the synapses
of nearby neurons.
In 1957, Rosenblatt introduced perceptrons [2]. The idea was not different from
the neuron of McCulloch and Pitts, but Rosenblatt came up with a way to make
such artificial neurons learn. Given a training set of input-output pairs, weights are
increased/decreased depending on the comparison between the perceptron’s output
and the correct output. Rosenblatt also implemented the idea of the perceptron in
custom hardware and showed it could learn to classify simple shapes correctly with
20 20 pixel-like inputs (Figure 2).
Marvin Minsky who was the founder of MIT AI Lab and Seymour Papert
together wrote a book related to the analysis on the limitations of perceptrons [4].
In this book, as an approach of AI, perceptrons were thought to have a dead end. A
single layer of neurons was not enough to solve complicated problems, and
Rosenblatt’s learning algorithm did not work for multiple layers. This conclusion
caused a declining period for the funding and publications on AI, which is usually
referred to as “AI winter.”
Paul Werbos proposed that backpropagation can be used in neural networks [5].
He showed how to train multilayer perceptrons in his PhD thesis (1974), but due to
the AI winter, it required a decade for researchers to work in this area. In 1986, this
approach became popular with “Learning representations by back-propagating
errors” by Rumelhart et al. [6]. First time in 1989, it was applied to a computer
vision task which is handwritten digit classification [7]. It has demonstrated excel-
lent performance on this task. However, after a short while, researchers started to
face problems with the backpropagation algorithm. Deep (multilayer) neural net-
works trained with backpropagation did not work very well and particularly did not
work as well as networks with fewer layers. It turned out that the magnitudes of
Figure 2.
Mark I Perceptron at the Cornell Aeronautical Laboratory, hardware implementation of the first perceptron
(source: Cornell University Library [3]).
3
Data Mining - Methods, Applications and Systems
backpropagated errors shrink very rapidly and this prevents earlier layers to learn,
which is today called as “the vanishing gradient problem.” Again it took more than a
decade for computers to handle more complex tasks. Some people prefer to name
this period as the second AI winter.
Later, it was discovered that the initialization of weights has a critical impor-
tance for training, and with a better choice of nonlinear activation function, we can
avoid the vanishing gradient problem. In the meantime, our computers got faster
(especially thanks to GPUs), and huge amount of data became available for many
tasks. G. Hinton and two of his graduate students demonstrated the effectiveness of
deep networks at a challenging AI task: speech recognition. They managed to
improve on a decade-old performance record on a standard speech recognition
dataset. In 2012, a CNN (again G. Hinton and students) won against other machine
learning approaches at the Large Scale Visual Recognition Challenge (ILSVRC)
image classification task for the first time.
Technically any neural network with two or more hidden layers is “deep.”
However, in papers of recent years, deep networks correspond to the ones with
many more layers. We show a simple network in Figure 3, where the first layer is
the input layer, the last layer is the output layer, and the ones in between are the
hidden layers.
In Figure 3, a j ðiÞ denotes the value after activation function is applied to the
inputs in jth neuron of ith layer. If the predicted output of the network, which is
a1 ð4Þ in this example, is close to the actual output, then the “loss” is low. Previously
mentioned backpropagation algorithm uses derivatives to carry the loss to the
ð4Þ
previous layers. ∂a∂L1
ð4Þ represents the derivative of loss with respect to a1 , whereas
∂L
∂a1 ð2Þ
represents the derivative of loss with respect to a second layer neuron a1 ð2Þ . The
derivative of loss with respect to a1 ð2Þ means how much of the final error (loss) is
neuron a1 ð2Þ responsible for.
Activation function is the element that gives a neural network its nonlinear
representation capacity. Therefore, we always choose a nonlinear function. If acti-
vation function was chosen to be a linear function, each layer would perform a
linear mapping of the input to the output. Thus, no matter how many layers were
there, since linear functions are closed under composition, this would be equivalent
to having a single (linear) layer.
Figure 3.
A simple neural network with two hidden layers. Entities plotted with thicker lines are the ones included in
Eq. (1), which will be used to explain the vanishing gradient problem.
4
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
∂L ∂L
ð 2Þ 0 ð3Þ ð3Þ 0 ð4Þ
¼ w σ z w σ z 1 (1)
∂a1 ð2Þ ∂a1 ð4Þ
Eq. (1) shows how the error in the final layer is backpropagated to a neuron in
the first hidden layer, where wðiÞ denotes the weights in layer i and z j ðiÞ denotes the
weighted input to the jth neuron in layer i. Here, let’s assume sigmoid is used as the
activation function. Then, a j ðiÞ denotes the value after the activation function is
applied to z j ðiÞ , i.e., a j ðiÞ ¼ σ z j ðiÞ . Finally, let σ 0 denote the derivative of sigmoid
function. Entities in Eq. (1) are plotted with thicker lines in Figure 3.
Figure 4 shows the derivative of sigmoid, where we observe that the highest
point derivative is equal to 25% of its original value. And most of the time, deriva-
tive is much less. Thus, at each layer wð jÞ σ 0 zð jþ1Þ ≤ 0:25 in Eq. (1). As a result,
products decrease exponentially. ∂a∂L 1
∂L
ð2Þ becomes 16 (or more) times smaller than ∂a ð4Þ .
1
Thus, gradients become very small (vanish), and updates on weights get smaller,
and they begin to “learn” very slowly. Detailed explanation of the vanishing gradi-
ent problem can be found in [8].
Figure 4.
Derivative of the sigmoid function.
Figure 5.
Plots for some activation functions. Sigmoid is on the left, rectified linear unit is in the middle, and leaky
rectified linear unit is on the right.
5
Data Mining - Methods, Applications and Systems
Deep learning transformed the field of natural language processing (NLP). This
transformation can be described by better representation learning through newly
proposed neural language models and novel neural network architectures that are
fine-tuned with respect to an NLP task.
Deep learning paved the way for neural language models, and these models
introduced a substantial performance improvement over n-gram language models.
More importantly, neural language models are able to learn good representations in
their hidden layers. These representations are shown to capture both semantic and
syntactic regularities that are useful for various downstream tasks.
6
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
Figure 6.
CBOW architecture.
exp ðwc wt Þ
pðwt jwc Þ ¼ P (2)
j ∈ V exp w j wt
In Skip-gram, the system predicts the most probable context words for a given
input word. In terms of a language model, while CBOW predicts an individual
word’s probability, Skip-gram outputs the probabilities of a set of words, defined by
a given context size. Due to high dimensionality in the output layer (all vocabulary
words have to be considered), Skip-gram has higher computational complexity than
CBOW (Figure 7). To deal with this issue, rather than traversing all vocabulary in
the output layer, Skip-gram with negative sampling (SGNS) [13] formulates the
problem as a binary classification where one class represents the current context’s
occurrence probability, whereas the other is all vocabulary terms’ occurrence in the
present context. In the latter probability calculation, a sampling approach is incor-
porated. As vocabulary terms are not distributed uniformly in contexts, sampling is
performed from a distribution where the order of the frequency of vocabulary
words in corpora is taken into consideration. SGNS incorporates this sampling idea
by replacing the Skip-gram’s objective function. The new objective function
(Eq. (3)) depends on maximizing PðD ¼ 1jw, cÞ, where w, c is the word-context
pair. This probability denotes the probability of ðw, cÞ coming from the corpus data.
Additionally, PðD ¼ 0jui , cÞ should be maximized if ðui , cÞ pair is not included in the
corpus data. In this condition, ðui , cÞ pair is sampled, as the name suggests negative
sampled k times.
X Xk
! !
log σ w c þ ! !
log σ ui c
(3)
w, c i¼1
Both word2vec variants produced word embeddings that can capture multiple
degrees of similarity including both syntactic and semantic regularities.
A regular extension to word2vec model was doc2vec [14], where the main goal is
to create a representation for different document levels, e.g., sentence and
7
Data Mining - Methods, Applications and Systems
Figure 7.
Skip-gram architecture.
paragraph. Their architecture is quite similar to the word2vec except for the exten-
sion with a document vector. They generate a vector for each document and word.
The system takes the document vector and its words’ vectors as an input. Thus, the
document vectors are adjusted with regard to all the words in this document. At the
end, the system provides both document and word vectors. They propose two
architectures that are known as distributed memory model of paragraph vectors
(DM) and distributed bag-of-words model of paragraph vectors (DBOW).
DM: In this architecture, inputs are the words in a context except for the last
word and document, and the output is the last word of the context. The word
vectors and document vector are concatenated while they are fed into the system.
DBOW: The input of the architecture is a document vector. The model predicts
the words randomly sampled from the document.
An important extension to word2vec and its variants is fastText [15], where they
considered to use characters together with words to learn better representations for
words. In fastText language model, the score between a context word and the
middle word is computed based on all character n-grams of the word as well as the
word itself. Here n-grams are contiguous sequences of n letters like unigram for a
single letter, bigram for two consecutive letters, trigram for three letters in succes-
sion, etc. In Eq. (4), vc represents a context vector, zg is a vector associated with
each n-gram, and Gw is the set of all character n-grams of the word w together with
itself.
X
sðw, cÞ ¼ zTg vc (4)
g ∈ Gw
The idea of using the smallest syntactic units in the representation of words
introduced an improvement in morphologically rich languages and is capable to
compute a representation for out-of-vocabulary words.
The recent development in representation learning is the introduction of contex-
tual representations. Early word embeddings have some problems. Although they can
learn syntactic and semantic regularities, they are not so good in capturing a mixture
of them. For example, they can capture the syntactic pattern look-looks-looked.
8
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
In a similar way, the words hard, difficult, and tough are embedded into closer points
in the space. To address both syntactic and semantic features, Kim et al. [16] used
a mixture of character- and word-level features. In their model, at the lowest level
of hierarchy, character-level features are processed by a CNN; after transferring
these features over a highway network, high-level features are learned by the use
of a long short-term memory (LSTM). Thus, the resulting embeddings showed
good syntactic and semantic patterns. For instance, the closest words to the word
richard are returned as eduard, gerard, edward, and carl, where all of them are
person names and have syntactic similarity to the query word. Due to character-
aware processing, their models are able to produce good representations for
out-of-vocabulary words.
The idea of capturing syntactic features at a low level of hierarchy and the
semantic ones at higher levels was realized ultimately by the Embeddings from
Language Models (ELMo) [17]. ELMo proposes a deep bidirectional language model
to learn complex features. Once these features are learned, the pre-trained model is
used as an external knowledge source to the fine-tuned model that is trained using
task-specific data. Thus, in addition to static embeddings from the pre-trained
model, contextual embeddings can be taken from the fine-tuned one.
Another drawback of previous word embeddings is they unite all the senses of a
word into one representation. Thus, different contextual meanings cannot be
addressed. The brand new ELMo and Bidirectional Encoder Representations from
Transformers (BERT) [18] models resolve this issue by providing different repre-
sentations for every occurrence of a word. BERT uses bidirectional Transformer
language model integrated with a masked language model to provide a fine-tuned
language model that is able to provide different representations with respect to
different contexts.
In NLP, different neural network solutions have been used in various down-
stream tasks.
Language data are temporal in nature so recurrent neural networks (RNNs)
seem as a good fit to the task in general. RNNs have been used to learn long-range
dependencies. However, because of the dependency to the previous time steps in
computations, they have efficiency problems. Furthermore, when the length of
sequences gets longer, an information loss occurs due to the vanishing gradient
problem.
Long short-term memory architectures are proposed to tackle the problem of
information loss in the case of long sequences. Gated recurrent units (GRUs) are
another alternative to LSTMs. They use a gate mechanism to learn how much of the
past information to preserve at the next time step and how much to erase.
Convolutional neural networks have been used to capture short-ranging depen-
dencies like learning word representation over characters and sentence representa-
tion over its n-grams. Compared to RNNs, they are quite efficient due to
independent processing of features. Moreover, through the use of different convo-
lution filter sizes (overlapping localities) and then concatenation, their learning
regions can be extended.
Machine translation is a core NLP task that has witnessed innovative neural
network solutions that gained wide application afterwards. Neural machine trans-
lation aims to translate sequences from a source language into a target language
using neural network architectures. Theoretically, it is a conditional language model
where the next word is dependent on the previous set of words in the target
sequence and the source sentence at the same time. In traditional language
9
Data Mining - Methods, Applications and Systems
modeling, the next word’s probability is computed based solely on the previous set
of words. Thus, in conditional language modeling, conditional means conditioned
on the source sequence’s representation. In machine translation, source sequence’s
processing is termed as encoder part of the model, whereas the next word predic-
tion task in the target language is called decoder. In probabilistic terms, machine
translation aims to maximize the probability of the target sequence y given the
source sequence x as follows.
10
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
Figure 8.
Sequence-to-sequence attention.
from the current hidden state of the decoder (each QUERY in Figure 8), αi
(WEIGHTS in Figure 8) are attention weights, and K (OUTPUT in Figure 8) is the
attention output that is combined with the last hidden state of the decoder to make
the next word prediction in translation.
exp ðhi wi Þ
αi ¼ P
j exp h j w j
oi ¼ αi hi (8)
X
K¼ oi
i
11
Data Mining - Methods, Applications and Systems
different images of the same class have quite different instances and varying view-
points, illumination, deformation, occlusion, etc.
All competitors in ILSVRC train their model on ImageNet [22] dataset.
ImageNet 2012 dataset contains 1.2 million images and 1000 classes. Classification
performances of proposed methods were compared according to two different
evaluation criteria which are top 1 and top 5 score. In top 5 criterion, for each image
top 5 guesses of the algorithm are considered. If actual image category is one of
these five labels, then the image is counted as correctly classified. Total number of
incorrect answers in this sense is called top 5 error.
An outstanding performance was observed by a CNN (convolutional neural
network) in 2012. AlexNet [23] got the first place in classification task achieving
16.4% error rate. There was a huge difference between the first (16.4%) and second
place (26.1%). In ILSVRC 2014, GoogleNet [24] took the first place achieving 6.67%
error rate. Positive effect of network depth was observed. One year later, ResNet
took the first place achieving 3.6% error rate [25] with a CNN of 152 layers. In the
following years, even lower error rates were achieved with several modifications.
Please note that the human performance on the image classification task was
reported to be 5.1% error [22].
CNNs are the fundamental structures while working on images and videos.
A typical CNN is actually composed of several layers interleaved with each other.
0 1 1 0 1 1
12
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
Figure 9.
Convolution process. (a) First convolution operation applied with filter W1. Computation gives us the
top-left member of an activation map in the next layer. (b) Second convolution operation, again applied
with filter W1.
Figure 10.
Formation of a convolution layer by applying n number of learnable filters on the previous layer. Each
activation map is formed by convolving a different filter on the whole input. In this example input to the
convolution is the RGB image itself (depth = 3). For every further layer, input is its previous layer. After
convolution, width and height of the next layer may or may not decrease.
13
Data Mining - Methods, Applications and Systems
Figure 11.
Max pooling.
Figure 12.
A typical CNN for image classification task.
Figure 13.
An example of softmax classification loss calculation. Computed loss, Li, is only for the ith sample in the dataset.
What we have in the last fully connected layer of a classification network is the
output scores for each class. It may seem trivial to select the class with the highest
score to make a decision; however we need to define a loss to be able to train the
network. Loss is defined according to the scores obtained for the classes. A common
practice is to use softmax function, which first converts the class scores into nor-
malized probabilities (Eq. (10)):
eoj
pj ¼ P o (10)
ke
k
where k is the number of classes, oj are the output neurons (scores), and pj are
the normalized probabilities. Softmax loss is equal to the log of the normalized
probability of the correct class. An example calculation of softmax loss with three
classes is given in Figure 13.
The ability of a model to make correct predictions for new samples after trained
on the training set is defined as generalization. Thus, we would like to train a CNN
14
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
with a high generalization capacity. Its high accuracy should not be only for training
samples. In general, we should increase the size and variety of the training data, and
we should avoid training an excessively complex model (simply called overfitting).
Since it is not always easy to obtain more training data and to pick the best com-
plexity for our model, let’s discuss a few popular techniques to increase the gener-
alization capacity.
This is a term, RðW Þ, added to the data loss with a coefficient (λ) called
regularization strength (Eq. (10)). Regularization loss can be a sum of L1 or L2
norm of weights. The interpretation of RðW Þ is that we want smaller weights to be
able to achieve smoother models for better generalization. It means that no input
dimension can have a very large influence on the scores all by itself.
N
1X
L¼ Li þ λ RðW Þ (11)
N i¼1
4.2.2 Dropout
The more training samples for a model, the more successful the model will be.
However, it is rarely possible to obtain large-size datasets either because it is hard to
collect more samples or it is expensive to annotate large number of samples. There-
fore, to increase the size of existing raw data, producing synthetic data is sometimes
preferred. For visual data, data size can be increased by rotating the picture at
different angles, random translations, rotations, crops, flips, or altering brightness
and contrast [27].
Short after people realized that CNNs are very powerful nonlinear models for
computer vision problems, they started to seek an insight of why these models
Figure 14.
Applying dropout in a neural net.
15
Data Mining - Methods, Applications and Systems
Figure 15.
Image patches corresponding to the highest activations in a random subset of feature maps. First layer’s high
activations occur at patches of distinct low-level features such as edges (a) and lines (b); further layers’ neurons
learn to fire at more complex structures such as geometric shapes (c) or patterns on an animal (d). Since
activations in the first layer correspond to small areas on images, resolution of patches in (a) and (b) is low.
Table 1.
Strategies of transfer learning according to the size of the new dataset and its similarity to the one used in pre-
trained model.
16
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
Figure 16.
Example images for each class used in the experiment of transfer learning for animal classification.
Figure 17.
Training and validation set accuracies obtained (a) with transfer learning and (b) without transfer learning.
17
Data Mining - Methods, Applications and Systems
5. Conclusions
Deep learning has become the dominant machine learning approach due to the
availability of vast amounts of data and improved computational resources. The
main transformation was observed in text and image analysis.
In NLP, change can be described in two major lines. The first line is learning
better representations through ever-improving neural language models. Currently,
self-attention-based Transformer language model is state-of-the-art, and learned
representations are capable to capture a mix of syntactic and semantic features and
are context-dependent. The second line is related to neural network solutions in
different NLP tasks. Although LSTMs proved useful in capturing long-term depen-
dencies in the nature of temporal data, the recent trend has been to transfer the pre-
trained language models’ knowledge into fine-tuned task-specific models. Self-
attention neural network mechanism has become the dominant scheme in pre-
trained language models. This transfer learning solution outperformed existing
approaches in a significant way.
In the field of computer vision, CNNs are the best performing solutions. There
are very deep CNN architectures that are fine-tuned, thanks to huge amounts of
training data. The use of pre-trained models in different vision tasks is a common
methodology as well.
One common disadvantage of deep learning solutions is the lack of insights due
to learning implicitly. Thus, attention mechanism together with visualization seems
promising in both NLP and vision tasks. The fields are in the quest of more
explainable solutions.
One final remark is on the rise of multimodal solutions. Till now question
answering has been an intersection point. Future work are expected to be devoted
to multimodal solutions.
Author details
© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
18
Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision
DOI: http://dx.doi.org/10.5772/intechopen.91813
References
[3] Images from the Rare Book and [12] Mikolov T, Chen K, Corrado G,
Manuscript Collections. Cornell Dean J. Efficient estimation of
University Library. Available from: word representations in vector
https://digital.library.cornell.edu/cata space. In: Workshop Proceedings of
log/ss:550351 ICLR. 2013
19
Data Mining - Methods, Applications and Systems
20