0% found this document useful (0 votes)

31 views

Learning Deep Learning

Uploaded by

Thato Sekonyane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Learning Deep Learning

Uploaded by

Thato Sekonyane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Revista Brasileira de Ensino de Física, vol.

44, e20220101 (2022) Didactic Resources

www.scielo.br/rbef cb
DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101 Licença Creative Commons

Learning Deep Learning

Henrique F. de Arruda*1,2 , Alexandre Benatti1 , César Henrique Comin3 ,
Luciano da F. Costa1
1
Universidade de São Paulo, Instituto de Física de São Carlos, São Carlos, SP, Brasil.
2
ISI Foundation, Via Chisola 5, 10126, Turin, Italy.
3
Universidade Federal de São Carlos, Departamento de Ciência da Computação, São Carlos, SP, Brasil.

Received on April 01, 2022. Revised on July 21, 2022. Accepted on July 22, 2022.
As a consequence of its capability of creating high level abstractions from data, deep learning has been effectively
employed in a wide range of applications, including physics. Though deep learning can be, at first and simplistically
understood in terms of very large neural networks, it also encompasses new concepts and methods. In order to
understand and apply deep learning, it is important to become familiarized with the respective basic concepts. In
this text, after briefly revising some works relating to physics and deep learning, we introduce and discuss some
of the main principles of deep learning as well as some of its principal models. More specifically, we describe the
main elements, their use, as well as several of the possible network architectures. A companion tutorial in Python
has been prepared in order to complement our approach.
Keywords: Deep learning, Tutorial, Classification.

1. Introduction This is the case with several categories that are par-
ticularly important to humans, such as faces, plant and
In order for humans to interact with their environment, animals, as well as actions, among others. In these cases,
which includes other humans, it is necessary to develop subcategories are created, leading to increasing levels
models of the entities in the world (e.g. [1]). These of information and detail. However, because of limited
models allow not only the recognition of important memory and processing, the problem becomes increas-
objects/actions, but also provide subsidies for making ingly complex (e.g. [4]), and we need to stop this sub-
predictions that can have great impact on our lives. categorization at a point that is viable given our needs.
As a consequence of our restricted cognitive abilities, As a consequence of the fundamental importance of
the developed models of world entities need to have pattern recognition for humans, and also of our limita-
some level of abstraction, so as to allow a more effective tions, interest was progressively invested in developing
handling and association of concepts, and also as a automated means for performing this ability, leading to
means to obtain some degree of generalization in the areas such as automated pattern recognition, machine
representations [2]. learning, and computer vision (e.g. [5]).
Interestingly, the ability of abstraction is required from Artificial approaches to pattern recognition typically
humans, as a consequence of the need to prevent a level involve two main steps: (a) feature extraction; and
of detail that would otherwise surpasses our memory (b) classification based on these features (e.g. [1, 6]).
and/or processing capacity [3]. So, when we derive a Figure 1 illustrates these two stages. While the former
category of a real object such as a pear, we leave out was initially human-assisted, efforts were focused on
a large amount of detailed information (e.g. color varia- the classification itself. From the very beginning, neural
tions, small shape variations, etc.) so as to accommodate networks were understood to provide a motivation and
the almost unlimited instances of this fruit that can be reference, as a consequence of the impressive power of
found. Provided that we chose an effective set of features biological nervous systems (e.g. [7]).
to describe the pear, we will be able to recognize almost Interestingly, the subject of artificial neural networks
any instance of this fruit as being a pear, while generally (e.g. [8]) received successive waves of attention from the
not being able to distinguish between the pears in a tree. scientific-technologic community (for instance, respec-
Ideally, it would be interesting that the recognition tive to the Perceptron (e.g. [9]), and Hopfield networks
operated at varying levels of details so that, after (e.g. [10, 11]). These waves, frequently observed in
recognizing the general type of object, we could process scientific development, are characterized by a surge
to subsequent levels of increased detail and information, of development induced by one or more important
therefore achieving a more powerful performance. advances, until a kind of saturation is reached as a
consequence of the necessity of new major conceptual
* Correspondence email address: [email protected] and/or technological progresses. After saturation arises,

Copyright by Sociedade Brasileira de Física. Printed in Brazil.

e20220101-2 Learning Deep Learning

Supervised
including but not limited to mathematics, computer
Data science, biology, and physics. In the case of deep learning,
size
important concepts of physics [14, 15] have often been
weight
→ employed. Here, we brief and non-exhaustively review
width f Classifier Category
some of the concepts of physics that have found their

...
...

way to deep learning, and vice versa [16].

sweetness Among the various artificial neural networks devel-
opments, we have the concept of Boltzmann Machine
Stage 1: Feature extraction Stage 2: Classification
(BM) [17], which is also known as stochastic Hopfield
due to shared characteristics. The BM is based on the
Figure 1: Scheme of classification that starts from the feature concepts of statistical physics, and its name refers to the
extraction step, in which information from the object (e.g. a original study Boltzmann in the areas of thermodynam-
pear) is measured, to the classification where the respective ics and statistics [8]. For more details, see Section 4.5.4.
category is assigned. In this example, we show a supervised Variations of this type of neural network have been
classifier. So, a database (set of training samples) is used. important to developments and applications on deep
learning (e.g., [18, 19]), mainly regarding computer
new subsequent waves of development are induced, inte- vision [18, 20, 21]. For instance, the Deep Boltzmann
grating and complementing the previous developments Machine has been applied to handwritten digit recogni-
through new approaches and perspectives. tion (e.g. [18]).
The great current interest in deep learning stems Another interesting approach is the Deep Lagrangian
from the impressive performance that has been often Networks (DeLaN) [22], which considers Lagrangian
obtained. These achievements are a consequence mechanics in order to learn the dynamics of a system.
of several interrelated factors. First, we have that As an example of application, [22] used Deep Lagrangian
computing hardware has developed to unprecedented Networks for robot tracking control. Another work
levels of performance, with emphasis on GPUs (Graphics considering Lagrangian dynamics was developed aimed
Processing Units), paving the way to implementing and at applications to fiber-optic communication [23]. More
performing large neural algorithms [12]. Thus, many specifically, a deep learning technique was used to solve a
layers are now possible, each incorporating impressive problem related to communication, in which the signal
numbers of neurons. propagation is described by the nonlinear Schrödinger
Another important basis of deep learning has been equation.
the ever increasing availability of good quality large In addition to deep learning techniques based on
databases [13]. Added to these, we also have conceptual physics, deep learning has also been applied to the
and methodological advances such as the possibility to analysis of physics-related phenomena [24, 25]. For
automatic define and extract features, as well as the instance, the authors of [26] proposed a deep learning
incorporation of novel activation functions [13]. model capable of learning nonlinear partial differential
Aiming at creating a guideline for learning deep equations of physical dynamics. General machine learn-
learning, the present text provides a brief and hopefully ing techniques, with a focus on deep learning, have been
accessible conceptual introduction to some of the main often employed to the analysis of high-energy physics
respective aspects and developments. We present and experiments. An interesting review [27] has indicated the
discuss the main concepts and illustrate the respective use of deep learning in the Large Hadron Collider (LHC)
applications. We also developed a tutorial that com- collision experiments, such as the use of neural networks
prises examples of all presented deep learning variations, to assist simulations of related dynamics [28, 29]. Other
which can be accessed from the following link: https: examples of application of deep learning in physics
//github.com/hfarruda/deeplearningtutorial. include: analysis of satellite data to forecast oceanic
This paper is organized as follows. Section 2 describes phenomena [30], understanding the physics of extensive
the use of physics to develop deep learning as well as its air showers [31], among other possibilities [24, 32].
applications to solve physics-related problems. Section 3
describes traditional neural networks. In Section 4, we 3. Traditional Neural Networks
cover the main characteristics of deep learning networks,
which include novel activation functions, computer The idea of creating a classifier based on neurons starts
architectures, as well as some of the main deep learning from a single unity (see Fig. 2a) whose dendrites can
architectures. receive the input information that is weighted by the
weight matrix W = [wi,k ], the cell body sums up
2. Deep Learning and Physics the data, and the axons give the classification activa-
tion, which is controlled by a given function (called
The development of machine learning and artificial activation function). More information regarding the
intelligence has relied on concepts from diverse areas, activation functions are provided in Section 4.3. In order

Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022 DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101
Arruda et al. e20220101-3

• Hidden layer: receives information from a previ-

ous and, after a sum and activation operation,
transmits the data to the next layer. In the same
network, it is possible to have as much hidden
layers as necessary;
• Output layer: gives the classifier answer.
In MLP, all the neurons from a previous layer can
be connected to the next, which is called dense. The
training step consists of the method of back propaga-
(a) tion [8] being similar to the approach employed to a
x1 single neuron. More specifically, the training data is
Neuron k
x2 fed into the network, and the weights are optimized
x3 w 1,k Activation to decrease the error function from the output to the
w 2,k function
w 3,k input layer. Interestingly, it was found that one hidden
layer, for a sufficient large number of neurons, is capable
zk

xi
w i,k Ʃ yk of learning any possible shape [33], which is called
universal aproximation theorem. So, theoretically, this
w N,k number of layers is enough to any problem. However,
usually at least two hidden layers are used because it
was found to decrease the learning time and improve
xN
the accuracy [34].
(b)
4. The Deep Learning Framework
Figure 2: (a) A highly simplified biological neuron. The main
parts of a neuron include: dendrites and synapses; the cellular One of the main points of deep-learning is the capability
body; the implantation cone (represented as the dashed region),
of the network to extract features directly from the
in which the integration occurs; and the axons that transmit the
signal to the next neuron(s). (b) A possible model of a neuron
data. While feature extraction and classification are
k, k = 1, 2, . . . , K: The input data xi , i = 1, 2, . . . , N , come performed in standard machine learning methods, in
from the input layer at the left-hand side, and each of these deep-learning the network can learn the features from
values is multiplied by the respective weight wi,k . These values the raw data. Figure 3 illustrates the similarities and
are then summed up, yielding zk , which is fed into the activation differences between a typical neural network and a
function, producing the output yk . convolutional deep learning network.

4.1. Optimization
to assign the correct class to a given input data, it
Optimization is one of the key points of deep learning.
is necessary to associate the more appropriate weights,
This step consists of minimizing the loss function during
by using a given training method. One possibility is to
neural network training. The loss function, which mea-
optimize W according to an error function, where the
sures the quality of the neural network in modeling the
error is updated as follows
data, is used in order to optimize the network weights
wi,k (n) = wi,k (n − 1) + α xi ǫ, (1) (W ). There are several functions that can be used as
loss functions, some example include: Mean Square Error
where wi,k ∈ W , wi,k (n − 1) are the current weights, (MSE), Mean Absolute Error (MAE), Mean Bias Error
wi,k (n) are the updated weights, α is the learning rate, (MBE), SVM Loss, and Cross entropy loss. The chosen
xi is the input data, and ǫ is the error. Interestingly, this function depends on the deep learning network type and
simple methodology can classify data from two classes the performed task.
that are linearly separated. Figure 2 presents a highly The method used for optimization is called Optimizer.
simplified biological neuron (a) as well as a possible This optimization allows the classifier to learn its weights
respective model (b). W with respect to the training data. Because it is not
In order to represent more general regions, sets of possible to know the location of the global minimum
neurons have been considered, which are organized as of this function, several methods have been considered
a network [8]. The more straightforward manner is the including gradient descent, stochastic gradient descent,
use of the Multilayer Perceptron (MLP) [8]. In this case, and Adam [35]. The latter, one of the most often
the neurons are organized into three types of layers: adopted methods, consists of an efficient method for
stochastic optimization. As in the stochastic gradient
• Input layer: the first layer of the network (data descent, Adam also employs random sub-samples, called
input); minibatches. For each of the optimized parameters,

DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101 Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022
e20220101-4 Learning Deep Learning

Data

Input layer Hidden layer Output layer

(a)

k'N
kN

Data ...

...

...
...

...
k'3
k'2
k1 k'1
k0 k'0

Convolution Pooling Convolution Pooling Flattening

Input layer Feature extraction Classifier

(b)

Figure 3: A simple multilayer perceptron (a), and a convolutional deep learning network (b). A convolutional network involves layers
dedicated to convolution, pooling, and flattening. Each matrix of the convolution layer is associated with a given kernel ki . Often
a feedforward network is adopted for the classifier component.

one individual adaptive learning rate is used, and the CPU GPGPU
parameters are estimated from the first and second Memory
moments of the gradients. This method is indicated
for problems involving a significant number of parame- Memory Control
Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core

ters [35]. Another advantage of Adam is that it requires Core Core Core Core Core Core Core Core

relatively few memory. Control

Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core
Control
4.2. GPUs Control Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core

The Graphics Processing Unit (GPU) was created to Core Core Control
Core Core Core Core Core Core Core Core

deal with graphical applications, such as games. By

considering the high processing power of these GPUs,
Figure 4: A simplified comparison between CPU and GPGPU.
the manufacturers created a novel type of boards, called The CPU cores are more powerful and can be employed to
General Purpose Graphics Processing Unit (GPGPU) execute complex tasks in parallel, while the GPGPU have lots
that can be applied to a wider range of applications. One of cores that are more specific to performing massive processing
of the advantages of the GPGPUs, when comparing with of data.
GPUs, is that programs of GPGPUs can be implemented
in a simpler way. Consequently, many libraries have been
developed, which include more efficient methods of linear 4.3. Activation functions
algebra, computer graphics, image processing, and deep
learning. In biological neurons, the cell body sums up the input
By comparing Central Processing Units (CPUs) and stimulus, and the output is controlled by a respective
GPGPUs, typically the GPGPUs have a considerably activation function. In Figure 5 we show some of the
higher number of cores (see Figure 4). However, only main types of activation functions. In the case of the step
some specific applications can be executed in a GPGPU, function [36], if the integrated stimulus intensity is lower
which are characterized by data parallelism tasks. For than zero, the neuron is considered unactivated, yielding
instance, GPGPUs are efficient when a given operation zero as result. Otherwise, the neuron returns one and is
is computed over many elements of an array. Some considered activated (see Fig. 5a). The step function is
disadvantages of using GPGPUs include the high cost typically employed for classification of linearly separable
of the data transfer between the CPU RAM (Random- data (e.g. [8]).
Access Memory) and the board memory, and the limited Other activation functions can also be employed. For
memory size, among others. GPGPUs typically cannot example, we have the sigmoid function [13] (shown in
replace CPUs, but many novel approaches have become Figure 5(b)), which can be interpreted as a probability,
achievable with this technology, including deep learning. as it returns values between zero and one. This function

Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022 DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101
Arruda et al. e20220101-5

Figure 5: Examples of activation functions, used in specific neural networks-based solutions.

is defined as The softmax function [13] can be used in the last layer
1 to deal with classification problems in which there are
f (z) = , (2) many distinct classes. This function is defined for each
1 + e−z
neuron k, k = 1, 2, . . . , K (see Fig. 2) as follows
where z is the value of the cell body (at the implan-
tation cone) sum. Another alternative is the hyperbolic ezk
f (zk ) = PK , (6)
tangent [37], which is defined as ezi
i=1
ez − e−z where zi is the ith input to the respective activation
f (z) = . (3)
ez + e−z function and K is the number of inputs to that function.
In this case, the function returns positive or negative Because the sum of the exponential values normalizes
values when the input is positive or negative, respec- this function, it can be understood as a probability
tively, as shown in Figure 5(c). Due to this characteristic, distribution.
the hyperbolic tangent is typically employed in tasks
involving many negative inputs. 4.4. Deep learning main aspects
Another possibility is the identity function [13], also
This subsection briefly describes the characteristic
called linear function. In this case, the input and
aspects of deep learning.
output are exactly the same, as can be observed in
Figure 5(d). This function is typically employed for
regression tasks. In the case of convolutional neural 4.4.1. Bias
networks (see Section 4.5), the most common activation The concept of bias consists of incorporating a fixed
function is the Rectified Linear Unit (ReLU) [13], which value, b, as input to the neural layer [40]. This value
is a function defined as allows the activation function to adapt in order to
f (z) = max(0, z). (4) better fit the data. Biasing can be mathematically
represented as
This functions is shown in Figure 5(e). In the case
of image data, the value of the pixels are given by yk = f (X T · Wk + b), (7)
positive numbers. So the input layer does not have where X = [xi ] is the input column vector, Wk = [wi,k ]
negative values. This function is understood as being is the column vector k derived from the weight matrix
ease to optimize and to preserve properties having good W , f (·) is a given activation function, and yk is the
generalization potential. output of the neuron k.
Alternatively, the Leaky Rectified Linear Units (Leaky
ReLU) [38, 39] function can be employed instead of
4.4.2. One hot encoding
ReLu. The difference is that the Leaky ReLU returns
output values different from zero when the inputs are One possibility to deal with categorical features (e.g.,
negative. In some situations, the Leaky ReLU was found car brands, nationalities, and fruits) is to map the
to reduce the training time. This function is defined as categories into integer numbers [41]. However, the order
of the numbers is arbitrary and can be interpreted by
f (z) = max(αz, z), (5)
the classifier as a ranking. Another solution consists
where α is the parameter that controls the negative part of assigning a separated variable to each category. An
of the function. An example of this function is illustrated example regarding fruits can be found in Figure 6. This
in the Figure 5(f). approach is called one hot encoding.

DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101 Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022
e20220101-6 Learning Deep Learning

Categorical
Encodings y
features

Fruit

Blueberry 1 0 0
Pear 0 1 0
x
Apple 0 0 1 (a)
Apple 0 0 1
Pear 0 1 0 y y

Figure 6: Example of One Hot Encoding, where the fruits are

converted into a sparse matrix.

x x
(b) (c)
0 3 6 1 2 0
9 7 5 6 5 2 Figure 8: Example of overfitting while classifying samples from
9 6 5 two classes, represented by blue circles and yellow squares, in
4 8 8 5 9 0 the presence of noisee. The dashed lines indicate the proper
8 8 9
1 3 5 4 2 3 separation between regions, and the black lines indicate the
7 8 8 separation found by a classifier. This classification problem can
7 6 7 8 8 4
be described by the regions as in (a), but different samplings
1 3 1 2 0 7 from this problem can lead to rather different classification
curves, as illustrated in (b) and (c), since the curve adheres
Figure 7: Example of max pooling, in which the highest number too much to each of the noisy sampled data sets.
of each window is selected and assigned to the new, reduced
matrix.
4.4.5. Overfitting
Overfitting (e.g. [43]) happens when the model fits, in
4.4.3. Pooling presence of noise or original category error, many details
This process is used in convolutional neural networks of the training data at the expense of undermining
(CNNs), typically after the convolution, for reducing the its generality for classifying different data. Figure 8
dimensionality of a given matrix, by first partitioning illustrates an example of this behavior. Some of the
each matrix in an intermediate layer and then mapping possible approaches to address overfitting are discussed
each partition into a single value [42]. There are many in the following.
possibilities of pooling. For example, the max pooling
selects the maximum value from each window; the min 4.4.6. Dropout
pooling considers the minimum value instead, among
many other possibilities. See an example in Figure 7. Dropout was proposed in order to minimize the problem
of overfitting [44]. More specifically, the objective of
this approach is to avoid excessive detail by replacing a
4.4.4. Flattening
percentage of the values of a given layer with zeros. The
This technique is employed in CNNs to convert a 2D percentage of zeros is the parameter of this technique.
matrix (or a set of matrices) into a 1D vector, as The success of Dropout derives from the fact that the
  neurons do not learn too much detail of the instances of
x1,1
x1,2
the training set. Alternatively, it can be considered that
Dropout generates many different versions of a neural
 
 .. 
. network, and each version has to fit the data with good
 
   
x1,1 x1,2 ··· x1,M 
 x1,M 
 accuracy.
 x2,1 x2,2 ··· x2,M   x2,1 
 . .. ..  Flattening  , (8)
 
 .. ..  x2,2
. . . −−−−−−−→ 
..
 4.4.7. Batch normalization
 
xN,1 xN,2 ··· xN,M  . 

 x2,M

 Batch normalization [45] is based on the idea of normal-
 ..  izing the input of each layer to zero mean and unit stan-
.
 
dard deviation for a given batch of data. The advantages
xN,M
of applying this technique include the reduction of the
where N ×M is the dimension of the input matrix X. By number of training steps and, consequently, a decrease
considering a matrix set, the resultant vector represents of the learning time. This effect occurs because it allows
the concatenation of the vectors respectively to all of the the neurons from any layer to learn separately from
matrices. those in the other layers. Another advantage is that, in

Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022 DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101
Arruda et al. e20220101-7

some cases, batch normalization can mitigate overfitting.

In these cases, the use of Dropout can be unnecessary.

4.4.8. Weight regularization

Typically, overfitting leads to large values for the weights
W . Thus, this effect can be reduced by constraining the
weights to have small values, that is, by regularizing the
values of the weights. This technique consists of adding
penalties to the loss function during the optimization
step. Some possibilities of weight regularization were
proposed [13], such as L1 , L2 (also called weight decay), Figure 9: Example of a feedforward network that can be
and L1 L2 . L1 and L2 are defined as the sum of the employed in binary classifications.
absolute weights and the sum of the squared weights,
respectively, and L1 L2 employs the sum of both regular-
the activation function. In addition to the illustrated
izations, which are defined as
classification task, the feedforward networks are also
X employed to regression tasks.
L1 = l1 |wi,k |, (9)
i
X 4.5.2. Convolutional neural network
2
L2 = l2 wi,k , (10)
i
A convolutional neural network (CNN) [13] is often
applied for visual analysis. CNNs can be particularly
L1 L2 = L1 + L2 , (11) efficient in tasks such as objects detection, classification,
and face detection. This is to a great extent allowed by
where l1 and l2 set the amount of regularization and wi,k the fact that these networks can automatically learn
are elements of the matrix W . effective features. An example of CNN is shown in
Figure 3.
4.5. Types of deep learning networks The first layer of a CNN is a matrix corresponding
to an image. The hidden layers are associated to spe-
In order to deal with a variety of problems, many
cific convolutions, followed by respective pooling layers.
neural network topologies have been proposed in the
These two types of layers are repeated many times,
literature [13]. Here, we describe some of the most
alternately. The matrices are then converted into an 1D
used deep learning topologies, and comment on their
vector by using the process called flattening. Finally,
respective applications.
the vector is sent to a classifier, e.g., a feedforward
network. Many variations of this network can be found
4.5.1. Feedforward in the literature. Usually, the dropout technique does not
Feedforward is one of the first artificial neural net- tend to be particularly effective when applied to CNNs
work topologies proposed in the literature [46]. In spite because a very large number of nullifying operations
of its simplicity, this network is still used nowadays, would need to be applied in order to counteract the large
including deep learning. In a feedforward structure, the redundancy commonly found in visual data.
information moves in a single direction (from the input In Deep Learning Tutorial – 2,2 we present a tutorial
nodes to the output, through the hidden layers), without regarding classification, using the well-known CIFAR10,
loops. Figure 9 shows a typical example of a feedforward which consist of a dataset of colored digits images with
network for binary classification, having a single neuron 10 classes.
in the output layer.
In Deep Learning Tutorial – 1,1 two main examples of 4.5.3. Recurrent neural network
the use of this network can be found. In the first example, Recurrent neural networks (RNN) [47] consists of a class
we employ the feedforward network in binary classifi- of neural networks with recurrent layers, as illustrated
cation, more specifically to distinguish wine from two in Figure 10. In these layers, for each neuron, the output
cultivars, by using a wine dataset containing chemical is re-inserted as an input. The hyperbolic tangent is
information about different wines. The second consists of employed as the activation function in hidden layers.
an analysis of the same dataset, which can be classified This recurrent behavior can be understood as a type of
into three classes, according to their cultivars. In this memory, which gives rise to different ways of processing
case, we show an example of the use of softmax as sequences or vectors. This type of network has been

1 https://github.com/hfarruda/deeplearningtutorial/blob/mast 2 https://github.com/hfarruda/deeplearningtutorial/blob/mast

er/deepLearning_feedforward.ipynb er/deepLearning_CNN.ipynb

DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101 Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022
e20220101-8 Learning Deep Learning

(a)

Figure 10: Example of RNN, which is similar to the structure

of the feedfoward network, but with recurrent neurons.

employed to solve various problems including speech

recognition, text translation, language modeling, and
image captioning, among others.
One problem that can affect RNNs is called vanishing
gradient, which consists in the gradient of loss functions
vanishing to zero during the training stage [48]. In order
to address this problem, it is possible to use the Long (b)
Short-Term Memory (LSTM)3 [49]. For each LSTM
neuron, there is a different set of rules that involves some Figure 11: (a) Example of Boltzmann Machine network and (b)
activation functions (sigmoids and hyperbolic tangents). example of Restricted Boltzmann Machine network, in which
These functions control the flow of incoming information the green and blue nodes represent visible and hidden neurons,
so as to boost the output, consequently reducing the respectively.
effect of the vanishing gradient.
In Deep Learning Tutorial – 3,4 we present a deep difference is the employed training step of RBM, which
LSTM learning model able to predict Bitcoin prices can be the contrastive divergence (CD) algorithm [52].
along time by using the input as a temporal series. Among the many possible applications, we can list
dimensionality reduction, classification, collaborative fil-
4.5.4. Boltzmann machine tering, feature learning, topic modeling, many-body
quantum mechanics, and recommendation systems.
The Boltzmann machine is a stochastic neural network In Deep Learning Tutorial – 4,5 we provide an example
where all neurons are non-directly interconnected [50] of RBM for a recommendation system of CDs and
(see Fig. 11a). The types of neurons used in this model Vinyls.
are divided into visible and hidden, the information being
input into the former type of neurons, which can have
4.5.5. Autoencoders
their values modified. Simulated annealing [51] is used
in the training step instead of gradient descent. One Autoencoders consist of a deep learning model that
disadvantage of this network in its basic configuration generates a coding representation from a given data [53].
is its relatively high computational cost due to the high One example of autoencoder is shown in Figure 12.
number of connections increasing exponentially with the The first part of the network is used to create the code
number of neurons. (encoder), and the second is responsible for recovering
One possible solution to the problem of high com- the original data (decoder). The quality of training
putational cost is the restricted Boltzmann machine is measured by considering the differences between
(RBM) [52], which is a variant of the Boltzmann machine the input layer and the output layer. After training,
(see Fig. 11b). In this case, the nodes are divided into the decoder and the output layer are removed, and the
two layers, that represent visible and hidden neurons. values produced by the encoder become the output of the
The neurons from one layer are connected with all the network. There are many applications of autoencoders,
neurons of the other layer. Furthermore, there are no which include dimensionality reduction, information
connections between nodes in the same layer. Another retrieval, as well as several computer vision tasks. In the
latter, the encoder and decoder layers are typically
3 More information about the LSTM can be found in http://cola convolutional.
h.github.io/posts/2015-08-Understanding-LSTMs/.
4 https://github.com/hfarruda/deeplearningtutorial/blob/mast 5 https://github.com/hfarruda/deeplearningtutorial/blob/mast
er/deepLearning_LSTM.ipynb er/deepLearning_RBM.ipynb

Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022 DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101
Arruda et al. e20220101-9

Figure 12: Example of an autoencoder network. After training,

the code layer becomes the output of the network.

Figure 13: Example of Adversarial networks, which are divided

In order to illustrate an autoencoder application, we into two parts, the generator and discriminator.
use the Fashion MNIST dataset, which comprises dif-
ferent types of clothes. By using the resulting codes, we
project the data using a Uniform Manifold Approxima- consists of partitioning the data into k groups with the
tion and Projection (UMAP) [54]. The code is available same size. For each step, a group is selected as the test
in Deep Learning Tutorial – 5.6 group, while the remainder are considered the training
data. This process is repeated k times for all possible
4.5.6. Generative adversarial networks test groups, and then an accuracy index is computed.
The average and standard deviation of this accuracy
Generative Adversarial Networks (GANs) [55] are super-
are calculated. A high accuracy average means that the
vised deep learning models capable of generating,
method reached a high performance and a high standard
through learning, patterns from data and noise. These
deviation suggests that the classifier has overfitted the
networks consist of two parts, the generator, and the
data (ex [58]).
discriminator (see Figure 13). For instance, for a GAN
For the cases in which a massive combination of
that creates characters, the generator is responsible for
parameters needs to be tested, another approach can
creating the desired character, and the discriminator
be employed to avoid overfitting. In this case, the
monitors the quality of the generated character. The
data is divided into three sets, namely train, validation
training step is repeated many times, and the discrim-
and test. The training set is used for optimizing the
inator is progressively made more strict regarding the
model weights. The validation set is used for estimating
training dataset. Applications adopting GANs include
the model performance and for tuning hyperparameters
the generation of images from texts, videos from images,
(e.g., number of layers, dropout rate, number of training
and text from videos, among others. Also, GANs are
epochs). After a final model is obtained, the test set is
known to be sensitive to changes in network parame-
used for estimating the model performance on previously
ters [56, 57].
unseen data.
In Deep Learning Tutorial – 6,7 we present an example
regarding handwritten character generation, using the
MNIST (Modified National Institute of Standards and 6. Concluding Remarks
Technology) dataset to train a GAN. As a result, we cre-
ate a network that automatically generates handwritten Deep learning has been used to effectively solve many
characters. problems involving classification and clustering. As such,
these networks have been incorporated into the most
diverse applications, ranging from automated movies
5. Performance Evaluation
subtitles to self-driving cars. In principle, deep learning
structures consist of large neural networks involving
Performance evaluation methods can be applied to each
many neurons and considering very large datasets,
deep learning configuration, and are often decisive for
as well as GPGPUs. In addition, some new concepts
enhancing performance and better understanding the
and methods have been incorporated in this approach,
obtained results given the adopted configurations. In the
including autoencoding, diverse activation functions, as
case of supervised methods, one of the most common
well as overfitting prevention.
evaluation approach is the k-fold [40]. This approach
In the present work, after briefly reviewing some
6 https://github.com/hfarruda/deeplearningtutorial/blob/mast
interdisciplinary works involving physics and deep learn-
ing, we briefly introduced the main elements underlying
er/deepLearning_autoencoder.ipynb
7 https://github.com/hfarruda/deeplearningtutorial/blob/mast deep learning, including its motivation, basic concepts,
er/deepLearning_GAN.ipynb and some of the main models. These elements are

DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101 Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022
e20220101-10 Learning Deep Learning

Table 1: Comparison among the models considered in this work. *RBMs are normally employed as a part of the deep belief networks.
†
NLP means Natural Language Processing. The last column presents links to tutorial elaborated for each model.
Models Learning Main Applications Information Flow Tutorials
Feedforwrd Supervised Classiﬁcation and Regression. Single Direction Tutorial – 1
CNN Supervised Computer Vision. Single Direction Tutorial – 2
RNN Supervised Temporal Series. With Loops Tutorial – 3
RBM* Unsupervised Computer Vision, Recommender Systems, Undirected Tutorial – 4
Information Retrieval, and
Data compression, etc.
Autoencoder Unsupervised Information Retrieval and Single Direction Tutorial – 5
Data compression.
GAN Semi-supervised Generation of Images, Audio Synthesis, Single Direction Tutorial – 6
NLP† , and Temporal Series.
Full tutorial available at https://github.com/hfarruda/deeplearningtutorial

particularly important for understanding a wide range of [5] L.F. Costa and R.M. Cesar Jr, Shape analysis and
deep learning-related methods. Table 1 summarizes the classification: theory and practice (CRC Press, Inc.,
revised models and some of their respective characteris- Boca Raton, 2000).
tics. A tutorial in Python has been prepared to serve as a [6] R.O. Duda, P.E. Hart and D.G. Stork, Pattern classifi-
companion to this work, illustrating and complementing cation (John Wiley & Sons, Hoboken, 2012).
the covered material (https://github.com/hfarruda/de [7] F.R. Monte Ferreira, M.I. Nogueira and J. DeFelipe,
eplearningtutorial). It is hoped that the reader will be Frontiers in neuroanatomy 8, 1 (2014).
motivated to probe further into the related literature. [8] S. Haykin, in: Neural networks and learning machines
(Pearson Education, India, 2009), 3 ed., v. 10.
Acknowledgments [9] I. Stephen, IEEE Transactions on neural networks 50,
179 (1990).
[10] J.D. Keeler, Cognitive Science 12, 299 (1988).
Henrique F. de Arruda acknowledges FAPESP for spon-
[11] B. Xu, X. Liu and X. Liao, Computers & Mathematics
sorship (grant no. 2018/10489-0, from 1st February
with Applications 45, 1729 (2003).
2019 until 31st May 2021). H. F. de Arruda also
[12] S. Dutta, Wiley Interdisciplinary Reviews: Data Mining
thanks Soremartec S.A. and Soremartec Italia, Ferrero and Knowledge Discovery 8, e1257 (2018).
Group, for partial financial support (from 1st July 2021). [13] I. Goodfellow, Y. Bengio and A. Courville, Deep learning
His funders had no role in study design, data collec- (MIT press, Cambridge, 2016).
tion, and analysis, decision to publish, or manuscript [14] N. Thuerey, P. Holl, M. Mueller, P. Schnell, F. Trost and
preparation. Alexandre Benatti thanks Coordenação K. Um, arXiv:2109.05237 (2021).
de Aperfeiçoamento de Pessoal de Nível Superior – [15] A. Tanaka, A. Tomiya and K. Hashimoto, Deep Learning
Brasil (CAPES) – Finance Code 001. Luciano da F. and Physics (Springer, Singapore, 2021).
Costa thanks CNPq (grant no. 307085/2018-0) and [16] L. Zdeborva, Nature Physics 16, 602 (2020).
FAPESP (proc. 15/22308-2) for sponsorship. César H. [17] D.E. Rumelhart, G.E. Hinton, J.L. McClelland, in:
Comin thanks FAPESP (Grant Nos. 2018/09125-4 and Parallel distributed processing: Explorations in the
2021/12354-8) for financial support. This work has microstructure of cognition, edited by D.E. Rumelhart
been supported also by FAPESP grants 11/50761-2 and and J.L. McClelland (MIT Press, Cambridge, 1986).
15/22308-2. [18] R. Salakhutdinov and H. Larochelle, in: Proceedings
of the thirteenth international conference on artificial
References intelligence and statistics (Sardinia, 2010).
[19] M.H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and
[1] L.F. Costa, Modeling: The human approach to sci- R. Melko, Physical Review X 8, 021050 (2018).
ence (cdt-8), available in: https://www.researchgate.net [20] N. Srivastava and R.R. Salakhutdinov, Advances in
/publication/333389500_Modeling_The_Human_App neural information processing systems 25 (2012).
roach_to_Science_CDT-8, accessed in 06/06/2019. [21] I. Goodfellow, M. Mirza, A. Courville and Y. Bengio, in:
[2] E.B. Goldstein and J. Brockmole, Sensation and percep- Proceedings of Advances in Neural Information Process-
tion (Cengage Learning, Belmont, 2016). ing Systems 26 (Lake Tahoe, 2013).
[3] R.G. Cook and J.D. Smith, Psychological Science 17, [22] M. Lutter, C. Ritter and J. Peters, arXiv:1907.04490
1059 (2006). (2019).
[4] L.F. Costa, Quantifying complexity (cdt-6), available in: [23] C. Häger and H.D. Pﬁster, IEEE Journal on Selected
https://www.researchgate.net/publication/332877069_ Areas in Communications 39, 280 (2020).
Quantifying_Complexity_CDT-6, accessed in 06/06/ [24] P. Sadowski and P. Baldi, in: Braverman Readings in
2019. Machine Learning. Key Ideas from Inception to Current

Revista Brasileira de Ensino de Física, vol. 44, e20220101, 2022 DOI: https://doi.org/10.1590/1806-9126-RBEF-2022-0101
Arruda et al. e20220101-11

State, edited by L. Rozonoer, B. Mirkin and I. Muchnik [48] Y. Bengio, P. Simard and P. Frasconi, IEEE transactions
(Springer, Boston, 2018). on neural networks 5, 157 (1994).
[25] M. Erdmann, J. Glombitza, G. Kasieczka and U. [49] S. Hochreiter and J. Schmidhuber, Neural computation
Klemradt, Deep Learning for Physics Research (World 9, 1735 (1997).
Scientiﬁc, Singapore, 2021). [50] G.E. Hinton, S. Osindero and Y.W. Teh, Neural compu-
[26] M. Raissi, The Journal of Machine Learning Research tation 18, 1527 (2006).
19, 932 (2018). [51] E. Aarts, J. Korst, Simulated annealing and Boltzmann
[27] D. Guest, K. Cranmer and D. Whiteson, Annual Review machines: a stochastic approach to combinatorial opti-
of Nuclear and Particle Science 68, 161 (2018). mization and neural computing (John Wiley & Sons,
[28] T.A. Le, A.G. Baydin and F. Wood, in: Proceedings of Inc., Hoboken, 1989).
the 20th International Conference on Artificial Intelli- [52] G.E. Hinton, Neural computation 14, 1771 (2002).
gence and Statistics PMLR 54 (Fort Lauderdale, 2017). [53] P. Baldi, in: Proceedings of ICML workshop on unsuper-
[29] A.G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. Mead- vised and transfer learning, PMLR 27 (Bellevue, 2012).
ows, J. Liu, A. Munk, S. Naderiparizi, B. Gram-Hansen, [54] L. McInnes, J. Healy and J. Melville, arXiv:1802.03426
G. Louppe, et al., in: Proceedings of the international (2018).
conference for high performance computing, networking, [55] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
storage and analysis (Denver, 2019). Warde-Farley, S. Ozair, A. Courville and Y. Bengio, in:
[30] G. Zheng, X. Li, R.H. Zhang and B. Liu, Science Advances in neural information processing systems 27,
advances 6, eaba1482 (2020). edited by Z. Ghahramani, M. Welling, C. Cortes, N.
[31] A. Guillen, A. Bueno, J. Carceller, J. Martınez- Lawrence and K.Q. Weinberger (NeurIPS Proceedings,
Velazquez, G. Rubio, C.T. Peixoto and P. Sanchez- Montreal, 2014).
Lucas, Astroparticle Physics 111, 12 (2019). [56] K. Roth, A. Lucchi, S. Nowozin and T. Hofmann, in:
[32] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Advances in Neural Information Processing Systems 30,
Denzler, N. Carvalhais and Prabhat, Nature 566, 195 edited by I. Guyon, U. Von Luxburg, S. Bengio, H.
(2019). Wallach, R. Fergus, S. Vishwanathan and R. Garnett
[33] R. Hecht-Nielsen, in: Proceedings of the international (NeurIPS Proceedings, Montreal, 2014).
conference on Neural Networks (New York, 1987). [57] L. Metz, B. Poole, D. Pfau and J. Sohl-Dickstein, in:
[34] M.M. Poulton, in: Handbook of Geophysical Exploration: Proceedings of International Conference on Learning
Seismic Exploration (Elsevier, Amsterdam, 2001), v. 30. Representations (San Juan, 2016).
[35] D.P. Kingma and J. Ba, arXiv:1412.6980 (2014). [58] G.C. Cawley and N.L. Talbot, Journal of Machine
[36] A.K. Jain, J. Mao and K. Mohiuddin, Computer 29, 31 Learning Research 11, 2079 (2010).
(1996).
[37] P. Sibi, S.A. Jones and P. Siddarth, Journal of The-
oretical and Applied Information Technology 47, 1264
(2013).
[38] A.L. Maas, A.Y. Hannun and A.Y. Ng, in: Pro-
ceeding International Conference on Machine Learning
(Atlanta, 2013).
[39] B. Xu, N. Wang, T. Chen and M. Li, arXiv:1505.00853
(2015).
[40] C.M. Bishop and N.M. Nasrabadi, Pattern recognition
and machine learning (Springer, Berlim, 2006), v. 4.
[41] A. Deshpande and M. Kumar, Artificial intelligence
for big data: Complete guide to automating big data
solutions using artificial intelligence techniques (Packt
Publishing Ltd, Birmingham, 2018).
[42] M. Cheung, J. Shi, O. Wright, L.Y. Jiang, X. Liu and
J.M. Moura, IEEE Signal Processing Magazine 37, 139
(2020).
[43] X. Ying, in: 2018 International Conference on Computer
Information Science and Application Technology – v.
1168 (Daqing, 2019).
[44] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever
and R.R. Salakhutdinov, arXiv:1207.0580 (2012).
[45] S. Ioﬀe and C. Szegedy, in: Proceedings of the 32nd Inter-
national Conference on Machine Learning (Mountain
View, 2015).
[46] J. Schmidhuber, Neural networks 61, 85 (2015).
[47] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Nature
323, 533 (1986).