Yang Et Al. 2023 Deep Learning and Reinforcement Learning
Yang Et Al. 2023 Deep Learning and Reinforcement Learning
Contributors
Ning Weng, Prashant Baral, Ning Yang, Bhanu K.N. Prakash, Arvind Channarayapatna Srinivasa, Ling Yun
Yeow, Wen Xiang Chen, Audrey Jing Ping Yeo, Wee Shiong Lim, Cher Heng Tan, Hany Mohamed Nabil Helmy,
Sherif El Diasty, Hazem Shatila, Yuan Wang, Zekun Li, Zhenyu Deng, Huiling Song, Jucheng Yang, Fateme
Fathinezhad, Jocelyn Chanussot, Peyman Adibi, Bijan Shoushtarian, Ramzi Mahmoudi, Narjes Ben Ameur
Individual chapters of this publication are distributed under the terms of the Creative Commons
Attribution 3.0 Unported License which permits commercial use, distribution and reproduction of
the individual chapters, provided the original author(s) and source publication are appropriately
acknowledged. If so indicated, certain images may not be included under the Creative Commons
license. In such cases users will need to obtain permission from the license holder to reproduce
the material. More details and guidelines concerning content reuse and adaptation can be found at
http://www.intechopen.com/copyright-policy.html.
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not
necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of
information contained in the published chapters. The publisher assumes no responsibility for any
damage or injury to persons or property arising out of the use of any materials, instructions, methods
or ideas contained in the book.
156
Countries delivered to
Top 1%
most cited scientists
12.2%
Contributors from top 500 universities
A
ATE NALY
IV
R
TI
CLA
CS
BOOK
CITATION
INDEX
IN
DEXED
XII
Contents
Preface XV
Section 1
Theory and Algorithms of Deep Learning and Reinforcement Learning 1
Chapter 1 3
Utilized System Model Using Channel State Information Network with
Gated Recurrent Units (CsiNet-GRUs)
by Hany Helmy, Sherif El Diasty and Hazem Shatila
Chapter 2 29
Graph Neural Networks and Reinforcement Learning: A Survey
by Fatemeh Fathinezhad, Peyman Adibi, Bijan Shoushtarian
and Jocelyn Chanussot
Section 2
Applications of Deep Learning and Reinforcement Learning 51
Chapter 3 53
IoT Device Identification Using Device Fingerprint and Deep Learning
by Prashant Baral, Ning Yang and Ning Weng
Chapter 4 75
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment
Segmentation and Quantification
by Bhanu K.N. Prakash, Arvind Channarayapatna Srinivasa, Ling Yun Yeow,
Wen Xiang Chen, Audrey Jing Ping Yeo, Wee Shiong Lim and Cher Heng Tan
Chapter 5 89
Deep Learning for Natural Language Processing
by Yuan Wang, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang
Chapter 6 107
Deep Learning in Medical Imaging
by Narjes Benameur and Ramzi Mahmoudi
Preface
Nowadays, deep learning and reinforcement learning have become some of the hot-
test research directions in computer science. They can solve complex problems such as
natural language processing, computer vision, medical image analysis, and more by
training powerful neural networks. The deep learning algorithm has become one of
the most important and promising technologies in the field of artificial intelligence.
In addition, reinforcement learning can autonomously learn and adjust to maximize
rewards, which is expected to solve complex sequential decision tasks, such as intel-
ligent games and robot control.
In recent years, the rapid development and widespread application of deep learning
and reinforcement learning have created enormous commercial and social value. This
book introduces the latest advances in the fields of deep learning and reinforcement
learning, covering a variety of key areas like natural language processing, medicine
analysis, and Internet of Things (IoT) device recognition.
This book consists of two sections: “Theory and Algorithms of Deep Learning and
Reinforcement Learning” and “Applications of Deep Learning and Reinforcement
Learning.” Sections I and II contain two and four chapters, respectively. Section I dis-
cusses new network structures and algorithms for deep learning and reinforcement
learning. Section II explores new deep learning and reinforcement learning solutions
to the challenges faced by the fields of natural language processing, medicine analysis,
and IoT device recognition.
I would like to express my sincerest gratitude to the editors, authors, and reviewers
who have contributed to this book.
Thank you!
Jucheng Yang, Yarui Chen, Tingting Zhao, Yuan Wang and Xuran Pan
College of Artificial Intelligence,
Tianjin University of Science and Technology,
Tianjin, China
Section 1
1
Chapter 1
Abstract
1. Introduction
Figure 1.
Enhanced multiple-access for mmWave massive MIMO [2].
4
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
User equipment encodes channel matrices into codewords using the encoder; after
the codewords are returned to the BS, it uses the decoder to reconstruct the original
channel matrices. The technique can be applied as a feedback protocol in FDD MIMO
systems. The autoencoder [3] in deep learning, which is used to learn an encoding for
a set of data types for dimensionality reduction, and CsiNet are closely related. To
recreate accurate models from CS data, several deep learning (DL) architectures have
recently been designed and introduced in [4–6].
DL shows state-of-the-art performance in natural-image reconstruction,
but because wireless channel reconstruction is more difficult than image
reconstruction, it can also demonstrate that this capability is unclear. The DL-based
CSI reduction and recovery strategy is introduced in the current work. The
most significant research appears to be [7], in which a closed-loop MIMO
system implements DL-based CSI encoding. It differs from previous research that
did not consider CSI recovery by demonstrating that, as compared to current CS-
based methods, CSI can be recovered with a significantly increased reconstruction
quality by DL.
H
where h ~ and y ϵ Nt x 1 is the channel frequency response vector and the pre-
n n
coding vector at the nth subcarrier, separately, xn represents the transmitted informa-
tion image, z n is the additive noise or obstruction and ð�ÞH is a conjugate transpose. In
the FDD system, improving feedback links through UE and BS, focus on the feedback
scheme which allows autoencoder processing, assume:
h iH
Ĥ= h ~ 1 … h̃Nc ϵ Ñc x Nt in CSI stacked in the spatial frequency domain, which
means the UE should return Ĥ to the BS through feedback links, and in the feedback
system, the total number parameter is Nt Ñc , using a 2D (DFT) discrete Fourier
~ can be improved in the angular-delay domain to
transform, which introducing H
reduce feedback overhead:
5
Deep Learning and Reinforcement Learning
~ H
H ¼ Fd � HF (2)
a
where Fd and Fa are Ñc X Ñc and Nt X Nt DFT matrices, respectively. So, consid-
ering the COST 2100 as was illustrated in [9] channel model as shown in Figure 2.
depending on a uniform linear array (ULA), H has a small fraction of significant
components. According to the delay domain, the first Nc rows of H contain values,
retain the first Nc Rows of H and remove remaining rows. In a massive MIMO system,
the total number of feedback parameters can be reduced to N = N c Nt. So, we design
the encoder S,
S ¼ f en ðHÞ (3)
We can convert H into a codeword M vector, where M < N, and design the
decoder inverse transformation from the codeword to H original channel.
H ¼ f de ðSÞ (4)
The European Cooperation in Science and Technology COST 2100 channel model
is a GSCM that can reproduce the stochastic properties of massive MIMO channels
Figure 2.
A plot of the strength of H ϵ ℂ32x32 [8].
6
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
zt ¼ ρg ðW z xt þ U Z ht�1 þ bz Þ (5)
rt ¼ ρg ðW r xt þ U r ht�1 þ br Þ (6)
Were, xt input vector, ht output vector, zt update gate vector, rt reset gate vector
and W, U, and b denote matrices and vectors, respectively.
Activation functions: �ρg Original sigmoid activation, ϕh For the initial hyper-
bolic tangent, Alternative activation functions can be used, provided the ρg (x) € [0,1].
It is possible to construct alternative forms by modifying zt and rt .
GRU’s ability to hold on to long-term dependencies or memory stems from the
gated recurrent unit cell’s computations to produce the hidden state. At the same time,
LSTMs have two different states passed between the cell state and hidden state, which
carry the long and short-term memory, respectively GRUs only have one hidden state
transferred between time steps. This hidden state can hold both long-term and short-
7
Deep Learning and Reinforcement Learning
Figure 3.
Gated recurrent unit, fully gated version [12].
term dependencies at the same time due to the gating mechanisms and computations
that the hidden state and input data go through.
The GRU cell contains only two gates: The Update gate and the Reset gate; like the
gates in LSTMs, the GRU gates are trained to selectively filter out any irrelevant
information while keeping what’s useful. These gates are essentially vectors containing
values between 0 and 1, multiplying with the input data or hidden state.
A zero (0) value in the gate vectors indicates that the input or hidden state’s
corresponding data is unimportant and will, therefore, return as a zero.
On the other hand, a one (1) value in the gate vector means that the corresponding
data is essential and will be used. Reset gate: In the first step, we’ll create the Reset
gate; this gate is derived and calculated using both the hidden state from the previous
time step and the input data at the current time step (Figure 4).
Figure 4.
Reset gate flow [13].
8
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
When the entire network is trained through back-propagation, the weights in the
equation will be updated such that the vector will learn to retain only the valuable
features. The previous hidden state will first be multiplied by a trainable weight and
will then undergo an element-wise multiplication Hadamard product with the reset
vector. This operation will decide which information will be kept from the previous
time steps and the new inputs.
Simultaneously, the current input will also be multiplied by a trainable weight
before being summed with the reset vector’s product and the previous hidden state
above. Finally, a non-linear activation tanh function will be applied to the result to
obtain r in the equation below.
r ¼ tanh gatereset ⊙ðW h1 � ht�1 Þ þ W x1 � xt (9)
Update gate: next, we’ll create the Update gate, like the Reset gate; the gate is
computed using the previous hidden state and current input data (Figure 5). Both the
Update and Reset gate vectors are created using the same formula, but the weights
multiplied with the input and hidden state are unique to each gate, which means that
each gate’s final vectors are different; This allows the gates to serve their specific
purposes.
Figure 5.
Update gate flow [13].
9
Deep Learning and Reinforcement Learning
gateupdate ¼ σ W inputupdate � xt þ W hiddenupdate � ht�1 (10)
The Update vector will undergo element-wise multiplication with the previous
hidden state to obtain u in our equation below, which will be used to compute our
final output later.
The Update vector will also be used in another operation later when obtaining our
final output.
The purpose of the Update gate here is to help the model determine how much of
the past information stored in the previous hidden state needs to be retained for the
future. Combining the outputs: In the last step, we will be reusing the Update gate and
obtaining the updated hidden state (Figure 6).
This time, we will be taking the element-wise inverse version of the same Update
vector (1—Update gate) and doing an element-wise multiplication with our output
from the Reset gate, r. This operation’s purpose is for the Update gate to determine
what portion of the new information should be stored in the hidden state. Lastly, the
result of the above operations will be summed with our output from the Update gate
in the previous step, u.
This will give us our new and updated hidden state; We can use this new hidden
state as our output for that time step by passing it through a linear activation layer.
ht ¼ r⊙ 1 � gateupdate þ u (12)
The Reset gate determines which parts of the previous hidden state are to be
combined with the current input to propose a new hidden state, and the Update gate
determines how much of the previous hidden state is to be retained and what part of
the new proposed hidden state derived from the Reset gate is to be added to the final
Figure 6.
Final output computations [13].
10
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
hidden state. This solves the Vanishing/Exploding Gradient Problem. The network
chooses which components of the previous hidden state to keep in memory while
discarding the rest when the Update gate is first multiplied with it. When it uses the
Reset gate’s inverse gate to filter the proposed new hidden state from the Update gate,
it then fills in the gaps in the information that were previously missing. The network
can maintain long-term dependencies as a result. If the Update vector values are close
to 1, the Update gate may decide to keep most of the previous memories in the hidden
state rather than recalculating or altering the hidden state entirely.
When training a recurrent neural network RNN, the vanishing or exploding
gradient problem can happen, especially if the RNN is processing lengthy
sequences or has multiple layers. The network’s weight is updated in the right direc-
tion and by the right amount using the error gradient that was calculated during
training. However, this gradient is determined using the chain rule, beginning at the
end of the network. As a result, for long sequences, the gradients will undergo con-
tinuous matrix multiplications and either disappear (vanish) or explode (explode)
exponentially.
A gradient that is too small will prevent the model from effectively updating its
weights, whereas a gradient that is too large will make the model unstable.
Due to the addictive nature of the Update gates, the long short-term memory
(LSTM) and gated recurrent units (GRUs) can keep most of the existing hidden state
while adding new content on top of it, unlike traditional RNNs that always replace the
entire hidden state content at each time step.
This prevents the additional operations from causing the error gradients to vanish
or explode too quickly during backpropagation. Utilizing alternative activation func-
tions, like ReLU, which does not result in a small derivative, is the simplest solution.
Another option is residual networks, which offer residual connections directly to
earlier layers. In a feedforward network (FFN), the backpropagated error signal
typically decreases (or increases) exponentially as a function of the distance from the
final layer. This technique is referred to as the vanishing gradient.
C
Δt ¼ (13)
2 v fo
where f 0 is the carrier frequency, and c is the speed of light. The CSI within Δt is
considered correlated with one other. Therefore, instead of independently recovering
CSI, the BS can combine the feedback and previous channel information to enhance
the subsequent reconstruction.
We set the feedback time interval as δt and place T adjacent instantaneous
angular-delay domain channel matrices into a channel group, i.e.,
T
H00t t¼1
¼ H001 , … H00t , … , H00T (14)
Figure 7.
The structure of the proposed CsiNet-GRUs using dropout technique [14].
trade-off. We will also introduce the multi-CR strategy to implement variable CRs on
different channel matrices; The proposed CsiNet-GRU is illustrated in Figure 7. with
CsiNet. Our model includes the following two steps: angular-delay domain feature
extraction, correlation representation, and final reconstruction. Each GRU has an
inherent memory unit that, for future prediction, can hold previously extracted
information for a long time. A 3 � 3 convolutional layers and an M-unit dense layer
for sensing, and a dense layer with 2N 0c N t units should be considered to facilitate
comparison with the results of the CsiNet structure given in [8] and two decoders
from RefineNet for reconstruction as shown in Figure 7, each RefineNet comprises
channel into four 3 � 3 convolutional layers with different channel sizes.
The CsiNet decoder’s output generates a sequence, and the length of every
sequence is T, which is then fed into a three-layer GRU. All low-CR CsiNet’s shown in
Figure 7. share the same network parameters, i.e., weights and bias, because
they perform the same work, which dramatically reduces parameter overhead.
Furthermore, the architecture can be easily rescaled to perform on channel groups
with different T if the value of T changes to adapt to the channel-changing speed and
feedback frequency; A low-CR CsiNet will be reused (T � 1) time instead of making
(T � 1) copies in practice. The gray blocks in Figure 7 load parameters from the
original CsiNet’s as pre-training before end-to-end training with the entire architec-
ture. This method can alleviate vanishing gradient problems due to long paths from
CsiNet’s to GRUs. We use GRUs to extend the CsiNet decoders for time correlation
extraction and final reconstruction. Gated recurrent units have inherent memory cells
and can keep the previously extracted information for a long period for later predic-
tion. In particular, the CsiNet decoders’ outputs form a sequence of length T before
13
Deep Learning and Reinforcement Learning
being fed into three-layer GRUs. Each GRU has a 2N 0c N t ; The hidden unit is the same
as the size of the output. Then the final output is reshaped into two matrices as the
final recovered H^ 00 ; This allows the CR-CsiNet encoder to send to the rest T � 1.
t
Because less information is required due to channel correlation, the channel matrix
performs operations, M2 � 1 codewords (M1 > M2 ), are generated. The (T � 1)
codewords are all concatenated with the first codeword M1 � 1 before being fed into
the low-CR CsiNet decoder to utilize feedback information fully. Each CsiNet outputs
two matrices with size (N 0c � N t Þ as extracted features from the angular delay domain
as the final recovered H ^ 00 . The spatial frequency domain CSI can then be obtained via
t
inverse 2D-DFT. At each time step, the GRUs implicitly learns time correlation from
the previous inputs and then merge them with the current inputs to increase low CR
recovery quality.
Dropout: During training, randomly selected neurons are ignored and “dropped
out.” This means that their contribution to downstream neuron activation is removed
temporally on the forward pass, and no weight updates are applied to the neuron on
the backward pass. Dropout can be implemented on any hidden layer in the network;
the visible or input layer, as well as the term “dropout,” refers to dropping out units
(hidden and visible) in a neural network. Dropout is a regularization method used
when training the network, as illustrated in Figure 8. It is possible that the input and
loop connections to the gated recurrent unit (GRU) in Figure 7 are not excluded from
activation and weight updates. Depending on the framework, the dropout regulariza-
tion Training Phase: Ignore (zero out) a random fraction, p, of nodes for each hidden
layer, training sample, and iteration (and corresponding activations). A phase of
testing: Use all activations but reduce them by a factor of p. (to account for the
missing activations during training). Dropout is a regularization method used when
training the network, as shown in Figure 8. However, it does not always exclude the
input and loop connections to the gated recurrent unit (GRU) from activation and
weight updates, as shown in Figure 7. To reduce overfitting and improve the effi-
ciency of the CsiNet-GRU structure, a neural network approach is used. We stated
Figure 8.
Neural network with dropout architecture [15].
14
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
that the effect on “downstream neurons” activation during the forward process would
be temporarily removed and that no weight update for “backward propagation to
neurons” would be applied [15].
Training Phase: Ignore (zero out) a random fraction, p, of nodes for each hidden
layer, training sample, and iteration (and corresponding activations). A phase of
testing: Use all activations but reduce them by a factor of p. (to account for the
missing activations during training). Dropout is a regularization method used when
training the network, as shown in Figure 8. However, it does not always exclude the
input and loop connections to the gated recurrent unit (GRU) from activation and
weight updates, as shown in Figure 7. Depending on the framework, the dropout
regularization approach used in neural networks is used to reduce overfitting and
improve the efficiency of the CsiNet-GRU structure. We stated that the effect on
“downstream neurons” activation during the forward process would be temporarily
removed and that no “backward propagation to neurons” weight update would be
applied [15]. Training Phase: Ignore (zero out) a random fraction, p, of nodes for each
hidden layer, training sample, and iteration (and corresponding activations). A phase
of testing: Use all activations but reduce them by a factor of p. (to account for the
missing activations during training).
Some observations: Dropout forces a neural network to learn more robust features
that can be used in conjunction with the random subsets of many other neurons.
Dropout roughly doubles the number of iterations needed to converge; however, each
epoch’s training time is less, and during the testing phase, the entire network is
considered, and each activation is reduced by a factor of p. When training the network
in the proposed structure, the input and recurrent connections to the GRU unit may
not be excluded from activation and weight updates.
There are two dropout parameters in RNN layers: dropout, applied to the
first operation on the inputs, and recurrent dropout applied to the other operation on the
recurrent inputs. It is worth mentioning that interested in designing the encoder
which can transform the channel matrix into an M-dimensional vector (codeword),
where M < N. Thus, define the data compression ratio γ as ðγ ¼ M=2N t N c Þ.
The encoder first extracts CSI features via a convolutional layer with two 3 � 3
filters, followed by an activation layer. A fully connected (FC) layer with M neurons is
then used to compress the CSI features to lower dimensions. The compression ratio
(CR) of this encoder can be expressed as CR ¼ 1=γ. The final reconstruction of the
CSI is performed by three 2N 0c N t unit GRUs with dropout techniques.
Moreover, adopting depth-wise separable convolutions in feature recovery reduces
the model’s size and interacts with information between channels and introducing the
delay θ as a parameter used in the encoder and decoder, i.e., θ ¼ fθen , θde g:It is worth
mentioning that H 00t are standardized with all components scaled into the [0; 1], and
this standardization is required for CsiNet.
multiple-output (MIMO) is a technology that enables faster and more reliable trans-
missions over wireless channels.
The COST 2100 model simulates MIMO channels and generates training samples;
we set the MIMO-OFDM system to work on a 20 MHz bandwidth using a uniform
linear array (ULA). The parameters utilized in indoor and outdoor channel scenarios
are given in Table 1; Data sets are generated by randomly setting different start places
for indoor and outdoor scenarios and performing the simulations at CR values with
the first channel H001 they were compressed under 1/4. Table 1 shows the training,
validation, and testing sets; some parameters are preloaded from the CsiNet for
initialization (epochs from 500 to 1000, learning rate of 0.001, and batch size of 200),
as shown in Table 1.
We compare the proposed architecture’s performance with previous similar
modeling approaches of channel state information (CSI) with different deep learning
approaches, namely Conv-LSTM CsiNet, LASSO, TVAL3 [16], and CsiNet, utilizing
the default setup in the open-source codes of the previously mentioned techniques for
reproduction.
TVAL3 uses a minimum total variation method that provides remarkable recovery
quality and high computing efficiency, while LASSO uses simple sparse priors to
achieve good performance. In the feature extraction and recovery modules of
Convolutional-LSTM CsiNet, RNN, and depth-wise separable convolution were used.
The term “training” refers to the process of determining which parameters to use
in a given dataset. We run the modeling CsiNet-GRUs on Collaboratory (python)
according to zero configuration required, free access to GPUs, and easy sharing
training and testing of the CsiNet, Conv-LSTM CsiNet, and CsiNet-GRUs on python
colab editor.
H 32 � 32
Nt 32 Antennas
NC 1024 Subcarriers
Epochs 500–1000
∂t 0.04 s
T 10 s–100 s
CR 4, 16, 32, 64
Table 1.
COST 2100 model DATA-SETS and system, parameters.
16
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
Comparisons are made using the normalized mean square error, cosine similarity,
accuracy, and run-time in the indoor and outdoor channels, as well as the complexity
factored in. The Normalized Mean Square Error measures and reflects the mean
relative scatters.
The normalization of the MSE assures that the metric will not be biased when the
model overestimates or underestimates the predictions. So, the normalized mean
square error (NMSE) utilized for comparisons quantifies the difference between the
� �T
input fHt gT and the output H
t¼1
^t in both proposed techniques CsiNet-GRUs are
t¼1
given by:
( )
T � �
1X � 00 ^ 00 �2 � �2
NMSE ¼ �Ht � Ht � =�H00t �2 (15)
T t¼1 2
Where b hn, t denotes the reconstructed channel vector of the nth subcarrier at time
t. ρ can measure the quality of the beamforming vector when the vector is set as
� � H
� �
� � � �
vn,t = b
hn, t /�b
hn, t � since the UE will achieve the equivalent channel b
h hn, t /�b
n, t hn, t � .
2 2
Introducing a new parameter for comparison, which calling accuracy
defining it as the ratio of the number of correct predictions to the total number
of input samples, that means accuracy is the ratio of the recovered channel vector
� �T
to the original channel vector H00t t¼1 =H 001 so the accuracy in CsiNet-GRUs is
defined as:
8 � H �9
> � �=
<1 1 X T X hn, t � >
N c �b
Accuracy ¼ (17)
:T N c t¼1 n¼1 khn, t k2 >
> ;
Figures 9 and 10 show the relationship between CR and NMSE for all structures in
indoor and outdoor scenarios. Figure 9 shows that the proposed CsiNet-GRUs have
the lowest NMSE, whereas Figure 10 shows that it has the lowest NMSE among others
except for Conv-LSTM CsiNet at CR > 20. Figures 11 and 12 show the relationship
between the CR and accuracy for all structures in indoor and outdoor scenarios.
17
Deep Learning and Reinforcement Learning
Figure 9.
NMSE (dB) performance comparison between CS methods INDOOR scenario.
Figure 10.
NMSE (dB) performance comparison between CS methods OUTDOOR scenario.
18
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
Figure 11.
Accuracy performance comparison between CS methods INDOOR scenario.
Figure 12.
Accuracy performance comparison between CS methods OUTDOOR scenario.
19
Deep Learning and Reinforcement Learning
The CsiNet-GRUs outperform the other structures, with higher accuracy observed
at lower CR values. Figures 13 and 14 illustrate the relation between the cosine
similarity (ρ) and CR in indoor and outdoor scenarios for all structures. Again, the
proposed CsiNet-GRUs outperform the other structures, and besides, it exhibits a
near-linear performance with the lowest slope.
Figure 13.
ρ Performance comparison between CS methods INDOOR scenario.
Figure 14.
ρ Performance comparison between CS methods OUTDOOR scenario.
20
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
21
Deep Learning and Reinforcement Learning
Table 2.
Comparison of results between the proposed framework and other available Ones (Epoch = 1000 iterations in the
proposed techniques and others previous techniques).
epoch = 1000 (1000 iterations) in terms of correlation and accuracy in the proposed
technique CsiNet-GRUs. In terms of the NMSE, the CsiNet-GRUs achieve the lowest
values of all compressed ratios (CRs), particularly when CR is low.
CsiNet-GRUs have very short run periods when compared to LASSO and TVAL3
techniques. However, when compared to the other CsiNet technique and the proposed
modeling technique, CsiNet-GRUs lose time efficiency slightly. It is worth noting that,
despite the addition of significant complexity as a result of the GRU layers, the run
time is still comparable to that of the CsiNet.
Figure 15 depicts in comparison to the other modeling techniques, the reconstruc-
tion results of the proposed technique, namely LASSO, TVAL3, CsiNet, and Conv-
LSTM CsiNet in an indoor Picocellular scenario, the figure represents the average
performance at different CRs, reflecting on the reconstructed images to use the other
techniques.
Figure 15.
Reconstruction images for CR in CS algorithms in an indoor scenario.
22
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
7. Conclusion
24
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
Author details
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
25
Deep Learning and Reinforcement Learning
References
[1] Zhang T, Ge A, Beaulieu NC, Hu Z, [8] Wen C, Shih W, Jin S. Deep learning
Loo J. A limited feedback scheme for for massive MIMO CSI feedback. IEEE
massive MIMO systems based on Wireless Communications Letters. 2018;
principal component analysis. EURASIP 7(5):748-751
Journal on Advances in Signal
Processing. 2016;2016. DOI: 10.1186/ [9] Liu L, Oestges C, Poutanen J, Haneda
s13634-016-0364-9 K. The COST 2100 MIMO channel
model. IEEE Wireless Communications.
[2] Busari A, Huq KMS, Mumtaz S, Dai L,
2012;19(6):92-99
Rodriguez J. Millimeter-wave massive
MIMO communication for future
[10] Hochreiter S. The vanishing gradient
wireless systems: A survey. IEEE
problem during learning recurrent
Communications Surveys & Tutorials.
neural nets and problem solutions.
2018;20(2):836-869
International Journal of Uncertainty,
[3] Tao J, Chen J, Xing J, Fu S, Xie J. Fuzziness and Knowledge-Based
Autoencoder neural network based Systems. 1998;6(2):107-116
intelligent hybrid beamforming design
for mmWave massive MIMO systems. [11] Aleem S, Huda N, Amin R, Khalid S,
IEEE Transactions on Cognitive Alshamrani SS, Alshehri A. Machine
Communications and Networking. 2020. Learning Algorithms for Depression:
DOI: 10.1109/TCCN.2020.2991878 Diagnosis, Insights, and Research
Directions. Electronics. 2022;11(7):1111.
[4] Zhai J, Zhang S, Chen J, He Q.
DOI: 10.3390/electronics11071111
Autoencoder and Its Various Variants. In:
2018 IEEE International Conference on
[12] Cho K, van Merrienboer B, Gulcehre
Systems, Man, and Cybernetics (SMC),
Miyazaki, Japan. 2018. pp. 415-419. C, Bahdanau D, Bougares F, Schwenk H,
DOI: 10.1109/SMC.2018.00080 et al. Learning Phrase Representations
using RNN Encoder-Decoder for
[5] Karanov B, Lavery D, Bayvel P, Statistical Machine Translation. 2014.
Schmalen L. End-to-end optimized DOI: 10.3115/v1/D14-1179
transmission over dispersive intensity-
modulated channels using bidirectional [13] Dey R, Salem FM. Gate-variants of
recurrent neural networks. Optics Gated Recurrent Unit (GRU) neural
Express. 2019;27:19650-19663 networks. 2017. pp. 1597-1600. DOI:
10.1109/MWSCAS.2017.8053243
[6] Sohrabi F, Cheng HV, Yu W. Robust
Symbol-Level Precoding Via [14] Helmy HMN, Daysti SE, Shatila H,
Autoencoder-Based Deep Learning. Aboul-Dahab M. Performance
2020. pp. 8951-8955. DOI: 10.1109/ enhancement of massive MIMO using
ICASSP40776.2020.9054488 deep learning-based channel estimation.
IOP Conference Series: Materials Science
[7] Liu Z, del Rosario M, Liang X, Zhang and Engineering. 2021;1051(1):012029
L, Ding Z. Spherical Normalization for
Learned Compressive Feedback in [15] Srivastava N, Hinton G, Krizhevsky
Massive MIMO CSI Acquisition. 2020. A, Sutskever I, Salakhutdinov R.
pp. 1-6. DOI: 10.1109/ICCWorkshops Dropout: A simple way to prevent neural
49005.2020.9145171 networks from overfitting. Journal of
26
Utilized System Model Using Channel State Information Network with Gated Recurrent Units…
DOI: http://dx.doi.org/10.5772/intechopen.111650
27
Chapter 2
Abstract
Graph neural network (GNN) is an emerging field of research that tries to gener-
alize deep learning architectures to work with non-Euclidean data. Nowadays, com-
bining deep reinforcement learning (DRL) with GNN for graph-structured problems,
especially in multi-agent environments, is a powerful technique in modern deep
learning. From the computational point of view, multi-agent environments are inher-
ently complex, because future rewards depend on the joint actions of multiple agents.
This chapter tries to examine different types of applying GNN and DRL techniques in
the most common representations of multi-agent problems and their challenges. In
general, the fusion of GNN and DRL can be addressed from two different points of
view. First, GNN is used to influence the DRL performance and improve its formula-
tion. Here, GNN is applied in relational DRL structures such as multi-agent and multi-
task DRL. Second, DRL is used to improve the application of GNN. From this view-
point, DRL can be used for a variety of purposes including neural architecture search
and improving the explanatory power of GNN predictions.
1. Introduction
Building an intelligent system that can extract high-level representation from data
is necessary for many issues related to artificial intelligence. Theoretical and biological
arguments show that to build such systems, deep architecture models are needed that
include many layers of non-linear processing units. Before the emergence of deep
learning [1], traditional machine learning approaches depended on the representa-
tions given by feature selection or extraction that get from the data.
These methods required an expert in the domain of the subject to extract the
features manually. However, this hand-crafted feature extraction is a time-consuming
and sub-optimal process. The emergence of deep learning could quickly replace these
traditional methods because it could automatically extract the features according to
each problem. In recent years, deep learning has become the main motivation for
innovative solutions to artificial intelligence problems. This issue has been made
29
Deep Learning and Reinforcement Learning
Recently, GNNs have been offered to model and operate on graphs to reach
combinatorial generalization and relational reasoning. Indeed, GNNs simplify the
learning of relations between entities in a graph and the rules for composing them. A
combination of DRL and GNN can work and optimize problems while generalizing to
unseen topologies. Specifically, the GNN used by the DRL agent is inspired by
message-passing neural networks [8].
Robotics, pattern recognition, recommendation systems, and games are some of
the subjects in which DRL has presented acceptable performance. On the other hand,
GNNs exhibit excellent efficiency in supervised learning for graph-structured data
[9]. DRLs utilize the ability of DNNs to solve sequential decision problems with RL,
and on the other hand, GNNs are new architectures that are suitable for organizing
graph-structured data in this field.
In this survey, an overview of the concepts of GNNs is prepared, and then their
relationship with reinforcement learning (RL) is explained. The rest of this chapter is
structured as follows. A short review of graph neural networks is given in Section 2.
The technical backgrounds of deep reinforcement learning concepts and multi-agent
reinforcement learning are presented in Section 3. The relation between RL and GNN
is presented in Section 4. Finally, the conclusion is provided in the last section.
Nowadays, many learning problems need to use graph representation to present the
complex relationship between data [10, 11]. Recently, more attention over studies on
graph models has been received due to the great expressive power in social science (social
networks) [12–14] and biology science (predicting protein interface and bioinformatics
analysis, knowledge graphs, modeling physics systems, and classifying diseases) [15–17].
Pairwise message passing is one of the main elements in the structure of GNNs,
such that each node in the graph frequently updates its representations by replacing
information with its neighbors until a stable balance is attended. The graph neural
network usually contains two parts: the message passing part for extraction of local
infrastructure features used around the nodes and the readout phase which is an
aggregation part to summarize the particular features of the node in a vector of
features of the graph surface.
Representing data as a graph has several advantages, such as a simplified repre-
sentation of complex problems, systematic modeling of relationships, etc. On the
other hand, working with data with a graph structure using common DNN-based
methods has its own challenges. The variable size of the unordered nodes, the uneven
structure of the graph, and the dynamic neighborhood composition make it difficult
to implement basic mathematical methods such as convolution on the graph. Graph
neural networks (GNN) as its general structure is shown in Figure 1, overcome this
defect with the help of new DNN methods in the graph structures of datasets. GNN
architectures can model structural information and node features. In the following
several well-known models of graph neural networks are introduced.
For the first time in [18], spectral networks and local deep networks were
connected on a graph convolutional network (GCN), as a method for semi-supervised
31
Deep Learning and Reinforcement Learning
Figure 1.
Graph neural networks (GNN) framework.
of neighbors must be calculated, the calculation cost as well as the amount of memory
occupied increases rapidly.
2.3 GraphSAGE
In graph theory, there is a concept called node embedding, which means mapping
nodes to an embedded space with dimensions less than the actual dimension of the
data defined on the nodes of the graph, in which similar nodes are embedded close to
each other, in the resulting latent space.
GraphSAGE [22] is a deductive learning technique that exploits node features to
learn an embedding function for dynamic graphs. This inductive learning approach is
scalable across graphs of different sizes as well as subgraphs within a given graph. A
new node can be embedded without retraining by the GraphSAGE approach. It uses
aggregator functions to induce new node embeddings based on node features and
neighborhoods.
In [23] a method for data-driven neighborhood subsampling is defined by a non-
linear regressor based on the real-valued importance of each node and its neighbor-
hood. This subsampling helps to embed nodes in the graph using a small set of
neighboring nodes with high importance. The regressor is learned using value-based
reinforcement learning. Here, the negative classification loss output of GraphSAGE is
used to extract this importance.
GraphSAGE-D3QN [24] presents a graph DRL method for emergency control of
undervoltage load shedding model. Feature extraction of states in this model is
designed by GraphSAGE-based method with topology variation in the training step
and then online emergency control is achieved.
Link prediction [25], node classification [26], clustering [27] and, etc., are consid-
ered as graph analysis objectives. In the following, several common GNNs goals are
described:
Node classification: training models to classify nodes by determining the label of
samples that are shown as nodes. Usually, these problems are used in a semi-
supervised way, with only a part of the graph being labeled.
Graph classification: Graph classification is a task with real applications in social
network analysis, categorizing documents in natural language processing, and classi-
fying proteins in bioinformatics fields. Graph classification obtains a graph feature
that aids discriminate between graphs of different classes.
Graph Visualization: Visual representation of data structures and anomalies with
the help of geometric graph theory and information visualization that helps the user
understand graphs.
Link prediction: Predicting the relationship between two nodes and considering
that nodes in a network are likely to have links. An application of this approach is to
detect social interactions or suggest potential friends to users on social networks. It has
also been used in predicting criminal associations, and in recommender system prob-
lems.
Graph clustering: clustering on graphs is performed in two ways. Either clustering
is based on nodes that should be converted into different and connected groups based
33
Deep Learning and Reinforcement Learning
on the edge distances and their weights or considering the graph as objects that should
be clustered, and clusters these objects based on similarity.
Using DNNs to solve sequential decision issues in the framework of RL led to the
emergence of deep reinforcement learning (DRL) in high-dimensional problems (see
Figure 2). Nowadays, different applications of artificial intelligence have been
enhanced with the help of DRL which includes natural language processing [28],
transportation [29], finance [30], healthcare [31], robotics [32], recommendation
systems [33], and gaming [34]. DRL can be defined as a system that maximizes the
long-term reward in a reinforcement learning problem using representations that are
themselves learned by the deep network. The outstanding success of DRL can be
considered due to the ability of this method to deal with complex problems and
provide efficient, scalable, and flexible computational methods. Also, DRLs have a
high ability to understand the dynamics of the environment and produce optimal
actions according to their interactions with the environment. When dealing with
various high-dimensional problems or continuous states, reinforcement learning suf-
fers from the problem of inefficient feature representation. Therefore, learning time is
slow and techniques should be designed to speed up the learning process. The most
significant feature of deep learning is that DNNs can discover compact representa-
tions of high-dimensional data automatically.
Combining DNNs with RL has become more attractive in recent years and it has
gently shifted the focus from single-agent environments to multi-agent ones. Working
Figure 2.
Total structure of the combination of GNNs and DRL.
34
Graph Neural Networks and Reinforcement Learning: A Survey
DOI: http://dx.doi.org/10.5772/intechopen.111651
with multiple agents is inherently more complex because future rewards depend on
the joint actions of several agents and the computational complexity of the function
increases. Single-agent environments such as Atari [6], and navigation robots [35],
and multi-agent settings such as traffic light control [36], financial market trading
[37], and strategy games such as Go, StarCraft, and Dota are some examples that are
developed by DRL.
In DRL, unstructured input data from the state space are given to the network.
This input such as pixels rendered on the screen in a video game or images from a
camera or the raw sensor stream from a robot can be very large and high-dimensional.
In the output, the value of an action is determined for the agent to decide what actions
must be performed in the environment to maximize the expected rewards. Since the
RL methods are suffered from the curse of dimensions problem. DNNs can find low-
dimensional representations (and features) of high-dimensional data automatically. In
the following, the subject of DRL for the special scope of multi-agent reinforcement
learning will be expressed widely.
3.1.2 Non-stationary
In the most recent research, many MARL methods use GNNs to provide informa-
tion interactions between agents to complete collaborative tasks and coordinate
actions. In general, not extracting enough useful information from neighboring agents
is one of the problems of simply aggregation in GNN, which is due to ignoring the
topological relationships in the graph.
To solve this problem, Ding et al. [47] presented a method to extract useful
information from neighboring agents as much as possible in the graph structure,
which has the ability to provide feature representation to complete the cooperation
task. For this purpose, mutual information (MI) is applied for measuring the agent
topological relationships and the agent features information to maximize the correla-
tion between input feature information of neighbor agents and output high-level
hidden feature representations.
A GNN architecture for training decentralized agent policies on the perimeter of a
unit circle has been proposed in continuous action spaces [48]. In this approach,
multi-agent perimeter defense problems are solved by learning decentralized strate-
gies with GNNs. Local perceptions of the defenders are considered as inputs in the
learning framework and finally, the model is trained by an expert policy based on the
maximum matching algorithm and returns actions to maximize the number of
captures for the defender team.
The proposed framework [49] used GNNs for value function factorization in
multi-agent deep reinforcement learning. A complete directed graph is designed
by the team of agents as the nodes of the graph, and edge weights are
37
Deep Learning and Reinforcement Learning
This section describes different methods for calculating the Q-value function for
multi-agent environments. In MARL problems, each agent has a local and private
observation of its surrounding space that it wants to take action based on that infor-
mation. A problem that the agent may face with it is the locality of observation and
not having complete information about the environment. Another problem is the non-
stationarity of the environment because all agents in the environment are learning and
show different behaviors during training.
38
Graph Neural Networks and Reinforcement Learning: A Survey
DOI: http://dx.doi.org/10.5772/intechopen.111651
training completion, only the local actors are used in the execution phase, acting in a
decentralized manner.
COMA is a multi-agent policy gradient-based method for cooperative multi-agent
systems that uses a centralized critic to estimate Q performance and decentralized
actors to optimize agent policies. Also, this method solves the problem of credit
assignment using a count. Unlike COMA, which uses a centralized critic for all agents,
MADDPG has a concentrated critic for each agent to have different reward functions
in competitive environments.
Recent works have been conducted based on MADDPG, R-MADDPG [62]
develops the MADDPG algorithm to the semi-observable environment by preserving
the history of previous observations in the critic module and by having an iterative
actor. M3DDPG [63] includes minimax optimization for powerful policy learning
against agents with changing strategies. Actor-Critic with mean field [64] factorizes
the Q-value function only by using interaction with neighboring agents based on
mean field theory, and the idea of dropping out can be expanded to MADDPG for
managing large input space [65].
Figure 3.
Schematic structure of deep reinforcement learning agent.
40
Graph Neural Networks and Reinforcement Learning: A Survey
DOI: http://dx.doi.org/10.5772/intechopen.111651
network usually represents to define the edge weights as the strength of the connec-
tion in the coordination graph between each agent and its neighbors. In the next step,
the graph convolution layer is applied to perform message passing and information
integration across all agents. Finally, the deep Q-network is used to approximate the
Q-value function. By considering the maximum output of the Q-network the next
action for the agents is determined.
The embedding layer contains an encoder for n observations fo1 , o2 , … , on g of n
agents. The outputs of the encoder include embedding vectors Ei for i ¼ 1::n as follows:
Ei ¼ Encoderðoi , θE Þ (1)
In the local attention Layer, the attention weights for two agents i and j in the
graph are calculated using embedding vectors as:
� � ��
exp Attention Ei , Ej , Wa
Atij ¼ Pn (2)
k¼1 expðAttention ðEi , Ek , Wa Þ
1X �
^ st , at , θpredict
�
Lθ ¼ yt � Q (4)
b t
� �
where b is the batch size, and yt ¼ rtþ1 þ γ max Q stþ1 , atþ1 , θtarget in time step t is
atþ1
the target of Q value function for state s and action a with reward r.
In general, the combination of GNN and DRL can be addressed from two different
points of view. From one perspective, GNN is used to advance the formulation and
performance of DRL and specifically, when GNN has been used for relational DRL
problems. The successful modeling for this relationship can be defined among (1)
different agents in a multi-agent deep reinforcement learning (MADRL) framework,
and (2) different tasks in multi-task deep reinforcement learning (MTDRL) frame-
work [70].
From another perspective, DRL can be used to progress the performance of GNN.
DRL is used to improve the explanatory power of GNN predictions, Neural Architec-
ture Search (NAS) [71], and design adversarial examples for GNN. NAS is the process
of automatically searching for the optimal architecture of a particular neural network
to solve a problem, which includes finding the number of layers, the number of nodes
in the layer, etc. In GraphNAS [72], the RL algorithm helps to search in the graph
neural architectures. GraphNAS represents a search space for covering sampling
41
Deep Learning and Reinforcement Learning
Inspired by this idea [80], a model is presented in [81] that controls the connected
autonomous vehicles (CAVs) as multi-agents by GNN and RL for cooperation
between them. Information transfer for connected autonomous vehicles attains
through the onboard sensors of nearby human-driven vehicles (HDVs) as local infor-
mation and also from other connected autonomous vehicles the global information is
obtained via connectivity channels. This information helps to define the graph struc-
ture. Within the local network, information passes from HDVs to CAVs. From the
global network, all the CAVs can share knowledge including locally sensed informa-
tion and their own information. Here, the environment contains a variable number of
agents and makes a dynamic length output that matches with CAVs driving opera-
tions. Due to the variable number of agents, it is difficult to use joint training for each
agent with its distinct Q network. Also, joint training is not scalable because by
increasing the number of agents, the number of parameters for distinct Q networks
will grow exponentially. One efficient method for solving these challenges is to apply a
shared centralized Q network for all agents to determine their actions. Using the
combination of GCN and deep Q network can have collaborative and safe controlling
for lane-changing decisions in different traffic.
5. Conclusion
In this survey, we tried to summarize about GNNs and RL and their relations. We
had an overview of the challenges inherent in graph neural networks and multi-agent
environments. Since, learning in collaborative multi-agent environments with
dynamic, non-deterministic, and large state space has become a very important chal-
lenge in applications. Among these challenges, we can mention the effect of the size of
the state space on the duration of learning, as well as the inefficient cooperation and
the lack of proper coordination in decision-making between the agents. Also, when
using reinforcement learning algorithms with the graph structure, the models will face
43
Deep Learning and Reinforcement Learning
challenges such as the difficulty of determining the appropriate learning goal and the
long convergence time caused by trial and error-based learning. So, the integration of
these methods leads to more realistic scenarios and more effective solutions to real-
world problems. Researchers in this field have a significant impact on the progress of
the combination of GNNs and DRL by providing newer models and architectures.
Author details
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
44
Graph Neural Networks and Reinforcement Learning: A Survey
DOI: http://dx.doi.org/10.5772/intechopen.111651
References
[9] Hwang D, Yang S, Kwon Y, Lee KH, [16] Veselkov K et al. Hyperfoods:
Lee G, Jo H, et al. Comprehensive study Machine intelligent mapping of cancer-
45
Deep Learning and Reinforcement Learning
48
Graph Neural Networks and Reinforcement Learning: A Survey
DOI: http://dx.doi.org/10.5772/intechopen.111651
51
Chapter 3
Abstract
1. Introduction
addresses got spoofing [1]. The gravity of the impact the breach in IoT has on varied
fields is substantial, and we need to come up with an appropriate security mechanism
to reduce the risk of data being compromised by IoT device forgery.
Different metrics can be used for device identification such as IP address, MAC
address, IMEI address, and other network parameters such as transmission time,
transmission rate, inter-arrival time, and medium access time. Comparisons of differ-
ent metrics for device identification are in Table 1. The parameters, such as MAC
address and IP address, are easier to spoof, so the study has been made on finding out
the important parameters that can distinguish the devices. In [2], transmission time,
transmission rate, inter-arrival time, and medium access time have been compared.
IAT and transmission time outperform the other parameters in device identification.
In this paper, we worked on the deep learning approach for device identification.
Device fingerprint is created from the parameters extracted during the communica-
tion of a device with router. This device fingerprint is used to train the deep learning
model and device identification.
We fingerprint a device using IAT, RTT, and its outliers and feed them to deep
learning models for device identification. These parameters are easier to extract and
are not spoofed that easily after creating the device fingerprint with them.
Timestamps (from which IAT and transmission time are extracted) are generated at
the receiver side, which makes it harder to sniff and spoof. The adversaries need to
change their own behavior to get a hand on these parameters. IAT and RTT are
varied for different devices due to different CPU configuration and clock frequency.
IAT and RTT depend on cache configurations, data cache, instruction cache, clock
frequency, busses, and NIC card. These hardware configurations have an impact on
the packet transfer rates. The attackers might try to emulate the signature using
different techniques such as (1) introduce delays in packets, (2) change the data
rate, and (3) make a customized operating system. Even while considering such
techniques for an attack, an attacker is not successful in emulating the device.
The attacker must consider a spoofing a signature along with hiding its original
signature.
We use deep learning to extract knowledge from the data. It allows us to better
understand the system model and simulate. CNN learns the semantic in the images
and patterns in the image graph. Similarly, LSTM is recognized as a good algorithm for
the classification of time series data. We use these two deep learning algorithms for
the classification of devices. In earlier research, mathematical tools such as Mann-
Whitney U-Test were used, but these algorithms require much time invested in
Header generation sender wireless sender wireless receiver wireless receiver wireless
card card card card
Table 1.
Comparison of parameters for fingerprinting.
54
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
• different parameters (IAT and RTT) for creating unique signatures, which are
separately used for training deep learning models.
• compared how well deep learning algorithm was in classification using different
metrics.
2. Related work
The use of IP address, MAC address, and IMEI number for device identification
brings significant risks of critical information, and the device itself is compromised.
This alerts the researcher to produce a flexible and effective technique for
55
Deep Learning and Reinforcement Learning
device identification [1, 6, 7]. For example, a new stack [1] for the identity of
IoT is proposed as it differs from the traditional identity of network devices and
survey on attribute-based authentication for the identity of IoT devices.
Neumann et al. [2] surveys different features of the MAC layer such as transmis-
sion rate, transmission time, and inter-arrival time, and evaluated them on two
criteria for effectiveness, fingerprint similarity at different time, and fingerprint
dissimilarity of two different devices. In [2], authors use the IAT packets from wire-
less devices for creating digital fingerprints and created a histogram where each bin
specifies the frequency of IATs in a specified range. Here, histogram is the fingerprint
used for the classification of the the device and used to identify known and unknown
devices from the database. The author tested the scenario where a malicious user tries
to emulate the known device by introducing delay to the packets. The author con-
cluded that different software and hardware make it difficult to emulate the hard-
ware. In [2], authors use a passive approach for fingerprinting and, Radhakrishnan
et al. [5] extended the work [2] using active approach for device fingerprinting. In the
passive approach, we just observe the wireless communication to/from the device and
use the important features of packets. Instead, in the active approach, we inject the
signal to get a response from the device to get useful features. Sandhya et al. [8] used
CNN but considered all types of packets flowing from devices to AP for device
classification. This might be practical, but a lower accuracy of 86% may be
problematic from a security point of view.
In [5], the author used a ping application to communicate between a device on
campus. In [9], the author used IAT of probe request to fingerprint the device and
used Mann-Whitney U-test for the analysis if two samples are of the same distribu-
tion. Miettinen [10] used 23 features such as ICMP, TCP, HTTP, and size from
different layers (data link layer, transport layer, network layer, and application layer,
etc.). The work collects these features of 23 for 12 packets and used a random forest
algorithm for classification. The accuracy for 17 out of 27 was obtained 95% and 50%
for the rest devices (10).
Robyns et al. [11] introduce the idea of noncooperative MAC layer fingerprinting,
which does not require cooperation with the device as it uses some adversary nodes at
the monitoring station to capture and monitor the bits of MAC frames without the
user’s permission. This hampers the privacy of the user but provides security from
attacks from outside. The accuracy, when used for classification of 50 to 100 devices,
was between 67–80%, but the accuracy decreases rapidly from 33–15% when device
numbers were increased.
Kohno et al. [12] used the clock skew for fingerprinting devices. The work mea-
sured the timestamp by time difference of the time stamps using the traces from
Tcpdump. The work considered the scenario where IP addresses were changed during
data collection. Maurice [13] used a probe request and response for fingerprinting, but
the results were not that promising for similar devices. Cunche et al. [14] used probe
request from an AP and in response got the list of wireless networks. The work used
this vulnerability to identify people from the list of networks connected. Francois
et al. [15] made use of behavioral fingerprinting and automatically disconnects the
device, which has suspicious activity and asks it to reconnect based on the behavioral
fingerprint. Sun et al. [16] use the fingerprinting method for localization of devices
connected to Wi-Fi AP indoor or outdoor.
Xu [1] studied the challenges and opportunities in digital fingerprinting for both
wired and wireless devices. The author extracted the features from the physical and
MAC layers such as clock skew, IAT, transmission time, SSID, and frequency. The
56
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
work concluded IAT and transmission time as good parameters for device
classification based on accuracy.
Kulin et al. [17] used different algorithms such as k-NN, decision tree, logistic
regression, and neural networks for device classification using publicly available
datasets. The performance of k-NN, decision tree, and logistic regression was good, but
neural networks performed poorer than other classification algorithms with an average
precision of 0.47 and recall value of 0.46. It is a common understanding that neural
networks should perform better than others, but this was not the case in this work.
3. Device fingerprinting
We set up the devices in the lab for extracting the information about the devices.
First, we set up Raspberry Pi as a router. Next, we use Samsung A20 and Samsung J5
Prime as an edge device (target IoT devices). Wireless communication between the
edge devices and router was recorded. In the sniffing applications, Wireshark captures
the packets incoming and outgoing on Raspberry Pi. These captured packets are used
to calculate IATs/RTTs of packets and plot IAT, RTT, and IAT outlier graphs. These
graphs are used as datasets to train and test the model. Python program is used to plot,
label, and split the dataset. A split training set trains the deep learning model, and the
testing dataset validates it. Our overall methodology is depicted in Figure 1 and
explained in detail in a subsection of this section and Section 4.
Our setup has Raspberry Pi as a router and phones as the edge devices. Raspberry
Pi (acts as a router) broadcasts an access point. The packets sent from edge devices are
captured at the router side, which has a packet sniffing tool installed. Wireshark is
installed in Raspberry Pi which inspects, deciphers, and keeps track of all incoming
and outgoing packets to/from it. As there might be many packets coming to the
router, we use the filter to find the required packet. We collected the data in two ways:
1. Probe request and response and 2. Ping request and response.
Probe requests are the packets broadcast by wireless devices, which consist of
supported data rates and their capabilities. The access point receives these requests
and responses with packets consisting of SSID, supported data rates, and encryption
type, etc. We used a sniffing tool, Wireshark, to passively sniff the packets at the
router level and use those packets for making IAT graphs.
Ping sends the ICMP echo request packet to any device on the network and waits
for the response from the target device. In our setup, we ping the edge device, and the
edge device responds to the router. This packet communication of ICMP is passively
observed and recorded by Wireshark. This data is used for making RTT graphs.
The data collected by a sniffing tool and must be processed to obtain IAT and RTT.
We obtain data using a snipping tool in Raspberry Pi. These data are timestamps of
incoming and outgoing packets. We process timestamps to calculate the IAT and RTT
of the packet.
After we obtain the value of IAT and RTT of packets, we write a Python program
to plot the graph and download it. IAT and RTT graph is plotted as a line graph of 100
57
Deep Learning and Reinforcement Learning
Figure 1.
Methodology.
IATs/RTTs. The plot of IAT and RTT is shown in Figures 2–4. We use IAT and RTT
separately for device identification.
The image obtained by plotting the graph must be labeled before we use that data
for training and testing the model based on different metrics. We label the data using
Python. For two phones, 0 represents Samsung A20 and 1 represents Samsung Prime.
We split the total images into training data and testing data. For each IAT and RTT,
we use 75 images for training and 30 for testing for each device (total 150 for training
and 60 for testing). After creating an image and labeling it, we apply CNN and
CNN + LSTM algorithms for image classification.
58
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Figure 2.
IAT graph from our setup.
Figure 3.
IAT graph from verification dataset.
We use the dataset of IAT from crawdad, which was developed by Ulugac et al.
[4]. The dataset is the collection of IATs of different devices. We use four devices: two
iPad and two Dell notebooks for the verification of models. First, we use ICMP
packets for generating the IAT graph. Since we are comparing the classification using
a single packet type, multiple packet types, and an outlier, we also use TCP, UDP, and
ICMP packets for generating the IAT graph and outliers. We plot a graph using 100
59
Deep Learning and Reinforcement Learning
Figure 4.
RTT graph from our setup.
IATs. As in our setup, we similarly label zero for Dell notebook1, one for Dell note-
book2, two for ipad1, and three for ipad2 and split it into a training and testing
dataset.
The created image is colored, but for this classification problem, we convert the
image into grayscale and reduce the image size to 256 * 256. Initially, it was 800 * 800.
Then we split the labeled data into training and testing datasets and use the training
set to train the CNN model. Our CNN model has the first convolution layer with 32
filters and a kernel size of 5 * 5. The input size of this layer was set to 256 * 256 * 1.
Next, we use max-pooling with stride length 2; this helps in reducing the parameters
by selecting the maximum from four (2 in x-direction and 2 in the y-direction). The
next convolution layer in our model has 64 filters and a kernel size of 3 * 3. The input
60
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Figure 5.
CNN model summary.
to this layer is set by Keras. We again use max-pooling with stride length 2. The third
convolution layer has 128 layers and a kernel size of 2 * 2, and we max-pooled with a
stride length of 2 for this layer as well. For all these convolution layers, we use
Rectified linear Unit (ReLU) as an activation function. Next, we use a flattened layer
and two dense layers with 128 and 64 nodes followed by a dense layer with four nodes
with softmax as activation function. Figure 5 shows the model summary of CNN. The
model is compiled using categorical cross-entropy for calculation of loss and Adam as
the optimizer. We use both IAT and RTT data for training the CNN model and check
how good was its classification using different metrics. Furthermore, we use an outlier
of IAT data for classification. While training for different datasets, the number of
nodes and epochs is changed.
Figure 6.
CNN + LSTM model.
the same filter to the n images. We use the same three identical CNN layers but
TimeDistributed. This is illustrated in Figure 6. The input to the first layer is n * 256 *
256 *1. Another input size is managed by Keras. This model has an additional LSTM
layer with 32 nodes after CNN layers. The output of Maxpool2D is flattened to get one
single vector. This is a feed to LSTM and a dense layer. Figure 7 shows the model
Figure 7.
CNN + LSTM model summary.
62
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
summary of CNN + LSTM. LSTM makes use of chronological data and previous frame
data to find what is useful in prediction. The model is compiled using categorical
cross-entropy for calculation of loss and Adam as the optimizer. We use a combina-
tion of CNN and LSTM and observe how good the prediction the model can make.
While training for different datasets, the number of nodes and epochs is changed.
Evaluation of the model is an important task in data science. We need to make sure
our model is not overfitted. Overfitting is a modeling error in statistics, which occurs
due to the model aligning too closely to the limited data points. There are different
techniques to prevent overfitting. Some of the techniques that we use are: reduce
learning rate and dropout Layer. While training the model, we can monitor the
validation accuracy and if it does not increase for a certain epoch, we reduce the
learning rate by a certain factor. Below is the snippet of reducing learning rate where
we monitor the validation loss and reduce the learning rate by a factor of 0.1 when for
3 consecutive epochs validation loss is increased.
tf.Keras.callbacks.ReduceLROnPlateau(monitor = “val.
loss,” factor = 0.1,patience = 3,verbose = 0, andmin lr = 1e-6).
Similarly, the dropout rate can be specified to the layer as the probability of setting
each input to the layer to zero. Below is the code for adding the dropout layer. The rate
is set to 0.3, which drops 0.3 of input units.
model.add(Dense(128, activation = ‘relu’)).
model.add(Dropout(0.3)).
The most common metric used for the evaluation of the algorithm is classification
accuracy. Classification accuracy is equal to the number of correct predictions made
divided by the total number of predictions made.
In our case, we use categorical cross-entropy for the calculation of loss, which
makes the use of the probability of belonging to a class for the calculation of loss.
outputsize
X
Classificationloss ¼ � yi logf ðsÞi (1)
i¼1
Where, yi is the class and f ðsÞi is the probability of belonging to that class. We also
need to control the number of times we train the model. This is called epoch. Too
much training can result in network overfitting to the training data. While training a
model for certain epochs if validation error increases but the training loss decreases or
remains constant, we can conclude that our model is overfitting as shown in Figure 8.
5. Results
Our setup has the phone Samsung A20, and Samsung Prime communicating with
Raspberry Pi. As Section 3.2, we created the IAT graph using probe request and
response from these devices to Raspberry Pi and prepared the data for feeding to CNN
and evaluated the model. We trained the CNN model as in Section 4. A for 10 epochs
and obtained the accuracy of 1.00 and loss of 0.0021 on training data. Accuracy in the
validation dataset was 1.00 and loss of 0.0021. Using the IAT graph for classification
and CNN + LSTM model and running for 30 epochs, the accuracy and loss were 1 and
63
Deep Learning and Reinforcement Learning
Figure 8.
Model loss.
0.0015 in the training dataset and 1 and 0.0011 in the validation dataset. Similarly, we
created the RTT graph using ping as in Section 3.2 and trained for 10 epochs while
feeding to CNN and 40 epochs while feeding to CNN + LSTM and achieved 100%
accuracy in classification in both.
We used the dataset of IAT from crawdad, which was developed by Ulugac et al.
[4] for verification. We used ICMP packets used by two Dell notebooks and two iPads
communicating in the local area network. Using CNN for classification and running
for 10 epochs, we achieved the accuracy of 1 and loss of 1.4 * 10–4 in the training
dataset. We achieved an accuracy of 0.97 and a loss of 0.1326 in the validation dataset.
Figure 9.
Accuracy using IAT(ICMP) as parameter from verification dataset using CNN.
64
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Figure 10.
Loss using IAT(ICMP) as parameter from verification dataset using CNN.
Figures 9 and 10 show the learning curve of the CNN model. Using CNN + LSTM for
classification and running for 35 epochs, we achieved an accuracy of 0.9463 and a loss
of 0.1906 in the training dataset. We achieved an accuracy of 0.9060 and a loss of
0.3115 in the validation dataset. Figures 11 and 12 show the learning curve of the
CNN + LSTM model.
Figure 11.
Accuracy using IAT(ICMP) as parameter from verification dataset using CNN + LSTM.
65
Deep Learning and Reinforcement Learning
Figure 12.
Loss using IAT(ICMP) as parameter from verification dataset using CNN + LSTM.
After analyzing the IAT graph, we found that there is a regular pattern of outliers
and considered if the outliers in the IAT graph can better classify a device using these
deep learning algorithms. We utilized the outliers in the IATs of the verification
dataset for four devices: two Dell notebooks and two iPads. There lies inter-burst
latency between the IAT packets, and we utilize these for classification. We plotted
the outlier graph for four devices considering their own threshold for each. We
plotted outlier graphs and used CNN and CNN + LSTM algorithms for classification.
We used the same CNN configurations ranging from convolution layers, input size,
activation function, and number of layers, etc., for the classification using the IAT
outlier graph. We ran the model for 10 epochs. We achieved the accuracy and loss of
0.9981 and 0.0079 and validation accuracy and loss of 0.9648 and 0.1397, respec-
tively. Figures 13 and 14 show the learning curve of the CNN model using an outlier
dataset for training. We also used the same CNN + LSTM configurations ranging from
convolution layers LSTM layer, activation function, and number of layers, etc., for the
classification using the IAT outlier graph. We ran the model for 15 epochs. We
achieved the accuracy and loss of 0.9870 and 0.0520 and validation accuracy and loss
of 0.9574 and 0.1422, respectively. Figures 15 and 16 show the learning curve of the
CNN + LSTM model using an outlier dataset for training.
To validate the improvement of classification using single type packets (ICMP/
probe request) in our work, we also classified the devices using TCP, UDP, and ICMP
packet types from the same dataset of IAT from crawdad for classification as in [8].
The IAT graphs generated for these packet types were together used for training and
testing the model. We trained the CNN model for 16 epochs and put the dropout layer
after flattened layer to prevent overfitting. We used 18,000 training images and 6000
testing images and obtained an accuracy of 0.9656 and a loss of 0.0894; the validation
accuracy and validation loss were 0.9290 and 0.3073, respectively. Figures 17 and 18
66
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Figure 13.
Accuracy using IAT(ICMP) outlier graph from verification dataset using CNN.
Figure 14.
Loss using IAT(ICMP) outlier graph from verification dataset using CNN.
show the learning curve of CNN model using image graphs of IAT generated using
TCP, UDP, and ICMP packet types from the verification dataset.
Again, for this different type of packet, we considered the outliers and classified
them using the outliers of IAT. We trained the CNN model for 20 epochs and put the
67
Deep Learning and Reinforcement Learning
Figure 15.
Accuracy using IAT (ICMP) outlier graph from verification dataset using CNN + LSTM.
Figure 16.
Loss using IAT (ICMP) outlier graph from verification dataset using CNN + LSTM.
dropout layer after flatten layer to prevent overfitting. We used 5440 training images
and 1700 testing images and obtained an accuracy of 0.8888 and a loss of 0.2704; the
validation accuracy and validation loss were 0.8504 and 0.4344, respectively.
68
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Figure 17.
Accuracy using IAT (TCP, UDP, ICMP) as parameter from verification dataset using CNN.
Figure 18.
Loss using IAT (TCP, UDP, and ICMP) as parameter from verification dataset using CNN.
Figures 19 and 20 show the learning curve of the CNN model using image outlier
graphs of IAT generated using TCP, UDP, and ICMP packet types from the
verification dataset.
69
Deep Learning and Reinforcement Learning
Figure 19.
Accuracy using IAT(TCP, UDP, and ICMP) outlier graph from verification dataset using CNN.
Figure 20.
Loss using IAT(TCP, UDP, and ICMP) outlier graph from verification dataset using CNN.
5.1 Comparison of models and parameters for IAT outlier graphs and IAT graphs
from verification dataset
The summary of the model and parameters is shown in Table 2. When we used
IAT graphs, the validation accuracy is 0.97 for CNN, which is better than
70
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
Model/Parameters IAT graphs IAT Outlier IAT graphs (TCP, IAT outlier graphs
(ICMP) graphs (ICMP) UDP, ICMP) (TCP, UDP,
ICMP)
val. Acc val. Loss val. Acc val. Loss val. Acc val. Loss val. Acc val. Loss
Table 2.
Performance of models in terms of validation accuracy and validation loss using verification dataset.
CNN + LSTM, in which case the validation accuracy is 0.9060. When we used the IAT
outlier graph, the validation accuracy is 0.9648 for CNN and 0.9574 for CNN + LSTM.
We observe that classification accuracy is similar in the case of CNN irrespective of
the IAT graph or IAT outlier graph used in classification, but in the case of
CNN + LSTM, the accuracy is lower, while using IAT graph for classification than IAT
outlier graph.
We noticed that the results of the combination of CNN and LSTM cannot
outperform the CNN alone model. The first reason is that the input of LSTM is a
flattened version of CNN’s output instead of a specific time series; therefore, the time
dependence captured by LSTM may not reflect the relationship among input images.
The second reason is that the used LSTM layer in the experiments has a small output
size. In this case, some valuable information may be lost.
6. Conclusion
In this work, we classified devices using two parameters, namely inter-arrival time
(IAT) and round-trip time (RTT), and two deep learning algorithms, namely CNN
and a combination of CNN and LSTM. We used the IAT and RTT image graph as
device fingerprint and model using two deep learning algorithms. We captured the
packets using the packet snipping tool at Raspberry Pi(router) for two different
setups. IAT and RTT were recorded for each device by snipping tool in real time. The
security threat posed by adversaries once they forge the IoT device makes device
identification a fundamental problem. The dynamic parameters that we used depend
on hardware and software (CPU cache, data cache, and clock frequency, etc.), which
makes it harder for intruders to create the fingerprint of a device. We used deep
learning to extract the knowledge from data. The widespread recognition of CNN as a
good algorithm for image classification encouraged us to use it. Moreover, as LSTM
has made its name for the classification of time series data, we used a combination of
CNN and LSTM because we were using an image graph of time series data for training
the model. Our approach can be used to detect the malicious user if we store the
fingerprint and match the fingerprint of the device trying to connect to the network
before allowing it to connect. Our approach brings the alternative of using IMEI, IP
and MAC address, cryptography security, and a digital certificate for device
identification, which are prone to spoofing.
We used two different parameters and obtained good accuracy in our real setup.
We also verified our model using the dataset available in public for a single ICMP
packet and were able to achieve validation accuracy of 0.97 for CNN and 0.9060 for
CNN + LSTM. We compared two deep learning algorithms for device identification.
71
Deep Learning and Reinforcement Learning
Both models were good when we used a dataset that was generated from our setup,
but while using the dataset from crawdad, CNN was more accurate in classification
than CNN + LSTM. We further used IAT outlier graphs for classification and achieved
validation accuracy of 0.9648 for CNN and 0.9574 for CNN + LSTM. To validate the
improvement in classification accuracy using ICMP packet, we also classified the
devices using TCP, UDP, and ICMP packet types from the verification dataset. We
achieved good accuracy in using a single ICMP packet type for classification.
We collected RTT data in our setup and achieved good accuracy in classification.
In the future, we can collect RTT data in a real scenario with many devices and use it
for classification.
Acknowledgements
This work is supported in part by the US National Science Foundation under Grant
CC-2018919. Beside NSF grant support, Dr. Yang’s work is also supported in part by
the new hire startup fund from Southern Illinois University Carbondale.
Conflict of interests
The authors declare that there are no conflicts of interest regarding the publication
of this article.
Author details
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
72
IoT Device Identification Using Device Fingerprint and Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.111554
References
[4] Uluagac AS. “A. selcuk uluagac, [10] Miettinen M, Marchal S, Hafeez I,
crawdad dataset gatech/fingerprinting Asokan N, Sadeghi A-R, Tarkoma S. Iot
(v. 2014-06-09). 2014. Available from: sentinel: Automated device-type
https://crawdad.org/gatech/ identification for security enforcement
fingerprinting/20140609. in iot. In: 2017 IEEE 37th International
Conference on Distributed Computing
[5] Uluagac AS, Radhakrishnan SV, Systems (ICDCS). Atlanta, USA: IEEE;
Corbett C, Baca A, Beyah R. A passive 2017. pp. 2177-2184
technique for fingerprinting wireless
devices with wired-side observations. In: [11] Robyns P, Bonné B, Quax P,
2013 IEEE Conference on Lamotte W. Noncooperative 802.11 mac
Communications and Network Security layer fingerprinting and tracking of
(CNS). Washington, D.C., USA: IEEE; mobile devices. Security and
2013. pp. 305-313 Communication Networks. 2017;2017:
1-21
[6] Hamad SA, Zhang WE, Sheng QZ,
Nepal S. Iot device identification via [12] Kohno T, Broido A, Claffy KC.
network-flow based fingerprinting and Remote physical device fingerprinting.
learning. In: 2019 18th IEEE International IEEE Transactions on Dependable and
Conference on Trust, Security and Privacy Secure Computing. 2005;2(2):93-108
In Computing and Communications/13th
IEEE International Conference on Big [13] Maurice C, Onno S, Neumann C,
Data Science and Engineering (TrustCom/ Heen O, Francillon A. Improving 802.11
BigDataSE). Rotorua, New Zealand: IEEE; fingerprinting of similar devices by
2019. pp. 103-111 cooperative fingerprinting. In: 2013
International Conference on Security
[7] Mazhar N, Salleh R, Zeeshan M, and Cryptography (SECRYPT).
Hameed MM. Role of device Reykjavik, Iceland: IEEE; 2013. pp. 1-8
73
Deep Learning and Reinforcement Learning
74
Chapter 4
Abstract
75
Deep Learning and Reinforcement Learning
1. Introduction
Obesity is a globally growing epidemic which has affected more than 2 billion
adults, and many teens (18 years plus) are overweight, of which 650 million are obese
[1]. Anthropometric measurements, waist-to-hip ratio, body mass index (BMI), waist
circumference, does not explicitly distinguish fat mass, and quantity of fat present in
visceral, and subcutaneous compartments. Literature, highlights that accumulation of
fat leads to insulin resistance, oncologic and cardiovascular diseases [2–4] affecting
the quality of life. Hence, body composition analysis to determine the amount of
adipose and muscle tissue is of medical importance for obesity risk analysis. Magnetic
resonance imaging (MRI) and computed tomography (CT) can characterize fat and
non-fat tissues [5]. Among the imaging modalities, MR is more efficient in tissue
characterization compared to CT for quantification of body fat volume [6, 7]. By
quantifying different fat compartments from the imaging scans, we can perform body
composition analysis. Manual quantification of fat and muscle volumes from the
imaging scans is tedious and time-consuming, leading to loss clinical man-hours.
Anatomically, the subcutaneous adipose tissue compartments (superficial: SSAT
and deep: DSAT) are separated by thin fascia, whereas the visceral adipose tissue
(VAT) is found in-between internal and external abdominal boundaries. VAT is
around the internal organs and discontinuous whereas SAT (SSAT+DSAT) is contin-
uous. Fat depots are irregular in shape, lack texture, and vary across abdominal profile
as demonstrated in Figure 1 making it a challenging medical image segmentation task.
Several semi-automated methodologies have been developed to reduce time and
reduce bias [8–12]. These methodologies are less reliable and offer low accuracy as
they depend on expert knowledge for fine-tuning image parameters.
Deep learning for image segmentation [13] has found many applications in medical
image analysis and one such application is abdominal fat compartment segmentation.
Several fat quantification studies use single contrast DIXON MR scan and 2D/3D U-
Net architecture [14, 15] for SAT and VAT segmentation. Enhancement versions of
Standard U-Net such as Competitive Dense Fully Convolutional Network (CDFNet),
nnUNet, and Dense Convolutional Network (DCNet), which can handle complex
image features, have been used for adipose tissue segmentation [16–18]. Attention
gate model [AG] in 2D and 3D U-Net [19] has gained popularity in adipose tissue
segmentation task as AG focuses on target structures of varying shapes and sizes by
suppressing irrelevant regions and highlighting useful salient features [20, 21].
Ibtehaz et al. proposed a MultiRes block to address multiscale issues and ResPath to
Figure 1
Illustration of fat depots of SSAT (red), DSAT (green), and VAT (blue) varying shape, size across the abdominal
profile.
76
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation…
DOI: http://dx.doi.org/10.5772/intechopen.111555
reduce adverse learning of features which might lead to a false prediction by skip
connection of U-Net [22].
In our previous work on adipose fat depot segmentation, we had proposed patch-
based 3D-ResUNet Attention [23] for fat depot segmentation, The patch-based
framework failed to handle (i) different body compositions like lean, and moderately
obese due to fixed patch sizes, and (ii) generalize to unseen abdominal region seg-
mentation due to cataphoric forgetting of network, anatomical differences, and class
imbalance. Figure 2 illustrates a few failed cases from our previous work. Hence to
overcome these drawbacks, we focused on the enhancement of MultiResUNet [23] by
proposing a MultiRes-Attention U-Net architecture, with
ii. attention gates for focused learning and improved prediction accuracy.
Data sets of 190 elderly Asians (aged >50 years, residing within the community)
who participated in characterization of early sarcopenia to assess functional decline
Figure 2
Illustration of failed cases of our previous work on patch-based 3D-ResUNet attention vs. proposed architecture.
77
Deep Learning and Reinforcement Learning
study was used in our study [24]. The MR abdominal scans were acquired using a 3D
modified breath-hold T1-weighted Dixon sequence. Subjects were advised a 20 s
breath hold during the scans. The scans were performed on a 3T Siemens Magnetom
Trio MRI scanner with TR/TE/FA/Bandwidth: 6.62 ms, 1.225 ms, 100, and 849 Hz/
pixel, respectively. The study group consisted of mainly Chinese (91.6%) ethnicity
having mean age was 67.85 7.90 years, BMI 23.75 3.65 kg/m2, and predominantly
female (69.5%) subjects. As the study subjects were elderly, many had common
comorbidities such as hypertension, diabetics, and hyperlipidemia. National
Healthcare board reviewed the cohort study with written consent from all subjects.
Data set can be considered as heterogeneous as it included (i) subjects from
different ages (ii) scans covering different anatomical regions—thoracic, lumbar, and
sacral (iii) variations in fat accumulation in different compartments based on body
composition and (iv) acquisitional variations like—image dimensions, slice thickness,
breathing/motion artifacts, etc.
Manual (radiology experts) ground truths were generated in 26 data sets out of 190
scans covering L1-L5 regions. The data with ground truths were subjected to MR-
acquisition based data augmentation to scale the number from 26 to 130 to create
training data sets.
2.3 Preprocessing
All the training/testing data were subjected to quality check to assess motion
artifacts originating from breathing, and fat-water swaps. Auto-check was developed
to ensure training dataset slices match with the marked ground-truth slices. Arm
region artifacts were removed automatically using the projection method [21]. Four
different data augmentations were performed once before training these included (i)
Random Noise (ii) Random Ghosting (iii) Random Bias Field (iv) Blur augmentation
[23] to increase the total number of datasets. Finally, 3D MR scans were converted to
2D slices for training/testing the proposed deep learning architecture.
In standard deeper convolutional network, input data goes through multiple con-
volutions to obtain salient spatial features leading to vanishing gradient problem. The
architectures like ResNet [25] adopt summation of connect of all preceding feature
maps leading to high memory demanding network. DenseNet [26] introduces “dense
connections”, where each layer in the network is connected to every other layer,
instead of only connected to previous layers as in standard network architecture but
fail to handle multi-scale issue. To handle multi-scale issue of fat depots which vary in
shape, size, and improve semantic segmentation which is memory efficient.
78
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation…
DOI: http://dx.doi.org/10.5772/intechopen.111555
Two sequential convolutional layers at each level in U-Net [24] are substituted
with a proposed MultiRes block (similar to dense block in denseNet [26]) with the
residual path, (as in ResNet [25]) as shown in Figure 3. multiRes block contains
Inception-like modules with parallel convolution filters of 33, 55, and 77 to
capture spatial features from different scales. However, they are not memory effi-
cient. To reduce the memory, we factorized a large filter into a sequence of 33 filters
with a gradual increase in the number of filters at each layer as shown in Figure 3.
2.6 ResAtt-path
Figure 3.
Proposed MultiRes-attention U-Net architecture with MultiRes Block, ResAtt-path and attention gate block at the
decoder to aggregate attention features.
Figure 4.
Description of (a) MultiRes block, (b) ResAtt-path and (c) attention gated block of MultiRes-attention U-Net
architecture.
79
Deep Learning and Reinforcement Learning
The ResAtt path connects the U-Net encoder at each level to the attention modules in
the decoding section of U-Net.
2.7 Self-attention
Soft attention gates (AGs) proposed by Oktay et al. [20] assist the model to focus
on regions of interest by suppressing irrelevant location-based feature activations.
AGs ensure that only salient spatial information is carried across skip connection
which improves the network performance in false positives reduction. Soft attention
gates (AGs), as shown in Figure 3(c), and illustrated in Eq. (1) contains two inputs (i)
Ip —lower-level block input and, (ii) IR —ResAtt-Path from the proposed skip connec-
tion layer. Ip input is fed into 1�1 convolution filter for upsampling to match the
dimensions of the inputs as illustrated in Eq. (2). The dimension matched inputs
xattention and xupsampled are combined and passing through a ReLU activation function
and sigmoid activation functions to yield a coefficients with values between 0 and 1.
Finally, these coefficients are upsampled through trilinear interpolation to gener-
ate the soft attention feature map. Which is then multiplied by the ResAtt-Path’s skip
connection to produce the final output as shown in Eq. (3)
� �
xattention ¼ Soft Attention Ip , IR (1)
� �
xupsampled ¼ Upsample Ip (2)
� � ��
output ¼ ConvBlock concat xattention , xupsampled (3)
where TPssat , TPdsat , TPvat correspond to predicted voxel count of SSAT, DSAT and
VAT classes & Ir corresponds to each subject’s voxel resolution. Sub-regions volumes
percentage
P is computed using Eq. (7), where TPi is the true positive volume of class i,
and TPv is the total volume of the fat region.
80
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation…
DOI: http://dx.doi.org/10.5772/intechopen.111555
TPi
%Vc ¼ P ∗ 100 (7)
TPv
Single contrast fat-only 3D MR Dixon scans were converted to 2D slices for training
(approximately 8000, 2D slices). Training was conducted on ubuntu 18.04 LTS operat-
ing system with NVIDIA Titan X GPU card with code written using TensorFlow frame-
work [28] with hyperparameters of MultiRes-Attention U-Net is shown in Table 1.
Multiclass Dice ratio (DR) & Hausdorff distance were two performance matrices
used to evaluate the fat subregions segmentation which comprising of SSAT, DSAT
and VAT regions.
The similarity between predicted and ground truth segmentation results is assessed
by measuring the overlap using multiclass Dice score as illustrated in Eq. (8).
P
ðIpred½Igt ¼¼ k� ¼¼ kÞ ∗ 2:0
DSIk ¼ P P (8)
ðIpred½Ipred ¼¼ k� ¼¼ kÞ þ ðIgt½Igt ¼¼ k� ¼¼ kÞ
where DSIk is the subclass DSI value ranging between 0 and 1, where 1 means
complete overlop of subregion, Ipred is the predicted output, Igt is the ground truth,
and k is the number of classes.
Hausdorff Distance (HD) measures as the distance between two compact non-
empty subsets of a metric space [30]. In order to find similarity between predicted
(Pred) and ground truth (GT) HD measure between two closed and bounded subsets
A and B of a given metric space M is defined as.
Epochs 150
Dropout 0.05
Patience 15
Table 1
Illustrating the hyperparameters values in training MultiRes-attention U-Net.
81
Deep Learning and Reinforcement Learning
where HDðPred, GT Þ is the direct distance between Predicted region and ground
truth, distðαPred, GT Þ is the distance from point to region GT and μðα, GT Þ is a point
distance in the metric space. The smaller HD(Pred, GT) indicates better segmentation
accuracy i.e., less mismatch area.
3. Results
Accurate fat depot segmentation plays a significant role in evaluating fat distribu-
tion which can be used as biomarkers to assess metabolic syndrome and obesity.
Table 2 illustrates the training and testing Dice statistical index (DSI) (Mean � SD)
for MultiRes-Attention U-Net, MultiResUNet, and standard U-Net’s 3-class (Class 1:
Superficial Fat, Class 2: Deep-Superficial Fat, Class 3: Visceral fat) segmentation
accuracies with trained on focal dice loss functions.
Dice score (Table 1) indicated that all the models show improved segmentation
accuracy when trained under focal dice loss function.
4. Discussion
DSI score for training (focal dice loss) SSAT DSAT VAT
Table 2.
Performance comparison of models.
82
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation…
DOI: http://dx.doi.org/10.5772/intechopen.111555
features such as fascia boundary and smaller VAT components around the spine and
preventing the network from learning false positive information. Focal dice loss func-
tion was found to be more appropriate in improving the overall segmentation results
compared to cross-entropy (CE) loss and dice. Experimental results showed that focal-
dice loss function could handle inherent class imbalance (amount of SSAT/DSAT/VAT
in different slices) where cross-entropy or dice loss functions failed. The mean focal
dice loss DSI for the test dataset was about 97.81% for SSAT, 97.18% for DSAT, and
97.11% for VAT, which is a significant improvement by 7%, 11%, and 23% respectively
when compared to standard U-Net results. AHD of the proposed architecture is slightly
better than MultiResUNet and when compared to standard U-Net, it is significantly
better for 3 classes (SSAT, DSAT, and VAT). In addition, the model was able to separate
SAT into SSAT and DSAT in lean subjects (broken or invisible fascia) and obese subjects
(multiple fasciae). As shown in Figure 5, the model was also able to differentiate
between VAT and bones, especially in the spine and pelvic regions. Further, MultiRes-
Attention U-Net was tested on a new 190 data sets (unseen during training; upper &
lower abdomen scans with different resolution) as illustrated in Figure 6 which yielded
accurate results for SSAT and DSAT but had few false positives in sacrum region VAT.
Figure 5.
Shows comparison of predicted results of U-Net, MultiResUNet, and MultiRes-attention U-Net (loss function:
Focal dice) on low-medium and high-fat subjects.
Figure 6.
Illustration of the predicted result of MultiRes-attention U-Net on a few selected samples of new 190 data sets
(unseen during training; upper & lower abdomen scans with different resolution).
83
Deep Learning and Reinforcement Learning
5. Conclusion
In this study, we propose MultiRes-Attention U-Net with hybrid loss function for
segmentation of superficial and deep subcutaneous adipose tissue (SSAT & DSAT),
and visceral adipose tissue (VAT) from abdominal MR scans. MultiRes block, ResAtt-
Path, and attention gates can handle shape, scale, and heterogeneity in the abdominal
data. Model performance is also dependent on the loss function, especially when there
is data imbalance. In this research work, focal dice loss function compared to cross-
entropy (CE) loss and dice were found to be more appropriate in improving the
overall segmentation results. The proposed pipeline contains pre-processing, data
augmentation, and automatic segmentation of fat compartments and fat quantifica-
tion. The proposed algorithm takes less than 5 s for segmentation and quantification of
3 fat compartments are provided more generalizable results where the model was able
to separate SAT into SSAT and DSAT in lean subjects (broken or invisible fascia) and
in obese subjects (multiple fasciae) and also differentiate small VAT tissue from bones
making it feasible for use in large clinical trials and clinical routine.
Author details
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
84
MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation…
DOI: http://dx.doi.org/10.5772/intechopen.111555
References
85
Deep Learning and Reinforcement Learning
87
Chapter 5
Abstract
1. Introduction
Deep learning becomes increasingly important due to the fast growing of internet
contents and the urgent needs of big data in natural language processing (NLP).
The text classification task is one of the most fundamental scenarios in natural
language processing (NLP), where the user enters the text and the model divides the
input text into defined categories. Text classification tasks can be divided into multi-
class text classification, multi-label text classification, hierarchical text classification
and extreme multi-label text classification. In the multi-class text classification set-
tings, there are two or more label categories in the label set, and each sample has only
one relevant label. In the multi-label text classification (MLTC) settings, a sample may
have one or more relevant labels. The hierarchical text classification is a special multi-
class text task or multi-label task, where the labels have a hierarchical relationship
89
Deep Learning and Reinforcement Learning
between them. The extreme multi-label text classification task (XMTC) is annotating
the most relevant labels for the text from a large label set with millions, or even
billions, of labels. It is a limitation of traditional models that words are treated as
independent features out of context. Deep learning methods have had great success in
other related fields by automatically extracting context-sensitive features from raw
text. Text classification techniques can be applied into problem classification [1], topic
classification [2], and emotion classification [3]. Text classification tasks can be
divided into the recommendation system domain, the legal domain, and the ad place-
ment domain depending on the target domain. In the field of recommendation sys-
tems, predicting how much a user prefers a particular item. In the legal field, MLTC
questions are used to predict the final outcome of bills. In the field of ad placement,
personalized ads are tailored to users by inferring their characteristics and personal
interests on social media.
Sentiment analysis refers to mining people’s opinions and emotional attitudes
toward various matters through modal information such as texts and images. In the
early days, sentiment analysis was mainly used to analyze user reviews of products
sold online, and thus confirm user preferences for purchasing products. With the
popularity of self-publishing nowadays, sentiment analysis is more often used to
identify the sentiment analysis of topic participants, to mine the value of topics, and to
analyze related public opinion. Sentiment analysis has important application value for
both society and individuals.
The dialog system relies on deep learning technology to act as an assistant to talk or
chat with people to people. Task-oriented dialog system is used to solve specific
problems in specific fields, such as movie ticket reservation, restaurant table reserva-
tion, etc. Because of its huge commercial value, it has attracted more and more
people’s attention.
This chapter is organized as follows: Section 2 discusses advancement in text
classification, Section 3 outlines the sentiment analysis, Section 4 presents the
task-oriented dialog system, and finally, Section 5 concludes the chapter.
There are three problems in MLTC settings. The process of obtaining comprehen-
sive supervisory information is time-consuming and labor-intensive. The lack of the-
oretical support for the interpretability aspect of deep learning is also an issue that
needs to be addressed. Modeling label dependencies is a major difficulty (Figure 1).
Multi-label text classification includes text pre-processing, text representation
work using feature engineering, and classifier. Text pre-processing is a series of
processes on the original text including word segmentation, cleaning, normalization,
and so on. Text representation processes words into vectors or matrices so that
computers can process them. Feature engineering is divided into heuristics, machine
Figure 1.
Deep learning in multi-label text classification.
90
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
to extract local contextual information of the text and deepens the multi-layer
convolutional and pooling layers to capture deeper textual information. In detail, the
input layer obtains low-dimensional word vectors. The convolution layer extracts the
local information of the text and the pooling layer reduces the feature dimension and
prevents overfitting. Finally, the text and label dimensions are unified by the fully
connected layer. The softmax layer is normalized to obtain the probability. RNN uses
time series memory history information to obtain a representation of text content
information by accepting text sequences of arbitrary length and generating a fixed-
length vector. Gradient vanishing or explosion prevents RNN from effectively learn-
ing long-term dependencies and correlations. LSTM, in order to solve the problem of
RNN on long-term dependency, adds forgetting gates, input gates, and output gates
units to RNN to avoid gradient vanishing or explosion. The methods above assign the
same weight to words and cannot distinguish the importance of words. Inspired by
human attention, the attention mechanism is introduced to focus on key information
and key contents, making it easy for models to focus on the weighted part and
improve the classification accuracy. The attention mechanisms are usually divided
into three categories, namely local attention, global attention, and self-attention
mechanisms. Global attention considers entire text of words, assigning weights
between 0 and 1 to obtain the text representation. Local attention assigns a weight of
either 0 or 1 to each word, discarding some irrelevant items directly. Self-attention
assigns weights based on the interaction of input words, which has advantage of
parallel computing in long text classification.
In conclusion, both word vector models and neural network models are important
components of deep learning-based text representation techniques, and they each
have their own advantages and can be selected according to the needs of specific tasks.
Word vector models focus more on the static representation of words, while neural
network models are better able to capture the dynamic information of the context.
Word vector models are relatively fast to train, while neural network models usually
require larger computational resources and longer training time. Neural network
models may perform better on some complex tasks, but for some simple tasks, word
vector models are effective enough.
Extreme multi-label text classification learns a classifier that labels the most rele-
vant subset of labels for a document from a very large set of labels. The main challenge
is the millions of labels, features, and training points. The current research architec-
tures in extreme multi-label text classification can be divided into four main catego-
ries, namely one-vs-all models, embedding-based models, tree-based models, and
deep learning models. Due to the high computational costs brought by large-scale
labels, the existing MLTC techniques have difficulty solving the XMTC problem. It
can be seen that the extreme label text classification task is trapped in a large label
space and feature space, leading to two pressing problems. The first problem is the
power-law distribution, where long-tailed labels have very little data associated with
them, making it difficult to obtain dependencies between labels, presenting data
sparsity and scalability in extreme text classification work. The second problem is that
computation is expensive, and the same results can be obtained at less cost using data
augmentation techniques. One-vs-all models train a separate classifier for each label
on the entire datasets. The one-vs-all models usually classifies well and with high
accuracy; however, it assumes that the individual labels are independent of each other
92
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
and uncorrelated, resulting in a cost that grows linearly with the number of labels.
Embedded models typically use the relationships between labels to map labels from a
high-dimensional space to a low-dimensional space using a linear matrix mapping
approach as a way to reduce the total number of parameters in the model and reduce
the training time required for the model. The limitation of the embedding method is
that it ignores the correlation between input and output, resulting in an unaligned
embedding of the two. Tree-structured models are trained to produce instance or
labeled trees to make predictions, such as decision trees, random forests, Hoffman
trees, etc. Traditional tree-based approaches can harm performance due to large tree
height and large cluster size.
All three types of models mentioned above are based on bag-of-words representa-
tions of text, where words are treated as independent features out of context and
cannot capture deep semantic information. In contrast, deep learning models can
automatically extract implicit contextual features from raw text for extreme multi-
label text classification.
Typical work, such as XML-CNN [14], first explored the application of deep
learning to XMTC, proposing a series of CNN models for XMTC, modeling
convolutional neural networks and dynamic maximum pooling layers to extract
semantic features of text, and introducing hidden bottleneck layers to reduce model
parameters and accelerate training; however, XML-CNN [14] cannot capture the most
important subtext of each label. Therefore, AttentionXML [15] solves this problem
with two techniques. Firstly, a multi-label attention mechanism is introduced to
capture the most relevant parts of text for each label. Secondly, a shallow and wide
probabilistic label tree is built to handle millions of labels. Lightxml [16] adopts BERT
as an encoder for text and obtains a better text representation, which is the state-
of-the-art extreme multi-label text classification model. DeepXML [17] designed a
framework to decompose XMTC into four subtasks using this framework. These four
subtasks are optimized by selecting different components to generate a series of
algorithms, including Astec [17], DECAF [18], GalaXC [19], and ECLARE [20]. Astec
[17] needs to use label clustering to obtain intermediate feature representations.
DECAF [18] jointly learn model parameters and feature representation to get label
metadata. GalaXC [19] introduces a label attention mechanism to make more accurate
predictions based on the multi-resolution embedding of nodes given by the graph.
ECLARE [20] allows collaborative learning using label-label correlations.
In summary, one-vs-all models are simple and intuitive and can be used flexibly
with a variety of binary classification algorithms but ignore the correlation between
labels, which may lead to inaccurate classification. Embedding-based models capture
semantic information but do not directly model the correlation between labels. Tree-
based models are able to handle high-dimensional and nonlinear data and can capture
correlations between nested features and labels. Deep learning models are capable of
learning complex feature representations and contextual correlations and are suitable
for large-scale data and complex tasks.
This section will introduce the aspect-based sentiment analysis (ABSA) and mul-
timodal sentiment analysis in the sentiment analysis task, which is a classical task in
the field of natural language processing, and we will mainly introduce the deep
learning techniques for sentiment analysis since they have better performance than
93
Deep Learning and Reinforcement Learning
the past machine learning methods and are the mainstream methods in the field of
sentiment analysis.
The concept of ABSA was first introduced in 2010 by Thet et al. [21], and further,
Liu [22] gave a definition of viewpoint in 2012; sentiment analysis and opinion mining
refers to the field of research that analyzes people’s opinions, sentiments, evaluations,
attitudes, and emotions from written language. From 2014 to 2016, SemEval, an
international semantic evaluation conference, has included the ABSA task as one of its
subtasks and provided a series of benchmark datasets [23, 24], which have all been
manually annotated. In recent years, the aspect-based sentiment analysis task has
been receiving attention from many scholars, especially after the rapid application of
deep learning and other related technologies in the fields of data mining, information
retrieval, and intelligent question and answer. Therefore, research related to aspect-
based sentiment analysis based on deep learning has also continued to achieve break-
throughs [25–29], and the ABSA task has gradually become one of the popular
research topics in the field of NLP (Figure 2).
The advantage of aspect-based sentiment analysis is mainly that text sentiment
analysis is fine-grained. Coarse-grained sentiment analysis can often only capture one-
sided single sentiment tendency and cannot analyze detail from each attribute level. A
review text often contains sentiment views for different evaluation objects, for exam-
ple, “the service of this restaurant is good, but the taste is bad.” The text of this review
evaluates the two aspects of “service” and “taste” separately, and the document-level
and sentence-level sentiment analysis cannot mine each aspect separately. Therefore,
aspect-based sentiment analysis is needed for re view texts that contain multiple
aspects [30, 31].
Sentiment analysis methods based on deep learning can be divided into fourmain
types: sentiment analysis methods with a single neural network, sentiment analysis
methods with a hybrid neural network, sentiment analysis with the introduction of
attention mechanisms, and sentiment analysis using pre-trained models.
The main methods for sentiment analysis of single neural networks are introducing
a series of neural network models [32, 33] (e.g., CNN, RNN, etc.). CNN is mainly used
to extract local features of text data, abstract low-dimensional vectors into vector
Figure 2.
The working effect of ABSA.
94
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
With the rapid development of information and network technology and the
widespread use of mobile terminals, people are gradually showing a trend of diversi-
fying the content they publish. The messages they publish for different events and
topics are no longer limited to a single text form, but tend to publish multimodal
content combining text and images to express their feelings and opinion aspect-based.
This situation and trend have attracted academic attention to multimodal sentiment
analysis research, and by analyzing the sentiment tendency implied by these multi-
modal data, it has great application value in box office prediction, product marketing,
political election, product recommendation, mental health analysis, etc. Therefore,
multimodal sentiment analysis has become a hot research topic in recent years
[42, 43]. Multimodal sentiment analysis is the process of combining documents that
describe the same thing in different forms (e.g., sound, image, text, etc.) to enrich our
perception of the thing and analyze the sentiment it expresses. The term modality is
generally associated in academic research with the sensory modalities that represent
our primary communication and sensory channels, and when a research question or
data set contains multiple modalities, it is characterized as a multimodal task or
multimodal data set. In general, academics have focused on (but not limited to) three
modalities: (1) natural language, both spoken and textual, (2) visual signals, often
represented by images or videos, and (3) acoustic signals, such as intonation and
audio. Multimodal learning is a dynamic multidisciplinary field that is breaking new
ground in many tasks such as multimodal sentiment analysis, cross-modal retrieve,
image caption, audiovisual speech recognition, and visual question and answer, visual
speech recognition, and other tasks (Figure 3).
Multimodal sentiment analysis makes full use of data from different modalities for
accurate sentiment prediction. In 2016, a cross-modality consistent regression (CCR)
model was proposed in the literature [44]. The authors of this paper concluded that
the overall sentiment of text and image unimodal, as well as multimodal is the same
with respect to representation of modality, text including descriptions and captions of
images, and learning visual features using CNNs, which outperformed the unimodal
model. In the same year, work [45] proposed a tree-structured recursive neural
95
Deep Learning and Reinforcement Learning
Figure 3.
The working effect of MSA.
networks (TreeLSTM) that use a tree structure and incorporates visual attention
mechanisms. The system builds a structured structure based on sentence parsing
aimed at aligning text words and image regions for accurate analysis and incorporates
LSTM and attention mechanisms to learn a robust joint visual text representation with
contemporaneous optimal results. In addition, the problem of image text mismatch
and defects in social media data such as spoken words, misspellings, and lack of
punctuation, pose a challenge to the task of sentiment analysis of multimodal data,
and to address this challenge, in 2017, Xu et al. constructed different multimodal
sentiment analysis networks, such as the hierarchical semantic attentional network
(HSAN) [46] and multimodal deep semantic network (MultiSentiNet) [47]. HSAN
focused on image captions and proposed a hierarchical semantic network model based
on image captions in a multimodal sentiment analysis task using image captions to
extract visual semantic features as additional information for text. MultiSentiNet, on
the other hand, extracts image features from both objects and scenes and proposes a
visual feature-guided attentional long- and short-term memory network to extract
words that contribute to the understanding of text sentiment and aggregates these
words with visual semantic features, objects and scenes. In 2018, co-memory network
[48] proposed a novel co-memory network (CoMN), which models the
interdependence between vision and text through memory networks to fully consider
the interrelationship between multimodal data. In 2020, multi-view attentional net-
work (MVAN) [49] utilizes a continuously updated memory network to obtain deep
semantic features of images and texts. The authors found that existing datasets for
multimodal sentiment analysis generally labeled only positive, negative and neutral
sentiment polarities, and lacked graphical multimodal datasets for more detailed
sentiment classification, so the authors constructed a large-scale image text multi-
modal dataset (TumEmo) based on social media multimodal data. Cheema proposed a
simple and effective multimodal neural network (Sentiment Multi-Layer Neural Net-
work, Se-MLNN) [50] model that used RoBERT to extract text features containing
contextual features and multiple high-level image features from multiple perspectives
to accurately predict the overall sentiment after fusing the features.
This chapter introduces the task-oriented dialog system, including pipeline mode
and end-to-end mode (Figure 4).
96
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
Figure 4.
Task-oriented dialog system.
Task-oriented dialog system aims to process user messages accurately and puts
forward fairly requirements for response constraints. Therefore, a pipeline method is
proposed to generate responses in a controllable way. It is mainly divided into four
parts: natural language understanding, dialog state tracking, dialog strategy learning,
and natural language generation. The natural language understanding module con-
verts the original user messages into semantic slots and classifies the domain and user
intentions. Dialog status tracking module iteratively calibrates the dialog status based
on the current input and dialog history. The dialog state includes relevant user actions
and slot value pairs. The dialog strategy learning module tracks the calibrated dialog
state according to the dialog state and decides the next action of the dialog agent.
Finally, the natural language generation module converts the selected conversation
actions into natural language for feedback to users. For example, in the movie ticket
reservation task, the agent interacts with the movie knowledge base to retrieve movie
information with specific constraints [51], such as movie name, time, cinema, etc.
hotel reservation domain, slots may include check-in date, check-out date, location,
room type, etc.
Domain classification and intent detection belong to the same classification task.
The problem of domain intent and classification of dialog is solved through deep
learning, including building a deep convex network [52], which combines the predic-
tion of a prior network with the current dialog as the overall input of the current
network. In order to solve the difficulty of using depth neural networks to predict
fields and intentions, some scholars used restricted Boltzmann machines and depth
belief networks to derive the parameters of the initialized depth neural networks [53].
In order to take advantage of the advantages of recurrent neural networks (RNN) in
sequence processing, some work used recurrent neural networks as dialog encoders
and predicted intentions and domain categories [54]. Some scholars have proposed a
short text intention classification model. Due to the lack of information in a single
conversation turn, it is difficult to identify the intention of phrases. Using RNN or
CNN structure to fuse the dialog history, and obtain the context information as the
additional input of the current turn information [55]. This model has achieved good
performance in intention classification tasks. Recently, by pre-trained task-oriented
dialog BERT, this method has achieved high accuracy in intention detection tasks. The
proposed method can effectively alleviate the problem of data shortage in specific
areas.
Slot filling, also known as semantic tagging problem, is a sequence classification
problem. This model needs to predict multiple targets at the same time. Deep belief
network shows good ability in deep structure learning. Some scholars built a sequence
marker based on deep belief network. In addition to the named entity recognition
input features used in traditional markers, they also combined part of speech and
syntactic features as part of the input. Recurrent structures are beneficial to sequence
marking tasks because they can track information along past time steps to maximize
the use of sequence information. Some scholars first proposed that RNN language
models can be applied to sequence tagging rather than simply predicting words [56].
At the output end of RNN, the sequence labels corresponding to the input words are
not normal words. Some scholars further studied the impact of different recurrent
structures on slot filling tasks and found that all RNN models are superior to the
simple conditional random field method [57]. Because the shallow output representa-
tion of traditional semantic annotation lacks the ability to express structured dialog
information, the slot filling task is regarded as a template based tree decoding process
by iteratively generating and filling templates [58].
Dialog state tracking (DST) is the first module of the dialog manager. According to
the entire dialog history, each turn tracks the user’s goals and relevant details, provid-
ing the strategy learning module with the information needed for decision-making.
There is a close relationship between natural language understanding and dialog state
tracking. Both of them need to fill slots of dialog information [59]. However, they
actually play two different roles. The natural language understanding module
attempts to classify current user messages, such as intention recognition and domain
recognition, and slots to which each message character belongs.
The first flow can be considered as a multi-class classification task. For multi-class
classification DST, the tracker predicts to select the correct class from multiple values.
Some scholars used RNN as a neural tracker to obtain the perception of dialog context
98
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
[60]. The tracker finally makes a binary prediction of the current slot value pair based
on the dialog history. The second flow of neural tracker with unfixed slot names and
values attracts more attention because it not only reduces the model and time com-
plexity of DST tasks but also helps to train task-oriented dialog systems end-to-end.
Some scholars proposed the belief span, that is. the text corresponding to the dialog
context spans to a specific slot [61]. They built a two-stage CopyNet to copy and store
the slot value history storage slot in the dialog to prepare for neural response. The
belief span promotes the end-to-end training of the dialog system and improves the
tracking accuracy outside the vocabulary. Based on this, some scholars proposed the
minimum belief span, which is not scalable to generate belief state domains from
scratch when the system interacts with APIs from different sources [62]. Some
scholars proposed a trade model. The model also applies the replication mechanism
and uses a soft-gated pointer generator to generate the slot value dialog context based
on the domain slot pair and coding [63].
Natural language generation is the last module in the pipeline mode of task-
oriented dialog system. It tries to convert the dialog actions generated by the dialog
manager into the final natural language representation. The standard flow of the
defined natural language generation module is composed of four components, and its
core components are content determination, sentence planning, and surface
implementation.
The deep learning method is applied to further enhance the NLG performance, and
the pipeline is folded into a single module. The generation of end-to-end natural
languages has made gratifying progress and is the most popular way to implement
NLG. Some scholars believed that natural language generation should be completely
data-driven and not rely on any expert rules [64]. They proposed a statistical language
model based on RNN, which uses semantic constraints and syntax trees to learn
response generation. In addition, they also used CNN re-ranked to further select
better answers. Similarly, some scholars used LSTM model to learn sentence planning
and surface implementation at the same time. Some scholars used GRU to further
improve the generation quality on multiple domains [65]. The proposed generator
always generates high-quality responses on multiple domains. To improve the adapt-
ability of the domain recurrent model, some scholars proposed to first train the
recurrent language to model the data synthesized from the data sets outside the
domain, and then fine-tune the relatively small data sets within the domain. This
training strategy has proved to be effective in human assessment [66].
Recent works often do not build end-to-end systems to apply in a pipeline manner.
Instead, they use complex neural models to implicitly represent key functions and
integrate modules into one. Task-oriented end-to-end neural model research focuses
on training methods or model architecture, which is the key and quality of response
correctness. Some scholars proposed an incremental learning framework to train their
end-to-end learning task-oriented system [61]. The main idea is to establish an uncer-
tainty evaluation module to evaluate the confidence of the generated appropriate
response. If the confidence score is higher than the threshold, then the response will
be accepted, while if the confidence score is very low. The agent can also use online
learning to learn from human responses. Some scholars use model agnostic meta
learning (MAML) to jointly improve adaptability and reliability [68]. In real life
online service tasks, there are only a few training samples. Similarly, some scholars
also used MAML to train the end-to-end neural model to promote domain adaptation,
which enables the model to train rich resource tasks first, and then train limited new
task data [59]. Other scholars trained an inconsistent order detection module in an
unsupervised manner [63]. The module detects whether the command discourse
generates a more coherent response.
5. Conclusions
Most existing shallow and deep learning models have structures that can be used
for text classification, including integrated approaches. BERT learns a form of lin-
guistic representation that can be used to fine-tune many downstream NLP tasks. The
main approaches are to add data, increase computational power, and design training
programs to obtain better results. The trade-off between data and computational
resources and predictive performance is worth investigating. Due to the inability to
collect data with full supervisory information, so MLTC is gradually turning to the
problem of classification with limited supervised information. Since the excellent
performance of AlexNet in 2012, deep learning has shown great potential. How to
leverage the powerful learning capabilities of deep learning to better capture the label
dependencies is key to solving MLTC tasks.
With the development of deep learning technology in the application of emotion
analysis tasks, the performance of emotion analysis has been greatly improved.
However, some tasks and scenarios still need more abundant data sets to evaluate the
model more accurately.
Although deep learning has achieved remarkable results in the dialog system, in
the pipeline mode, if accurate and fast access to user intentions is still the demand of
the industry, in the end-to-end mode, controllability, and interpretability also need to
be further studied.
100
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
Author details
Yuan Wang*, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang
College of Artificial Intelligence, Tianjin University of Science and Technology, China
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
101
Deep Learning and Reinforcement Learning
References
[1] Graves A. Long short-term memory. In: (EMNLP). Toronto: ACL; 2014.
Supervised sequence labelling with pp. 1532-1543
recurrent neural networks. Berlin:
Springer; 2012. pp. 37-45 [9] Sarzynska-Wawer J, Wawer A,
Pawlak A, Szymanowska J, Stefaniak I,
[2] Sakai Y, Matsuoka Y, Goto M. Jarkiewicz M, et al. Detecting formal
Purchasing behavior analysis model that thought disorder by deep contextualized
considers the relationship between topic word representations. Psychiatry
hierarchy and item categories. In: Research. 2021;304:114135
International Conference on Human-
[10] Devlin J, Chang M-W, Lee K,
Computer Interaction. Cham: Springer;
Toutanova K. Bert: Pre-training of deep
2022. pp. 344-358
bidirectional transformers for language
understanding. 2018. arXiv preprint
[3] Chen Z, Qian T. Transfer capsule
arXiv:1810.04805
network for aspect level sentiment
classification. In: Proceedings of the 57th [11] Kalchbrenner N, Grefenstette E,
Annual Meeting of the Association for Blunsom P. A convolutional neural
Computational Linguistics. Washington: network for modelling sentences. arXiv
ACL; 2019. pp. 547-556 preprint arXiv:1404.2188. 2014
[37] Li Z, Li L, Zhou A, Hongbin L. Jtsg: A [45] You Q , Cao L, Jin H, Luo J. Robust
joint term-sentiment generator for visual-textual sentiment analysis: When
aspect-based sentiment analysis. attention meets tree-structured recursive
Neurocomputing. 2021;459:1-9 neural networks. In: Proceedings of the
24th ACM International Conference on
[38] Qiannan X, Zhu L, Dai T, Yan C. Multimedia. New York: ACM; 2016.
Aspect-based sentiment classification pp. 1008-1017
104
Deep Learning for Natural Language Processing
DOI: http://dx.doi.org/10.5772/intechopen.112550
[46] Nan X. Analyzing multimodal public Representations. New York: ACM; 2014.
sentiment based on hierarchical pp. 46-57
semantic attentional network. In: 2017
IEEE International Conference on [53] Campagna G, Foryciarz A,
Intelligence and Security Informatics Moradshahi M, Lam MS. Zero-Shot
(ISI). Beijing, China: IEEE; 2017. Transfer Learning with Synthesized Data
pp. 152-154 for Multi-Domain Dialogue State
Tracking. 2020
[47] Xu N, Mao W. Multisentinet: A deep
semantic network for multimodal [54] Chen J, Zhang R, Mao Y, Xu J.
sentiment analysis. In: Proceedings of Parallel interactive networks for multi-
the 2017 ACM on Conference on domain dialogue state generation. In:
Information and Knowledge Proceedings of the 2020 Conference on
Management. New York: ACM; 2017. Empirical Methods in Natural Language
pp. 2399-2402 Processing (EMNLP). Toronto: ACL;
2020. pp. 17-26
[48] Xu N, Mao W, Chen G. A
co-memory network for multimodal [55] Chen H, Liu X, Yin D, Tang J.
sentiment analysis. In: The 41st A survey on dialogue systems: Recent
International ACM SIGIR Conference on advances and new frontiers. Acm Sigkdd
Research & Development in Information Explorations Newsletter. 2017;19(2):
Retrieval. New York: ACM; 2018. 25-35
pp. 929-932
[56] Gliwa B, Mochol I, Biesek M, Wawer
[49] Yang X, Feng S, Wang D, Zhang Y. A. Samsum corpus: A human-annotated
Image-text multimodal emotion dialogue dataset for abstractive
classification via multi-view attentional summarization. In: Proceedings of the
network. IEEE Transactions on 2nd Workshop on New Frontiers in
Multimedia. 2020;23:4014-4026 Summarization. New York: ACM; 2019.
pp. 38-49
[50] Cheema GS, Hakimov S,
Müller-Budack E, Ewerth R. A fair and [57] Wen TH, Gasic M, Kim D, Mrksic N,
comprehensive comparison of Su PH, Vandyke D, et al. Stochastic
multimodal tweet sentiment analysis language generation in dialogue using
methods. In: Proceedings of the 2021 recurrent neural networks with
Workshop on Multi-Modal Pre-Training convolutional sentence reranking. In:
for Multimedia Understanding. Proceedings of the 16th Annual Meeting
New York: ACM; 2021. pp. 37-45 of the Special Interest Group on
Discourse and Dialogue. Toronto: ACL;
[51] Masi I, Tran AT, Leksut JT, Hassner
2015. pp. 275-284
T, Medioni G. Do we really need to
collect millions of faces for effective face [58] Wen TH, Gasic M, Mrksic N,
recognition? In: Computer Vision. Cham: Rojas-Barahona LM, Su PH, Ultes S, et al.
Springer; 2016. pp. 579-596 Conditional generation and snapshot
learning in neural dialogue systems. 2016
[52] Bahdanau D, Cho K, Bengio Y.
Neural machine translation by jointly [59] Wen TH, Vandyke TH., Mrksic N,
learning to align and translate. In: Gasic M, Rojas-Barahona LM, Su PH,
International Conference on Learning et al. A network-based end-to-end
105
Deep Learning and Reinforcement Learning
106
Chapter 6
Abstract
Medical image processing tools play an important role in clinical routine in helping
doctors to establish whether a patient has or does not have a certain disease. To
validate the diagnosis results, various clinical parameters must be defined. In this con-
text, several algorithms and mathematical tools have been developed in the last two
decades to extract accurate information from medical images or signals. Traditionally,
the extraction of features using image processing from medical data are time-con-
suming which requires human interaction and expert validation. The segmentation
of medical images, the classification of medical images, and the significance of deep
learning-based algorithms in disease detection are all topics covered in this chapter.
1. Introduction
Deep learning algorithms were used in many medical applications to solve prob-
lems with segmentation, image classification, and pathology diagnosis. The manual
segmentation process is time-consuming for radiologists because it is typically done
slice by slice. Furthermore, segmentation results are susceptible to inter and interob-
server variability. To address these limitations, several approaches based on active
107
Deep Learning and Reinforcement Learning
contour, level set, and statistical shape modeling [1–3] have been proposed to segment
the extent of various pathologies or anatomical geometries. All of the methods men-
tioned above, however, are still semi-automated and require human interaction [4].
With the advent of DL, a fully automated segmentation of serial medical images
is become possible in a few seconds. Several studies in the literature reported that
segmentation algorithms based on AI outperformed the other classical models [5, 6].
Convolutional neural networks (CNNs) are the most used architecture to segment
medical images. It consists of reducing the spatial dimensionality of the original
image data through a series of the network layers by performing convolution and
pooling operations. Other DL architectures were also proposed for this task such as
deep neural network (DNN), artificial neural network (ANN), fully convolutional
network (FCN), ResNet-50, and VGGNet-16 [7–10]. Figure 1 describes the tasks
involved in segmenting cardiac images for various imaging modalities.
The success of DL-based medical image segmentation inspired other studies to
reevaluate the traditional approaches to image segmentation and incorporate DL
models into their work. Many factors have facilitated the increased use of DL. Among
them, we can note the availability of medical data and the evolution of graphics
processors’ performances.
Each year, large, annotated datasets were published online. These data were col-
lected during many challenges such as medical segmentation decathlon and medical
image computing and computer aided interventions (MICCAI). Table 1 summarizes
the largest medical images datasets available online.
Segmentation based on DL were applied in different field of medical imaging
[12–14]. In cardiac MRI, several DL models were used to delineate the contours of the
myocardium which represent a crucial step to compute useful clinical parameters for
the evaluation of cardiac function [15]. DL was also applied for the segmentation of
different types and stage of cancer. For breast cancer, the data include mammography,
ultrasound, and MRI images [16–18]. Other DL architectures were also proposed
in the literature to segment cervical cancer based on Magnetic Resonance Imaging
(MRI), computed tomography (CT), and positron emission tomography (PET) scan
Figure 1.
Overview of cardiac image segmentation tasks for different imaging modalities [11].
108
Deep Learning in Medical Imaging
DOI: http://dx.doi.org/10.5772/intechopen.111686
NIH Image Gallery Various diseases X-rays, MRI, CT, National Institutes of Health (NIH)
PET. https://www.flickr.com/photos/nihgov/
UCI Machine Various diseases X-rays, MRI, CT, The National Science Foundation
Learning PET, echography https://archive.ics.uci.edu/ml/index.php
Repository
Stanford Medical Various diseases X-rays, MRI Stanford University
ImageNet scans, and CT https://aimi.stanford.edu/
scans. medical-imagenet
Open Images Various diseases All medical Google in collaboration with CMU and
Dataset Imaging Cornell Universities
techniques https://storage.googleapis.com/openimages/
web/index.html
Alzheimer’s Disease Alzheimer’s disease brain scans and Foundation for the National Institutes
Neuroimaging related data from of Health
Initiative MRI https://adni.loni.usc.edu/
Table 1.
Medical images datasets available online.
data [19]. Zhao et al. [20] proposed a new model of DL that combined U-net with
progressive growing of U-net+ (PGU-net) for automated segmentation of cervical
nuclei. In their study, they reported a segmentation accuracy of 92.5%. Similarly,
Liu et al. [21] applied a modified U-net model on CT images for clinical target vol-
ume delineation in cervical cancer. In their proposed architecture, the encoder and
decoder components were replaced with dual path network (DPN) components. The
mean dice similarity coefficient (DSC) and the Hausdorff distance (HD) values of the
model were 0.88 and 3.46 mm.
Although image segmentation based on DL facilitates the detection, charac-
terization, and analysis of different lesions in medical images, it still suffers from
several limitations. First, the problem of missing border regions in medical images
should be considered [22]. Furthermore, the imbalanced data available online could
significantly affect the segmentation performances. In medical imaging, the col-
lection of balanced data is challenging since images related to controls are largely
109
Deep Learning and Reinforcement Learning
Figure 2.
Deep learning for the screening of breast cancer [37].
Early and precise diagnosis is crucial for the treatment of different diseases and
for the estimation of a severity grade. The use of DL for the diagnosis of diseases is a
dynamic research area that attracts several researchers worldwide. In fact, DL archi-
tectures have been applied to some specific pathologies such as cancer, heart disease,
diabetes, and Alzheimer’s disease [34, 35]. The increasing number of medical imag-
ing dataset led different researchers to use deep learning models for the diagnosis of
different diseases.
DL algorithms have proven their performances in the prediction and diagnosis of
cancer diseases. The availability of images derived from MRI, CT, mammography,
and biopsy helped several researchers to use these data for early cancer detection. The
analysis of cancer images includes the detection of tumor area, the classification of
different cancer stages, and the extraction of different characteristics for tumors [36].
Recently, Shen et al. [37] used a modified version of CNNs for the screening of
breast cancer using mammography data. The outcomes of their study showed an AUC
of 0.95 and a specificity of 96.1%. A CNN was also applied for the classifications of
different kinds of cancer and the detection of carcinoma. Figure 2 depicts the entire
image categorization process for breast cancer screening using DL architecture.
Alanazi et al. [38] applied the transfer DL model to detect brain tumor in the
early stage by using various types of tumor data. Furthermore, another study used a
3D deep CNN to assess the glioma grade (low or high-grade glioma). In their study,
they reported an accuracy of 96.49% [39]. Compared to the classical algorithms,
the different studies proved the efficiency of DL in the prediction and analysis of
cancer. However, bigger medical data available online are needed for more adequate
validation.
5. Conclusion
As has been shown, using medical image processing techniques in clinical practice
is crucial for determining if a patient has a particular disease or not. The field of
111
Deep Learning and Reinforcement Learning
medical imaging has been transformed by AI and DL, which enable more precise and
automatic feature extraction from medical data. DL has been used to address a variety
of healthcare issues, including image segmentation and classification, disease detec-
tion, computer-aided diagnosis, and the learning of complex features without human
interaction. Despite the advances made, many challenges still exist in medical health
including privacy and heterogeneity of datasets.
Conflict of interest
Author details
© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
112
Deep Learning in Medical Imaging
DOI: http://dx.doi.org/10.5772/intechopen.111686
References
[1] Pohl KM, Fisher J, Kikinis R, deep convolutional neural network for
Grimson WEL, Wells WM. Shape based segmentation of whole-slide pathology
segmentation of anatomical structures images identifies novel tumour cell-
in magnetic resonance images. perivascular niche interactions that
Computer Visual Biomedical Image are associated with poor survival in
Application. 2005;3765:489-498. glioblastoma.
DOI: 10.1007/11569541_49
[8] Cai L, Gao J, Zhao D. A review
[2] Chen X, Williams BM, Vallabhaneni SR, of the application of deep learning
Czanner G, Williams R, Zheng Y. in medical image classification and
Learning active contour models for segmentation. Annals of Translational
medical image segmentation. In: IEEE/ Medicine. 2020;8(11):713. DOI: 10.21037/
CVF Conference on Computer Vision atm.2020.02.44
and Pattern Recognition (CVPR). Long
Beach, CA, USA; 2019. pp. 11624-11632. [9] Malhotra P, Gupta S, Koundal D,
DOI: 10.1109/CVPR.2019.01190 Zaguia A, Enbeyle W. Deep neural
networks for medical image
[3] Swierczynski P, Papież BW, segmentation. Journal of Healthcare
Schnabel JA, Macdonald C. A level-set Engineering. 2022;200:9580991.
approach to joint image segmentation DOI: 10.1155/2022/9580991
and registration with application to CT
lung imaging. Computerized Medical [10] Alsubai S, Khan HU, Alqahtani A,
Imaging and Graphics. 2018;65:58-68 Sha M, Abbas S, Mohammad UG.
Ensemble deep learning for brain
[4] Gao Y, Tannenbaum A. Combining tumor detection. Frontiers in Computer
atlas and active contour for automatic Neuroscience. 2022;16:1005617.
3d medical image segmentation. DOI: 10.3389/fncom.2022.1005617
Proceedings of the IEEE International
Symposium Biomedical Imaging. [11] Chen C, Qin C, Qiu H,
2011;2011:1401-1404 Tarroni G, Duan J, Bai W, et al. Deep
learning for cardiac image segmentation:
[5] Kim M, Yun J, Cho Y, Shin K, A review. Frontiers in Cardiovascular
Jang R, Bae HJ, et al. Deep learning Medicine. 2020;7:25. DOI: 10.3389/
in medical imaging. Neurospine. fcvm.2020.00025
2019;16(4):657-668
[12] Hesamian MH, Jia W, He X,
[6] Vaidyanathan A, van der Kennedy P. Deep learning techniques
Lubbe MFJA, Leijenaar RTH, van for medical image segmentation:
Hoof M, Zerka F, Miraglio B, et al. Achievements and challenges. Journal
Deep learning for the fully automated of Digital Imaging. 2019;32(4):582-596.
segmentation of the inner ear on MRI. DOI: 10.1007/s10278-019-00227-x
Scientific Reports. 2021;11(1):2885
[13] Fu Y, Lei Y, Wang T, Curran WJ,
[7] Zadeh Shirazi A, McDonnell MD, Liu T, Yang X. A review of deep learning
Fornaciari E, Bagherian NS, Scheer KG, based methods for medical image multi-
Samuel MS, Yaghoobi M, Ormsby RJ, organ segmentation. Physica Medica.
Poonnoose S, Tumes DJ, Gomez GA. A 2021;85:107-122
113
Deep Learning and Reinforcement Learning
[14] Bangalore Yogananda CG, Shah BR, for auto-delineation of clinical target
Vejdani-Jahromi M, Nalawade SS, volume and organs at risk in cervical
Murugesan GK, Yu FF, et al. A fully cancer radiotherapy. Radiotherapy and
automated deep learning network for Oncology. 2020;153:172-179
brain tumor segmentation. Tomography.
2020;6(2):186-193 [22] Zambrano-Vizuete M,
Botto-Tobar M, Huerta-Suárez C,
[15] Wang Y, Zhang Y, Wen Z, Tian B, Paredes-Parada W, Patiño Pérez D,
Kao E, Liu X, et al. Deep learning based Ahanger TA, et al. Segmentation of
fully automatic segmentation of the left medical image using novel dilated ghost
ventricular endocardium and epicardium deep learning model. Computational
from cardiac cine MRI. Quantitative Intelligence and Neuroscience.
Imaging in Medicine and Surgery. 2022;2022:6872045
2021;11(4):1600-1612
[23] Gondara L. Medical image
[16] Abdelrahman A, Viriri S. Kidney denoising using convolutional denoising
tumor semantic segmentation using deep autoencoders. In: 2016 IEEE 16th
learning: A survey of state-of-the-art. International Conference on Data Mining
Journal of Imaging. 2022;8(3):55 Workshops (ICDMW). Barcelona,
Spain; 2016. pp. 241-246. DOI: 10.1109/
[17] Yue W, Zhang H, Zhou J, Li G, ICDMW.2016.0041
Tang Z, Sun Z, et al. Deep learning-
based automatic segmentation for size [24] Gulakala R, Markert B, Stoffel M.
and volumetric measurement of breast Generative adversarial network based
cancer on magnetic resonance imaging. data augmentation for CNN based
Frontiers in Oncology. 2022;12:984626 detection of Covid-19. Scientific Reports.
2022;12:19186
[18] Caballo M, Pangallo DR, Mann RM,
[25] Shukla P, Verma A, Verma S,
Sechopoulos I. Deep learning-based
Kumar M. Interpreting SVM for
segmentation of breast masses in
medical images using Quadtree.
dedicated breast CT imaging: Radiomic
Multimedia Tools and Applications.
feature stability between radiologists
2020;79(39-40):29353-29373
and artificial intelligence. Computers in
Biology and Medicine. 2020;118:103629 [26] Tchito Tchapga C, Mih TA, Tchagna
Kouanou A, Fozin Fonzin T, Kuetche
[19] Yang C, Qin LH, Xie YE, Liao JY. Fogang P, Mezatio BA, et al. Biomedical
Deep learning in CT image segmentation image classification in a big data
of cervical cancer: A systematic review architecture using machine learning
and meta-analysis. Radiation Oncology. algorithms. Journal of Healthcare
2022;17(1):175 Engineering. 2021;2021:9998819
[20] Zhao Y, Rhee DJ, Cardenas C, [27] Rashed BM, Popescu N. Critical
Court LE, Yang J. Training deep-learning analysis of the current medical image-
segmentation models from severely based processing techniques for
limited data. Medical Physics. automatic disease evaluation: Systematic
2021;48(4):1697-1706 literature review. Sensors (Basel).
2022;22(18):7065
[21] Liu Z, Liu X, Guan H, Zhen H,
Sun Y, Chen Q , et al. Development and [28] Puttagunta M, Ravi S. Medical
validation of a deep learning algorithm image analysis based on deep learning
114
Deep Learning in Medical Imaging
DOI: http://dx.doi.org/10.5772/intechopen.111686
Deep learning and reinforcement learning are some of the most important and exciting
research fields today. With the emergence of new network structures and algorithms
such as convolutional neural networks, recurrent neural networks, and self-attention
models, these technologies have gained widespread attention and applications in
fields such as natural language processing, medical image analysis, and Internet of
Things (IoT) device recognition. This book, Deep Learning and Reinforcement Learning
examines the latest research achievements of these technologies and provides a
reference for researchers, engineers, students, and other interested readers. It helps
readers understand the opportunities and challenges faced by deep learning and
reinforcement learning and how to address them, thus improving the research and
application capabilities of these technologies in related fields.
ISSN 2633-1403
ISBN 978-1-80356-950-5
978-1-80356-952-9
Published in London, UK
© 2023 IntechOpen
© your_photo / iStock