A Review of Deep Learning Methods and Applications For PDF
A Review of Deep Learning Methods and Applications For PDF
Journal of Sensors
Volume 2017, Article ID 3296874, 13 pages
https://doi.org/10.1155/2017/3296874
Review Article
A Review of Deep Learning Methods and Applications for
Unmanned Aerial Vehicles
Copyright © 2017 Adrian Carrio et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Deep learning is recently showing outstanding results for solving a wide variety of robotic tasks in the areas of perception, planning,
localization, and control. Its excellent capabilities for learning representations from the complex data acquired in real environments
make it extremely suitable for many kinds of autonomous robotic applications. In parallel, Unmanned Aerial Vehicles (UAVs) are
currently being extensively applied for several types of civilian tasks in applications going from security, surveillance, and disaster
rescue to parcel delivery or warehouse management. In this paper, a thorough review has been performed on recent reported uses
and applications of deep learning for UAVs, including the most relevant developments as well as their performances and limitations.
In addition, a detailed explanation of the main deep learning techniques is provided. We conclude with a description of the main
challenges for the application of deep learning for UAV-based solutions.
Robotic agent N
Society coordinator ···
Planning system Robotic agent 1
Global Motor system
mission Missions
Motion references
planner Actions & skills Executive Motion
feedback system Aerial
Actions
All controller A
Unexpected
Actuator
skills
Process
operation
signals commands platform &
management Planning system Manager of actuators
Communication actions Visual
system Supervision Mission servoing
system planner controller
Communication Hardware
system Action Self-localization interface
Path Environment understanding
Network
Robot-robot monitor
planner Actuator
interface Situational interface
Environment understanding
Process awareness Extracted features
Performance
Self-localization
monitor Yaw system
Feature Sensor
Human-robot planner
Extracted features
Self- extraction interface
Problem
Internal state
interface
performance
manager localization system
Human Action and mapping
measurements
specialist Feature
Communication extractor A
Multisensor
Raw
system All
signals All fusion state Sensors
signals estimator Feature
extractor B
Operator
Social layer Reflective layer Deliberative layer Executive layer Reactive layer Physical layer
Figure 1: Aerostack architecture, consisting of a layered structure, corresponding to the different abstraction levels in an unmanned aerial
robotic system. The architecture has been applied here to systematically classify deep learning-based algorithms available in the state of the
art which have been deployed for applications with Unmanned Aerial Vehicles.
related to perception, guidance, navigation, and control of (v) Executive system: this system receives high-level
unmanned rotorcraft systems. The purpose of referring to symbolic actions and generates detailed behaviour
this architecture, depicted in Figure 1, is to achieve a better sequences
understanding about the nature of the components to the (vi) Planning system: this type of system generates global
aerial robotic systems analyzed. Using this taxonomy also solutions to complex tasks by means of planning (e.g.,
helps identify the components in which deep learning has not path planning and mission planning)
been applied yet. According to Aerostack, the components
(vii) Supervision system: components in the supervision
constituting an unmanned aerial robotic system can be
system simulate self-awareness in the sense of abil-
classified into the following systems and interfaces:
ity to supervise other integrated systems. We can
(i) Hardware interfaces: this category includes interfaces exemplify this type of component with an algorithm
with both sensors and actuators that checks whether the robot is actually making
progress towards its goal and reacts in the presence
(ii) Motor system: the components of a motor system of problems (unexpected obstacles, faults, etc.) with
are motion controllers, which typically receive com- recovery actions
mands of desired values for a variable (position,
(viii) Communication system: the components in the com-
orientation, or speed). These desired values are trans-
munication system are responsible for establishing
lated into low-level commands that are sent to actua-
an adequate communication with human operators
tors
and/or other robots
(iii) Feature extraction system: feature extraction here
The remainder of this paper is as follows: firstly, Section 2
refers to the extraction of useful features or repre-
covers a description of the currently relevant and prominent
sentations from sensor data. The task of most deep
deep learning algorithms. For the sake of completeness, deep
learning algorithms is to learn data representations,
learning algorithms have been included regardless of their
so feature extraction systems are somewhat inherent
direct use in UAV applications. Section 3 presents the state
to deep learning algorithms
of the art in deep learning for feature extraction in UAV
(iv) Situational awareness system: this system includes applications. Section 4 surveys UAV applications of deep
components that compile sensor information into learning for the development of components of planning
state variables regarding the robot and its envi- and situation awareness systems. Reported applications of
ronment, pursuing environment understanding. An deep learning for motion control in UAVs are presented in
example component within the situational awareness Section 5. Finally, a discussion of the main challenges for the
system is SLAM algorithms application of deep learning for UAVs is covered in Section 6.
Journal of Sensors 3
Fully
Convolutions Subsampling Convolutions Subsampling connected
Figure 2: A generic example of a Convolutional Neural Network model. The usual architecture alternates convolution and subsampling
layers. Fully connected neurons are used in the last layers.
2. Deep Learning in the Context of nowadays in supervised learning: Feedforward Neural Net-
Machine Learning works, a popular variation of these called Convolutional Neu-
ral Networks (CNNs), Recurrent Neural Networks (RNNs),
Machine Learning is a capability enabling Artificial Intelli- and a variation of RNNs called Long Short-Term Memory
gence (AI) systems to learn from data. A good definition (LSTM) models.
for what learning involves is the following: “a computer Feedforward Neural Networks, also known as Multi-
program is said to learn from experience E with respect layer Perceptrons (MLPs), are the most common supervised
to some class of tasks T and performance measure P if its learning models. Their purpose is to work as function
performance at tasks in T, as measured by P, improves with approximators: given a sample vector x with 𝑛 features, a
experience E” [15]. The nature of this experience E is typically trained algorithm is expected to produce an output value or
considered for classifying Machine Learning algorithms into classification category y that is consistent with the mapping
the following three categories: supervised, unsupervised, and of inputs and outputs provided in the training set. The
reinforcement learning: approximated function is usually built by stacking together
(i) In supervised learning, algorithms are presented with several hidden layers that are activated in chain to obtain
a dataset containing a collection of features. Addi- the desired output. The number of hidden layers is usually
tionally, labels or target values are provided for each referred to as the depth of the model, which explains the
sample. This mapping of features to labels of target origin of the term deep learning: learning using models with
values is where the knowledge is encoded. Once it has several layers. These layers are made up of neurons or units
learned, the algorithm is expected to find the mapping whose activation given an input vector 𝑥 ∈ R𝑛 is given by the
from the features of unseen samples to their correct following equation:
labels or target values.
(ii) The purpose in unsupervised learning is to extract 𝑎𝜃 (𝑥) = 𝑔 (𝜃𝑇 𝑥) , (1)
meaningful representations and explain key features
of the data. No labels or target values are necessary in
where 𝜃 is a vector of 𝑛 weights and 𝑔 is an activation function
this case in order to learn from the data.
that is usually chosen to be nonlinear. The activation of unit
(iii) In reinforcement learning algorithms, an AI agent 𝑘 in layer 𝑚 given its 𝑛 inputs (outputs of the previous layer
interacts with a real or simulated environment. This 𝑚 − 1) is given by the following equation:
interaction provides feedback between the learning
system and the interaction experience which is useful
to improve performance in the task being learned. 𝑎𝑘𝑚 = 𝑔 (Θ𝑚−1 𝑚−1
𝑘0 𝑎0 + Θ𝑚−1 𝑚−1
𝑘1 𝑎1 + ⋅ ⋅ ⋅ + Θ𝑚−1 𝑚−1
𝑘𝑛 𝑎𝑛 ). (2)
𝐶 (𝑖, 𝑗) = (𝐼 ∗ 𝐾) (𝑖, 𝑗)
Cell
(3)
= ∑∑𝐼 (𝑚, 𝑛) 𝐾 (𝑖 − 𝑚, 𝑗 − 𝑛) . xt
𝑚 𝑛 x ct x ℎt
ℎ𝑡 = 𝑔 [𝑊𝑥𝑡 + 𝑈ℎ𝑡−1 ] , (4) The cell state vector activation is given by the following
equation:
where ℎ𝑡 represents the hidden state at time step 𝑡. The weight 𝑐𝑡 = 𝑓𝑡 ∘ 𝑐𝑡−1 + 𝑖𝑡 ∘ 𝑔 (𝑊𝑐 𝑥𝑡 + 𝑈𝑐 ℎ𝑡−1 ) , (7)
matrices 𝑊 (input-to-hidden) and 𝑈 (hidden-to-hidden)
determine the importance given to the current input and to where ∘ represents the Hadamard product. Finally, the output
the previous state, respectively. The activation is computed gate vector activation is given by the following equation:
with a third weight matrix 𝑉 (hidden-to-output) as indicated
by the following equation: ℎ𝑡 = 𝑜𝑡 ∘ 𝑔 (𝑐𝑡 ) . (8)
𝑎𝑡 = 𝑉ℎ𝑡 . (5) As it has been already stated, LSTM gated cells in RNNs
have internal recurrence, besides the outer recurrence of
RNNs are usually trained using Backpropagation RNNs. Cells store an internal state, which can be written to
Through Time (BPTT), an extension of backpropagation and read from them. There are gates controlling how data
which takes into account temporality in order to compute the enter and leave and are deleted from this cell state. Those
gradients. Using this method with long temporal sequences gates act on the signals they receive, and, similar to a standard
can lead to several issues. Gradients accumulated over a neural network, they block or pass on information based on
long sequence can become immeasurably large or extremely its strength and importance using their own sets of weights.
small. These problems are referred to as exploding gradients Those weights, as the weights that modulate input and hidden
and vanishing gradients, respectively. Exploding gradients states, are adjusted via the recurrent network’s learning
are easier to solve, as they can be truncated or squashed, process. The cells learn when to allow data to enter and leave
whereas vanishing gradients can become too small for or be deleted through the iterative process of making guesses,
Journal of Sensors 5
backpropagating error, and adjusting weights via gradient sample is presented to the model, the binary states of the
descent. This type of model architecture allows successful hidden variables are set to 1 with probability given by (14).
learning from long sequences, helping to capture diverse time Analogously, once the binary states of the hidden variables
scales and remote dependencies. Practical aspects on the use are computed, the binary states of the visible units are set to 1
of LSTMs and other deep learning architectures can be found with a probability given by (15).
in [18].
𝑃 (h | k; 𝜃) = ∏𝑝 (ℎ𝑗 | V) ,
2.2. Unsupervised Learning. Unsupervised learning aims 𝑗
towards the development of models that are capable of (13)
extracting meaningful and high-level representations from 𝑃 (k | h; 𝜃) = ∏𝑝 (V𝑖 | ℎ) ,
𝑖
high-dimensional sensory unlabeled data. This functionality
is inspired by the visual cortex which requires very small
amount of labeled data. 𝑝 (ℎ𝑗 = 1 | V) = 𝜎 (∑𝑊𝑖𝑗 V𝑖 + 𝑎𝑗 ) , (14)
Deep Generative Models such as Deep Belief Networks 𝑖
(DBNs) [19, 20] allow the learning of several layers of
nonlinear features in an unsupervised manner. DBNs are
𝑝 (V𝑖 = 1 | ℎ) = 𝜎 (∑𝑊𝑖𝑗 ℎ𝑗 + 𝑏𝑖 ) , (15)
built by stacking several Restricted Boltzmann Machines 𝑗
(RBMs) [21, 22], resulting in a hybrid model in which the
top two layers form a RBM and the bottom layers act as a where 𝜎(𝑧) = 1/1 + exp(−𝑧) is the logistic function.
directed graph constituting a Sigmoid Belief Network (SBN). For training the RBM model, the learning is conducted by
The learning algorithm proposed in [19] is supposed to be one applying the Contrastive Divergence algorithm [22], in which
of the first efficient ways of learning DBNs by introducing the update rule applied to the model parameters is given by
a greedy layer-by-layer training in order to obtain a deep the following equation:
hierarchical model. In this greedy learning procedure, the
hidden activity patterns obtained in the current layer are used Δ𝑊𝑖𝑗 = 𝜖 (⟨V𝑖 ℎ𝑗 ⟩data − ⟨V𝑖 ℎ𝑗 ⟩recons ) , (16)
as the “visible” data for training the RBM of the next layer.
Once the stacked RBMs have been learned and combined where 𝜖 is the learning rate, ⟨V𝑖 ℎ𝑗 ⟩data represents the expected
to form a DBN, a fine-tuning procedure using a contrastive value of the product of visible and hidden states at thermal
version of the wake-sleep algorithm [23] is applied. equilibrium, when training data is presented to the model,
For a better understanding, the theoretical details of and ⟨V𝑖 ℎ𝑗 ⟩recons is the expected value of the product of visible
RBMs are provided in the following equations. The energy of and hidden states after running a Gibbs chain.
a joint configuration {k, h} can be calculated as follows: Deep neural networks can also be utilized for dimen-
sionality reduction of the input data. For this purpose,
𝐸 (k, h; 𝜃) = − ∑ V𝑖 𝑏𝑖 − ∑ ℎ𝑗 𝑎𝑗 − ∑𝑊𝑖𝑗 V𝑖 ℎ𝑗 , (9) deep “autoencoders” [24, 25] have been shown to provide
𝑖∈vis 𝑗∈hid 𝑖,𝑗 successful results in a wide variety of applications such
as document retrieval [26] and image retrieval [27]. An
where 𝜃 = {𝑊, 𝑏, 𝑎} represent the model parameters. k ∈ {0, 1} autoencoder (see Figure 4) is an unsupervised neural network
are the “visible” stochastic binary units, which are connected in which the target values are set to be equal to the inputs.
to the “hidden” stochastic binary units h ∈ {0, 1}. The bias Autoencoders are mainly composed of an “encoder” network,
terms are denoted by 𝑏𝑖 for the visible units and 𝑎𝑗 for the which transforms the input data into a low-dimensional code,
hidden units. and a “decoder” network, which reconstructs the data from
The probability of a joint configuration over both visible the code. Training these deep models involves minimizing the
and hidden units depends on the energy of that joint error between the original data and its reconstruction. In this
configuration and is given by (10), where 𝑍(𝜃) represents the process, the weights initialization is critical to avoid reaching
partition function (see (11)): a bad local optimum; thus some authors have proposed a
pretrained stage based on stacked RBMs and a fine-tuning
1
𝑃 (k, h; 𝜃) = exp (−𝐸 (k, h; 𝜃)) , (10) stage using backpropagation [24, 27]. In addition, the encoder
𝑍 (𝜃) part of the autoencoder can serve as a good unsupervised
𝑍 (𝜃) = ∑∑ (exp (−𝐸 (k, h; 𝜃))) . nonlinear feature extractor. In this field, the use of Stacked
(11) Denoising Autoencoders (SDAE) [25] has been proven to
k h
be an effective unsupervised feature extractor in different
The probability assigned by the model to a visible vector classification problems. The experiments presented in [25]
k can be computed as expressed in the following equation: showed that training denoising autoencoders with higher
noise levels forced the model to extract more distinctive and
1 less local features.
𝑃 (k; 𝜃) = ∑ exp (−𝐸 (k, h; 𝜃)) . (12)
𝑍 (𝜃) ℎ
2.3. Deep Reinforcement Learning. In reinforcement learning,
The conditional distributions over hidden variables h and an agent is defined to interact with an environment, seeking
visible variables v can be extracted using (13). Once a training to find the best action for each state at any step in time (see
6 Journal of Sensors
Code
layer
Original Reconstructed
input input
Encoder Decoder
Figure 4: Deep autoencoder. An autoencoder consists of an encoder network, which transforms the original input data into a low-
dimensional code, and a decoder network, which reconstructs the data from the code.
optimization of the action-value function 𝑄, based on the DDPG method learns with an average factor of 20 times
Bellman Optimality Equation [29] for 𝑄 (see (21)): fewer experience steps than DQN [33]. Both DDPG and
DQN require large samples datasets, since they are model-
𝜋∗ = arg max 𝑄∗ (𝑠𝑡 , 𝑎𝑡 ) , (20) free algorithms. Regarding DNN-based Guided Policy Search
𝑎𝑡
(DNN-based GPS) [34] method, it learns to map from the
tuple raw visual information and joint states directly to
𝑄∗ (𝑠𝑡 , 𝑎𝑡 ) = E [𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛾 max
𝑎
𝑄 (𝑠𝑡+1 , 𝑎𝑡+1 )] . (21) joint torques. Compared to the previous works, it managed
𝑡+1
to perform high-dimensional control, even from imperfect
Deep 𝑄-Network (DQN) [30, 31] method estimates the sensor data. DNN-based GPS has been widely applied to
action-value function (see (22)) by means of a CNN model robotic control, from manipulation to navigation tasks [35,
with a set of weights 𝜃 as 𝑄∗ (𝑠, 𝑎) ≈ 𝑄(𝑠, 𝑎; 𝜃): 36].
are mainly utilized for plant counting and identification. Deep learning techniques for UAVs have been utilized for
Several applications have used deep learning techniques for acoustic data recognition [64, 65]. In [64], a Partially Shared
this purpose [12, 49, 50, 55, 56], providing robust systems Deep Neural Network (PS-DNN) was proposed to deal with
for monitoring the state of the crops in order to maximize the problem of sound source separation and identification
their productivity. In [55], a sparse autoencoder was utilized using partially annotated data. For this purpose, the PS-DNN
for unsupervised feature learning in order to perform weed is composed of two partially overlapped subnetworks: one
classification from images taken by a multirotor UAV. In regression network for sound source separation and one clas-
[56], a hybrid neural network for crop classification amongst sification network responsible for the sound identification.
23 classes was proposed. The hybrid network consisted The objective of the regression network for sound source
of the combination of a Feedforward Neural Network for separation is to improve the network training for sound
histogram information management and a CNN. In [49], source classification by providing a cleaner sound signal.
the well-known AlexNet CNN architecture proposed in [69] Results showed that PS-DNN model worked reasonably well
was utilized in combination with a sliding window object for people’s voice identification in disastrous situations. The
proposal technique for palm tree detection and counting. data was collected using a microphone array on board a
Other similar approaches have focused on weed scouting Parrot Bebop UAV.
using a CNN model for weed specifies classification [12]. In [65], the problem of UAVs identification based on
Deep learning techniques applied on images taken from their specific sound was addressed by using a bidirectional
UAVs have also gained a lot of importance in monitor- LSTM-RNN with 3 layers and 300 LSTM blocks. This model
ing and search and rescue applications, such as jellyfish exhibited the best performance amongst other 2 preselected
monitoring [70], road traffic monitoring from UAVs [71], models, namely, Gaussian Mixture Models (GMM) and
assisting avalanche search and rescue operations with UAV CNN.
imagery [72], and terrorist identification [73]. In [72, 73], Concerning the radar technology and despite the fact that
the use of pretrained CNN models for feature extraction is radar data has not been widely addressed using deep learning
worth noting again. In both cases, the well-known Inception techniques for UAVs in the literature, the recent advances
model [74] was used. In [72], the Inception model was presented in [62] are worth mentioning. In this paper, the
utilized with a Support Vector Machine (SVM) classifier for spectral correlation function (SCF) was captured using a
detecting possible survivors, while in [73], a transfer-learning 2.4 GHz Doppler radar sensor that was utilized in order
technique was used to fine-tune the Inception network in to detect and classify micro-UAVs amongst 3 predefined
order to detect possible terrorists. classes. The model utilized for this purpose was based on a
Most of the presented approaches, especially in the field of semisupervised DBN trained with the SCF data.
object recognition, require the use of GPUs for dealing with
Regarding laser technology, in [66], a novel strategy
real-time constraints. In this sense, the state-of-the-art object
for detecting safe landing areas based on the point clouds
recognition systems are based on the approaches presented in
captured from a LIDAR sensor mounted on a helicopter
[46, 47], in which the object recognizer is able to run at rates
was proposed. In this paper, subvolumes of 1 m3 from a
from 40 to 90 frames per second on an Nvidia GeForce GTX
volumetric density map constructed from the original point
Titan X.
cloud were used as input to a 3D CNN which was trained
Despite the good results provided by the aforementioned
to predict the probability of the evaluated area as being a
systems, UAV constraints such as endurance, weight, and
safe landing zone. Several CNN models consisting of one or
payload require the development of specific hardware and
two convolutional layers were evaluated over synthetic and
software solutions for being embedded on board a UAV.
semisynthetic datasets, showing in both cases good results
Taking these limitations into account, only few systems in the
when using a 3D CNN model with two convolutional layers.
literature have embedded feature extraction algorithms using
deep learning processed by GPU technology on board a UAV.
In [75], the problem of automatic detection, localization, 4. Deep Learning for Planning and
and classification (ADLC) of plywood targets was addressed. Situational Awareness
The solution consisted of a cascade of classifiers based on
CNN models trained on an Nvidia Titan X and applied over Several deep learning developments have been reported for
24 M-pixel RGB images processed by an Nvidia Jetson TK1 tasks related to UAV planning and situational awareness.
mounted on board a fixed-wing UAV. The ADLC algorithm Planning tasks refer to the generation of solutions for com-
was processed by combining the CPU cores for the detection plex problems without having to hand-code the environment
stage, allowing the GPU to focus on the classification tasks. model or the robot’s skills or strategies into a reactive con-
troller. Planning is required in the presence of unstructured,
3.2. With Other Sensors. Most of the presented workload dynamic environments or when there is diversity in the
using deep learning in the literature has been applied to scope and/or the robot’s tasks. Typical tasks include path,
data capture by image sensors due to the consolidated motion, navigation, or manipulation planning. Situational
results obtained using CNN models. However, deep learning awareness tasks allow robots to have knowledge about their
techniques cover a wide range of applications and can be own state and their environment’s state. Some examples of
used in conjunction with sensors other than cameras, such this kind of tasks are robot state estimation, self-localization,
as acoustic, radar, and laser sensors. and mapping.
Journal of Sensors 9
4.1. Planning. Path planning for collaborative search and 5. Deep Learning for Motion Control
rescue missions with deep learning-based exploration is
presented in [57]. This work, where a UAV explores and maps Deep learning techniques for motion control have been
the environment trying to find a traversable path for a ground recently involved in several scientific researches. Classic con-
robot, focuses on minimizing overall deployment time (i.e., trol has solved diverse robotic control problems in a precise
and analytic manner, allowing robots to perform complex
both exploration and path traversal). In order to map the
maneuvers. Nevertheless, standard control theory only solves
terrain and find a traversable path, a CNN is proposed for
the problem for a specific case and for an approximated robot
terrain classification. Instead of using a pretrained CNN,
model, not being able to easily adapt to changes in the robot
training is done on the spot, allowing training the classifier model and/or to hostile environments (e.g., a propeller on
on demand with the terrain present at the disaster site [58]. a UAV gets damaged, wind gusts, and rain). In this context,
However, the model takes around 15 minutes to train. learning from experience is a matter of importance which can
overcome numerous stated limitations.
4.2. Situational Awareness. Cross-view localization of images As a key advantage, deep learning methods are able to
is achieved with the help of deep learning in [59]. Although properly generalize with certain sets of labelled input data.
the work is presented as a solution for UAV localization, no Deep learning allows inferring a pattern from raw inputs,
UAVs were used for image collection and the experiments such as images and LIDAR sensor data which can lead to
were based on ground-level images only. The approach is proper behaviour even in unknown situations. Concerning
based on mining a library of raw image data to find nearest the UAV indoor navigation task, recent advances have led
neighbor visual features (i.e., landmarks) which are then to a successful application of CNNs in order to map images
matched with the features extracted from an input query to high-level behaviour directives (e.g., turn left, turn right,
image. A pretrained CNN is used to extract features for rotate left, and rotate right) [38, 39]. In [38], 𝑄 function is
matching verification purposes, and although the approach estimated through a CNN, which is trained in simulation and
is said to have low computational complexity, authors do not successfully tested in real experiments. In [39], actions are
provide details about retrieval time. directly mapped from raw images. In all stated methods, the
learned model is run off board, usually taking advantage of a
Ground-level query images are matched to a reference
GPU in an external laptop.
database of aerial images in [60]. Deep learning is applied
here to reduce the wide baseline and appearance variations With regard to UAV navigation in unstructured envi-
between both ground-level and aerial images. A pair-based ronments, some studies have focused on cluttered natural
scenarios, such as dense forests or trails [40]. In [40], a DNN
network structure is proposed to learn deep representations
model was trained to map image to action probabilities (turn
from data for distinguishing matched and unmatched cross-
left, go straight, or turn right) with a final softmax layer
view image pairs. Even though the training procedure in the and tested on board by means of an ODROID-U3 processor.
reported experiments took 4 days, the use of fast algorithms The performance of two automated methods, SVM and the
such as locality-sensitive hashing allowed for real-time cross- method proposed in [76], is latterly compared to that of two
view matching at city scale. The main limitation of their human observers.
approach is the need to estimate scale, orientation, and In [37], navigable areas are predicted from a disparity
dominant depth at test time for ground-level queries. image in the form of up to three bounding boxes. The center
In [61], a CNN is proposed to generate control actions of the biggest bounding box found is selected as the next
(the permitted turns for a UAV) given an image captured waypoint. Using this strategy, UAV flights are successfully
on board and a global motion plan. This global motion plan performed. The main drawback is the requirement to send the
indicates the actions to take given a position on the map disparity images to a host device where all computations are
by means of a potential function. The purpose of the CNN made. The whole pipeline for the UAV horizontal translation,
is to learn the mapping from images to position-dependent disparity map generation, and waypoint selection takes about
actions. The process would be equivalent to perform image 1.3 seconds which makes navigation still quite slow for real
registration and then generate the control actions given the applications. On the other hand, low-level motion control
global motion plan but this behaviour is here learnt to is challenging, since tackling with continuous and multi-
be efficiently encoded in a CNN, demonstrating superior variable action spaces can become an intractable problem.
results to classical image registration techniques. However, Nevertheless, recent works have proposed novel methods to
no tests on real UAV were carried out and no information is learn low-level control policies from imperfect sensor data
provided about execution time, which might complicate the in simulation [41, 63]. In [63], a Model Predictive Controller
deployment for a real UAV application. (MPC) was used to generate data at training time in order
As seen from the presented works, developments in to train a DNN policy, which was allowed to access only raw
planning and situational awareness with deep learning for observations from the UAV onboard sensors. In testing time,
UAVs are still quite rudimentary. The path planning approach the UAV was able to follow an obstacle-free trajectory even
presented is limited to small-scale disaster sites and the in unknown situations. In [41], the well-known Inception v3
different localization and mapping approaches are still slow model (pretrained CNN) was adapted in order to enable the
and have little accuracy for real UAV applications. final layer to provide six action nodes (three transitions and
10 Journal of Sensors
Table 1: Deep learning-based UAV applications grouped by learning algorithms and application fields.
three orientations). After retraining, the UAV managed to Challenges in Deep Learning. Deep learning techniques are
cross a room filled with a few obstacles in random locations. still facing several challenges, beginning with their own
Deep learning techniques for robotic motion control theoretical understanding. An example of this is the lack
can provide increasing benefits in order to infer com- of knowledge about the geometry of the objective function
in deep neural networks or why certain architectures work
plex behaviours from raw observation data. Deep learning
better than others. Furthermore, a lot of effort is currently
approaches have the potential of generalization, with the being put in finding efficient ways to do unsupervised
limitations of current methods which have to overcome the learning, since collecting large amounts of unlabeled data is
difficulties of continuous state and action spaces, as well as nowadays becoming economically and technologically less
issues related to the samples efficiency. Furthermore, novel expensive. Success in this objective will allow algorithms to
deep learning models require the usage of GPUs in order learn how the world works by simply observing it, as we
to work in real time. In this context, onboard GPUs, Field humans do.
Programmable Gate Arrays (FPGAs), or Application-Specific Additionally, as mentioned in Section 2.3, real-world
Integrated Circuits (ASICs) are a matter of importance which problems that usually involve high-dimensional continuous
hardware manufacturers shall take into consideration. state spaces (large number of states and/or actions) can turn
the problem intractable with current approaches, severely
limiting the development of real applications. An efficient
6. Discussion way for coping with these types of problems remains as an
Deep learning has arisen as a promising set of technologies unsolved challenge.
to the current demands for highly autonomous UAV opera-
tions, due to its excellent capabilities for learning high-level Challenges in UAV Autonomy. UAV autonomous operations,
enabling safe navigation with little or no human super-
representations from raw sensor data. Multiple success cases
vision, are currently key for the development of several
have been reported (Tables 1 and 2) in a wide variety of
civilian and military applications. However, UAV platforms
applications.
still have important flight endurance limitations, restricting
A straightforward conclusion from the surveyed articles size, weight, and power consumption of the payload. These
is that images acquired from UAVs are currently the prevail- limitations arise mainly from the current state of sensor and
ing type of information being exploited by deep learning, battery technology and limit the required capabilities for
mainly due to the low cost, low weight, and low power autonomous operations. Undoubtedly, we will see develop-
consumption of image sensors. This noticeable fact explains ments in these areas in the forthcoming years.
the dominance of CNNs among the deep learning algorithms Furthermore, onboard processing is desired for many
used in UAV applications, given the excellent capabilities of UAV operations, especially those where communications can
CNNs in extracting useful information from images. compromise performance, such as when large amounts of
However, deep learning techniques, UAV technology, and data have to be transmitted and/or when there is limited
the combined use of both still present several challenges, bandwidth available. Today, the design of powerful minia-
which are preventing faster and further advances in this field. turized computing devices with low-power consumption,
Journal of Sensors 11
Table 2: Deep learning-based UAV applications grouped by the type of system within an unmanned aerial systems architecture, the sensor
technologies, and the type of learning algorithms: supervised (𝑆), unsupervised (𝑈), and reinforcement (𝑅).
machine learning algorithm,” in Proceedings of the 2016 ASABE [32] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep
Annual International Meeting, American Society of Agricultural q-learning with model-based acceleration,” in Proceedings of the
and Biological Engineers, p. 1, 2016. 33rd International Conference on International Conference on
[13] J. L. Sanchez-Lopez, M. Molina, H. Bavle et al., “A multi- Machine Learning, vol. 48, pp. 2829–2838, New York, NY, USA,
layered component-based approach for the development of June 2016, preprint https://arxiv.org/abs/1603.00748.
aerial robotic systems: The aerostack framework,” Journal of [33] T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control
Intelligent & Robotic Systems, pp. 1–27, 2017. with deep reinforcement learning,” preprint https://arxiv.org/
[14] A. Graves, “Generating sequences with recurrent neural net- abs/1509.02971.
works,” arXiv preprint https://arxiv.org/abs/1308.0850. [34] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end
[15] T. M. Mitchell, Machine Learning, vol. 45 (37), McGraw Hill, training of deep visuomotor policies,” Journal of Machine
Burr Ridge, Ill, USA, 1997. Learning Research, vol. 17, no. 39, pp. 1–40, 2016, preprint
https://arxiv.org/abs/1504.00702.
[16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
Press, Cambridge, Mass, USA, 2016. [35] M. Zhang, Z. McCarthy, C. Finn, S. Levine, and P. Abbeel,
“Learning deep neural network policies with continuous mem-
[17] S. Hochreiter and J. Schmidhuber, “LSTM can solve hard long ory states,” in Proceedings of the 2016 IEEE International Confer-
time lag problems,” in Proceedings of the 10th Annual Conference ence on Robotics and Automation, ICRA 2016, pp. 520–527, May
on Neural Information Processing Systems, NIPS 1996, pp. 473– 2016.
479, December 1996.
[36] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
[18] A. Gibson and J. Patterson, Deep Learning, O’Reilly, 2016. control policies for autonomous aerial vehicles with MPC-
[19] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning guided policy search,” in Proceedings of the 2016 IEEE Interna-
algorithm for deep belief nets,” Neural Computation, vol. 18, no. tional Conference on Robotics and Automation, ICRA 2016, pp.
7, pp. 1527–1554, 2006. 528–535, May 2016.
[20] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy [37] U. Shah, R. Khawad, and K. M. Krishna, “Deepfly: Towards
layer-wise training of deep networks,” Advances in Neural complete autonomous navigation of mavs with monocular
Information Processing Systems, vol. 19, pp. 153–160, 2007. camera,” in Proceedings of the Tenth Indian Conference on
[21] P. Smolensky, “Information processing in dynamical systems: Computer Vision, Graphics and Image Processing, ICVGIP 16, pp.
foundations of harmony theory,” Tech. Rep., DTIC Document, 59:1–59:8, New York, NY, USA, 2016.
1986. [38] F. Sadeghi and S. Levine, “Real single-image flight without a
[22] G. E. Hinton, “Training products of experts by minimizing single real image,” preprint https://arxiv.org/pdf/1611.04201.pdf.
contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. [39] D. K. Kim and T. Chen, “Deep neural network for real-time
1771–1800, 2002. autonomous indoor navigation,” preprint https://arxiv.org/abs/
[23] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The “wake- 1511.04668.
sleep” algorithm for unsupervised neural networks,” Science, [40] A. Giusti, J. Guzzi, D. C. Ciresan et al., “A machine learning
vol. 268, no. 5214, pp. 1158–1161, 1995. approach to visual perception of forest trails for mobile robots,”
IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 661–667,
[24] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimen-
2016.
sionality of data with neural networks,” American Association
for the Advancement of Science. Science, vol. 313, no. 5786, pp. [41] K. Kelchtermans and T. Tuytelaars, “How hard is it to cross the
504–507, 2006. room? – training (recurrent) neural networks to steer a uav,”
preprint https://arxiv.org/abs/1702.07600.
[25] P. Vincent, H. Larochelle, I. Lajoie, and P. Manzagol, “Stacked
denoising autoencoders: learning useful representations in a [42] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich fea-
deep network with a local denoising criterion,” Journal of ture hierarchies for accurate object detection and semantic
Machine Learning Research, vol. 11, pp. 3371–3408, 2010. segmentation,” in Proceedings of the 27th IEEE Conference on
Computer Vision and Pattern Recognition (CVPR ’14), pp. 580–
[26] R. Salakhutdinov and G. Hinton, “Semantic hashing,” Interna-
587, Columbus, Ohio, USA, June 2014.
tional Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–
978, 2009. [43] R. Girshick, “Fast R-CNN,” in Proceedings of the 15th IEEE
International Conference on Computer Vision (ICCV ’15), pp.
[27] A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders 1440–1448, December 2015.
for content-based image retrieval,” in Proceedings of the 19th
European Symposium on Artificial Neural Networks (ESANN [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards
’11), Bruges, Belgium, April 2011. real-time object detection with region proposal networks,” in
Advances in Neural Information Processing Systems, vol. 28, pp.
[28] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in 91–99, 2015.
robotics: A survey,” International Journal of Robotics Research,
vol. 32, no. 11, pp. 1238–1274, 2013. [45] J. Lee, J. Wang, D. Crandall, S. Šabanovic, and G. Fox, “Real-
time, cloud-based object detection for unmanned aerial vehi-
[29] R. S. Sutton and A. G. Barto, Reinforcement Learning: An cles,” in Proceedings of the 1st IEEE International Conference on
Introduction, vol. 1, MIT Press, Cambridge, UK, 1998. Robotic Computing (IRC), pp. 36–43, Taichung, Taiwan, April
[30] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with 2017.
deep reinforcement learning,” arXiv preprint https://arxiv.org/ [46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
abs/1312.5602. look once: Unified, real-time object detection,” in Proceed-
[31] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control ings of the IEEE Conference on Computer Vision and Pattern
through deep reinforcement learning,” Nature, vol. 518, no. Recognition, pp. 779–788, 2016, preprint https://arxiv.org/abs/
7540, pp. 529–533, 2015. 1506.02640
Journal of Sensors 13
[47] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” Military Communications Conference (MILCOM), pp. 924–929,
preprint https://arxiv.org/abs/1612.08242. Baltimore, Md, USA, November 2016.
[48] W. Liu, D. Anguelov, D. Erhan et al., “Ssd: Single shot multibox [63] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
detector,” in Proceedings of the European Conference on Com- control policies for autonomous aerial vehicles with MPC-
puter Vision, pp. 21–37, Springer, 2016. guided policy search,” in Proceedings of the 2016 IEEE Interna-
[49] W. Li, H. Fu, L. Yu, and A. Cracknell, “Deep learning based oil tional Conference on Robotics and Automation (ICRA), pp. 528–
palm tree detection and counting for high-resolution remote 535, Stockholm, Sweden, May 2016.
sensing images,” Remote Sensing, vol. 9, no. 1, p. 22, 2017. [64] T. Morito, O. Sugiyama, R. Kojima, and K. Nakadai, “Partially
[50] S. W. Chen, S. S. Shivakumar, S. Dcunha et al., “Counting apples shared deep neural network in sound source separation and
and oranges with deep learning: a data-driven approach,” IEEE identification using a uav-embedded microphone array,” in
Robotics and Automation Letters, vol. 2, no. 2, pp. 781–788, 2017. Proceedings of the 2016 IEEE/RSJ International Conference on
Intelligent Robots and Systems, IROS 2016, pp. 1299–1304,
[51] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva,
October 2016.
“Learning deep features for scene recognition using places
database,” in Proceedings of the 28th Annual Conference on [65] S. Jeon, J.-W. Shin, Y.-J. Lee, W.-H. Kim, Y. Kwon, and
Neural Information Processing Systems 2014, NIPS 2014, pp. 487– H.-Y. Yang, “Empirical study of drone sound detection in
495, December 2014. real-life environment with deep neural networks,” preprint
https://arxiv.org/abs/1701.05779.
[52] O. A. B. Penatti, K. Nogueira, and J. A. Dos Santos, “Do deep
features generalize from everyday objects to remote sensing [66] D. Maturana and S. Scherer, “3D convolutional neural networks
and aerial scenes domains?” in Proceedings of the IEEE Confer- for landing zone detection from LiDAR,” in Proceedings of
ence on Computer Vision and Pattern Recognition Workshops, the IEEE International Conference on Robotics and Automation
CVPRW 2015, pp. 44–51, June 2015. (ICRA ’15), pp. 3471–3478, IEEE, Washington, DC, USA, May
2015.
[53] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep
convolutional neural networks for the scene classification of [67] Y. LeCun, B. E. Boser, J. S. Denker et al., “Handwritten digit
high-resolution remote sensing imagery,” Remote Sensing, vol. recognition with a back-propagation network,” in Advances in
7, no. 11, pp. 14680–14707, 2015. Neural Information Processing Systems, D. S. Touretzky, Ed., vol.
2, pp. 396–404, 1990.
[54] A. Gangopadhyay, S. M. Tripathi, I. Jindal, and S. Raman, “Sa-
cnn: dynamic scene classification using convolutional neural [68] A. Ghaderi and V. Athitsos, “Selective unsupervised feature
networks,” preprint https://arxiv.org/abs/1502.05243. learning with convolutional neural network (S-CNN),” in Pro-
ceedings of the 2016 23rd International Conference on Pattern
[55] C. Hung, Z. Xu, and S. Sukkarieh, “Feature learning based
Recognition (ICPR), pp. 2486–2490, December 2016.
approach for weed classification using high resolution aerial
images from a digital camera mounted on a UAV,” Remote [69] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
Sensing, vol. 6, no. 12, pp. 12037–12054, 2014. cation with deep convolutional neural networks,” in Proceedings
of the 26th Annual Conference on Neural Information Processing
[56] J. Rebetez, H. F. Satizábal, M. Mota et al., “Augmenting a
Systems (NIPS ’12), pp. 1097–1105, Lake Tahoe, Nev, USA,
convolutional neural network with local histograms-a case
December 2012.
study in crop classification from high-resolution uav imagery,”
in Proceedings of the European Symposium on Artificial Neural [70] H. Kim, D. Kim, S. Jung, J. Koo, J.-U. Shin, and H. Myung,
Networks, 2016. “Development of a UAV-type jellyfish monitoring system using
deep learning,” in Proceedings of the 12th International Confer-
[57] J. Delmerico, E. Mueggler, J. Nitsch, and D. Scaramuzza, “Active
ence on Ubiquitous Robots and Ambient Intelligence, URAI 2015,
autonomous aerial exploration for ground robot path planning,”
pp. 495–497, October 2015.
IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 664–671,
2017. [71] N. V. Kim and M. A. Chervonenkis, “Situation control of
unmanned aerial vehicles for road traffic monitoring,” Modern
[58] J. Delmerico, A. Giusti, E. Mueggler, L. M. Gambardella, and D.
Applied Science, vol. 9, no. 5, pp. 1–13, 2015.
Scaramuzza, ““on-the-spot training” for terrain classification in
autonomous air-ground collaborative teams,” in Proceedings of [72] M. Bejiga, A. Zeggada, A. Nouffidj, and F. Melgani, “A convo-
the International Symposium on Experimental Robotics (ISER), lutional neural network approach for assisting avalanche search
EPFL-CONF-221506, 2016. and rescue operations with UAV imagery,” Remote Sensing, vol.
9, no. 2, p. 100, 2017.
[59] T. Taisho, L. Enfu, T. Kanji, and S. Naotoshi, “Mining visual
experience for fast cross-view UAV localization,” in Proceedings [73] A. Sawarkar, V. Chaudhari, R. Chavan, V. Zope, A. Budale,
of the 8th Annual IEEE/SICE International Symposium on and F. Kazi, “HMD vision-based teleoperating UGV and
System Integration, SII 2015, pp. 375–380, December 2015. UAV for hostile environment using deep learning,” CoRR
abs/1609.04147. URL http://arxiv.org/abs/1609.04147.
[60] T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep
representations for ground-to-aerial geolocalization,” in Pro- [74] C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolu-
ceedings of the IEEE Conference on Computer Vision and Pattern tions,” in Proceedings of the IEEE Conference on Computer Vision
Recognition, CVPR 2015, pp. 5007–5015, June 2015. and Pattern Recognition (CVPR ’15), pp. 1–9, Boston, Mass, USA,
June 2015.
[61] F. Aznar, M. Pujol, and R. Rizo, “Visual Navigation for UAV
with Map References Using ConvNets,” in Advances in Artificial [75] The Technion – Israel Institute of Technology, “Technion
Intelligence, vol. 9868 of Lecture Notes in Computer Science, pp. aerial systems 2016,” in Journal Paper for AUVSI Student UAS
13–22, Springer, 2016. Competition, 2016.
[62] G. J. Mendis, T. Randeny, J. Wei, and A. Madanayake, “Deep [76] P. Santana, L. Correia, R. Mendonça, N. Alves, and J. Barata,
learning based doppler radar for micro UAS detection and “Tracking natural trails with swarm-based visual saliency,”
classification,” in Proceedings of the MILCOM 2016 - 2016 IEEE Journal of Field Robotics, vol. 30, no. 1, pp. 64–86, 2013.
International Journal of
Rotating
Machinery
International Journal of
The Scientific
(QJLQHHULQJ Distributed
Journal of
Journal of
Journal of
Control Science
and Engineering
Advances in
Civil Engineering
Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
Journal of
Journal of Electrical and Computer
Robotics
Hindawi Publishing Corporation
Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
VLSI Design
Advances in
OptoElectronics
,QWHUQDWLRQDO-RXUQDORI
International Journal of
Modelling &
Simulation
$HURVSDFH
Hindawi Publishing Corporation Volume 2014
Navigation and
Observation
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
in Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
(QJLQHHULQJ
+LQGDZL3XEOLVKLQJ&RUSRUDWLRQ
KWWSZZZKLQGDZLFRP 9ROXPH
Hindawi Publishing Corporation
http://www.hindawi.com
http://www.hindawi.com Volume 201-
International Journal of
International Journal of Antennas and Active and Passive Advances in
Chemical Engineering Propagation Electronic Components Shock and Vibration Acoustics and Vibration
Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014