0% found this document useful (0 votes)

1 views

Deep_Reinforcement_Learning_Tf-agent-based_Object_

This article presents a deep reinforcement learning approach for object tracking using a tf-agent integrated with the Unreal Game Engine simulation platform. The study explores the performance of virtual drones in tracking objects under various simulated conditions, comparing two algorithm methods (DQN and PPO) for stability and accuracy. The research highlights the advantages of virtual environments for testing and optimizing tracking algorithms without the constraints of physical setups.

Uploaded by

marwaissaoui895

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Deep_Reinforcement_Learning_Tf-agent-based_Object_

Uploaded by

marwaissaoui895

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325062

Deep Reinforcement Learning Tf-agent-based

Object Tracking with Virtual Autonomous Drone
in a Game Engine
Khurshedjon Farkhodov1, Suk-Hwan Lee2, Ing. Jan Platos3, Ki-Ryong Kwon1
1
Department of AI Convergence, Pukyong National University, Busan, 48513, South Korea
2
Department of Computer Engineering, Donga University, Busan, 49315, South Korea
3
Department of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Czech Republic

Corresponding author: Ki-Ryong Kwon (e-mail: [email protected]).

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-2020-0-
01797) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and the MSIT (Ministry of Science and ICT), Korea, under the
ICT Consilience Creative program (IITP-2023-2016-0-00318) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

ABSTRACT The recent development of object-tracking framework inventions has affected the performance
of many manufacturing and service industries, such as product delivery, autonomous driving systems,
security systems, military and transportation, retailing industries, smart cities, healthcare systems, agriculture,
etc. Object tracking in physical environments and conditions is much more challenging to achieve accurate
results. However, the process can be experimented using simulation techniques or platforms to evaluate and
check the model’s performance under different simulation conditions and weather changes. This paper
represents one of the target tracking approaches based on the reinforcement learning technique integrated
with tf-agent (TensorFlow-Agent) to accomplish the tracking process in the Unreal Game Engine simulation
platform, Blocks. The productivity of these platforms can be seen while experimenting in virtual-reality
conditions with virtual drone agents and performing fine-tuning to achieve the best or desired performance.
In this proposal, the tf-agent drone learns how to track an object integration with a deep reinforcement
learning process to control the actions, states, and tracking by receiving sequential frames from a simple
Blocks environment. The TF-agent is trained in a Blocks environment for adaptation to the environment and
existing objects in a simulation environment for further testing and evaluation regarding the accuracy of
tracking and speed. We have tested and compared two approaches to the algorithm methods based on the
DQN and PPO trackers integrated with the simulation process regarding stability, rewards, and numerical
performance.

INDEX TERMS Object Tracking, Object Detection, Reinforcement Learning, AirSim, Virtual
Environment, Virtual Simulation, tf-agent, Unreal Game Engine

I. INTRODUCTION techniques. Moreover, there are several fields in which drones

Unmanned Aerial Vehicles are widely utilized in every field are being applied, such as surveillance and controlling large
of manufacturing and daily service procedures [1], where unreachable areas [3], drones for medical purposes such as
drones help make the process easy by automating and Automated Emergency Drones (AED) [4], and autonomous
decreasing the consumption time with safer service activities. drones for search and rescue operations [5], in UAV-based
For example, in recent years, drones have become a top road traffic monitoring applications [6], drones for weather
priority for technical assistance in the agricultural field for the sensing in an urban environment application [7], emergency
medication and fertilization processes; additionally, drones drone system applications for firefighter processes [8], and
can provide proper observation of cultivated field mapping [2] drones for entertainment: photography and cinematography
with important information about any irregular changes in the [9]. Many recent innovations and technologies related to UAV
grower’s field. In most cases, it can be beneficial for frameworks include surveillance and target tracking units for
increasing the productivity of large plantations with several reasons, including security maintenance, control,
congenerous growers to determine drainage patterns and wet- military assistance, and traffic monitoring, besides
dry spots of field elevation that allow more efficient watering manufacturing optimization and control. The current

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325062

development of innovative technologies is taking novel ideas algorithms. Kalidas A. P. et al. [19] presented vision-based
from modeling systems such as virtual reality environment navigation of UAVs based simply on image data by
simulators. Creating a virtual version of the physical objects employing deep reinforcement learning to avoid stationary
and process simulations can provide proper service and movable obstacles autonomously in discrete and
optimization and allow for free experimentation with continuous action space. W. Zhao et al. [20] also proposed a
conditions for any activity. perception-based hierarchical active tracking control for
In recent years, most of the UAV-based research UAVs deploying a high-level controller and action orders in a
community has paid attention to improving the performance V-REP-based environment. A trained PPO algorithm [21]
of visual tracking techniques with several neural networks with reward shaping for aircraft direction to a moving
related to network-based architectures, such as CNN [10], destination in a three-dimensional continuous space model
DNN [11], LSTM [12], RL [13], and others. However, these was suggested, with the agent-specific target guidance in
research works present training and testing results in physical virtual state space using a novel reward calculation. Using a
environment datasets, such as image collections taken from PPO-based DRL algorithm [22] was suggested for UAV
different cameras in public areas, image and video sets taken tracking with the assistance of another UAV, introducing the
by drone cameras. These object-tracking applications work generalized distributed deep reinforcement learning platform,
with certain object classes where the drone tracks dynamic which provides solutions to overcome various problems such
objects in an exact pathway and localizes objects with specific as tracking, controlling, and mission coordination of UAVs.
methods as an additional task for the target-tracking Moreover, M. A. B. Abdelkader et al. [23] propose RL-based
framework. Still, there are challenges while tracking moving drone elevation control on a Python-Unity integrated
and static objects with apparently identical aspects to simulation framework to achieve a stable user diagram
recognize which one is an actual tracking target. Recent protocol (UDP) with the suggested algorithm. E. Ç etin [24]
object-tracking research development has begun with proposes counting drones in a 3D space with several DRL
intensive learning with virtual reality integration platforms methods present to count drones with another drone in the
[14], [15], so scientists can integrate their technique or environment provided by an AirSim simulator.
algorithm with a virtual reality platform to test their proposals In this study, we developed an algorithm based on tf-agent
with various fine-tuning parameters and conditions. It gives drone tracking in a Blocks environment where the tf-agent is
scholars more opportunities to explore their methods better actively makes decisions to track the target object in the
and more deeply in a hardware-free environment at zero cost, runtime environment. This proposal includes different reward
and optimize them as much as possible. Additionally, there are techniques to boost the learning, tracking, and decision-
some techniques motivated by robust extension of integral making processes via a TF-agent-based drone in a simulation
schemes for mismatched uncertain nonlinear systems platform. There is some computational consideration for
proposed to support asymptotic tracking [16]. Asymptotic correctly applying parameter values to achieve a higher
tracking means ensuring that the systems’ output tracks a accuracy rate, and state representation was formulated to clear
desired reference trajectory over time with negligible tracking out unnecessary losses and constraints for the training and
error. The main goal is to design a tracking control system that testing processes.
guarantees the output converges to the reference trajectory as The following illustrations show our work’s primary
time approaches infinity in uncertain environments. Another contributions:
model is the output feedback adaptive rise control technique • We introduce a virtual environmental simulation-
[17] used for uncertain nonlinear systems to achieve accurate based object-tracking algorithm model that receives
tracking of desired trajectories. The term “adaptive” indicates input images directly from a realistic virtual platform.
that the controller parameters are updated online based on the • Direct access to the network feedable source images
system’s behavior and the tracking error. A deep Q-learning- from the simulation environment makes the
based [18] approach has been suggested for firefighting framework more advantageous to learn and test when
situations, which is obtainable in some agent robots or drones it comes to unknown environmental conditions.
for finding or planning paths and navigating through fire • The experiment is implemented in an AirSim-based
environments. In such complex and hazardous cases, it is basic Blocks environment with a random, particular
required to be more careful to control the situation with walking person, to track by a virtual drone agent.
concrete plans and actions for rescuing injured or victims of • Two different methods were adopted and integrated
the incident by coordinating situational awareness with other with the virtual simulation platform to demonstrate
rescuers, which is an urgent task. When the framework is the performance of the models.
installed and applied to real drones or robots, it can ensure
firefighters or rescuers make the right decision in extreme,
panicky, and disorienting conditions.
Several relevant research studies have been published that
integrate virtual simulation platforms with proposal

VOLUME XX, 2023 9

FIGURE 1. Proposed DRL-based TF-agent object tracking baseline integration with the Game Engine.

II. RELATED WORKS value, showing the difference between the targeted action
The recent development of object tracking via bounding box and ground truth values. In general, it is called
reinforcement learning has improved by integrating it with intersection-over-union (IoU). Reward values will change
many target tracking techniques, which produce better according to the action value difference with ground truth
performance with decision-making in tracking procedures. output, which shows tracking accuracy.
Although most visual tracking concepts based on DRL could
perform better in the case of the representation model with B. VIRTUAL SIMULATION-BASED OBJECT TRACKING
adopted manners for locating the target object within a VIA DEEP REINFORCEMENT LEARNING
search region, the final estimated target coordinates are In the last few years, most research topics have interacted
ideally centered. with innovative trends in virtual simulation world
environments that allow the simulation of any action, object,
A. OBJECT TRACKING VIA DEEP REINFORCEMENT or process, enabling experimentation with complex
LEARNING conditions to manage and optimize results. Algorithm
The advancement of object tracking via reinforcement integration with simulation platforms makes it challenging to
learning is a comparatively novel idea, where object conduct testing and experimentation while taking advantage
localization and tracking integration with a decision-making of simulation behavior closely related to real-world models
model [13], [25]-[30] are applied to the learning and tracking with dynamic and inactive action modes. Several simulation
process as well. Several studies have discovered that platforms have flexible functionality to connect with
combining deep and RL [31] in various settings confers software algorithms for experimentation. The most widely
many advantages. Visual object tracking [32], localizing used open-source platforms currently are AirSim [14] and
temporal activity [33], identifying object classes [26], object Unity [15], which intend to bridge the gap between the
recognition through video sequence [34], and segmentation virtual and real worlds to support the development of
[35] are just a few of the computer vision problems that have autonomous control and a realistic replica of the actual
used DRL. Notably, visual object tracking via DRL world. Both platforms are advancing their technical abilities
framework studies has increased in recent years, where the with high intensity to positively influence the development
DRL was associated with several techniques to robust the and testing of data-driven machine intelligence techniques
training and decision-making ability while targeting object such as reinforcement learning and deep learning. W. Luo et
location. The agent must estimate the target position al. suggested an active object tracking technique [29] via
(bounding box) in every sequence frame in the most typical deep reinforcement learning, in which a drone agent adopted
use of DRL on visual object tracking by repeatedly selecting the ConvNet-LSTM function approximator for predicting
ultimate fitting actions to get accurate tracking results. the target movement using a frame-to-action strategy.
Accordingly, the state representation is the fulfillment Besides, they perform additional (ViZDoom and Unreal
status of the general frame states within a targeted bounding Engine simulation) environment augmentation techniques
box. In general, actions are the transformation result of the and a customized reward function to boost the training
bounding box while tracking that can shift, scale, and turn process to achieve better target tracking performance.
actions depending on how the network learned and adapted Another virtual simulation-based approach [36] uses a
to the environment in training time. In DRL-based object monocular onboard camera via a DRL model to follow the
tracking, accuracy (precision) is emphasized as a reward detected target object. They state that this technique is a more

VOLUME XX, 2023 9

FIGURE 2. The flow chart of the DRL-based tf-agent object tracking model in Game Engine.

accurate and cost-efficient strategy for adopting an algorithm framework's fundamental idea is to learn the action space
in a virtual environment by using multiple sensor data points using direct input from a virtual simulation platform.
from the pre-calculated trajectory. The proposed model Platforms allow training the network with little effort spent
combines one of the object detection models called actively learning and tracking operations. However, the
MobileNet [37] to get the bounding box information from method must be correctly linked with the Q-learning network
the image input of the learning process. The model includes model to get the required object feature information to study
convergence-based exploration and exploitation for the environment and aid in making continuous action
adaptively aligning algorithms with the network. decisions in each frame of the tracking sequence.
Moreover, J. Schulman et al. suggested a reinforcement
learning-based drone follow-me behavior object tracking A. ALGORITHM BASELINE
framework [38] using the Deep Q-Learning (DQN) model to Figure 1 shows the baseline of our proposed method
control RL agents with adaptive and flexible behavior. In this illustration. Firstly, the AirSim simulation platform must be
object-tracking model, stacked image frames and the installed and set with the required characteristic parameters
inclusion of depth information to integrate as input frames to to integrate the designed algorithm model. We manually
the learning and testing process. The proposed model has insert an object into the simulation platform with a defined
experimented with the different level environments with walking route around the particular location specified in the
several structural changes reasonably. Experimental output virtual environment part of the pipeline (Figure 1). The
with several specific conditions showed that the RL-based virtual simulation platform provides essential input frame
drone following technique succeeded in its adaptive and sequences with feature information, such as ordinary,
generalizing behavior. segmented, and gray-scale (negative) depth images, that
In our recent research, we proposed virtual simulation- could proceed through tf-agent DRL network layers to learn
based visual object tracking via a deep reinforcement and take action for target tracking measures. We use image
learning algorithm [25], which the AirSim drone agent uses depth to identify object location and targeted class while
to track the targeted object class in a runtime virtual experimenting through the network for adaptation to
simulation environment by utilizing sequential frames unknown conditions.
directly from it. Additionally, the suggested model has been
tested with a public dataset to evaluate the performance of B. TF-AGENT-BASED DRL OBJECT TRACKING MODEL
recent research outputs. The main advantage of a virtual Tracking objects on a virtual simulation platform differs
simulation platform is that researchers can conduct from typical state-of-the-art target-tracking framework
experimentation several times with different fine-tuning approaches. The target object moves automatically across
techniques at no cost until they improve their proposal with the simulation platform area, occluded by obstacles such as
high accuracy. Accordingly, generating new, fake, or high walls, several different-shaped objects, etc. In this
augmented data or collecting or reusing data from public sets proposal, a random walking pedestrian was set into a
to reinforce model learning and boost localization exactness simulation environment to create learning and tracking
while decreasing estimation time and human effort are conditions by a virtual AirSim drone agent. As shown in
unnecessary. Figure 2, we integrated the simulation platform with the
suggested alternative algorithm model to jointly optimize
III. PROPOSED METHOD representatives by experimenting in different conditions.
In this technique, we created an algorithm implementation Firstly, we request the environment simulation platform to
that includes several components of the object tracking get the typical depth images and the segmentation map to get
framework, including training and tracking, and we the pixels with a target. In the next step, frames will be gray-
evaluated it in a virtual simulation environment. The scaled and normalized for further recognition of an object in

VOLUME XX, 2023 9

the virtual simulation model. After getting the pixels with the the computed Q-value. However, it should be at a balanced
target and creating them, bounding box points are learning rate to keep the trade-off between the previous and
concatenated and transformed into network-readable values new Q-values for the further training process. A learned
for the following process. value is a reward 𝑹𝒕+𝟏 that the drone agent receives moving
randomly from the starting state point, plus discounted
1) DQN-BASED TF-AGENT estimation 𝜸 of optimal future Q-value for a new state-action
The DQN agent is suitable for any environmental condition match (𝒔′ , 𝒂′ ) in 1-time steps. The output of the learned
possessed by a discrete action space formulated value multiplication by the learning rate 𝜶 is done to get the
deterministically for simplicity and expectations over optimal policy value update. The Q-learning process update
stochastic environmental transitions. The main goal of the illustration is in Figure 3.
DQN agent in this model is to train a policy to maximize the As illustrated in Figure 3, there can be several actions
discounted cumulative reward (1). through the training or learning process where an agent
chooses the seemingly optimal actions 𝑸𝝅 (𝒔𝒕 , 𝒂𝒕 ) and
𝑹𝒕𝟎 = ∑∞
𝒕=𝒕𝟎 𝜸
𝒕−𝒕𝟎
𝒓𝒕 (1) receives a reward for the agent’s performance through steps
in a virtual environment. For further learning, the agent
That is also known as the return value 𝑹𝒕𝟎 . In most RL-based should choose an action from the 𝑺𝒕+𝟏 state to continuously
networks, the discount factor γ should be a constant value learn and analyze the environment with more profound
highlighting the sum of converges between 0 and 1. It allows feature results. Here, the greedy epsilon option is a
our agent to gain better reward value results by avoiding straightforward strategy for balancing exploration and
uncertain environment feature information and identifying exploitation by randomly selecting between the two. The
which is less relevant than a fairly confident one. The 𝑸∗ is method, where epsilon is the likelihood of selecting to
to achieve an affordable reward or return value emphasized: explore or exploit, determines whether it proceeds to explore
𝑸∗ : 𝑺𝒕𝒂𝒕𝒆 × 𝑨𝒄𝒕𝒊𝒐𝒏 → ℝ when the action is taken in a the environment with a slight chance. We can see the action
given state, the return results from a constructed policy to selection with the epsilon greedy method mathematical
achieve maximized rewards (2). formulation below the equation (5).

𝝅∗ (𝒔) = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑸∗ (𝒔, 𝒂) (2) 𝐦𝐚𝐱 𝑸𝒕 (𝒂) 𝟏−𝝐

𝒂 Action at time (t) 𝒂𝒕 = { 𝒂 (5)
𝒂𝒏𝒚 𝒂𝒄𝒕𝒊𝒐𝒏 (𝒂) 𝝐
In our virtual reality world simulation model, we will have
access to state and action space information, related to the The action selection method for further learning can be
𝑸∗ value function to create and train the Q-network. Most of detailed thoroughly when the value is 𝟏 − 𝝐, and the agent
the 𝑸 functions in the case of policy-required conditions uses exploitation to take advantage of prior knowledge,
obey Bellman’s [39] equation (3). which is a best-estimated reward; otherwise, 𝝐 it takes
exploration to look for new optimal options.
𝑸𝝅 (𝒔, 𝒂) = 𝑹 + 𝜸𝑸𝝅 (𝒔′ , 𝝅(𝒔′ )) (3) The value of each action must be specified for our agent to
choose the one that will result in the best reward. The action-
The difference between initial and learned values value estimation function (6) uses probability theory to
calculations following the equality equation, also known as define these values. The predicted reward received when
Q-value updating [39] or the temporal difference error (4). choosing an action from a list of all potential actions refer to
the “value of that action”. So, we utilize the “sample-
⏞
𝒍𝒆𝒂𝒓𝒏𝒆𝒅 𝒗𝒂𝒍𝒖𝒆
average” approach to estimate the value of doing an action
𝜹 = 𝑸𝒏𝒆𝒘 (𝒔, 𝒂) = (𝟏 − 𝜶) 𝑸(𝒔,
⏟ 𝒂) + 𝜶 (𝑹𝒕+𝟏 + 𝜸 𝐦𝐚𝐱
′
𝑸(𝒔′ , 𝒂′ )) (4) since the agent does not know the value of choosing a
𝒂
𝒐𝒍𝒅 𝒗𝒂𝒍𝒖𝒆
particular action.
Equation (4) above calculates the updating Q-value for the
state-action (𝒔, 𝒂) pair at time step 𝒕. It is assumed to be 𝑺𝒖𝒎 𝒐𝒇 𝒓𝒆𝒘𝒂𝒓𝒅𝒔 𝒘𝒉𝒆𝒏 𝒂𝒄𝒕𝒊𝒐𝒏(𝒂)𝒕𝒂𝒌𝒆𝒏 𝒃𝒆𝒇𝒐𝒓𝒆 𝒕𝒊𝒎𝒆(𝒕)
𝑸𝒕 (𝒂) =
equal to a weighted sum of old and learned values, where the 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒊𝒎𝒆𝒔 𝒂𝒄𝒕𝒊𝒐𝒏(𝒂)𝒕𝒂𝒌𝒆𝒏 𝒃𝒆𝒇𝒐𝒓𝒆 𝒕𝒊𝒎𝒆(𝒕)
initial old value would be 0 since the agent is experiencing
this particular state-action pair value. The old value is ∑𝒕−𝟏
𝒊=𝟏 𝑹𝒊
= (6)
multiplied by (𝟏 − 𝜶). α learning rate is denoted and set as 𝒕−𝟏
𝜶 = 𝟎. 𝟎𝟎𝟏 for our default training network. Instead of
overwriting the newly calculated Q-value, the 𝜶 learning The agent will then select the action with the most
rate is set to determine the previously computed Q-value outstanding estimated value, referred to as a greedy action,
amount for the initial state-action pair. To retain the recently once the value 𝑸(𝒔′ , 𝒂′ ) has its pick rate.
obtained Q-value later, we give a higher learning rate for the
equal state-action match to adopt the drone agent quickly for

VOLUME XX, 2023 9

or streamed data while decreasing sample inefficiency.

Moreover, the running PPO policy uses N parallel actors that
individually collect data. The data is collected into mini-
batches and then trained for K epochs using the Clipped
Surrogate Objective function.
The Clipped Surrogate Objective will affect and optimize
every action the agent takes. Updating should be stopped if
the action is better (positive) 𝑨 > 𝟎 and more probable while
taking the end of gradient steps. Otherwise, when the drone
action directs in the wrong direction, but the action is good,
FIGURE 3. The process of Q-learning update. it can be redirected or undone from the initial state. In the
case of lousy action (negative) 𝑨 < 𝟎 and less probable
2) PPO-BASED TF-AGENT outcomes gained from agents’ actions, an agent needs to take
Proximal Policy Optimization (PPO) is a straightforward short steps, or they do not need to go far steps in action space.
policy gradient approach for RL-based optimization When it comes to the normalized level of updating, it can be
problems that alternates between optimizing a "surrogate" controlled in the balanced area to get the optimal probability
objective function using stochastic gradient ascent and ratio 𝒓 for agents. The illustration of the 𝑳𝑪𝑳𝑰𝑷 surrogate
sampling data through interaction with the environment [30]. function probability can be seen below in a summarizing
The main idea and difference between primary and novel Figure 4.
policy gradient methods are that the multiple-update Figure 4, illustrates clipped surrogate objective functions
minibatch objective function is applied. At the same time optimization parameters for the running learning period with
standard model updates the gradient per data sample in a probability ratio. Effectively, this technique can be
single epoch. encouraged using significant policy changes across learning
A policy gradient technique known as the Proximal Policy environments or input data for better probability
Optimization (PPO) algorithm is applied to improve the optimization with a model agent. PPO with clipped object
policy of a reinforcement learning agent. PPO is a set of technique shows the difference between two maintained
algorithms that includes PPO1 and PPO2. In this proposal, policy networks, the current 𝝅𝜽 (𝒂𝒕 |𝒔𝒕 ) and the last used
we will focus on the PPO1 algorithm. The Clipped Surrogate policy 𝝅𝜽𝒌 (𝒂𝒕 |𝒔𝒕 ) applied to collect samples. A new policy
Goal is a drop-in substitute for the policy gradient objective evaluation comes from necessary sampling, which involves
to increase training stability by restricting the policy change collecting old policy samples to improve efficiency.
at each step. To address these and other difficulties, we may The final loss function for the PPO actor-critic style looks
limit the amount, alter the policy, and ensure it constantly below equation (8), a combination of the Clipped Surrogate
improves. Furthermore, implementing this model helps to Objective function, Value Loss Function, and Entropy
integrate with a complete processing algorithm to achieve bonus.
efficient samples from input images and minimize ̂ 𝒕 [𝑳𝑪𝑳𝑰𝑷
𝑳𝑪𝑳𝑰𝑷+𝑽𝑭+𝑺
𝒕 (𝜽) = 𝔼 𝒕 (𝜽) − 𝒄𝟏 𝑳𝑽𝑭
𝒕 (𝜽)] + 𝒄𝟐 𝑺[𝝅𝜽 ](𝒔𝒕 ) (8)
hyperparameter tuning indicators. It achieves the same
performance improvements while avoiding complexity by
optimizing the basics of the Clipped Surrogate Objective (7). The given equation above includes several parts that can be
complex to understand, yet it gives more priority to
𝑳𝑪𝑳𝑰𝑷
𝒕
̂ 𝒕 [𝐦𝐢𝐧(𝒓𝒕 (𝜽)𝑨
(𝜽) = 𝔼 ̂ 𝒕 , 𝒄𝒍𝒊𝒑(𝒓𝒕 (𝜽), 𝟏 − 𝝐, 𝟏 + 𝝐)𝑨
̂ 𝒕 )] (7) achieving more accurate results while applying them to the
experimentation process. The explained first part is a
̂𝒕 –
𝒓𝒕 (𝜽)𝑨 identifies the same objective before, inside the Clipped Surrogate Objective function given in the (7)
minimization; 𝒄𝒍𝒊𝒑(𝒓𝒕 (𝜽), 𝟏 − 𝝐, 𝟏 + 𝝐)𝑨̂𝒕 – this part of equation. In (8), given above, 𝒄𝟏 and 𝒄𝟐 are coefficients of
formulation is the same objective, but 𝒓(𝜽) is clipped the value of the related parameter for calculation. 𝑳𝑽𝑭 𝒕 (𝜽)
𝒕𝒂𝒓𝒈
between (𝟏 − 𝝐, 𝟏 + 𝝐); The complete identifies as a squared-error value loss: (𝑽𝜽 (𝒔𝒕 ) − 𝑽𝒕 )𝟐 .
𝐦𝐢𝐧(𝒓𝒕 (𝜽)𝑨 ̂ 𝒕 ) – episode shows the
̂ 𝒕 , 𝒄𝒍𝒊𝒑(𝒓𝒕 (𝜽), 𝟏 − 𝝐, 𝟏 + 𝝐)𝑨 To ensure the sufficient exploration of unknown and
min of the same objective from before and the clipped one; complex scenarios in virtual environments added an entry as
The main objective of clipping surrogates is a region a bonus 𝑺[𝝅𝜽 ](𝒔𝒕 ).
clipping process that prevents the algorithm from getting too
greedy and trying to update too much at once while training
and learning to leave the region with good samples for
estimation and summarizing. PPO enables us to conduct
many gradient ascent epochs on our data stream without
triggering harmfully massive policy modifications.
Conducting these processes helps get more out of collected

VOLUME XX, 2023 9

FIGURE 4. Probability ratio 𝒓 of the surrogate function 𝑳𝑪𝑳𝑰𝑷 with positive 𝑨 > 𝟎 and negative 𝑨 < 𝟎 advantages. The
red circle on each plot shows the starting point for the optimization, i.e., 𝒓 = 1. The sum of the surrogate function
𝑳𝑪𝑳𝑰𝑷 will be performed for many terms [40].

IV. EXPERIMENTAL RESULTS receive as a reward during the training process. The
One of the main objectives and focuses was to get the most minimum reward is typically a negative since most problems
advantage from the simulation platform to perform involve a penalty for making suboptimal decisions – the
experiments in different conditions with parametric changes. training epoch and reward at 2000 and 50, respectively.
Many related researchers used several simulation platforms
to test and evaluate their algorithms in several evaluation
studies. There are different methods to get an advantage from
the realistic virtual platform. In most cases, platforms apply
for experimenting purposes only. However, it can also be
prioritized widely in learning, training, and testing. The
current development of simulation platforms like Unity,
Unreal Engine, and Cecium gives great opportunities and
advantages to process and experiment with state-of-the-art
models in multiple and impractical circumstances. One of the
prime features of the simulators is the interconnection
FIGURE 5. The minimum received reward output for two
between programming languages (Javascript, Python, Go, training models: DQN-based TF-AGENT and PPO-based
Java, Kotlin, PHP, C#, Swift, etc.) and frameworks (Angular, TF-AGENT.
jQuery, React, Ruby, and Rails, Vue, ASP.NET Core,
Furthermore, the background was set with a plot tab color in
Django, Express, etc.). However, building or setting up this
each method to show the overall performance of the training
type of architecture and framework is quite tricky, and it
agents. Each agent model initially gained different rewards,
could only be successful in some case due to third-party
whereas the DQN-based agent performed better.
programs’ and libraries’ conflicts and disproportionality. In
Nevertheless, at the end of the training epochs, the PPO-
this research work, we conducted experiments with different
based agent receives better results than the DQN-based
parametric changes and finetuning, as explained in the
model agent. The whole training reward performance
following chapter sessions.
illustration in Figure 6 above, set to 2000 and 50, training
A. TRAINING RESULTS epoch and reward, respectively. The maximum point is the
We have trained our proposed model with a simple Bloks highest value that the agent can receive in the training
environment by inserting randomly moving objects to learn process. The maximum reward is typically a positive value
environmental space and to create a model for future testing since most problems involve a reward for making optimal
and evaluation purposes. We applied two types of tf-agent decisions. The DQN-based TF-AGENT model initially gains
models, DQN and PPO-based tf-agents, to achieve more a higher reward value in this graph. However, the PPO-based
comparable output results with a 0.001 learning rate model performs better after 400 epochs until the end of the
configuration. training steps. Understanding the range of possible rewards
Figure 5 above illustrates the minimum reward outputs of can help set the hyperparameters of the models, such as the
the trained models in a typical Blocks environment, where learning rate or the discount factor. It can also help assess the
the DQN-based tf-agent and the PPO-based tf-agent model performance of the trained agent, as the rewards obtained by
is are marked with a blue and pink, respectively. The the agent compare against the minimum and maximum
minimum reward is the smallest value that the agent can possible values.

VOLUME XX, 2023 9

the agent during training can help create a model and apply
this model to the testing process as a performance metric.
This metric measures the agent’s ability to navigate the
environment and obtain expected output tracking. The
diagram below (Figure 8) represents the DQN-based TF-
AGENT model’s output with the reward percentage received
from unseen testing scenarios. In the testing session, the
received reward percentage was set to 100 in the 50 steps
respectively in every episode. The overall received in every
FIGURE 6. The received maximum reward output for two
step marked with a column and red line illustrates the
types of training models: DQN-based TF-AGENT and smoothed value of the DQN-based TF-AGENT testing
PPO-based TF-AGENT. results trained and tested with standard reward in Figure 8
(a).
The given average reward below refers to the mean value of Figure 8 (b) shows the PPO-based TF-AGENT’s received
the rewards received by agents during their interactions with percentage reward testing results in the 50 steps of the
the environment while training using DQN and PPO-based episode, along with smoothed red line output. The different
model algorithms. The average reward is essential for results between the DQN and PPO models received rewards
evaluating the agents’ performance during training. During in every training step. As we can see, the testing results show
training, agents try to learn an optimal policy that maximizes that both models give high accuracy and precise learning
the cumulative reward obtained over time. Calculating the performance in every testing step output with an elevated
average evaluation reward is done by dividing the sum of the conclusion.

V. CONCLUSION
In this research work, we have presented a DQN and PPO-
based TF-AGENT model-based object tracking framework
integrated with a simple Blocks environment to experiment
with and evaluate the performance of the proposed
algorithm. It has been integrated with the simulation
platform to highlight the algorithm’s overall performance.
The simulation platform provides three types of essential
input images to experiment with and evaluate the overall
status. While testing in a virtual-reality scenario with virtual
FIGURE 7. The received average rewards outcome of drone agents and finetuning to reach the best or desired
training for DQN and PPO-based TF-AGENTS. results, the productivity and eligibility of these platforms’
importance is vital. The DQN and PPO-based virtual tf-agent
rewards received during all episodes dividing it by the total
drones learn how to detect and track an object inserted in this
number of episodes. The estimated calculation is the average
platform by obtaining consecutive frames from a primary
reward the agent will receive when interacting with the
Blocks environment and using a DRL network to manage the
environment using the learned policy.
actions, states, and tracking pipeline. Both tf-agents are
By evaluating the average reward received, we can see the
trained in a Blocks environment to adapt to the surroundings
difference between DQN and PPO-based model’s
and existing objects in a simulation condition for additional
performance in varied configurations. However, in some
testing, tracking accuracy, and speed assessment. In the
scenarios, the average reward may not be the most suitable
training process, both models showed presentable results:
metric for evaluating the agent’s performance.
minimum 49 (PPO) and 48 (DQN) rewards in 2000 epochs;
B. TESTING RESULTS maximum 49 (DQN) and 49 (PPO) rewards in 2000 epochs;
We have tested our proposed DQN and PPO-based model average 49 rewards were received for both (PPO and DQN)
agents with the same environmental condition but different models. In the case of testing both models' performance
unseen test episodes to explore the ability of the models and contrasted 50 steps of one episode testing set, where the
compare their performance. As mentioned above, the DRL- PPO-based tf-agent gets its pick value reward of 97% in step
based algorithm’s performance evaluation differs from other 23, DQN-based agent receives its max value of 86% in the
state-of-the-art algorithms in the case of performance metrics 17th step respectively. However, the overall performance of
evaluations and comparison techniques. The agent-based the received percentage reward graph (Figure 8, a and b)
models’ precision can be seen or taken as a received reward indicates that the DQN-based model sequent results better
value. As mentioned earlier, the average reward obtained by than PPO-based. Regarding stability, reward contribution,

VOLUME XX, 2023 9

b)
FIGURE 8. The testing reward distribution of the DQN-based TF-AGENT (a) and the PPO-based TF-AGENT (b) in 50
steps of the episode.
and numeric graphical performance, we examined and 3. Gregory McNeal, “Drones and aerial surveillance: Considerations for
legislatures”, November 2014, Available:
compared the algorithm techniques to various established
https://www.brookings.edu/research/drones-and-aerial-surveillance-
hyperparametric changes with reinforcement learning-based considerations-for-legislatures/
network control incorporated into the simulation process. In 4. Purahong, B., Anuwongpinit, T., Juhong, A., Kanjanasurat, I., and
future work, we are going to integrate our model with several Pintaviooj, C. “Medical Drone Managing System for Automated
state-of-the-art tracking techniques to improve the External Defibrillator Delivery Service”, Drones 2022, 6(4), 93
https://doi.org/10.3390/drones6040093
performance of the target tracking framework by testing it in
5. Norbert Tusnio, and Wojciech Wroblewski, “The efficiency of drones
more complex virtual simulation environments. usage for safety and rescue operations in an open area: A case from
Poland”, Sustainability 2022, 14, 327. https://
REFERENCES doi.org/10.3390/su14010327
1. The Manufacturer. “The benefits of drones in manufacturing”, The 6. Mouna Elloumi, Riadh Dhaou, Benoit Escrig, and Hanen Idoudi,
Manufacturer, 29 Jun. 2022, Available: “Monitoring road traffic with a UAV-based system”, IEEE Wireless
https://www.themanufacturer.com/articles/the-benefits-of-drones-in- Communications and Networking Conference, 11 June 2018,
manufacturing/ DOI: 10.1109/WCNC.2018.8377077
2. Croptracker. “Drone Technology in Agriculture”, Dragonfly IT, 26 7. Chodorek, A., Chodorek R.R., and Yastrebov, A. “Weather Sensing in
April 2022, Available: https://www.croptracker.com/blog/drone- an Urban Environment with the Use of a UAV and WebRTC-Based
technology-in-agriculture.html Platform: A Pilot Study”, Sensors 2021, 21, 7113.

VOLUME XX, 2023 9

https://doi.org/10.3390/s21217113 Object Tracking in a Virtual Environmental Simulation. Appl.

8. Kajal Jewani, Mehak Katra, Dhiren Motwani, and Gaurav Jethwani, Sci. 2022, 12, 3220. https://doi.org/10.3390/app12073220
“Firefighter drone”, Proceedings of the first international conference 26. J. C. Caicedo and S. Lazebnik, ‘‘Active object localization with deep
on advanced scientific innovation in science, engineering, and reinforcement learning,’’ in Proc. IEEE Int. Conf. Comput. Vis.
technology, ICASISET 2020. (ICCV), Dec. 2015, pp. 2488–2496.
9. C. Huang et al., "Learning to Film From Professional Human Motion 27. B. Li and Y. Wu, "Path Planning for UAV Ground Target Tracking
Videos," 2019 IEEE/CVF Conference on Computer Vision and via Deep Reinforcement Learning," in IEEE Access, vol. 8, pp. 29064-
Pattern Recognition (CVPR), 2019, pp. 4239-4248, doi: 29074, 2020, doi: 10.1109/ACCESS.2020.2971780.
10.1109/CVPR.2019.00437. 28. C.-C. Chang, J. Tsai, P.-C. Lu, and C.-A. Lai, “Accuracy improvement
10. A. -H. A. El-Shafie, M. Zaki and S. E. D. Habib, "Fast CNN-Based of autonomous straight take-off, flying forward, and landing of a drone
Object Tracking Using Localization Layers and Deep Features with deep reinforcement learning,” International Journal of
Interpolation," 2019 15th International Wireless Communications & Computational Intelligence Systems, 2020, vol. 13, no. 1, pp. 914–919.
Mobile Computing Conference (IWCMC), 2019, pp. 1476-1481, doi: 29. W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-
10.1109/IWCMC.2019.8766466. end active object tracking and its real-world deployment via
11. R. Ravindran, M. J. Santora, and M. M. Jamali, "Multi-Object reinforcement learning,” IEEE Transactions on pattern analysis and
Detection and Tracking, Based on DNN, for Autonomous Vehicles: machine intelligence, 2019, vol. 42, no. 6, pp. 1317–1332.
A Review," in IEEE Sensors Journal, vol. 21, no. 5, pp. 5668-5677, 30. D. Zhang, Z. Zheng, R. Jia, and M. Li, “Visual Tracking via
1 March 1, 2021, doi: 10.1109/JSEN.2020.3041615. Hierarchical Deep Reinforcement Learning”, AAAI, vol. 35, no. 4, pp.
12. X. Farhodov, K.-S. Moon, S.-H. Lee, and K.-R. Kwon, “LSTM 3315-3323, May 2021.
Network with Tracking Association for Multi-Object 31. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex,
Tracking,” Journal of Korea Multimedia Society, vol. 23, no. 10, pp. Antonoglou, Ioannis, Wierstra, Daan, and Martin Riedmiller. "Playing
1236–1249, Oct. 2020. Atari with Deep Reinforcement Learning." arXiv, (2013).
13. D. G¨ozen and S. Ozer, "Visual Object Tracking in Drone Images with https://doi.org/10.48550/arXiv.1312.5602.
Deep Reinforcement Learning", 2020 25th International Conference 32. J. Supancic, III, and D. Ramanan, ‘‘Tracking as online decision-
on Pattern Recognition (ICPR), 2021, pp. 10082-10089, doi making: Learning a policy from streaming videos with reinforcement
10.1109/ICPR48806.2021.9413316. learning,’’ in Proc. Int. Conf. Comput. Vis., 2017, pp. 322–331.
14. Shah, Shital, Dey, Debadeepta, Lovett, Chris, and Ashish Kapoor. 33. W. Wang, Y. Huang, and L. Wang, ‘‘Language-driven temporal
"AirSim: High-Fidelity Visual and Physical Simulation for activity localization: A semantic matching reinforcement learning
Autonomous Vehicles." arXiv, (2017). model,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 334–343.
https://doi.org/10.48550/arXiv.1705.05065.
15. Juliani, Arthur, Berges, Vincent, Teng, Ervin, Cohen, Andrew, 34. W. Wu, D. He, X. Tan, S. Chen, and S. Wen, ‘‘Multi-agent
Harper, Jonathan, Elion, Chris, Goy, Chris, et al. "Unity: A General reinforcement learning based frame sampling for effective untrimmed
Platform for Intelligent Agents." arXiv, (2018). video recognition,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
https://doi.org/10.48550/arXiv.1809.02627. (ICCV), Oct. 2019, pp. 6222–6231.
16. Guichao Yang, “Asymtotic tracking with novel integral robust schems 35. J. Han, L. Yang, D. Zhang, X. Chang, and X. Liang, ‘‘Reinforcement
for mismatched uncertain nonlinear systems,” IJRNC, February 2023, cutting-agent learning for video object segmentation,’’ in Proc.
vol. 33, issue 3, pp. 1988-2002. Comput. Vis. Pattern Recognit., 2018, pp. 9080–9089.
17. Guichao Yang, Tao Zhu, Fengbo Yang, Longfei Cui, and Hua Wang, 36. Kyungtae Ko, “Visual object tracking for UAVs using deep
“Output feedback adaptive RISE control for uncertain nonlinear reinforcement learning”, Iowa State University ProQuest
systems,” ACA, January 2023, vol. 25, issue 1, pp. 433-442. Dissertations Publishing, 2020. 27836817, Available:
18. Bhattarai M. and Martínez-Ramón M., “A Deep Q-learning based Path https://doi.org/10.31274/etd-20200624-169
Planning and Navigation System for Firefighting Environments”, In 37. Howard, Andrew G., Zhu, Menglong, Chen, Bo, Kalenichenko,
Proceedings of the 13th International Conference on Agents and Dmitry, Wang, Weijun, Weyand, Tobias, Andretti, Marco, and
Artificial Intelligence (ICAART 2021) - Volume 2, pages 267-277, Hartwig Adam. "MobileNets: Efficient Convolutional Neural
DOI: 10.5220/0010267102670277 Networks for Mobile Vision Applications," arXiv, (2017).
19. A. P. Kalidas, C. J. Joshua, A. Q. Md, S. Basheer, S. Mohan, and S. https://doi.org/10.48550/arXiv.1704.04861.
Sakri, “Deep Reinforcement Learning for Vision-Based Navigation of 38. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
UAVs in Avoiding Stationary and Mobile Obstacles,” Drones, vol. 7, “Proximal Policy Optimization Algorithms,” CoRR, 2017, vol.
no. 4, p. 245, Apr. 2023, doi: 10.3390/drones7040245. abs/1707.06347.
20. W. Zhao, Z. Meng, K. Wang, J. Zhang, and S. Lu, “Hierarchical 39. Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning:
Active Tracking Control for UAVs via Deep Reinforcement An Introduction,” 2nd ed., Cambridge, Massachusetts London,
Learning,” Applied Sciences, vol. 11, no. 22, p. 10595, Nov. 2021, doi: England: The MIT Press, November 13, 2018.
10.3390/app112210595. 40. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
21. Wang Z, Li H, Wu Z, Wu H. “A pretrained proximal policy “Proximal Policy Optimization Algorithms.,” CoRR, vol.
optimization algorithm with reward shaping for aircraft guidance to a abs/1707.06347, 2017 [Online]. Available: http://dblp.uni-
moving destination in three-dimensional continuous trier.de/db/journals/corr/corr1707.html#SchulmanWDRK17.
space,” International Journal of Advanced Robotic Systems.
2021;18(1). doi:10.1177/1729881421989546
22. Tan, Ziya and Karakose, Mehmet, “A New Approach for Drone
Tracking with Drone Using Proximal Policy Optimization Based
Distributed Deep Reinforcement Learning,” Available at
SSRN: http://dx.doi.org/10.2139/ssrn.4393763
23. M. A. B. Abbass and H.-S. Kang, “Drone Elevation Control Based on
Python-Unity Integrated Framework for Reinforcement Learning
Applications,” Drones, Mar. 2023, vol. 7, no. 4, p. 225, doi:
10.3390/drones7040225.
24. E. Çetin, C. Barrado, and E. Pastor, “Countering a Drone in a 3D
Space: Analyzing Deep Reinforcement Learning Methods,” Sensors,
Nov. 2022, vol. 22, no. 22, p. 8863, doi: 10.3390/s22228863.
25. Park, J.-H.; Farkhodov, K.; Lee, S.-H.; Kwon, K.-R. Deep
Reinforcement Learning-Based DQN Agent Algorithm for Visual

VOLUME XX, 2023 9

FARKHODOV KHURSHEDJON received a B.S.

degree in the Department of Computer
Engineering, at Tashkent University of
Information Technologies, Uzbekistan in 2013-
2017 years. He received the M.S. degree in the
Department of AI Convergence and Application
Engineering at Pukyong National University,
South Korea in 2018-2021 years. He is currently
a Ph.D. student at Pukyong National University
in the Department of AI Convergence and
Application Engineering. His research interests
include Digital Image Processing, Computer Vision, and Machine Learning.

SUK-HWAN LEE received a B.S., an M.S., and

a Ph.D. degree in Electrical Engineering from
Kyungpook National University, Korea in 1999,
2001, and 2004 respectively. He is currently a
professor in the Department of Computer
Engineering at Donga University. His research
interests include multimedia security, digital
image processing, and computer graphics.

JAN PLATOS received a Ph.D. degree in

Computer science in 2010. He became a Full
professor in 2021 at the Department of Computer
Science. Since 2021, he has been Dean of the
Faculty of Electrical Engineering and Computer
Science, VSB-TUO. He has co-authored more
than 240 scientific articles published in
proceedings and journals. His citation report
consists of 849 citations and an H-index of 13 on
the Web of Science, 1407 citations and an H-
index of 16 on Scopus, and 2078 citations and an
H-index of 21 on Google Scholar. His primary fields of interest are machine
learning, artificial intelligence, industrial data processing, text processing,
data compression, bioinspired algorithms, information retrieval, data mining,
data structures, and data prediction.

KI-RYONG KWON received B.S., M.S., and

Ph.D. degrees in electronics engineering from
Kyungpook National University in 1986, 1990,
and 1994 respectively. He worked at Hyundai
Motor Company from 1986-1988 and at Pusan
University of Foreign Language from 1996-
2006. He is currently a professor in the Dept. of
IT Convergence & Application Engineering at
Pukyong National University. He has researched
the University of Minnesota in the USA from
2000-2002 with Post-Doc, and Colorado State
University from 2011-2012 with a visiting
professor. He was the General President of the Korea Multimedia Society
from 2015-2016. He is also a director of the IEEE R10 Changwon section.
His research interests are in the area of digital image processing, multimedia
security and watermarking, bioinformatics, and weather radar information
processing.

VOLUME XX, 2023 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4