0% found this document useful (0 votes)
17 views23 pages

A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents

This paper presents a hybrid multi-task learning approach to optimize deep reinforcement learning (DRL) agents in complex environments. The proposed framework utilizes multiple actor-critic models that share knowledge through a global network, enhancing performance by 15% to 20% in the Atari 2600 gaming environment. The research addresses challenges in existing single-task methods, improving data efficiency and learning effectiveness across related tasks.

Uploaded by

hicoxa1052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents

This paper presents a hybrid multi-task learning approach to optimize deep reinforcement learning (DRL) agents in complex environments. The proposed framework utilizes multiple actor-critic models that share knowledge through a global network, enhancing performance by 15% to 20% in the Atari 2600 gaming environment. The research addresses challenges in existing single-task methods, improving data efficiency and learning effectiveness across related tasks.

Uploaded by

hicoxa1052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Received February 16, 2021, accepted March 2, 2021, date of publication March 12, 2021, date of current version

March 26, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3065710

A Hybrid Multi-Task Learning Approach for


Optimizing Deep Reinforcement Learning Agents
NELSON VITHAYATHIL VARGHESE AND QUSAY H. MAHMOUD , (Senior Member, IEEE)
Department of Electrical, Computer and Software Engineering, Ontario Tech University, Oshawa, ON L1G 0C5, Canada
Corresponding author: Nelson Vithayathil Varghese ([email protected])
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

ABSTRACT Driven by recent technological advancements within the field of artificial intelligence (AI),
deep learning (DL) has been emerged as a promising representation learning technique across different
machine learning (ML) classes, especially within the reinforcement learning (RL) arena. This new direction
has given rise to the evolution of a new technological domain named deep reinforcement learning (DRL)
that combines the high representational learning capabilities of DL with existing RL methods. Performance
optimization achieved by RL-based intelligent agents designed with model-free-based approaches was
majorly limited to systems with RL algorithms focused on learning a single task. The aforementioned
approach was found to be quite data inefficient, whenever DRL agents needed to interact with more complex,
data-rich environments. This is primarily due to the limited applicability of DRL algorithms to many
scenarios across related tasks from the same distribution. One of the possible approaches to mitigate this issue
is by adopting the method of multi-task learning. The objective of this research paper is to present a hybrid
multi-task learning-oriented approach for the optimization of DRL agents operating within different but
semantically similar environments with related tasks. The proposed framework will be built with multiple,
individual actor-critic models functioning within independent environments and transferring knowledge
among themselves through a global network to optimize performance. The empirical results obtained by the
hybrid multi-task learning model on OpenAI Gym based Atari 2600 video gaming environment demonstrates
that the proposed model enhances the performance of the DRL agent relatively in the range of 15% to 20%
margin.

INDEX TERMS Machine learning, deep reinforcement learning, neural networks, transfer learning, actor-
critic, multi-task worker.

I. INTRODUCTION level of current RL agents was not optimal. This was majorly
Over the last few decades, the reinforcement learning domain due to the limitations related to deriving the optimum policy
has been well established its position as a vital topic within out of the large state-action space linked with the environment
technological areas such as robotics and intelligent agents [1]. of RL problems. At the same time, the inception of DL with
The core objective of RL is to address the problem of how the its very high level of representational learning capability has
intelligent agents should explore their operating environment given a new dimension to the field of reinforcement learn-
optimally, and thereby learn to take optimal actions to achieve ing and led to the evolution of deep reinforcement learning.
the highest possible reward while in a given state [2]. Sup- As a result of these advancements, DRL agents have been
ported by recent advancements within the field of ML, the RL applied to various areas such as continuous action control,3D
has been cemented its position as one of the major machine first-person environments, and gaming. Especially in the field
learning paradigms that deal with agent’s behavior patterns of gaming, DRL agents are proven to be extremely successful
while in an environment. In comparison to the performance and could surpass the human-level performance on classic
of ML systems based out of contexts namely supervised video-games like Atari as well as board games such as chess
learning, and unsupervised learning, the relative performance and Go [3].
Despite the impressive results achieved with a single-task-
The associate editor coordinating the review of this manuscript and based methodology, the RL agent is observed to be less effi-
approving it for publication was Vicente Alarcon-Aquino . cient within operating environments that are more complex

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 44681
N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

and richer in data such as 3-dimensional environments. One


of the possible directions to enhance the efficiency of the
RL agent in such an environment is by the application of
multi-task-based learning. With multi-task learning, a set of
closely related tasks from the operating environment will be
learned simultaneously by individual agents with the help
of a DRL algorithm such as A3C (Asynchronous Advan-
tage Actor-Critic) [4]. By the application of this approach,
at regular intervals, neural network parameters of each of
the individual agents will be shared with a global network.
By combining the learning parameters of all the individual
FIGURE 1. The ecosystem of reinforcement learning.
agents, the global network derives a new set of parameters,
which will be shared back with all of the individual agents.
The major aim of this methodology is to optimize the overall that are attempted on the multi-tasking front of DRL.
performance of the RL agent by transferring the learning, Section IV discusses the various related work done within the
shared knowledge, among multiple related tasks running same arena, with special focus given to three state-of-the-art
within the same environment. One of the most widely known solutions, namely Distral, IMPALA, and PopArt. Section V
and accepted multi-task learning methodologies within RL is details the proposed hybrid multi-task model architecture
parallel-based multi-task learning, in which a single RL agent while implementation details are covered in Section VI.
masters a group of diverse tasks [5]. The core idea behind Experiments conducted and results obtained with the hybrid
this approach mainly relies on the architecture used by the multi-task model are presented in Section VII. Analysis of
deep reinforcement learning model based on a single learner, test results are discussed in Section VIII. Finally, Section IX
often known as a critic, combined with different actors. Each concludes the paper and offers ideas for future work.
of the individual actors generates their learning trajectories,
which are a set of parameters, and further on sends them II. BACKGROUND
to the learner module, also called a critic module, either This section provides a brief background of the main aspects
synchronously or asynchronously. Subsequently, each of the related to reinforcement learning. Table 1 indicates nota-
actors retrieves the latest set of policy parameters from the tions used within this paper to explain the concepts and
learner before initiating the next learning trajectory. With this equations.
approach, learnings from each of the individual tasks will be Reinforcement learning is one of the ML paradigms related
exchanged with every other agent within the environment, to sequential decision-making which deals with mapping
which in turn improves the overall learning momentum of the situations to actions in a way that maximizes the associated
RL intelligent agent. reward. Within RL ecosystems, the learner, which is also
known as an agent, is not explicitly instructed on which
A. MOTIVATIONS AND CONTRIBUTIONS actions to take at each time step t, but instead, the RL agent
The major motivation behind the proposed hybrid multi-task must follow a trial-and-error method to identify which actions
approach is to address some of the major challenges associ- generate the most reward. A standard reinforcement learning
ated with the existing multi-task deep reinforcement learn- setup consists of an agent situated within an environment E,
ing (MTDRL) paradigm. Especially, attempting to address where an agent will be interacting with the environment in
key challenges such as partial observability, effective explo- discrete timesteps. At each of these timesteps t, the agent
ration, and lastly the amount of training data and training time will be in a state St (St eS) and will be performing a chosen
required to achieve an acceptable level of performance. action At (At eA) within the environment E. Further on, the
To this end, contributions of this paper are: environment responds by updating the current state St to a
follow-up state St+1 with a new timestep t+1 and also gives
1) Design and development of a hybrid multi-task learn- a reward r(St , At ) e R to the agent, indicating the reward
ing model to optimize the performance of DRL agents. value of performing an action in the preceding state St [1].
2) Evaluation of DRL agent’s performance with hybrid The below Fig. 1 represents the standard ecosystem for a
multi-task learning model within the context of the reinforcement learning environment at any given timestep t.
aforementioned challenges. By performing multiple actions in a sequential learning man-
3) The empirical analysis of the fluctuations in the DRL ner in a sequence of associated states s, with related actions a,
agent’s performance when the degree of semantic sim- respective follow-up states s’ and rewards r, several episodes
ilarity between the tasks trained together from multiple of tuples of <s, a, s’, r > are generated. At any given
game environments (Atari2600). state St , the goal of the agent is to determine a policy π
The rest of this paper is organized as follows. Section II that can create a state-to-action mapping that maximizes the
presents a brief background of reinforcement learning con- accumulated reward over the lifetime of the agent for that
cepts, and Section III explains the various existing approaches particular state [6].

44682 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

TABLE 1. List of notations. s that is achievable by any policy. Similarly, the value of any
state s under policy π is defined by Vπ (s) = E [Rt |St = s]
which is simply the expected return for following the policy
π from state s. The Q(s, a) is often used as a measure of the
value of the agent being in that particular state s and taking an
action a to reach that state. The famous Bellman’s equation
mentioned below is used as a reference to calculate the Q(s,
a) for every action in every state that helps an agent to make
decisions about its future moves.

Q0 (s, a) = Q(s, a) + α[R(s, a) + γ maxQ‘(s‘, a‘) − Q(s, a)]


(1)

where Q(s, a), α, R(s, a), γ and maxQ‘ (s‘, a‘) represents the
current Q value, learning rate, reward for taking that action
a in state s, and maximum expected future reward given the
new state s‘ with all the possible action from that state.
In the case of value-based model-free reinforcement learn-
ing methods, the action-value function Q(s, a) is often repre-
sented by using a function approximation method, such as a
neural network. In such a case an approximate action-value
function that parameterized with θ represented as Q(s, a; θ).
The updates for the parameters are decided with the help of
a suitable RL algorithm. In contrast to the aforementioned
value-based methods, policy-based model-free RL methods
directly parameterize the policy π(a|s;θ) and update the
parameter by performing, typically approximate, gradient
ascent on E [Rt ].

III. MULTI-TASK DEEP REINFORCEMENT LEARNING


With the enormous growth that was happened within the
AI and DL domains, DRL has been positioned itself as
the state-of-the-art and de-facto choice for solving many of
the benchmark tasks and real-world issues. As a direct result
of this, methods for the optimization of DRL have caught a
great level of interest and attention. The subsequent sections
cover the information presented in related research efforts and
related to methodologies and approaches designed to attain
multi-task DRL.

A. TRANSFER LEARNING ORIENTED APPROACH


At any point in the time t, the goal of the RL agent is to Before the inception of DL into the domain of RL, trans-
select the actions in such a way that it maximizes its expected fer learning was used as a major means for guiding the
return. The reward returned at any given time step t is the research work towards developing the multi-task learning
quantity that can be represented as algorithms with the RL domain. The major aspect of trans-
∞ fer learning (TL) is based on the concept of knowledge
γ τ r(St+τ , At+τ )
X
Rt = transfer happening between different source and target tasks
τ =0 that are closely related to each other. Subsequently, this
where γ e(0,1) is the discount factor that multiplies the knowledge transfer is intended to enhance the performance
future expected reward and varies on the range of [0,1]. of the ML system’s algorithm that would be employed to
At any moment, the goal of the DRL agent is to maximize the learn the target task. Within the context of RL, the notion of
expected return from each state St . The action value indicated transfer is majorly concerned about coming up with differ-
by Qπ (s, a) = E[Rt |St = s, a] is the expected return for ent approaches and techniques to enable knowledge transfer
taking an action a in state s by following a policy π. Similarly, originating from a group source tasks to a target task. This
the optimal value function indicated by Q∗(s, a) = maxπ methodology is proven to be delivering impressive results
Qπ (s, a) is the maximum action value for action a and state whenever there is a high amount of similarity between the

VOLUME 9, 2021 44683


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

source tasks as well as target tasks [7]. Especially when the a technique in place with the design, and the ability to abstract
level of similarity between two ends meets high, the process the aforementioned factors would play a pivotal role in accel-
of knowledge transfer is observed to be happening quite erating the whole learning process by an agent. Learning
smoothly such that this knowledge transferred in manner shared representations provides this capability to have this
plays an important role in assisting the target task’s algorithm milestone by learning the robust, transferable abstractions
to solve the tasks with a high level of efficiency. The major of the environment. This is majorly because those elements
reason behind the success of the above scenario is due to possess the ability to generalize over a group of tasks that an
the advantage provided by the positive knowledge transfer. agent needs to deal with while operating within its environ-
This in turn enables the target task’s algorithm to optimize its ment [13].
performance level in an easier way than spending more time The idea of value function plays an important role with the
for it to gain the same amount of knowledge by using more space of RL as it is being extensively used in conjunction with
amount of the target task’s data sample. Different TL mech- functional approximation methodology to generalize over the
anisms that are oriented on the aforementioned approach large-sized state-action space of the operating environment
have been deployed within various RL algorithms based on within which an agent needs to function [14]. The signif-
single-agent [8]. icance of value function lies in its ability to determine the
Similar kinds of approaches also experimented with multi- quality of a specific state that an agent needs to be in while
agent systems, and agents operating within the same envi- in an environment. Value functions often have the ability
ronment interact among them as a means for exchanging the to demonstrate the compositional structure concerning the
knowledge gained from their respective actions [9]. Typically, state space and goal states [15]. Empirically, prior research
the design of multi-agent systems uses a joint policy that efforts have proven that value functions could effectively
is learned by agents from the source task, and further on capture as well as represent knowledge beyond their current
same knowledge will be used to generate the initial policy goal, and this could be efficiently leveraged or re-used for
for the agents to use within the context of the target task [10]. future learning purposes [14]. It is possible to learn the value
Knowledge transfer is made possible among the source and of optimal value-functions by efficient usage of space of
target tasks by adopting different transfer techniques like state-action values that are exchanged among the tasks that
instance transfer, representation transfer, or parameter trans- an RL agent encounters and deals with during its operation
fer. On all these approaches, algorithms that handle the within the operating environment. This could be made pos-
knowledge transfer rely majorly on the prior knowledge gath- sible by having the ability to accommodate the aforemen-
ered while attempting to solve the same kind of source tasks, tioned common structure into the following value iteration
and then leverage on that as a reference to bias the approach and policy-iteration procedures namely fitted Q-iteration and
or learning to be followed on a new task [7]. approximate policy iteration respectively [16].

B. LEARNING SHARED REPRESENTATIONS C. PROGRESSIVE NEURAL NETWORKS


Learning the shared representations for value functions is a The design of this approach is based on leveraging the func-
technique that is quite analogous to transfer learning-oriented tion approximation ability of the neural network and having a
methodology. The base of this method lies in the neural high degree of similarity with transfer learning methodology.
network’s function approximation capability and how it could The challenges related to achieving the optimal performance
be applied to the domain of RL [3]. The key success factor on the multi-tasking in DRL were about the effective applica-
behind the usage of DNN with RL lies in DL algorithms’ tion of TL and eliminating the catastrophic forgetting. As an
capability to distill data representation in a quite meaningful answer to these issues, extensive research efforts were carried
manner from the operating environment’s high-dimensional out, and as a direct result of that, an approach named progres-
input state space [11]. Due to this capability, RL’s effective sive neural networks was proposed. The technique possessed
applicability into a wide spectrum of problems with complex the capability to mitigate the impacts of catastrophic forget-
real-life problems with high-dimensional input state space ting and derived a way to utilize prior accumulate knowledge
was made possible. Before the adoption of DL into RL space, by using the lateral connections to features that are learned
such an attempt always demanded the application of feature already. The progressive neural network method is proposed
engineering at a great level of effort [3]. The major success by DeepMind as a multi-tasking technique by adopting the
factors behind the application of DL into RL problems greatly idea of lateral features transferring with neural networks [17].
demand two conditions – firstly the capability to derive a The major aspect of the proposed model by this methodology
high-quality abstraction of the agent’s operating environ- is having not only the capability to learn new tasks but also to
ment, and secondly effective agent’s role within its operating maintain the prior knowledge gathered by using neural net-
environment [12]. The major driving factor behind learning works. The main objective of having a series of progressive
sharing representations lies in the idea that the group of dif- neural networks is to conduct the knowledge transfer across
ferent tasks that the agent needs to encounter in the operating a group of tasks quite efficiently. Theoretically, the design
environment will highly likely to have a great degree of a of the progressive neural networks is for achieving two key
shared structure as well as an in-built redundancy [7]. Having objectives. Firstly, to establish a system having the capability

44684 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

to efficiently incorporate the previously gathered knowledge neural network build a knowledge database on the efficient
while conducting the learning at each layer of the whole re-use of environment parameters for new tasks or actions.
feature hierarchy. Secondly, in order to come up with a All of the agents are designed to function in a parallel manner
system having enough immunity to handle the impacts of along with all the other agents present within the system who
catastrophic forgetting [7]. would be involved in learning other tasks as well as sharing
The major benefit of adopting this methodology is that parameters among them for facilitating the positive transfer
progressive networks possess the capability to retain a set of of knowledge [20].
pre-trained models whole throughout the training cycle [17]. Generally, PathNet architecture is made up of DNN having
Along with this, it can utilize the pre-trained model to learn L layers, with each of them having M modules. Then each one
lateral connections and thereby derive useful features needed of these modules itself would another neural network. Con-
for new tasks in the future. Having an approach with such solidated outputs of modules belonging to each layer would
abilities brings the combination of both richer composition- then sent into the active modules residing in the subsequent
ality and an easy room for the integration of previously layer [7]. For every individual layer, there would a limit on
acquired knowledge at every layer of feature hierarchy. The the maximum number of modules that could be supported for
continual nature of learning associated with this approach each of the pathways, and often this number lies between 3 to
facilitates the agents to both learn a group of tasks that are 4 [20]. The final layer within each of the neural networks for
encountered in a sequence as well as gives the capability to each of the tasks that are being learned will be always unique.
perform knowledge transfer from previous tasks to enhance More importantly, this would not be shared with any of the
the overall convergence speed [18]. Progressive networks remaining tasks running within the operating environment.
combine both these aspects into model architecture where One major advantage of the PathNet approach lies within the
catastrophic forgetting is eliminated by instantiating a new re-usability factor by which neural networks could leverage
neural network for each of the tasks that are being solved and learn from existing knowledge databases, and thereby
during an agent’s lifetime within its operating environment. save time by avoiding learning from scratch for the new tasks.
In addition to this, transfer of knowledge is made possible The impact of this would be more prevalent within the context
through the lateral connections to the list of features from of RL, as there could more interrelated tasks within the wide
the prior learned neural network columns [17]. At any given action space associated with the operating environment. The
time step t, when a new task is learned, the model appends PathNet approach has shown impressive results for positive
a new column of knowledge into its current framework as a transfer of knowledge for various datasets like binary MNIST
new neural network unit. Subsequently, the newly added unit (Modified National Institute of Standards and Technology),
would be used while learning the successive tasks. Each of the CIFAR-100 (Canadian Institute For Advanced Research), and
new column or neural network units created would be trained SVHN (The Street View House Numbers) supervised learn-
to solve a specific Markov decision process (MDP) [17]. One ing classification tasks. The same kind of results has been
of the plausible drawbacks related to this approach is that this obtained with several Atari2600 gaming and Labyrinth RL
could lead to a computationally expensive model as its size tasks.
could be growing as well with the progress in the learning
cycle. E. POLICY DISTILLATION
Policy distillation (PD) and actor-mimic (AM) are the two
D. PathNet approaches that are based on the concept of distillation
PathNet is another multi-task RL methodology designed for intended to achieve multi-task DRL. The core objective of
the purpose of achieving artificial general intelligence (AGI) distillation lies in the factor of minimalization in terms of
by joining together the aspects of transfer learning, continual the costs of computations associated with ensemble meth-
learning, and multitask learning [18]. The core design aspect ods [21]. The idea of an ensemble can be viewed as a group
of PatheNet lies with a neural network-oriented algorithm that of models wherein the prediction outputs of this group are
utilizes multiple agents which are deployed within the neural joined with help of either a technique of weighted average or
network. The major role of each agent is to identify which voting [22]. Studies on ensemble methods were of the promi-
all parts of the network could be re-used while learning new nent research fields within the past decade. The most famous
tasks [19]. All agents are being treated as pathways (known ensemble-oriented methods are namely bagging, boosting,
by the name genotypes) within the ecosystem of the neural random forests, Bayesian averaging, and stacking [22]. Two
network to decide the subset of parameters that could be major drawbacks related to ensemble-based methods are in
utilized during the learning process [20]. All of these param- terms of their huge memory requirements for operation as
eters used within the forward propagation of the learning well the amount of time needed for execution during run-
cycle undergo updates at the backpropagation phase of the time. The need for relatively high execution time makes them
PathNet algorithm. During the learning process, a tournament slow in terms of generating the output. To mitigate these
selection genetic algorithm would be deployed for selecting issues, a distillation methodology was suggested, which is
the pathways through the use of a neural network. During the designed based on the model compression approach. The
operation, various actions carried out by the agents inside the main objective of this approach is to compress the learned

VOLUME 9, 2021 44685


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

function by the complex model, which is an ensemble to a could be achieved by using models that are trained with this
much scaled-down and faster model that has got a relatively approach on several games. Specifically, with a high amount
comparable level of performance with the original ensem- of similarity level between source as well as target tasks,
ble [22]. Subsequently, this same approach was mapped into features that are learned while training source tasks could be
the domain of neural networks [23]. quite efficiently used for the generalization of target tasks’
By leveraging model compression, PD is being consid- training [25].
ered as an approach that could be used for the purpose The actor-mimic methodology utilizes the power of both
of extracting the policy of an RL agent. Following this, DRL as well as model compression techniques to train a
the same policy could be utilized towards training a new single policy network. The key intention behind the usage of
network at an optimum level with a relatively smaller size such a training method is to enable the network to gain knowl-
and a higher level of efficiency. Following this, an intelligent edge on how to act within a group of distinct tasks under the
agent could utilize the same approach for the consolidation guidance of multiple expert teachers [7]. Subsequently, rep-
of multiple task-oriented policies into one policy. Earlier resentational knowledge gathered by the DRL policy network
research efforts conducted on the PD front were mostly car- could be leveraged towards generalizing the new tasks with-
ried out by an RL algorithm namely DQN (deep Q-network). out having any sort of anterior expert guidance. Validation
With this, the PD method could be utilized in an effective of this technique was majorly carried out within the arcade
way to transfer one or more than one active policy from learning environment (ALE) [26]. Generally, actor-mimic
a DQN to another network that is untrained [7]. DQN is is being considered as part of the larger imitation learning
a famous, state-of-the-art model-free technique deployed in class of methods. These methods are generally rooted in the
RL with the help of deep neural networks (DNN). This idea of adopting expert guidance to train an agent on how
model functions within an environment having a discrete to act within a particular operating environment. Under the
set of action choices. DQN was proven to exhibiting a per- imitation learning methodology, a policy would be directly
formance level that outperforms human-level scores on a trained to mimic an expert’s behavior while sampling the set
collection of multiple Atari 2600 games [3]. In this con- of actions from the mimic agent’s space [24].
text, the distillation approach could be applied both at a
single task level as single game policy distillation. Similarly, G. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC
the same could be applied at a multi-task level as a knowl- Google DeepMind proposed the idea of a simultaneous,
edge transfer technique from a teacher model T to a student parallel learning-based training approach towards multi-task
model S. Within the single task-based policy distillation, learning, and formulated an algorithm by the name A3C
the responsibility of data generation would be handled by (asynchronous advantage actor-critic). According to this,
the teacher network, which is a trained DQN agent, and multiple intelligent agents, which are also called workers,
following this a supervised training would be performed will be running simultaneously in a parallel fashion within
by student network. To achieve multi-task PD, n different different instances of the same operating environment [4]. All
DQN-based single-game experts (agents) would be trained the workers that are running within the environment will be
independently [22]. Later on, all of these individual agents in charge of updating a global value function asynchronously.
generate both the inputs and target date, which are stored The key essence of this approach lies in the fact that at
inside memory buffers. Subsequently, the distillation agent the time of training each of these individual agents, at any
uses all of these n different data stores sequentially for given time-stamp t, each of these agents would be undergoing
learning. through different states within the environment. This property
offers them an independent and unique way of learning. The
F. ACTOR-MIMIC impact of this unique A3C algorithm will be in a position
The major design aspect and most desired characteristic of to deliver each of the agents with a very highly efficient
an intelligent agent lies with its ability to act under differ- learning trajectory within the vast state space of the operating
ent operating environments, accumulate information, then environment [27]. A3C is designed as an enhancement of
subsequently perform knowledge transfer from those past the original actor-critic methodology, which is having two
experiences gathered to new situations. The idea behind the different, independent neural network units- one for the actor
actor-mimic is based on the aforementioned methodology module and the other one for the critic module with its loss
with a special focus on aspects such as multi-task learning functions [7]. At the basic level, an actor module could be
and transfer learning. Having these two abilities would make treated as a function approximator unit that governs agent
an intelligent agent on learning efficiently on how to handle at each state, in a quite similar way as being judged by RL
and act concurrently with multiple tasks, and subsequently, methods like Q-learning or in REINFORCE. In these two
generalize that knowledge gathered or accumulated to the approaches, a neural network calculates either a function that
new domains [24]. Typically, actor-mimic could be perceived leads to the calculation of policy or deriving the policy itself
as a technique to train a single deep policy network with directly [28]. When it comes to the critic module, it acts
the help of a set of source tasks that are related. Empir- as a more sort of judging unit which effectively evaluates
ically, it is shown that impressive levels of performance the effectiveness of the policy created by the actor and then

44686 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

provides feedback on the same which helps further enhance- A. DISTRAL


ment of the future policy calculations [4]. Distral (DIStil and TRAnsfer Learning) is one of the well-
known approaches developed by google DeepMind for the
IV. RELATED WORK purpose of multi-task training. It is a prototype made for
This section provides the details on the related work done concurrent RL with more than one task [36]. The key design
on the multi-task DRL front. Before the inception of deep objective was to build a generic model to distill the cen-
reinforcement learning, most of the multi-task-oriented algo- troid policy first and following this transfer the commonality
rithms relied on transfer learning to realize proper control details and behavior patterns of multiple workers operating
over different tasks. Besides, some research efforts were within multi-task RL context. Rather than following a param-
carried out to investigate the joint training of multiple value eter sharing-based policy among the multiple worker agents
functions or policy functions over a set of tasks [29], [30]. operating in the environment, Distral’s design methodology
However, the functionalities of all of these algorithms were mainly focuses on distributing a distilled policy to individual
limited by handcrafted features. Even though a huge amount workers, and this policy should conceive the commonality
of work has been done to improve DRL algorithms over single in behavior across multiple related tasks. Once the distilled
tasks, relatively there is much less amount of work done policy is derived, then the same can be used to govern the
for multi-task scenarios. Some of those research attempts task-specific policies by adopting regularization with the help
either focused on the exploration and generative models or of Kullback-Leibler (KL) divergence [24]. By this method,
explored learning universal abstractions of state-action pairs initially, knowledge gathered from one task would be distilled
or feature successors, which are quite similar to transfer in the form of a shared policy, subsequently, the same knowl-
learning methodology [31]. DiGrad (Differential Policy Gra- edge could be transferred to other related tasks operating
dient) is an approach developed for simultaneous training in the environment. By adopting this methodology, each of
of multiple tasks sharing a set of common actions in con- the individual workers would be trained independently to
tinuous action spaces. The proposed framework is based on solve their task, in such a way that each of the workers
differential policy gradients and can accommodate multi-task could be staying more in line with shared policy. Training
learning in a single actor-critic network. This framework for this policy will be conducted with the help of the dis-
was designed predominantly for efficient multi-task learn- tillation process that serves as centroid for all the individ-
ing in complex robotic systems and tested on 8 link planar ual task policies [36]. This approach is found to present
manipulators and 27 degrees of freedom (DoF) Humanoid impressive results in terms of the transfer of knowledge
for learning multi-goal reachability tasks for 3 and 2 end within complex 3-dimensional operating environments for
effectors respectively [32]. Another research work related to RL problems.
the multi-task learning done was based on the model-based Empirically it has been observed that the Distral approach
approach to deep reinforcement learning which we use to often outperforms the traditional methods, by a significant
solve different tasks simultaneously. This model was devel- margin, that are oriented on parameter sharing policy of
oped with a recurrent neural network inspired by residual neural networks towards achieving multitasking or transfer
networks that decouple memory from computation allowing learning. The two key reasons behind this are mentioned
to model complex environments that do not require lots of below. Firstly, its due to the level of impact distillation has
memory [5]. Another relevant work at the multi-task front got on the process of optimization. It is more prevalent while
done was mainly attempting to address the partial observ- adopting KL divergences as a prime method to regularize task
ability issue of RL with help of the deep decentralized models’ output in deriving the distilled model extracted from
multi-task multi-Agent reinforcement learning method [33]. each of the policies of individual tasks. Secondly, the appli-
It was based on a decentralized single-task learning approach cation of distilled model itself as a means to regularize to
that is robust to concurrent interactions of teammates and train the individual task models within the environment. More
presented an approach for distilling single-task policies into importantly, the application of the distilled model as a method
a unified policy that performs well across multiple related to regularize comes with the notion of regularizing the col-
tasks, without explicit provision of task identity [34]. The lection of individual workers in a much impactful manner by
diffusion-based Distributed Actor-Critic (Diff-DAC) is a stressing on task policies by more margin than at the level of
deep neural network-oriented distributed actor-critic algo- parameter [20].
rithm designed to single-task and to average multitask rein-
forcement learning (MRL). In this method, each agent is B. IMPALA
having access to data from its local task only, and during the Google DeepMind came up with another well-known multi-
learning, process agents share their value-policy parameters task learning approach by the name IMPALA (Importance
with neighbors to converge to a common policy but without Weighted Actor-Learner Architecture). It is based on the idea
having a central node [35]. For the remainder of this section, of having a distributed agent architecture that is designed by
we will mainly focus on comparing and contrasting the three adopting the model of a single RL agent with only one set
state-of-the-art approaches namely Distral, IMPALA, and of parameters. The core design characteristic of the IMPALA
PopArt. model is about operating environment flexibility. This model

VOLUME 9, 2021 44687


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

is designed with the ability to not only in the efficient uti- C. PopArt
lization of resources within a single-machine-oriented train- A third approach by the name PopArt proposed by Google
ing environment but also it can be scaled to operate with DeepMind came out as a solution to mitigate the issues asso-
multiple machines without the need to sacrifice both data ciated with the existing IMPALA model. PopArt was aimed to
efficiency and utilization of the resource. By following a address the reasons behind the suboptimal performance fac-
novel off-policy correction method by the name V-trace, tors and thereby enhance the RL in multi-task-oriented envi-
IMPALA is capable of gaining quite a stable learning tra- ronments. The core design objective of PopArt is to reduce
jectory with a very high throughput level by combining the impacts of the distraction dilemma problem associated
both decoupled acting as well as learning [37]. In gen- with the IMPALA model and, thereby stabilize the learning
eral, the DRL model’s architecture follows the notion of a process in a better way to facilitate the adoption of multi-task
single learner (also known as a critic) clubbed with many RL techniques [38]. The term distraction dilemma refers to
numbers actors. Under this ecosystem, initially, each of the the probability of learning algorithms getting distracted only
individual actors creates its learning parameters called tra- by a fraction of few tasks from the large pool of multiple tasks
jectories and subsequently shares that knowledge with the to be solved. This scenario in turn leads to the challenges
learner (critic) by following a queue mechanism. The learner related to resource contention. It is about establishing the
subsequently accumulates the same kind of knowledge tra- right balance between the necessities of multiple tasks oper-
jectories from all of the multiple actors operating within ating within the same environment competing for a limited
the environment, which eventually acts as a source of infor- number of resources offered from a single learning system.
mation to prepare the central policy. Before starting the The design methodology of the PopArt model is based on
next learning cycle (trajectory), all of the individual actors the original IMPALA architecture model by adding multi-
operating within the environment gather the updated policy ple CNN layers combined with other techniques like word
parameters details from the learner(critic module) [37]. This embeddings with the help of a recurrent neural network of
approach is quite analogous to the popular RL algorithm type long-short term memory (LSTM) [38].
named A3C. The architecture of the IMPALA was inspired PopArt model functions by gathering the trajectories from
hugely by the same algorithm. RL algorithm model used each of the individual tasks to the RL agent’s updates. In this
within the IMPALA follows a topology of a system having manner, the PopArt model makes sure that every agent within
a group of actors and learners who build knowledge through the environment will have its role, subsequently proportional
collaboration. impact during dynamics of overall learning. The key design
The design of the IMPALA leverages an actor-critic-based aspect of the PopArt model relies on the fact that modifying
model to derive a policy π and, a baseline value function the weights of the neural network, will be based on the output
named V π. Major units of the IMPALA system consist of of all tasks operating within the environment. During the
a group of actors that generates g trajectories of experience first stage of operation, PopArt estimates both mean as well
in a continuous manner. In addition to this, there could as the spread of the ultimate targets such as the score of a
be at least one or more than one learner that leverage the game across all tasks under consideration. Following this,
generated trajectories shared from the individual actors to PopArt capitalizes on these estimate values to normalize the
learn the policy π, which is an off-policy. At the beginning targets before making an update on the network’s weights.
of every individual trajectory, an actor initially updates its This approach in turn makes the whole learning process more
local policy µ to the latest learner policy π. Subsequently, stable and robust. With the set of various experiments con-
each actor would adopt and run that policy for n number ducted with popular Atari games’ environment, PopArt has
of steps in its operating environment [37]. Upon completion demonstrated its capabilities and improvements over other
of these n steps, each of the individual actors sends another multi-task RL architectures [38].
set of information consisting of - the trajectory of states,
actions, and rewards together with related policy distribu- V. HYBRID MULTI-TASK LEARNING MODEL
tions to the learner. In this manner, the learner will have The major motivation behind the proposed hybrid multi-task
the opportunity to continuously update its policy π each approach is to address and mitigate some of the key chal-
time whenever the actors share their trajectory information lenges associated with DRL multi-tasking, which are not fully
from the environment. In this fashion, IMPALA architecture covered by the state of the art. In this paper, we extend our
gathers experiences from different individual learners within approach in [39] to address the DRL agent’s performance
the environment, which are further passed to a central learner optimization bottlenecks by adopting the hybrid multi-task
module. Following this, the central learner calculates the learning-based approach in complex operating environments
gradients and then generates a model having a framework having a higher number of distinct DRL agents. Besides,
of independent actors as well as learners. One of the major this work also examines the impact of semantic dissimi-
characteristics of the IMPALA architecture is its operational larity of DRL agents’ tasks on the overall momentum of
flexibility which allows the actors to be present either on the performance optimization. The challenges such as partial
same machine or it can be evenly distributed across numerous observability, amount of the training time as well as train-
machines. ing data samples required, and effective exploration often

44688 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

act as the bottlenecks in the performance optimization of a


DRL agent.
The proposed approach named the hybrid A3C model is
an attempt to address most of these aspects, by extending
the basic actor-critic model to two different environments
with a high level of semantic similarity. The key aspect the
of A3C algorithm is its ability to learn multiple instantia-
tions of a single target task simultaneously, and also its abil-
ity to improve the model’s performance by transferring the
knowledge between multiple instantiations [4]. The proposed
hybrid A3C approach will be leveraging this key aspect and
will attempt to achieve this objective across, two different
by semantically similar environments with related tasks. The
hybrid approach will be heavily relying on the applicability
of the multi-threaded capability of the A3C algorithm across FIGURE 2. The architecture of the hybrid parallel multi-task model.
semantically related tasks running in two different environ-
ments. The proposed approach could be treated as a model
running two threads of the A3C algorithm, wherein each The semantic similarity aspect of the related tasks running
thread will be managing the multiple instantiations of the two different gaming environments is the most vital factor
tasks running in each environment. Each of these individual to achieve the above-mentioned objectives, which otherwise
threads would consider itself as a subtask such as A and give challenges in terms of negative knowledge transfer.
B, with each of them sharing its learning with the learner Negative Transfer is considered to be one of the key chal-
in an asynchronous manner. Further on, the learner (global lenges while dealing with the multi-tasking aspect within the
network) will be converging the knowledge from both of reinforcement learning domain. The main idea of knowledge
these threads and deducing a new policy, that will be applied transfer learning in a multi-task context is that transferring
back on the threads. The key aspect of the hybrid approach knowledge accumulated from learning from a set of source
is only to enhance the performance of the RL agent through samples under one agent may improve the performance of
a joint-learning through multi-task learning approach by another task agent while learning on the target task [24].
using deep reinforcement learning. Fig. 2 shows a high-level However, this knowledge transfer could impact the overall
architecture model of the proposed hybrid multi-task learning progress and performance of the agent in either way,
approach. positively or negatively. If there is a considerable difference
The hybrid A3C model deploys multi-threaded asyn- between the source tasks and target tasks, then the transferred
chronous variants of the advantage actor-critic algorithm. knowledge could create a negative impact.
The major objective behind designing this model is to find Having multiple environments with a high level of
a methodology that can train deep neural network policies semantic similarity would in-directly improve the par-
reliably and without large resource requirements. During the tial observability by exchanging the learning across the
construction of the hybrid A3C model, initially, we con- agent’s operating environment [40]. Similarly, having mul-
ducted its validation on a desktop-based environment which tiple actor-critic models operating simultaneously across
is having a dual-core CPU on a single machine. Under this two semantically similar environments would mitigate RL
environment, we have conducted basic level testing with a agent’s issues associated with effective exploration, sufficient
pair of actor-learner worker threads. With this, one (actor- amount of training samples, and the training time required to
learner) worker thread was assigned to run the task from each reach an optimized performance level.
game’s environment. Throughout the execution, this model
asynchronously attempts to derive and optimize the global A. ACTOR-CRITIC METHODOLOGY
policy based on the observations that multiple actors-learners Unlike some simpler techniques which are based on either
running in parallel are likely to be exploring different parts value-iteration (Q-learning) methods or policy-gradient (PG)
of the environment. At an individual actor-learner module methods, the actor-critic(AC) methodology combines the best
level, it is possible to have different exploration policies in parts of both the methods, which are the algorithms that pre-
each module to maximize this diversity. In this way, hav- dict both the value function V (s) as well as the optimal policy
ing different exploration policies in different threads of the function π(s). In other words, actor-critic methods consist of
actor-learner module, the overall changes being made to the two models, namely an actor module and a critic module.
global network parameters by these different actor-learners Thereby AC attempt to combine the aspects of both policy
applying asynchronous updates in parallel are likely to be less gradient and value gradient into a single model. Fig. 3 shows
correlated. This model is designed to run on a single machine the diagram of actor-critic methodology.
with a standard multi-core CPU and applied to a variety of The actor acts as a policy network, that decides for a given
Atari 2600 domain games for testing. state s which action a to be taken at each given time step t.

VOLUME 9, 2021 44689


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

Within the AC model, the critic is in charge of updating


the value function parameters w, and based on the DRL
algorithm it could be either an action-value function Qw (a|s)
or state value function Vw (s). Based on the details of the value
function shared by the critic, the actor updates the policy
parameter θ for the πθ (a|s). The execution of an actor-critic
algorithm can be explained by the below steps [42].
1) Initialize s,θ,w at random, and sample a ∼ πθ (a|s)
2) For t = 1 . . . T :
a) Sample reward Rt∼R(s,a) and next state
s‘∼P(s‘|s,a);
FIGURE 3. Actor-Critic model.
b) Then sample next action a‘ ∼ πθ (a‘|s‘)
c) Update the policy parameters: θ ← θ +
αθ Qw (s, a) ∇θ lnπθ (a | s) ;
d) Compute the correction (TD error) at time t for
action-value:
i) δt = rt + γ Qw (s‘ | a‘) − Qw (s, a)
ii) Use it to update the parameters of the
action-value function as given w ← w +
αw δt ∇w Qw (s, a)
FIGURE 4. A single thread of actor-critic worker execution.
e) Update a ← a‘ and s ← s‘
Both the learning rates αθ and αw , are predefined for policy
The critic consists of a value network V π(s, a) that tells how and value function parameter updates respectively.
promising action is under the current state s. Having said that,
in its role critic outputs an evaluation value V (s, a) for the B. ACTOR
actor, which indirectly helps the actor to adjust its policy for An actor is a module that controls how a policy-based DRL
better results. At the same time, both actor and critic networks agent behaves within an environment. The actor takes as input
update themselves according to the knowledge gathered by the state and outputs the best action. It essentially controls
their respective neural networks from the environment. This how the agent behaves by learning the optimal policy π ∗ . The
internally helps the agent to converge its policy to the optimal policy-based algorithms such as Policy Gradients (PG) and
policy πθ∗ . In summary, the critic module updates the value REINFORCE try to find the optimal policy directly without
function parameters w, and depending on the algorithm it the Q -value as the intermediate step. Often an actor could
could be either action-value Qw (a|s) or state-value Vw (s) be a function approximator such as a neural network with its
whereas the actor module updates the policy parameters θ for objective as to identify the best action while a DRL agent is
πθ (a|s), in the direction suggested by the critic. in a state St at time step t. The neural network could be either
Fig. 4 shows the single actor-critic worker agent fully connected or a CNN.
flowchart [41]. The learning agent uses the value from the
value function calculated by the critic module to update the C. CRITIC
optimal policy function of the actor module. Note that here The critic, on the other hand, evaluates the action by comput-
the policy function means the probabilistic distribution of ing the value function (value-based). The role of the critic is
the action space. To be exact, the learning agent determines to evaluate how good an action is taken by the agent with the
the conditional probability P(a|s;θ) which otherwise means help of a value-based approach. As in the case of the actor,
parametrized probability that the agent chooses the action a the critic also could be a function approximator such as a
when in state s. The policy is often modeled as a function neural network. The result is that the overall architecture will
πθ (a|s) that is parameterized to θ. The value of the DRL learn to play the game more efficiently than the two methods
agent’s reward function depends on this policy, and the algo- separately.
rithms are used to optimize θ. The reward function is defined
as below, wherein d π (s) notation refers to the stationary D. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC(A3C)
distribution of Markov chain for πθ (for the on-policy state A3C is a state-of-the-art DRL algorithm developed based on
distribution under π). the AC methodology. This algorithm is designed to function
both in discrete and continuous action space environments
d π (s)V π (s)
X
J (θ) = (2) and can be treated as the multi-thread version of the original
sεS AC algorithm. A3C makes the AC algorithm converge faster
d π (s) πθ (a | s) Qπ (s, a)
X X
= (3) by running multiple agent threads [43]. Each of these threads
sεS aεA consists of an independent actor-critic pair that interacts with

44690 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 5. The ecosystem of single A3C worker thread with Atari 2600.

the environment simultaneously. The agents, which are also


known as workers, are trained in parallel and update peri- FIGURE 6. The architecture of the worker agent thread in A3C.
odically a global network, which holds shared parameters.
The updates are not happening simultaneously and that’s a new-found insight to the agent into the environment and thus
where the asynchronous comes from. The unique exploration the learning process is better. The advantage metric is given
experience offered by each of the global actor-critic networks. by the following expression
With such multiple threads sharing the experience with a
global network in an asynchronous fashion, A3C eliminates Advantage : A = Q(s, a) − V (s) (4)
the bias of continuous experience trajectory by feeding only
a small batch of experience tuple (s, a,r,s‘) at any time. After where Q refers to the Q value calculated by the critic module
each update, the agents reset their parameters to those of the based on the actual reward and TD error following an actor’s
global network and continue their independent exploration policy-based chosen action. The Advantage function named
and training for n steps until they update themselves again. A(St at ; θ, θv ) is calculated that needs to be discounted future
With this approach, the information flows not only from the rewards accumulated to tmax or at the terminal state.
agents to the global network but also between agents as each Xk−1
A(St at ; θ, θv ) = γ i rt+1 + γ k V (st+k ; θv ) − V (st ; θv )
agent resets its weights by the global network, which has i=0
the information of all the other agents. Fig. 5 shows the (5)
ecosystem of a single actor-critic worker.
A3C uses a deep neural network to model both a policy Gradients associated with both policy and value networks are
network π(at |st ; θ) and a value networkV (st ; θ). For a given denoted by the following equations (6) and (7) respectively,
state St , the policy network (which is the ‘‘actor’’) predicts which are calculated by summing over all the states in the
the optimal action to take at St while the value network past tmax local iterations of each worker agent thread’s exe-
(which is the ‘‘critic’’) approximates the future reward from cution [41].
taking the optimal action at St . By theory, these two networks
∇θ ‘ logπ(at |st ; θ)A(st , at ; θ, θv ) (6)
are separate, but in practice, we use the same convolutional  2
layers for both the policy and value networks with separate dθ = dθ + ∂(R − V st ; θV‘ ) /∂θ ‘ (7)
output layers at the end. The Asynchronous nature of A3C
means that multiple actor-critic threads are running at the The pseudocode of the A3C algorithm for each worker agent
same time, each with its environment. Each thread steps thread within the hybrid multi-task model is given by the
through its environment with its own local CNN, periodi- algorithm mentioned below [4].
cally updating a globally shared CNN wherein all networks
have an identical architecture. For each thread, at every tmax VI. IMPLEMENTATION OF HYBRID MULTI-TASK SYSTEM
local steps or when a terminal state is reached, that thread PROTOTYPE
syncs its local parameters with the global parameters, com- This section details the methodology adopted towards the
putes gradients, and applies them upstream to the global prototype implementation of the proposed hybrid multi-task
network [41]. model which is based on the A3C algorithm. Throughout the
A3C follows online learning by adopting a policy gradient implementation, the prototype was tested with various games
method, directly from the states as they are processed by under the Atari 2600 environment provided within the Ope-
each worker agent thread. The policy is developed naturally nAI Gym [44]. The Gym library is a toolkit made by OpenAI
as each thread runs within its stochastic Atari 2600 based for developing and comparing RL algorithms. The first stage
gaming environment and updates to the global parame- of the hybrid multi-task model was constructed by adopting
ters. Fig. 6 indicates the worker agent architecture with the A3C algorithm for the gaming environment Breakout-
CNN. v0. The high-level architecture of the model is based on the
This methodology suggests that A3C does not overfit to actor-critic methodology. In our context, the actor is a neural
any particular state trajectory of a specific worker thread. network that parameterizes the policy π (a | s) and critic is
The notion of Advantage A is used to measure the difference another neural network that parameterizes the value function
between the expected reward and estimated reward. By using V (s). The policy network outputs the policy (π), based on
the value of advantage instead, the agent also learns how which the actor chooses an action within the environment,
much better the rewards were than its expectation. This gives and the value network outputs the value function V (s). Each

VOLUME 9, 2021 44691


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

Algorithm A3C Algorithm – Pseudocode for Each


Actor-Leaner Thread
// Assume global shared parameter vectors θ and θv and
global shared counter T = 0
// Assume thread-specific parameter vectors θ 1 and θv‘
Initialize thread step counter t ← 1
repeat
Reset gradients dθ ← 0 and dθv ← 0
Synchronize thread-specific parameters θ ‘ = θ and
θv‘ = θv
tstart = t
FIGURE 7. A3C multi-task worker agent model.
Get state st
repeat
perform at according to the policy π(at |st ; θ ‘ )
Receive reward rt and new state st
t ←t +1
T ←T +1
until terminal st or t − tstart == tmax
R = {0  terminal st
R = V (st , θv‘ ) for terminal st //Bootstrap from last
state
for i ∈ {t − 1 . . . tstart do
R ← ri + γ R
Accumulate gradients wrt
θ ‘ : dθ ← dθ + ∇θ ‘ logπ(ai |si ; θ ‘ )(R − V (si ; θv‘ ))
Accumulate gradients wrt
2
θv‘ : dθv + ∂(R − V (si ; θv‘ )) /∂θv
end for FIGURE 8. Training workflow of worker agent thread.
Perform asynchronous update of θ using dθ and of
θ using dθv individual blocks is made up of a pair of CNN networks,
until T > Tmax each for the actor(policy) and critic (value function) modules.
In other words, A3C utilizes N worker agents attacking the
same game environment while being initialized differently.
of these networks has its respective weights which are often This indirectly points out that each of these agents starts at a
represented by notations such as θp and θv. different point in their environment so they will go through
the same environment in different ways to solve the same
π(a|s, θp) = Neural Network (input : s, weights : θp) problem.
(8) Fig. 8 shows the training workflow of each worker agent.
V (s, θv) = Neural Network (inputs, weights : θv) (9) Within the A3C-based multi-task worker agent environ-
ment, each of the individual worker agents is managed by
A more graphic intense Atari 2600 game environment the global network directly. Under this scheme, initially, each
named- Breakout-v0 is being relatively treated as a complex of the workers is reset with parameter values shared by the
environment as we will be having an infinite number of global network, later on, the worker interacts with its copy of
state-action spaces to deal with. To accommodate and handle the environment. Even though each of the worker agents is
this environment, the neural network-based model was used operating within the same game environment, they are being
for the validation. At the root level, this environment will initialized differently. This allows each of these agents to start
employ a pair of CNN models to implement both actor and at a different point in their environment. During its operation,
critic modules for a single worker. There will be multiple each worker agent plays a fixed number of game episodes
instances of the CNN class objects to implement the multiple and calculates the value and respective policy loss. As these
worker threads used within the multi-task model. Similarly, modules, both actor and critic are implemented using the
the global network was also deployed as a pair of CNN to neural network, gradient values are calculated from the losses
support the implementation of actor-critic modules at the incurred during its operation. These gradient values will be
global network level. shared with the global network after the work agent finishes
Fig. 7 shows the high-level architecture view of the a fixed number of game episodes. The algorithm behind the
multi-task model having N worker threads of execution coor- operation of the A3C multi-task worker agents’ model is
dinated and managed by a global network. Each of these mentioned below.

44692 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

Algorithm Algorithm of A3C-Based Multi-Task Model


Worker Agent
while not done:
a = sample an action a ∼π θ (a|s)
s‘, r, done = Perform action a
−env.step(a)
G = r + γ V (s‘ ) 
Lp = −(G − V (s))log(π a|s,θ p )
Lv = (G − V (s))∧2 FIGURE 10. Gradient update by worker agents with the global network.
θ p =θ p − α ∗ d Lp /d θ p
θ v = θ v − α ∗ d Lv /d θ v Now every so often, this global network is going to send
its weights to a set of worker agents each with their copy
of policy and value network. Further on each of these indi-
vidual worker agents will be playing a few episodes of
the game under its environment using its network weights
from its own experience. From its own experience, each
worker agent can calculate its policy gradient updates and
value updates. Knowledge of these updates will be limited
to only these individual worker agents. Eventually, worker
agents send their gradient values to the global network so
that the global network can update their weights accordingly.
FIGURE 9. The CNN-based architecture of a single A3C worker agent. Every so often the global network gives its new updated
parameters back to its working agents so worker agents are
During the operation, each of the worker agents loops always working with a relatively recent copy of the global
through each step of the game, and samples the action, and network. In this working model, worker threads play episodes
updates the weights of both the neural networks- actor and of games under their respective environments, find the errors,
critic. The algorithm runs until a preset number of episodes and calculate the update gradients which will be shared with
of the game are played, wherein initially action is sampled the global network regularly. Fig. 10 shows the sharing of
from the actor (policy network). Further on, upon comple- gradient updates by individual worker agents with the global
tion of that action respective reward (r) and a new state (s0 ) network.
are calculated. Based on the new state reached, the total
discounted future return (G) is calculated by applying the VII. EXPERIMENTS AND RESULTS
discount factor (gamma). Based on this each of the individual A3C provides a multi-threaded and asynchronous approach
neural networks calculates its policy loss (Lp), and value to deep reinforcement learning [43]. This algorithm gives
loss (Vp) [45]. Further on, the neural network uses gradient the capability to have a model to be trained with multiple,
descent to update the respective network weights (θp – policy different explorations of a single target task, providing data
network weight and θv – value network weight) to minimize sparsity, and avoiding the use of memory replay. Given the
the loss. multi-threading characteristics, the proposed hybrid model
At the root level, this environment will employ a pair of attempts to leverage A3C’s ability to perform multi-task
convolutional neural network (CNN) models to implement learning without modifications when applied to different,
both actor and critic modules for a single worker. There will but semantically related tasks. To do so, we simultaneously
be multiple instances of the CNN class objects to imple- train multiple tasks using a single A3C model, allowing the
ment the multiple-worker threads used within the multi-task network to asynchronously share knowledge obtained from
model. Similarly, the global network is also deployed as a and to all tasks. The hybrid A3C model attempts to learn
pair of CNN-based actor-critic modules at the global net- two different tasks and then combine the learning to accel-
work level. These neural network models act as a function erate the performance. Evaluation of the proposed hybrid
approximator by processing each screenshot of the game as multi-task model will be conducted on a prototype based on
its input. We have used RMSpromp optimizer with this imple- the A3C model and trained with the Atari 2600 environment
mentation. During the first stage of experiments, the eval- provided in the OpenAI Gym. The Gym library is a toolkit
uation of the multi-task learning model was performed on for developing and comparing reinforcement learning algo-
a machine having two cores (dual-core). Under this ini- rithms [44]. It makes no assumptions about the structure of
tial test-setup, each worker agent will be running on each your agent and is compatible with numerical computation
core, and hence both worker agents are executed in parallel. libraries, such as TensorFlow or Theano. A3C algorithm used
Fig. 9 is a diagrammatic representation of the CNN-based for the experiments will be based on Google DeepMind’s
model used to implement each of the individual worker agent paper titled-asynchronous methods for deep reinforcement
threads. learning.

VOLUME 9, 2021 44693


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 11. Single-agent actor – average rewards.


FIGURE 13. Single-agent actor- merged results.

FIGURE 12. Single-agent actor- total rewards.


FIGURE 14. Single-agent critic-average rewards.

As a preliminary step towards the development of the


proposed system model, the initial set of experiments are con-
ducted with the game of CartPole-v0 which is having a finite
set of action and state space. The methodology followed was
to individually develop the single-agent actor which is based
on policy gradient, and similarly a single agent critic which
is a value-based network to measure the performance. Both
these networks were developed as the standard feedforward
neural networks and experiments are conducted for the finite
number of episodes. As an outcome of the experiment perfor-
mance of both single-agent actor and critic are measured.
Fig. 11 to Fig. 13 represent the statistics generated for the
single-agent actor feedforward neural networks.
Fig. 14 to Fig. 16 represent the statistics generated for the
single-agent critic feedforward neural networks.
It is evident from the statistics that policy gradient-based
actors can increase the rewards over the episodes gradually. FIGURE 15. Single-agent critic-total rewards.
At the same time, the value-based critic module can show
the increment in performance in the early episodes, with a Pong-v0, Breakout-v0, SpaceInvaders-v0, DemonAttack-
small dip in the mid episodes with a fluctuating result for the v0, and Pheonix-v0. During the first stage of evaluation,
forthcoming episodes. the performance of the reinforcement learning agent will
Following OpenAI Atari 2600 gaming environments be measured individually on each of these gaming environ-
will be used for the evaluation of the proposed model, ments to generate the initial test statistics. The test results

44694 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 16. Single-agent critic-merged results.

FIGURE 18. Breakout-v0 multi-task workers model-average rewards.

FIGURE 17. Breakout-v0 multi-task workers environment.

generated by the A3C model were trained within the OpenAI


Atari 2600 environments provided in the OpenAI Gym [10]. FIGURE 19. Breakout-v0 multi-task workers model-total rewards.
In the next step towards the evaluation of a proposed hybrid
multi-task model, the A3C algorithm based on a multi-worker
agent-based environment is created for a more graphic intense
Atari 2600 game environment Breakout-v0. From the per-
spective of a DRL agent, this environment is being treated
as a complex one as we will be having an infinite number of
state-action spaces to deal with. To accommodate and han-
dle this environment, a convolutional neural network (CNN)
based model was used for the validation. This configura-
tion was tested under a desktop-based environment by using
a multi-task environment having four worker threads that
combinedly executed 500,000 steps of the game. Each of
the individual threads is having its copy of the environment
but different from one another in terms of the view of the
gaming environment. Fig. 17 shows the multi-task worker
based environment for Breakout-v0
Fig. 18 to Fig. 20 show the test results captured for the A3C
algorithm based on the multi-task worker model for the Atari FIGURE 20. Breakout-v0 multi-task workers model –merged results.
2600 gaming environment named Breakout-v0. This testing
was carried out by using 4-worker agents or worker threads As a further attempt towards the evaluation of the pro-
based A3C model to generate the initial set of results of a posed hybrid multi-task model, the A3C algorithm-based
desktop-based test environment. multi-worker agent environment is also created for one more

VOLUME 9, 2021 44695


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 21. Snapshot of Breakout-V0 and Pong-v0 environment.

graphic intense Atari 2600 game environment named- Pong-


v0. The decision to choose Pong-V0 was after the careful
FIGURE 22. Pong-v0 multi-task workers model-average rewards.
examination of the high level of similarity level among these
two games, Breakout-V0 and Pong-V0. Having a reasonable
level of similarity could act as an accelerator during the
validation of the proposed hybrid multi-task learning model
execution. Similar to the way how Breakout-v0 was tested
earlier under a multi-task worker environment, the Pong-
v0 game was also tested under a desktop-based environment
by using a multi-task environment having four worker threads
that combinedly executed 5 million steps of the game.
Each of the individual threads is having its copy of the envi-
ronment but different from one another in terms of the view
of the gaming environment. This environment will be having
an infinite number of state-action spaces to deal with during
the optimization of the DRL agent. To accommodate and han-
dle the Pong-v0 gaming environment, a similar CNN-based
model was used during the validation of the multi-task learn- FIGURE 23. Pong-v0 multi-task workers model-total rewards.
ing model. In both Pong and Breakout, a player must control
a paddle to hit a ball. For Pong, the player must attempt to
make an opponent miss the ball, while for Breakout the goal
is to break as many bricks as possible. Fig. 21 shows the
graphical representation for the Atari-2600 Breakout-v0 and
Pong-v0 gaming environments.
Fig. 22 to Fig. 24 show the test results captured for the A3C
algorithm based on the multi-task worker model for the Atari
2600 gaming environment named Pong-v0.
As part of the detailed and exclusive evaluation of the
proposed hybrid multi-task model, we decided to pick one
more pair of Atari 2600 games namely Space Invaders-v0 and
DemonAttack-v0 from the Gym library. The decision to
choose these two games as the second test pair was after the
examination of the high level of semantic similarity between
their pattern play. Both these games are based on the theme
of shooting wherein the player should be able to control a
moving ship with the capability of shooting and hitting the FIGURE 24. Pong-v0 multi-task workers model-merged results.
enemies. In terms of complexity, Space Invaders is relatively
less complex as the enemies in this game move more in a
regular fashion than in the other game. Whereas in Demon has its reward structure that is in-built by the Gym library.
Attack, there are a wide variety of enemies who moves more In other words, even though there is some level of semantic
randomly with the capability to shoot back, which makes similarity between the games chosen within each test pair,
the gameplay more complex from the perspective of the RL the scoring and reward structure followed within each game
agent. More importantly, every game used in this experiment is unique.

44696 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 25. SpaceInvaders-v0 and DemonAttack-V0 environment.

FIGURE 27. SpaceInvaders-v0 multi-task workers model-total rewards.

FIGURE 26. SpaceInvaders-v0 multi-task workers model-average rewards.

Fig. 25 shows the graphical representation for Atari-


2600 based SpaceInvaders-v0 and DemonAttack-v0 gaming
environments.
Similar to the way how previous two games from the FIGURE 28. SpaceInvaders-v0 multi-task workers model- merged results.
first pair were tested, the SpaceInvaders-v0 game was also
tested under a desktop-based test environment by using a
multi-task worker model having four worker threads that that combinedly executed 500,000 steps of the game. Each of
combinedly executed about 500,000 steps of the game. the individual threads is having its copy of the environment
Fig. 26 to Fig. 28 show the test results captured for the A3C but different from one another in terms of the view of the
algorithm based on the multi-task worker model for the Atari gaming environment. This environment will be having an
2600 gaming environment named Space Invaders-v0. infinite number of state-action spaces to deal with during the
Each of the individual threads is having its copy of the optimization of the DRL agent.
environment but different from one another in terms of the Fig. 29 to Fig. 31 show the test results captured for the A3C
view of the gaming environment. This environment will be algorithm based on the multi-task worker model for the Atari
having an infinite number of state-action spaces to deal with 2600 gaming environment named Demon Attack-v0.
during the optimization of the DRL agent. To accommodate To test and generate better results with a higher number
and handle the Space Invader-v0 gaming environment, a sim- of episodes of gameplay for each game under the proposed
ilar CNN based model was used during the validation of the hybrid multi-task model, we decided to test the proposed
multi-task learning model model under a cloud-based test environment. As part of this,
This testing was carried out by using 4-worker agents or we opted to move our testing to machines with GPU with
worker threads-based A3C model to generate the initial set CUDA cores support under the cloud environment hosted by
of results of a desktop-based test environment. Paperspace. This allowed us to rent a server in the cloud with
Similar to the way how Space Invader-v0 was tested earlier much higher throughput than that of our local machine.
under a multi-task worker environment, the DemonAttack- Paperspace server used has up to 8GB of graphic mem-
v0 game was also tested under a desktop-based environment ory and 32 GB of RAM and equipped with NVIDIA
by using a multi-task environment having four worker threads GPU - Quadro P5000 having CUDA support (with

VOLUME 9, 2021 44697


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 31. DemonAttack-v0 multi-task workers model-merged results.


FIGURE 29. Demon Attack-v0 multi-task workers model-average rewards.

FIGURE 32. Test environment of Paperspace cloud server machine.

FIGURE 30. DemonAttack-v0 multi-task workers model-total rewards.

2560 CUDA cores) to facilitate the parallel computing for


deep learning applications.
During this process, we configured a couple of Win-
dows OS-based virtual test machines namely Gen 2 (P4000)
having NVIDIA GPU supported with CUDA cores in the
cloud environment. Each of the Atari2600 games was tested
with 8 worker agents for a higher number of global steps. FIGURE 33. Breakout-v0 standalone test result with 8 multi-task workers.
To capture the test results, a tensor board visualization tool
was employed which uses the event file captured during the represent the rewards (game score). The same convention
test execution to generate the test execution results. Fig. 32 applies to Figures 38 to 48.
depicts the working environment of the cloud server machine. Now, as the next step in the verification of our proposed
Fig. 33 to Fig. 36 show the test results captured for the A3C hybrid multi-task model, we have tested the model by running
algorithm based on the multi-task worker model for the Atari two semantically similar games simultaneously.
2600 gaming environments under the virtual test machines At the end of testing, the individual test score for each
under the cloud environment. Note that these figures were game was captured. Since we have chosen two pairs of games
generated within TensorBoard (TensorFlow’s visualization with semantic similarity, we created a separate test setup for
toolkit), the numbers on the x-axis represent the global steps each pair. Fig. 37 shows the diagrammatic representation for
in millions (taken by the agent), and the numbers on the y-axis each pair under the hybrid multi-task model. To maintain

44698 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 34. Pong-v0 standalone test result with 8-multi-task workers.

FIGURE 37. Hybrid multi-task model of Breakout-v0 and Pong-v0.

FIGURE 35. DemonAttack-v0 with 8-multi-task workers.

FIGURE 38. Breakout-v0 test results with the hybrid multi-task model.

FIGURE 36. SpaceInvaders-v0 with 8-multi-task workers.

the uniformity of testing, each of the individual games was


tested with 8 worker agents which totals to 16 worker threads
altogether within the test environment. We have the RMSprop
optimizer for the testing with related hyperparameters such as
learning rate, decay, momentum, epsilon, clip norm parame-
ter. We also have other hyperparameters such as the discount
FIGURE 39. Pong-v0 test results with the hybrid multi-task model.
rate factor for rewards, maximum global steps, and worker
threads having CNNs with 2 hidden layers with a ReLU
activation function. Fig. 38 and Fig. 39 respectively show Similarly, we created the joint test environment for the sec-
the test execution results captured for breakout-v0 and Pong- ond test pair consisting of Atari2600 gaming environments,
v0 under the joint test environment. SpaceInvader-v0, and DemonAttack-v0. To maintain the
These test results are generated based on the experiments uniformity of testing, each of the individual games was
conducted with the Paperspace cloud server machines having tested with 8 worker agents which totals to 16 worker
the Nvidia GPU supported by CUDA cores. This environment threads altogether within the test environment. Fig. 40 and
facilitates the large-scale testing for the hybrid multi-task Fig. 41 show the diagrammatic representation for each
model having a CNN-based feature extraction module. pair.

VOLUME 9, 2021 44699


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

FIGURE 40. DemonAttack-v0 test results with the hybrid multi-task


model.

FIGURE 43. Pong-v0 results for the semantic dissimilar test.

FIGURE 41. SpaceInvaders-v0 test results with the hybrid multi-task


model.

FIGURE 44. SpaceInvaders-v0 results for the semantic dissimilar test.

FIGURE 42. DemonAttack-v0 results for the semantic dissimilar test.


FIGURE 45. Breakout-v0 results for the semantic dissimilar test.

To measure the impact of the DRL agent’s performance


with the hybrid multi-task model while testing with environ-
ments with high semantic dissimilarity, we have also con-
ducted two pairs of testing. In this testing first pair of testing
was done using DemonAttack-v0 and Pong-v0, which are
having a high level of semantic dis-similarity level. Under
this test environment, the performance of each of the indi-
vidual games will be measured to see the impact of negative
knowledge (gradient transfer). A similar test setup will be
made ready for the second pair consisting of Atari2600 gam-
ing environments namely, SpaceInvader-v0 and Breakout-v0. FIGURE 46. DemonAttack-v0 test results with hybrid multi-task model for
3 semantically similar environments.
Results shown from Fig. 42 to Fig. 45 show the test results for
each test pair.
During our test efforts, we also conducted experiments similarity factor, at the same time, each of them is having its
to measure the impact of individual game scores when the reward structure Fig. 46 to Fig. 48 show the respective test
hybrid multi-task model is tested with three semantic sim- results captured with the hybrid multi-task model for the three
ilar games namely SpaceInvader-v0, DemonAttack-v0, and OpenAI Atari 2600 gaming environments with the high level
Pheonix-v0. Even though each of these games has a semantic of semantic similarity.

44700 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

In the second stage of testing, we experimented with our


proposed hybrid multi-task model approach, wherein we
trained two games, but with high-level semantic similarity,
simultaneously. In contrast to the first stage testing, where
gradients shared to the global network by worker agents are
all of the same types, in the hybrid environment we have two
different types of worker threads. As it is anticipated, the per-
formance of individual games under the hybrid environment
was not on par with standalone performance results obtained
with the first stage of testing. As and when the progress of
the game, we could see the impact of positive knowledge
FIGURE 47. Pheonix-v0 test results with hybrid multi-task model for sharing among these two tasks that are trained jointly. Due
3 semantically similar environments.
to the semantic similarity among them, updates shared by
the global network could mitigate some of the key challenges
associated with partial observability in comparison to a single
game-based environment. Based on the test results obtained
with each of the sets that we mentioned earlier, we could see
that each of the games under each set could boost its perfor-
mance throughout the training. By this, we can establish that
our hybrid multi-task model can learn multiple similar gam-
ing tasks simultaneously without degradation in performance
for any one of the individual gaming tasks. In comparison to
the state-of-the-art methods discussed which are based on the
distillation methodology, the hybrid multi-task model adopts
FIGURE 48. SpaceInvaders-v0 test results with hybrid multi-task model to train and learn the method for a multi-task actor-critic
for 3 semantically similar environments.
network from the scratch. Along with this, the hybrid multi-
task approach also measures the impact amount of positive
VIII. DISCUSSION OF TEST RESULTS knowledge transfer done through parameter sharing. As we
This section analyzes the test results obtained with the hybrid have adopted a model-free-based approach, it is relatively
multi-task model tested with various Atari 2600 gaming envi- less computationally intensive compared to a model-based
ronments. In the first stage of the testing, we conducted a approach.
standalone kind of testing with each of the individual gaming In the next stage of testing with the hybrid multi-task
environments individually. To conduct this testing, we have model, we conducted experiments by testing the hybrid
created the A3C algorithm-based multi-thread model wherein multi-task model with two different pairs of games with a
each of the games is tested by using 8 worker threads. high level of semantic dissimilarity. As we could see from
To maintain the uniformity of the testing throughout this the test results obtained, negative knowledge transfer or the
experiment, we have kept the count of worker threads as gradients shared by two semantically dissimilar worker train-
8 for all the gaming environments. We tested our model by ing threads had a huge impact on the individual games’
adding the final LSTM layer after the feedforward network score. As the test results indicate, all the individual games
to obtain the best performance of the A3C algorithm as a ‘performance was hugely affected due to negative knowledge
whole. We have extensively used NVIDIA GPU - Quadro transfer. Finally, we also tested our model to see the impact
P5000 having CUDA support (with 2560 CUDA cores) to on the positive knowledge transfer by training more than two
facilitate parallel computing as it involves the use of CNN semantically similar tasks with the same number of workers
to process game screen images. More importantly, in the allocated to each game. The test results obtained indicate that
first stage of testing, we choose two sets of games, with set as the number of worker threads increases, updates shared
1 consisting of Breakout-v0 and Pong-v0, then set 2 consist- by the global network deteriorates in comparison to a hybrid
ing of games SpaceInvaders-v0 and DemonAttack-v0. The multi-task model with two semantically similar tasks. This
decision to choose these games to form two sets was after situation possibly requires more tuning on the hyperparame-
the clear examination of semantic similarity factor among ter front as well as catastrophic forgetting of the neural net-
them. As anticipated, the base A3C-based multi-thread model works of the gaming environments, which will be addressed
was able to achieve performance enhancement on all of these in the future work planned.
games during the testing due to the parallel multi-task learn- The objective behind the proposed hybrid multi-task learn-
ing aspect of A3C. We have conducted the testing for 25 mil- ing model is to leverage multi-task learning capabilities
lion to 30 million global steps for each of these individual offered by the core actor-critic methodology by using the
games to have convincing test results for comparison with A3C algorithm to optimize the DRL’s performance. By hav-
future state tastings planned. ing a hybrid multi-task-based learning environment, wherein

VOLUME 9, 2021 44701


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

agents belonging to different but semantically similar games, accumulated gradients transfer or knowledge transfer among
we aimed at addressing some of the key challenges associated workers always governed by the global network. Addition-
with the existing multi-task DRL. To showcase, the extent to ally, the current implementation of the hybrid multi-task
which our model could address those issues, we would like to model mandates that all the workers be present on the same
have a case study based on the test results obtained. For this machine, where the IMPALA model supports distributed
purpose, we are using both the standalone and hybrid model system-based working environment for the workers. The
test results obtained for the Breakout-v0 game as indicated by PopArt model is being considered as an extension of the
Fig. 33 and Fig. 38 respectively. To have a fair comparison IMPALA model itself and designed to address key issues such
and derive a convincing conclusion, we have ensured that the as distraction dilemma and thereby stabilize the process of
same amount of resources have been allotted in both test sce- multi-task learning.
narios in terms of the number of worker threads, test configu-
rations, and the number of global steps parameter. By having IX. CONCLUSION AND FUTURE WORK
a comparison of these two test results, it is quite evident that In this research work, we propose a hybrid multi-task
in terms of the training time needed, the hybrid model could model-that follows a parallel, multi-tasking approach for
surpass the performance of the standalone model much ahead optimizing the performance of deep reinforcement learning
of time. After running the Breakout-v0 under a standalone agents. We present how to combine the multi-task learnings
model for 2.5+e5 (25 Million) global steps, the highest score from two different deep reinforcement learning agents oper-
it could achieve was a little over the range of 12, whereas the ating within two different by semantically similar environ-
hybrid model could surpass the same level in almost half of ments running with related tasks. Initial stage experiments
its execution time. In continuation to this, it is reasonable to are conducted by applying the DRL algorithm A3C to Atari
conclude that hybrid multi-task learning by having a group of 2600 gaming environment to draw the results. During the
different but semantically similar environments with similar experiments, we can establish that our hybrid multi-task
tasks could reduce the impact of partial observability which model can learn multiple similar gaming tasks, at least two,
restricts a DRL agent from choosing the optimal action while simultaneously without making changes in the algorithm and
in a state. Due to the impacts of the positive knowledge degradation in performance for any one of the individual
transfer facilitated by the gradient transfers from the second gaming tasks. The semantic similarity aspect of the related
environment’s agents, the actor module within each worker is tasks running two different environments is the most vital
having a better policy to choose the optimal action while in a factor to reduce the challenges posed in terms of the possible
step. Having said this, by possessing better policy parameters negative knowledge transfer.
actor module is in a better position to explore the environment For future work, we plan to conduct the experiments of the
in a much effective way and choose the optimal action in hybrid multi-task model with more complex gaming envi-
each state. This in turn is expected to improve throughout the ronments having a higher number of worker threads under
DRL agents’ execution as more positive knowledge transfer GPU cloud server-based machine environment to draw strong
is anticipated to happen with more global steps of gameplay. conclusions on parallel multi-task learning. Along with this,
The same kind of comparison case study could be applied to we also would like to investigate the steps to mitigate the
other game test pairs from the experiment. Seen in the light of impacts of negative knowledge transfer and catastrophic for-
these observations, it is reasonable to conclude that the hybrid getting in deep reinforcement multi-task learning.
multi-task learning model can address the objectives, it was
ACKNOWLEDGMENT
aiming for, to a great extent.
The authors acknowledge the support of the Natural Sciences
Finally, we also would like to have a comparison of
and Engineering Research Council of Canada (NSERC).
the proposed hybrid multi-task learning model against the
three state-of-the-art techniques that were mentioned under REFERENCES
the related work. In comparison to the hybrid multi-task [1] R. S. Sutton, ‘‘Generalization in reinforcement learning: Successful exam-
model which relies on the idea of sharing the network learn- ples using sparse coarse coding,’’ in Proc. Adv. Neural Inf. Process. Syst.,
1996, pp. 1038–1044.
ing parameters by a global network to individual workers, [2] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8,
the Distral model works on the idea of sharing a distilled nos. 3–4, pp. 279–292, 1992.
centroid policy that would regularize the workers running in [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare,
the environment. When it comes to the comparison with the A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, ‘‘Human-
level control through deep reinforcement learning,’’ Nature, vol. 518,
IMPALA model, its design approach is having similarity to pp. 529–533, Feb. 2015.
the hybrid multi-task learning model in terms of the actor- [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
critic methodology as it follows the topology of a set of actors and K. Kavukcuoglu, ‘‘Asynchronous methods for deep reinforcement
learning,’’ in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
with either a single learner or multiple learners. Within the [5] A. Mujika, ‘‘Multi-task learning with deep model based
IMPALA model, the learner’s role is to create a central policy reinforcement learning,’’ 2016, arXiv:1611.01457. [Online]. Available:
to be shared with the actors. Along with these learners have http://arxiv.org/abs/1611.01457
[6] R. Glatt and A. H. R. Costa, ‘‘Improving deep reinforcement learning
the flexibility to communicate among themselves for shar- with knowledge transfer,’’ in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing the gradients. In the hybrid multi-task model, workers’ pp. 5036–5037.

44702 VOLUME 9, 2021


N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents

[7] N. Vithayathil Varghese and Q. H. Mahmoud, ‘‘A survey of multi-task deep [31] Z. Yang, K. E. Merrick, H. A. Abbass, and L. Jin, ‘‘Multi-task deep
reinforcement learning,’’ Electronics, vol. 9, no. 9, p. 1363, Aug. 2020. reinforcement learning for continuous action control,’’ in Proc. IJCAI,
[8] G. Boutsioukis, I. Partalas, and I. Vlahavas, ‘‘Transfer learning in multi- 2017, pp. 3301–3307.
agent reinforcement learning domains,’’ in Proc. Eur. Workshop Reinforce- [32] P. Dewangan, S. Phaniteja, K. M. Krishna, A. Sarkar, and B. Ravindran,
ment Learn. Berlin, Germany: Springer, 2011, pp. 249–260. ‘‘DiGrad: Multi-task reinforcement learning with shared actions,’’ 2018,
[9] G. Weiss, Multiagent Systems: A Modern Approach to Distributed Artifi- arXiv:1802.10463. [Online]. Available: http://arxiv.org/abs/1802.10463
cial Intelligence. Cambridge, MA, USA: MIT Press, 1999. [33] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, ‘‘Deep reinforce-
[10] N. D. Nguyen, T. T. Nguyen, D. Creighton, and S. Nahavandi, ‘‘A visual ment learning for multiagent systems: A review of challenges, solutions,
communication map for multi-agent deep reinforcement learning,’’ 2020, and applications,’’ IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839,
arXiv:2002.11882. [Online]. Available: http://arxiv.org/abs/2002.11882 Sep. 2020.
[34] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, ‘‘Deep
[11] Y. Bengio, Learning Deep Architectures for AI. Boston, MA, USA: Now,
decentralized multi-task multi-agent reinforcement learning under
2009.
partial observability,’’ 2017, arXiv:1703.06182. [Online]. Available:
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, http://arxiv.org/abs/1703.06182
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, [35] S. V. Macua, A. Tukiainen, D. G.-O. Hernández, D. Baldazo,
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, E. M. de Cote, and S. Zazo, ‘‘Diff-DAC: Distributed actor-critic for
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, average multitask deep reinforcement learning,’’ 2017, arXiv:1710.10363.
and D. Hassabis, ‘‘Mastering the game of go with deep neural networks [Online]. Available: http://arxiv.org/abs/1710.10363
and tree search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. [36] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell,
[13] D. Borsa, T. Graepel, and J. Shawe-Taylor, ‘‘Learning shared represen- N. Heess, and R. Pascanu, ‘‘Distral: Robust multitask reinforcement learn-
tations in multi-task reinforcement learning,’’ 2016, arXiv:1603.02041. ing,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4496–4506.
[Online]. Available: http://arxiv.org/abs/1603.02041 [37] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
[14] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction. Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu,
Cambridge, MA, USA: MIT Press, 2018. ‘‘IMPALA: Scalable distributed deep-RL with importance weighted
[15] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized actor-learner architectures,’’ 2018, arXiv:1802.01561. [Online]. Available:
experience replay,’’ 2015, arXiv:1511.05952. [Online]. Available: http://arxiv.org/abs/1802.01561
http://arxiv.org/abs/1511.05952 [38] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and
[16] T.-L. Vuong, D.-V. Nguyen, T.-L. Nguyen, C.-M. Bui, H.-D. Kieu, H. van Hasselt, ‘‘Multi-task deep reinforcement learning with popart,’’ in
V.-C. Ta, Q.-L. Tran, and T.-H. Le, ‘‘Sharing experience in multitask rein- Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 3796–3803.
forcement learning,’’ in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019, [39] N. V. Varghese and Q. H. Mahmoud, ‘‘Optimization of deep reinforcement
pp. 3642–3648. learning with hybrid multi-task learning,’’ in Proc. IEEE Int. Syst. Conf.
(SysCon), Vancouver, BC, Canada, 2021, pp. 1–8.
[17] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
[40] D. S. Chaplot, L. Lee, R. Salakhutdinov, D. Parikh, and D. Batra, ‘‘Embod-
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, ‘‘Progressive
ied multimodal multitask learning,’’ 2019, arXiv:1902.01385. [Online].
neural networks,’’ 2016, arXiv:1606.04671. [Online]. Available:
Available: http://arxiv.org/abs/1902.01385
http://arxiv.org/abs/1606.04671
[41] T. Chesebro and A. Kamko. (Dec. 15, 2016). Learning Atari: An
[18] M. E. Taylor and P. Stone, ‘‘An introduction to intertask transfer for Exploration of the A3C Reinforcement Learning Method. Accessed:
reinforcement learning,’’ AI Mag., vol. 32, no. 1, p. 15, Mar. 2011. Oct. 17, 2020. [Online]. Available: https://bcourses.berkeley.edu/files/
[19] R. Caruana, ‘‘Machine learning,’’ Mach. Learn., vol. 28, no. 1, pp. 41–75, 70573736/download?download_frd=1
1997. [42] L. Weng. (Apr. 18, 2018). Policy Gradient Algorithms. Accessed:
[20] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, Dec. 28, 2020. [Online]. Available: https://lilianweng.github.io/
A. Pritzel, and D. Wierstra, ‘‘PathNet: Evolution channels gradient descent and https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-
in super neural networks,’’ 2017, arXiv:1701.08734. [Online]. Available: algorithms.html#actor-critic
http://arxiv.org/abs/1701.08734 [43] Z. Gu, Z. Jia, and H. Choset, ‘‘Adversary A3C for robust rein-
[21] A. A. Rusu, S. Gomez Colmenarejo, C. Gulcehre, G. Desjardins, forcement learning,’’ 2019, arXiv:1912.00330. [Online]. Available:
J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, http://arxiv.org/abs/1912.00330
‘‘Policy distillation,’’ 2015, arXiv:1511.06295. [Online]. Available: [44] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang,
http://arxiv.org/abs/1511.06295 and W. Zaremba, ‘‘OpenAI gym,’’ 2016, arXiv:1606.01540. [Online].
[22] C. Bucilu, R. Caruana, and A. Niculescu-Mizil, ‘‘Model compression,’’ in Available: http://arxiv.org/abs/1606.01540
Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, [45] Lazy Programmer. (Aug. 24, 2020). Deep Reinforcement Learning in
pp. 535–541. Python. Accessed: Oct. 21, 2020. [Online]. Available: https://github.com/
[23] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge and https://github.com/lazyprogrammer
in a neural network,’’ 2015, arXiv:1503.02531. [Online]. Available:
NELSON VITHAYATHIL VARGHESE received
http://arxiv.org/abs/1503.02531
the Bachelor of Technology degree in computer
[24] E. Parisotto, J. L. Ba, and R. Salakhutdinov, ‘‘Actor-mimic: Deep multitask
and transfer reinforcement learning,’’ 2015, arXiv:1511.06342. [Online]. engineering from the Cochin University of Sci-
Available: http://arxiv.org/abs/1511.06342 ence and Technology, Cochin, India, and the
[25] M. S. Akhtar, D. S. Chauhan, and A. Ekbal, ‘‘A deep multi-task contextual M.A.Sc. degree in electrical and computer engi-
attention framework for multi-modal affect analysis,’’ ACM Trans. Knowl. neering from Ontario Tech University, Canada. His
Discovery Data, vol. 14, no. 3, pp. 1–27, May 2020. research interests include machine learning, neural
[26] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, ‘‘The arcade networks, and data science.
learning environment: An evaluation platform for general agents,’’ J. Artif.
Intell. Res., vol. 47, pp. 253–279, Jun. 2013.
[27] Y. Wang, J. Stokes, and M. Marinescu, ‘‘Actor critic deep reinforcement
learning for neural malware control,’’ in Proc. AAAI Conf. Artif. Intell., QUSAY H. MAHMOUD (Senior Member, IEEE)
vol. 34, 2020, pp. 1005–1012. was the Founding Chair of the Department of
[28] J. Zou, T. Hao, C. Yu, and H. Jin, ‘‘A3C-DO: A regional resource schedul- Electrical, Computer and Software Engineering,
ing framework based on deep reinforcement learning in edge scenario,’’ Ontario Tech University, Canada. He has worked
IEEE Trans. Comput., vol. 70, no. 2, pp. 228–239, Feb. 2021. as an Associate Dean of the Faculty of Engineering
[29] A. Lazaric and M. Ghavamzadeh, ‘‘Bayesian multitask reinforcement and Applied Science, Ontario Tech University. He
learning,’’ in Proc. Int. Conf. Mach. Learn., 2010, pp. 599–606. is currently a Professor of Software Engineering.
[30] C. Dimitrakakis and C. A. Rothkopf, ‘‘Bayesian multitask inverse rein- His research interests include intelligent software
forcement learning,’’ in Proc. Eur. Workshop Reinforcement Learn. Berlin, systems and cybersecurity.
Germany: Springer, 2011, pp. 273–284.

VOLUME 9, 2021 44703

You might also like