A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents
A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents
ABSTRACT Driven by recent technological advancements within the field of artificial intelligence (AI),
deep learning (DL) has been emerged as a promising representation learning technique across different
machine learning (ML) classes, especially within the reinforcement learning (RL) arena. This new direction
has given rise to the evolution of a new technological domain named deep reinforcement learning (DRL)
that combines the high representational learning capabilities of DL with existing RL methods. Performance
optimization achieved by RL-based intelligent agents designed with model-free-based approaches was
majorly limited to systems with RL algorithms focused on learning a single task. The aforementioned
approach was found to be quite data inefficient, whenever DRL agents needed to interact with more complex,
data-rich environments. This is primarily due to the limited applicability of DRL algorithms to many
scenarios across related tasks from the same distribution. One of the possible approaches to mitigate this issue
is by adopting the method of multi-task learning. The objective of this research paper is to present a hybrid
multi-task learning-oriented approach for the optimization of DRL agents operating within different but
semantically similar environments with related tasks. The proposed framework will be built with multiple,
individual actor-critic models functioning within independent environments and transferring knowledge
among themselves through a global network to optimize performance. The empirical results obtained by the
hybrid multi-task learning model on OpenAI Gym based Atari 2600 video gaming environment demonstrates
that the proposed model enhances the performance of the DRL agent relatively in the range of 15% to 20%
margin.
INDEX TERMS Machine learning, deep reinforcement learning, neural networks, transfer learning, actor-
critic, multi-task worker.
I. INTRODUCTION level of current RL agents was not optimal. This was majorly
Over the last few decades, the reinforcement learning domain due to the limitations related to deriving the optimum policy
has been well established its position as a vital topic within out of the large state-action space linked with the environment
technological areas such as robotics and intelligent agents [1]. of RL problems. At the same time, the inception of DL with
The core objective of RL is to address the problem of how the its very high level of representational learning capability has
intelligent agents should explore their operating environment given a new dimension to the field of reinforcement learn-
optimally, and thereby learn to take optimal actions to achieve ing and led to the evolution of deep reinforcement learning.
the highest possible reward while in a given state [2]. Sup- As a result of these advancements, DRL agents have been
ported by recent advancements within the field of ML, the RL applied to various areas such as continuous action control,3D
has been cemented its position as one of the major machine first-person environments, and gaming. Especially in the field
learning paradigms that deal with agent’s behavior patterns of gaming, DRL agents are proven to be extremely successful
while in an environment. In comparison to the performance and could surpass the human-level performance on classic
of ML systems based out of contexts namely supervised video-games like Atari as well as board games such as chess
learning, and unsupervised learning, the relative performance and Go [3].
Despite the impressive results achieved with a single-task-
The associate editor coordinating the review of this manuscript and based methodology, the RL agent is observed to be less effi-
approving it for publication was Vicente Alarcon-Aquino . cient within operating environments that are more complex
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 44681
N. V. Varghese, Q. H. Mahmoud: Hybrid Multi-Task Learning Approach for Optimizing DRL Agents
TABLE 1. List of notations. s that is achievable by any policy. Similarly, the value of any
state s under policy π is defined by Vπ (s) = E [Rt |St = s]
which is simply the expected return for following the policy
π from state s. The Q(s, a) is often used as a measure of the
value of the agent being in that particular state s and taking an
action a to reach that state. The famous Bellman’s equation
mentioned below is used as a reference to calculate the Q(s,
a) for every action in every state that helps an agent to make
decisions about its future moves.
where Q(s, a), α, R(s, a), γ and maxQ‘ (s‘, a‘) represents the
current Q value, learning rate, reward for taking that action
a in state s, and maximum expected future reward given the
new state s‘ with all the possible action from that state.
In the case of value-based model-free reinforcement learn-
ing methods, the action-value function Q(s, a) is often repre-
sented by using a function approximation method, such as a
neural network. In such a case an approximate action-value
function that parameterized with θ represented as Q(s, a; θ).
The updates for the parameters are decided with the help of
a suitable RL algorithm. In contrast to the aforementioned
value-based methods, policy-based model-free RL methods
directly parameterize the policy π(a|s;θ) and update the
parameter by performing, typically approximate, gradient
ascent on E [Rt ].
source tasks as well as target tasks [7]. Especially when the a technique in place with the design, and the ability to abstract
level of similarity between two ends meets high, the process the aforementioned factors would play a pivotal role in accel-
of knowledge transfer is observed to be happening quite erating the whole learning process by an agent. Learning
smoothly such that this knowledge transferred in manner shared representations provides this capability to have this
plays an important role in assisting the target task’s algorithm milestone by learning the robust, transferable abstractions
to solve the tasks with a high level of efficiency. The major of the environment. This is majorly because those elements
reason behind the success of the above scenario is due to possess the ability to generalize over a group of tasks that an
the advantage provided by the positive knowledge transfer. agent needs to deal with while operating within its environ-
This in turn enables the target task’s algorithm to optimize its ment [13].
performance level in an easier way than spending more time The idea of value function plays an important role with the
for it to gain the same amount of knowledge by using more space of RL as it is being extensively used in conjunction with
amount of the target task’s data sample. Different TL mech- functional approximation methodology to generalize over the
anisms that are oriented on the aforementioned approach large-sized state-action space of the operating environment
have been deployed within various RL algorithms based on within which an agent needs to function [14]. The signif-
single-agent [8]. icance of value function lies in its ability to determine the
Similar kinds of approaches also experimented with multi- quality of a specific state that an agent needs to be in while
agent systems, and agents operating within the same envi- in an environment. Value functions often have the ability
ronment interact among them as a means for exchanging the to demonstrate the compositional structure concerning the
knowledge gained from their respective actions [9]. Typically, state space and goal states [15]. Empirically, prior research
the design of multi-agent systems uses a joint policy that efforts have proven that value functions could effectively
is learned by agents from the source task, and further on capture as well as represent knowledge beyond their current
same knowledge will be used to generate the initial policy goal, and this could be efficiently leveraged or re-used for
for the agents to use within the context of the target task [10]. future learning purposes [14]. It is possible to learn the value
Knowledge transfer is made possible among the source and of optimal value-functions by efficient usage of space of
target tasks by adopting different transfer techniques like state-action values that are exchanged among the tasks that
instance transfer, representation transfer, or parameter trans- an RL agent encounters and deals with during its operation
fer. On all these approaches, algorithms that handle the within the operating environment. This could be made pos-
knowledge transfer rely majorly on the prior knowledge gath- sible by having the ability to accommodate the aforemen-
ered while attempting to solve the same kind of source tasks, tioned common structure into the following value iteration
and then leverage on that as a reference to bias the approach and policy-iteration procedures namely fitted Q-iteration and
or learning to be followed on a new task [7]. approximate policy iteration respectively [16].
to efficiently incorporate the previously gathered knowledge neural network build a knowledge database on the efficient
while conducting the learning at each layer of the whole re-use of environment parameters for new tasks or actions.
feature hierarchy. Secondly, in order to come up with a All of the agents are designed to function in a parallel manner
system having enough immunity to handle the impacts of along with all the other agents present within the system who
catastrophic forgetting [7]. would be involved in learning other tasks as well as sharing
The major benefit of adopting this methodology is that parameters among them for facilitating the positive transfer
progressive networks possess the capability to retain a set of of knowledge [20].
pre-trained models whole throughout the training cycle [17]. Generally, PathNet architecture is made up of DNN having
Along with this, it can utilize the pre-trained model to learn L layers, with each of them having M modules. Then each one
lateral connections and thereby derive useful features needed of these modules itself would another neural network. Con-
for new tasks in the future. Having an approach with such solidated outputs of modules belonging to each layer would
abilities brings the combination of both richer composition- then sent into the active modules residing in the subsequent
ality and an easy room for the integration of previously layer [7]. For every individual layer, there would a limit on
acquired knowledge at every layer of feature hierarchy. The the maximum number of modules that could be supported for
continual nature of learning associated with this approach each of the pathways, and often this number lies between 3 to
facilitates the agents to both learn a group of tasks that are 4 [20]. The final layer within each of the neural networks for
encountered in a sequence as well as gives the capability to each of the tasks that are being learned will be always unique.
perform knowledge transfer from previous tasks to enhance More importantly, this would not be shared with any of the
the overall convergence speed [18]. Progressive networks remaining tasks running within the operating environment.
combine both these aspects into model architecture where One major advantage of the PathNet approach lies within the
catastrophic forgetting is eliminated by instantiating a new re-usability factor by which neural networks could leverage
neural network for each of the tasks that are being solved and learn from existing knowledge databases, and thereby
during an agent’s lifetime within its operating environment. save time by avoiding learning from scratch for the new tasks.
In addition to this, transfer of knowledge is made possible The impact of this would be more prevalent within the context
through the lateral connections to the list of features from of RL, as there could more interrelated tasks within the wide
the prior learned neural network columns [17]. At any given action space associated with the operating environment. The
time step t, when a new task is learned, the model appends PathNet approach has shown impressive results for positive
a new column of knowledge into its current framework as a transfer of knowledge for various datasets like binary MNIST
new neural network unit. Subsequently, the newly added unit (Modified National Institute of Standards and Technology),
would be used while learning the successive tasks. Each of the CIFAR-100 (Canadian Institute For Advanced Research), and
new column or neural network units created would be trained SVHN (The Street View House Numbers) supervised learn-
to solve a specific Markov decision process (MDP) [17]. One ing classification tasks. The same kind of results has been
of the plausible drawbacks related to this approach is that this obtained with several Atari2600 gaming and Labyrinth RL
could lead to a computationally expensive model as its size tasks.
could be growing as well with the progress in the learning
cycle. E. POLICY DISTILLATION
Policy distillation (PD) and actor-mimic (AM) are the two
D. PathNet approaches that are based on the concept of distillation
PathNet is another multi-task RL methodology designed for intended to achieve multi-task DRL. The core objective of
the purpose of achieving artificial general intelligence (AGI) distillation lies in the factor of minimalization in terms of
by joining together the aspects of transfer learning, continual the costs of computations associated with ensemble meth-
learning, and multitask learning [18]. The core design aspect ods [21]. The idea of an ensemble can be viewed as a group
of PatheNet lies with a neural network-oriented algorithm that of models wherein the prediction outputs of this group are
utilizes multiple agents which are deployed within the neural joined with help of either a technique of weighted average or
network. The major role of each agent is to identify which voting [22]. Studies on ensemble methods were of the promi-
all parts of the network could be re-used while learning new nent research fields within the past decade. The most famous
tasks [19]. All agents are being treated as pathways (known ensemble-oriented methods are namely bagging, boosting,
by the name genotypes) within the ecosystem of the neural random forests, Bayesian averaging, and stacking [22]. Two
network to decide the subset of parameters that could be major drawbacks related to ensemble-based methods are in
utilized during the learning process [20]. All of these param- terms of their huge memory requirements for operation as
eters used within the forward propagation of the learning well the amount of time needed for execution during run-
cycle undergo updates at the backpropagation phase of the time. The need for relatively high execution time makes them
PathNet algorithm. During the learning process, a tournament slow in terms of generating the output. To mitigate these
selection genetic algorithm would be deployed for selecting issues, a distillation methodology was suggested, which is
the pathways through the use of a neural network. During the designed based on the model compression approach. The
operation, various actions carried out by the agents inside the main objective of this approach is to compress the learned
function by the complex model, which is an ensemble to a could be achieved by using models that are trained with this
much scaled-down and faster model that has got a relatively approach on several games. Specifically, with a high amount
comparable level of performance with the original ensem- of similarity level between source as well as target tasks,
ble [22]. Subsequently, this same approach was mapped into features that are learned while training source tasks could be
the domain of neural networks [23]. quite efficiently used for the generalization of target tasks’
By leveraging model compression, PD is being consid- training [25].
ered as an approach that could be used for the purpose The actor-mimic methodology utilizes the power of both
of extracting the policy of an RL agent. Following this, DRL as well as model compression techniques to train a
the same policy could be utilized towards training a new single policy network. The key intention behind the usage of
network at an optimum level with a relatively smaller size such a training method is to enable the network to gain knowl-
and a higher level of efficiency. Following this, an intelligent edge on how to act within a group of distinct tasks under the
agent could utilize the same approach for the consolidation guidance of multiple expert teachers [7]. Subsequently, rep-
of multiple task-oriented policies into one policy. Earlier resentational knowledge gathered by the DRL policy network
research efforts conducted on the PD front were mostly car- could be leveraged towards generalizing the new tasks with-
ried out by an RL algorithm namely DQN (deep Q-network). out having any sort of anterior expert guidance. Validation
With this, the PD method could be utilized in an effective of this technique was majorly carried out within the arcade
way to transfer one or more than one active policy from learning environment (ALE) [26]. Generally, actor-mimic
a DQN to another network that is untrained [7]. DQN is is being considered as part of the larger imitation learning
a famous, state-of-the-art model-free technique deployed in class of methods. These methods are generally rooted in the
RL with the help of deep neural networks (DNN). This idea of adopting expert guidance to train an agent on how
model functions within an environment having a discrete to act within a particular operating environment. Under the
set of action choices. DQN was proven to exhibiting a per- imitation learning methodology, a policy would be directly
formance level that outperforms human-level scores on a trained to mimic an expert’s behavior while sampling the set
collection of multiple Atari 2600 games [3]. In this con- of actions from the mimic agent’s space [24].
text, the distillation approach could be applied both at a
single task level as single game policy distillation. Similarly, G. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC
the same could be applied at a multi-task level as a knowl- Google DeepMind proposed the idea of a simultaneous,
edge transfer technique from a teacher model T to a student parallel learning-based training approach towards multi-task
model S. Within the single task-based policy distillation, learning, and formulated an algorithm by the name A3C
the responsibility of data generation would be handled by (asynchronous advantage actor-critic). According to this,
the teacher network, which is a trained DQN agent, and multiple intelligent agents, which are also called workers,
following this a supervised training would be performed will be running simultaneously in a parallel fashion within
by student network. To achieve multi-task PD, n different different instances of the same operating environment [4]. All
DQN-based single-game experts (agents) would be trained the workers that are running within the environment will be
independently [22]. Later on, all of these individual agents in charge of updating a global value function asynchronously.
generate both the inputs and target date, which are stored The key essence of this approach lies in the fact that at
inside memory buffers. Subsequently, the distillation agent the time of training each of these individual agents, at any
uses all of these n different data stores sequentially for given time-stamp t, each of these agents would be undergoing
learning. through different states within the environment. This property
offers them an independent and unique way of learning. The
F. ACTOR-MIMIC impact of this unique A3C algorithm will be in a position
The major design aspect and most desired characteristic of to deliver each of the agents with a very highly efficient
an intelligent agent lies with its ability to act under differ- learning trajectory within the vast state space of the operating
ent operating environments, accumulate information, then environment [27]. A3C is designed as an enhancement of
subsequently perform knowledge transfer from those past the original actor-critic methodology, which is having two
experiences gathered to new situations. The idea behind the different, independent neural network units- one for the actor
actor-mimic is based on the aforementioned methodology module and the other one for the critic module with its loss
with a special focus on aspects such as multi-task learning functions [7]. At the basic level, an actor module could be
and transfer learning. Having these two abilities would make treated as a function approximator unit that governs agent
an intelligent agent on learning efficiently on how to handle at each state, in a quite similar way as being judged by RL
and act concurrently with multiple tasks, and subsequently, methods like Q-learning or in REINFORCE. In these two
generalize that knowledge gathered or accumulated to the approaches, a neural network calculates either a function that
new domains [24]. Typically, actor-mimic could be perceived leads to the calculation of policy or deriving the policy itself
as a technique to train a single deep policy network with directly [28]. When it comes to the critic module, it acts
the help of a set of source tasks that are related. Empir- as a more sort of judging unit which effectively evaluates
ically, it is shown that impressive levels of performance the effectiveness of the policy created by the actor and then
is designed with the ability to not only in the efficient uti- C. PopArt
lization of resources within a single-machine-oriented train- A third approach by the name PopArt proposed by Google
ing environment but also it can be scaled to operate with DeepMind came out as a solution to mitigate the issues asso-
multiple machines without the need to sacrifice both data ciated with the existing IMPALA model. PopArt was aimed to
efficiency and utilization of the resource. By following a address the reasons behind the suboptimal performance fac-
novel off-policy correction method by the name V-trace, tors and thereby enhance the RL in multi-task-oriented envi-
IMPALA is capable of gaining quite a stable learning tra- ronments. The core design objective of PopArt is to reduce
jectory with a very high throughput level by combining the impacts of the distraction dilemma problem associated
both decoupled acting as well as learning [37]. In gen- with the IMPALA model and, thereby stabilize the learning
eral, the DRL model’s architecture follows the notion of a process in a better way to facilitate the adoption of multi-task
single learner (also known as a critic) clubbed with many RL techniques [38]. The term distraction dilemma refers to
numbers actors. Under this ecosystem, initially, each of the the probability of learning algorithms getting distracted only
individual actors creates its learning parameters called tra- by a fraction of few tasks from the large pool of multiple tasks
jectories and subsequently shares that knowledge with the to be solved. This scenario in turn leads to the challenges
learner (critic) by following a queue mechanism. The learner related to resource contention. It is about establishing the
subsequently accumulates the same kind of knowledge tra- right balance between the necessities of multiple tasks oper-
jectories from all of the multiple actors operating within ating within the same environment competing for a limited
the environment, which eventually acts as a source of infor- number of resources offered from a single learning system.
mation to prepare the central policy. Before starting the The design methodology of the PopArt model is based on
next learning cycle (trajectory), all of the individual actors the original IMPALA architecture model by adding multi-
operating within the environment gather the updated policy ple CNN layers combined with other techniques like word
parameters details from the learner(critic module) [37]. This embeddings with the help of a recurrent neural network of
approach is quite analogous to the popular RL algorithm type long-short term memory (LSTM) [38].
named A3C. The architecture of the IMPALA was inspired PopArt model functions by gathering the trajectories from
hugely by the same algorithm. RL algorithm model used each of the individual tasks to the RL agent’s updates. In this
within the IMPALA follows a topology of a system having manner, the PopArt model makes sure that every agent within
a group of actors and learners who build knowledge through the environment will have its role, subsequently proportional
collaboration. impact during dynamics of overall learning. The key design
The design of the IMPALA leverages an actor-critic-based aspect of the PopArt model relies on the fact that modifying
model to derive a policy π and, a baseline value function the weights of the neural network, will be based on the output
named V π. Major units of the IMPALA system consist of of all tasks operating within the environment. During the
a group of actors that generates g trajectories of experience first stage of operation, PopArt estimates both mean as well
in a continuous manner. In addition to this, there could as the spread of the ultimate targets such as the score of a
be at least one or more than one learner that leverage the game across all tasks under consideration. Following this,
generated trajectories shared from the individual actors to PopArt capitalizes on these estimate values to normalize the
learn the policy π, which is an off-policy. At the beginning targets before making an update on the network’s weights.
of every individual trajectory, an actor initially updates its This approach in turn makes the whole learning process more
local policy µ to the latest learner policy π. Subsequently, stable and robust. With the set of various experiments con-
each actor would adopt and run that policy for n number ducted with popular Atari games’ environment, PopArt has
of steps in its operating environment [37]. Upon completion demonstrated its capabilities and improvements over other
of these n steps, each of the individual actors sends another multi-task RL architectures [38].
set of information consisting of - the trajectory of states,
actions, and rewards together with related policy distribu- V. HYBRID MULTI-TASK LEARNING MODEL
tions to the learner. In this manner, the learner will have The major motivation behind the proposed hybrid multi-task
the opportunity to continuously update its policy π each approach is to address and mitigate some of the key chal-
time whenever the actors share their trajectory information lenges associated with DRL multi-tasking, which are not fully
from the environment. In this fashion, IMPALA architecture covered by the state of the art. In this paper, we extend our
gathers experiences from different individual learners within approach in [39] to address the DRL agent’s performance
the environment, which are further passed to a central learner optimization bottlenecks by adopting the hybrid multi-task
module. Following this, the central learner calculates the learning-based approach in complex operating environments
gradients and then generates a model having a framework having a higher number of distinct DRL agents. Besides,
of independent actors as well as learners. One of the major this work also examines the impact of semantic dissimi-
characteristics of the IMPALA architecture is its operational larity of DRL agents’ tasks on the overall momentum of
flexibility which allows the actors to be present either on the performance optimization. The challenges such as partial
same machine or it can be evenly distributed across numerous observability, amount of the training time as well as train-
machines. ing data samples required, and effective exploration often
FIGURE 5. The ecosystem of single A3C worker thread with Atari 2600.
FIGURE 38. Breakout-v0 test results with the hybrid multi-task model.
agents belonging to different but semantically similar games, accumulated gradients transfer or knowledge transfer among
we aimed at addressing some of the key challenges associated workers always governed by the global network. Addition-
with the existing multi-task DRL. To showcase, the extent to ally, the current implementation of the hybrid multi-task
which our model could address those issues, we would like to model mandates that all the workers be present on the same
have a case study based on the test results obtained. For this machine, where the IMPALA model supports distributed
purpose, we are using both the standalone and hybrid model system-based working environment for the workers. The
test results obtained for the Breakout-v0 game as indicated by PopArt model is being considered as an extension of the
Fig. 33 and Fig. 38 respectively. To have a fair comparison IMPALA model itself and designed to address key issues such
and derive a convincing conclusion, we have ensured that the as distraction dilemma and thereby stabilize the process of
same amount of resources have been allotted in both test sce- multi-task learning.
narios in terms of the number of worker threads, test configu-
rations, and the number of global steps parameter. By having IX. CONCLUSION AND FUTURE WORK
a comparison of these two test results, it is quite evident that In this research work, we propose a hybrid multi-task
in terms of the training time needed, the hybrid model could model-that follows a parallel, multi-tasking approach for
surpass the performance of the standalone model much ahead optimizing the performance of deep reinforcement learning
of time. After running the Breakout-v0 under a standalone agents. We present how to combine the multi-task learnings
model for 2.5+e5 (25 Million) global steps, the highest score from two different deep reinforcement learning agents oper-
it could achieve was a little over the range of 12, whereas the ating within two different by semantically similar environ-
hybrid model could surpass the same level in almost half of ments running with related tasks. Initial stage experiments
its execution time. In continuation to this, it is reasonable to are conducted by applying the DRL algorithm A3C to Atari
conclude that hybrid multi-task learning by having a group of 2600 gaming environment to draw the results. During the
different but semantically similar environments with similar experiments, we can establish that our hybrid multi-task
tasks could reduce the impact of partial observability which model can learn multiple similar gaming tasks, at least two,
restricts a DRL agent from choosing the optimal action while simultaneously without making changes in the algorithm and
in a state. Due to the impacts of the positive knowledge degradation in performance for any one of the individual
transfer facilitated by the gradient transfers from the second gaming tasks. The semantic similarity aspect of the related
environment’s agents, the actor module within each worker is tasks running two different environments is the most vital
having a better policy to choose the optimal action while in a factor to reduce the challenges posed in terms of the possible
step. Having said this, by possessing better policy parameters negative knowledge transfer.
actor module is in a better position to explore the environment For future work, we plan to conduct the experiments of the
in a much effective way and choose the optimal action in hybrid multi-task model with more complex gaming envi-
each state. This in turn is expected to improve throughout the ronments having a higher number of worker threads under
DRL agents’ execution as more positive knowledge transfer GPU cloud server-based machine environment to draw strong
is anticipated to happen with more global steps of gameplay. conclusions on parallel multi-task learning. Along with this,
The same kind of comparison case study could be applied to we also would like to investigate the steps to mitigate the
other game test pairs from the experiment. Seen in the light of impacts of negative knowledge transfer and catastrophic for-
these observations, it is reasonable to conclude that the hybrid getting in deep reinforcement multi-task learning.
multi-task learning model can address the objectives, it was
ACKNOWLEDGMENT
aiming for, to a great extent.
The authors acknowledge the support of the Natural Sciences
Finally, we also would like to have a comparison of
and Engineering Research Council of Canada (NSERC).
the proposed hybrid multi-task learning model against the
three state-of-the-art techniques that were mentioned under REFERENCES
the related work. In comparison to the hybrid multi-task [1] R. S. Sutton, ‘‘Generalization in reinforcement learning: Successful exam-
model which relies on the idea of sharing the network learn- ples using sparse coarse coding,’’ in Proc. Adv. Neural Inf. Process. Syst.,
1996, pp. 1038–1044.
ing parameters by a global network to individual workers, [2] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8,
the Distral model works on the idea of sharing a distilled nos. 3–4, pp. 279–292, 1992.
centroid policy that would regularize the workers running in [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare,
the environment. When it comes to the comparison with the A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, ‘‘Human-
level control through deep reinforcement learning,’’ Nature, vol. 518,
IMPALA model, its design approach is having similarity to pp. 529–533, Feb. 2015.
the hybrid multi-task learning model in terms of the actor- [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
critic methodology as it follows the topology of a set of actors and K. Kavukcuoglu, ‘‘Asynchronous methods for deep reinforcement
learning,’’ in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
with either a single learner or multiple learners. Within the [5] A. Mujika, ‘‘Multi-task learning with deep model based
IMPALA model, the learner’s role is to create a central policy reinforcement learning,’’ 2016, arXiv:1611.01457. [Online]. Available:
to be shared with the actors. Along with these learners have http://arxiv.org/abs/1611.01457
[6] R. Glatt and A. H. R. Costa, ‘‘Improving deep reinforcement learning
the flexibility to communicate among themselves for shar- with knowledge transfer,’’ in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing the gradients. In the hybrid multi-task model, workers’ pp. 5036–5037.
[7] N. Vithayathil Varghese and Q. H. Mahmoud, ‘‘A survey of multi-task deep [31] Z. Yang, K. E. Merrick, H. A. Abbass, and L. Jin, ‘‘Multi-task deep
reinforcement learning,’’ Electronics, vol. 9, no. 9, p. 1363, Aug. 2020. reinforcement learning for continuous action control,’’ in Proc. IJCAI,
[8] G. Boutsioukis, I. Partalas, and I. Vlahavas, ‘‘Transfer learning in multi- 2017, pp. 3301–3307.
agent reinforcement learning domains,’’ in Proc. Eur. Workshop Reinforce- [32] P. Dewangan, S. Phaniteja, K. M. Krishna, A. Sarkar, and B. Ravindran,
ment Learn. Berlin, Germany: Springer, 2011, pp. 249–260. ‘‘DiGrad: Multi-task reinforcement learning with shared actions,’’ 2018,
[9] G. Weiss, Multiagent Systems: A Modern Approach to Distributed Artifi- arXiv:1802.10463. [Online]. Available: http://arxiv.org/abs/1802.10463
cial Intelligence. Cambridge, MA, USA: MIT Press, 1999. [33] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, ‘‘Deep reinforce-
[10] N. D. Nguyen, T. T. Nguyen, D. Creighton, and S. Nahavandi, ‘‘A visual ment learning for multiagent systems: A review of challenges, solutions,
communication map for multi-agent deep reinforcement learning,’’ 2020, and applications,’’ IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839,
arXiv:2002.11882. [Online]. Available: http://arxiv.org/abs/2002.11882 Sep. 2020.
[34] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, ‘‘Deep
[11] Y. Bengio, Learning Deep Architectures for AI. Boston, MA, USA: Now,
decentralized multi-task multi-agent reinforcement learning under
2009.
partial observability,’’ 2017, arXiv:1703.06182. [Online]. Available:
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, http://arxiv.org/abs/1703.06182
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, [35] S. V. Macua, A. Tukiainen, D. G.-O. Hernández, D. Baldazo,
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, E. M. de Cote, and S. Zazo, ‘‘Diff-DAC: Distributed actor-critic for
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, average multitask deep reinforcement learning,’’ 2017, arXiv:1710.10363.
and D. Hassabis, ‘‘Mastering the game of go with deep neural networks [Online]. Available: http://arxiv.org/abs/1710.10363
and tree search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. [36] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell,
[13] D. Borsa, T. Graepel, and J. Shawe-Taylor, ‘‘Learning shared represen- N. Heess, and R. Pascanu, ‘‘Distral: Robust multitask reinforcement learn-
tations in multi-task reinforcement learning,’’ 2016, arXiv:1603.02041. ing,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4496–4506.
[Online]. Available: http://arxiv.org/abs/1603.02041 [37] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
[14] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction. Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu,
Cambridge, MA, USA: MIT Press, 2018. ‘‘IMPALA: Scalable distributed deep-RL with importance weighted
[15] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized actor-learner architectures,’’ 2018, arXiv:1802.01561. [Online]. Available:
experience replay,’’ 2015, arXiv:1511.05952. [Online]. Available: http://arxiv.org/abs/1802.01561
http://arxiv.org/abs/1511.05952 [38] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and
[16] T.-L. Vuong, D.-V. Nguyen, T.-L. Nguyen, C.-M. Bui, H.-D. Kieu, H. van Hasselt, ‘‘Multi-task deep reinforcement learning with popart,’’ in
V.-C. Ta, Q.-L. Tran, and T.-H. Le, ‘‘Sharing experience in multitask rein- Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 3796–3803.
forcement learning,’’ in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019, [39] N. V. Varghese and Q. H. Mahmoud, ‘‘Optimization of deep reinforcement
pp. 3642–3648. learning with hybrid multi-task learning,’’ in Proc. IEEE Int. Syst. Conf.
(SysCon), Vancouver, BC, Canada, 2021, pp. 1–8.
[17] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
[40] D. S. Chaplot, L. Lee, R. Salakhutdinov, D. Parikh, and D. Batra, ‘‘Embod-
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, ‘‘Progressive
ied multimodal multitask learning,’’ 2019, arXiv:1902.01385. [Online].
neural networks,’’ 2016, arXiv:1606.04671. [Online]. Available:
Available: http://arxiv.org/abs/1902.01385
http://arxiv.org/abs/1606.04671
[41] T. Chesebro and A. Kamko. (Dec. 15, 2016). Learning Atari: An
[18] M. E. Taylor and P. Stone, ‘‘An introduction to intertask transfer for Exploration of the A3C Reinforcement Learning Method. Accessed:
reinforcement learning,’’ AI Mag., vol. 32, no. 1, p. 15, Mar. 2011. Oct. 17, 2020. [Online]. Available: https://bcourses.berkeley.edu/files/
[19] R. Caruana, ‘‘Machine learning,’’ Mach. Learn., vol. 28, no. 1, pp. 41–75, 70573736/download?download_frd=1
1997. [42] L. Weng. (Apr. 18, 2018). Policy Gradient Algorithms. Accessed:
[20] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, Dec. 28, 2020. [Online]. Available: https://lilianweng.github.io/
A. Pritzel, and D. Wierstra, ‘‘PathNet: Evolution channels gradient descent and https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-
in super neural networks,’’ 2017, arXiv:1701.08734. [Online]. Available: algorithms.html#actor-critic
http://arxiv.org/abs/1701.08734 [43] Z. Gu, Z. Jia, and H. Choset, ‘‘Adversary A3C for robust rein-
[21] A. A. Rusu, S. Gomez Colmenarejo, C. Gulcehre, G. Desjardins, forcement learning,’’ 2019, arXiv:1912.00330. [Online]. Available:
J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, http://arxiv.org/abs/1912.00330
‘‘Policy distillation,’’ 2015, arXiv:1511.06295. [Online]. Available: [44] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang,
http://arxiv.org/abs/1511.06295 and W. Zaremba, ‘‘OpenAI gym,’’ 2016, arXiv:1606.01540. [Online].
[22] C. Bucilu, R. Caruana, and A. Niculescu-Mizil, ‘‘Model compression,’’ in Available: http://arxiv.org/abs/1606.01540
Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, [45] Lazy Programmer. (Aug. 24, 2020). Deep Reinforcement Learning in
pp. 535–541. Python. Accessed: Oct. 21, 2020. [Online]. Available: https://github.com/
[23] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge and https://github.com/lazyprogrammer
in a neural network,’’ 2015, arXiv:1503.02531. [Online]. Available:
NELSON VITHAYATHIL VARGHESE received
http://arxiv.org/abs/1503.02531
the Bachelor of Technology degree in computer
[24] E. Parisotto, J. L. Ba, and R. Salakhutdinov, ‘‘Actor-mimic: Deep multitask
and transfer reinforcement learning,’’ 2015, arXiv:1511.06342. [Online]. engineering from the Cochin University of Sci-
Available: http://arxiv.org/abs/1511.06342 ence and Technology, Cochin, India, and the
[25] M. S. Akhtar, D. S. Chauhan, and A. Ekbal, ‘‘A deep multi-task contextual M.A.Sc. degree in electrical and computer engi-
attention framework for multi-modal affect analysis,’’ ACM Trans. Knowl. neering from Ontario Tech University, Canada. His
Discovery Data, vol. 14, no. 3, pp. 1–27, May 2020. research interests include machine learning, neural
[26] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, ‘‘The arcade networks, and data science.
learning environment: An evaluation platform for general agents,’’ J. Artif.
Intell. Res., vol. 47, pp. 253–279, Jun. 2013.
[27] Y. Wang, J. Stokes, and M. Marinescu, ‘‘Actor critic deep reinforcement
learning for neural malware control,’’ in Proc. AAAI Conf. Artif. Intell., QUSAY H. MAHMOUD (Senior Member, IEEE)
vol. 34, 2020, pp. 1005–1012. was the Founding Chair of the Department of
[28] J. Zou, T. Hao, C. Yu, and H. Jin, ‘‘A3C-DO: A regional resource schedul- Electrical, Computer and Software Engineering,
ing framework based on deep reinforcement learning in edge scenario,’’ Ontario Tech University, Canada. He has worked
IEEE Trans. Comput., vol. 70, no. 2, pp. 228–239, Feb. 2021. as an Associate Dean of the Faculty of Engineering
[29] A. Lazaric and M. Ghavamzadeh, ‘‘Bayesian multitask reinforcement and Applied Science, Ontario Tech University. He
learning,’’ in Proc. Int. Conf. Mach. Learn., 2010, pp. 599–606. is currently a Professor of Software Engineering.
[30] C. Dimitrakakis and C. A. Rothkopf, ‘‘Bayesian multitask inverse rein- His research interests include intelligent software
forcement learning,’’ in Proc. Eur. Workshop Reinforcement Learn. Berlin, systems and cybersecurity.
Germany: Springer, 2011, pp. 273–284.