Deep Learning Feature Modeling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Using Deep Learning to Automate Feature Modeling in Learning by

Observation: A Preliminary Study


Michael W. Floyd1, JT Turner1, and David W. Aha2
1Knexus Research Corporation; Springfield, Virginia; USA
2Navy Center for Applied Research in AI; Naval Research Laboratory (Code 5514); Washington, DC; USA
{michael.floyd, jt.turner}@knexusresearch.com | [email protected]

Abstract Bengio, and Hinton 2015) to learn a feature representation


A primary advantage of learning by observation is that it from the agent’s raw visual inputs. Our approach trains two
allows non-technical experts to transfer their skills to an DL models: one uses the agent’s complete visual inputs (i.e.,
agent. However, this requires a general-purpose learning everything it can currently observe) while the other uses
agent that is not biased to any specific expert, domain, or
close-range visuals. The output of the two models are used
behavior. Existing domain-independent learning by
observation agents generalize a significant portion of to select actions to perform in response to novel visual input
learning but still require some human intervention, namely, (i.e., what the agent can see as it attempts to replicate the
modeling the agent’s inputs and outputs. We describe a expert’s behavior).
preliminary evaluation of using convolutional neural Our preliminary evaluation examines the feasibility of
networks to train a learning by observation agent without
our approach under common learning by observation
explicitly defining the input features. Our approach uses the
agent’s raw visual inputs at two levels of granularity to conditions. More specifically, these conditions include
automatically learn input features using limited training data. limited observations (i.e., due to limited expert availability),
We describe an initial evaluation with scenarios drawn from noisy or erroneous observations (e.g., errors by the expert or
a simulated soccer domain. incorrect observations by the agent), and partial
observability in the environment. We discuss related
research in Section 2, followed by a description of our
1. Introduction
approach in Section 3. We evaluate our approach using
Learning by observation (LbO) agents are trained to perform scenarios defined in a simulated soccer domain in Section 4,
specific behaviors by observing an expert demonstrate the and conclude with a discussion of future work in Section 5.
behaviors. Whereas traditional methods for training an agent
may involve computer programming or knowledge
modeling competency, LbO only requires the expert to be 2. Related Work
able to perform the behavior. By shifting the knowledge- Learning by observation has been used in a variety of
acquisition task from the expert to the agent itself, the agent domains, including poker (Rubin and Watson 2010), Tetris
is provided with the opportunity to learn from a variety of (Romdhane and Lamontagne 2008), first-person shooter
non-technical experts (e.g., healthcare professionals, games (Thurau, Bauckhage, and Sagerer 2003), helicopter
military commanders). However, for an agent to learn an control (Coates, Abbeel, and Ng 2008), robotic soccer
unknown behavior without any prior knowledge of the (Grollman and Jenkins 2007), simulated soccer (Floyd,
expert or domain, it should learn in a general, non-biased Esfandiari, and Lam 2008; Young and Hawes 2015), and
manner. real-time strategy games (Ontañón et al. 2007). However,
We describe our preliminary approach to overcome the most of these approaches were designed to learn in a single
limitations of existing general-purpose learning by domain, so the agents cannot be directly transferred to new
observation agents. Specifically, we remove the need for environments. Two domain-independent approaches for
input features to be manually modeled for each domain. LbO have been proposed (Gómez-Martín et al. 2010; Floyd
Instead, we use deep learning (DL) techniques (LeCun,

Copyright © 2017, Association for the Advancement of Artificial


Intelligence (www.aaai.org). All rights reserved.
and Esfandiari 2011), both of which separate the agent’s reinforcement learning approaches are beneficial in that
learning and reasoning from how it interacts with the they do not require labeled training data, they require
environment. This is advantageous because the observation, explicitly encoding reward functions which may bias the
learning, and reasoning components are general-purpose agents to learning specific behaviors.
and are not biased to any specific expert, behavior, or
domain. However, they both require the inputs (i.e., what
objects the agent can observe) and outputs (i.e., the actions 3. System Design
the agent can perform) to be modeled. Although the In real-time computer games, agents typically receive
modeling only needs to be performed once (i.e., before the sensory inputs in the form of periodic messages from the
agent is deployed in a new environment), it still requires game. These messages can include information about the
some human intervention. Floyd, Bicakci, and Esfandiari state of the game (e.g., elapsed time, score), the agent’s
(2012) use a robot architecture that allows sensors to be properties (e.g., player number, team name, resource levels),
dynamically added or removed, with each change modifying and observable objects. The observable objects are
how the LbO agent represents inputs. While this does not particularly important for an agent’s decision making
require human intervention before deployment in a new because they provide information about the physical state of
domain, it does require human intervention for each new the environment. For example, in a soccer game the
type of sensor. Our approach differs in that it does not observable objects would include the location of the ball,
require any human intervention to model the environment; other players, goal nets, and boundary markers. While most
the only requirement is that the domain provides a visual games explicitly define the set of observable objects in the
representation of the environment. game (e.g., in a user manual), deploying an agent in a new
Deep learning by observation is used for initial training game still requires some level of knowledge engineering to
of AlphaGo (Silver et al. 2016). However, their learning model these objects (i.e., converting the object definition
methodology has several limitations that may make it into a format that is understandable by the agent).
unsuitable for some LbO tasks. First, they trained their To remove the need for modeling the observable objects,
system with over 30 million observations. Large datasets our approach uses the raw visual representation of the
may be available for established games like Go, but less environment. For example, Figure 1 shows a player’s view
popular games or novel behaviors may not have any existing of the field in a soccer game. The left side of Figure 1 shows
observation logs. Second, such a large dataset requires the player’s entire field of vision, which we will refer to as
months of training using datacenters composed of state-of- the full visual representation, whereas the right side shows
the-art hardware. If models need to be trained rapidly with an enlarged view of the objects close to the player (i.e., a
limited computational resources, alternative learning fixed-sized region surrounding the player), which we refer
approaches are necessary. Finally, LbO is performed using to as the zoomed visual representation. Both representations
images of a turn-based board game. This minimizes the contain only a partial view of the environment (i.e., what is
influence of object occlusion (i.e., each Go piece is on its currently within the player’s field of vision, not the entire
own square), observation error (e.g., due to erroneous or field), with the full representation giving a larger view of the
delayed responses by the expert), and provides the learning field than the zoomed representation. The agent is not
agent with full observability. We instead examine the explicitly given information about what is contained in the
feasibility of using DL for LbO tasks with limited images (e.g., it does not know that the white circle is the
observations and limited training time in complex, real-time soccer ball). Each of the visual representations is stored as a
domains. 256 × 256 RGB image.
Our feature learning method is inspired by the deep
reinforcement learning work of Mnih et al. (2015). They use
raw visual inputs to learn to play a variety of Atari 2600
games. A primary difference from our work, in addition to
the amount of training time required to train their agents, is
they use reinforcement learning rather than LbO.
Reinforcement learning requires a reward function to be
defined for each domain (e.g., based on the game score),
thereby adding additional knowledge engineering before an
agent can be deployed in a new environment. Deep
reinforcement learning has also been used in simulated
soccer (Hausknecht and Stone 2016), with the reward
Figure 1: The full visual representation (left) and zoomed visual
functions partially encoding the desired behavior (e.g., move
representation (right) in a simulated soccer game
to ball reward and kick to goal reward). Although
During observation, the learning agent records the maximum of the six confidence values is selected and its
expert’s current visual inputs, both the full version 𝑉𝑓𝑢𝑙𝑙 and associated action is used by the agent (i.e., the agent
zoomed version 𝑉𝑧𝑜𝑜𝑚𝑒𝑑 , as well as the action 𝐴 performed performs the action in the environment). By using this
by the expert. Each input-action pair is stored in the combined approach, the agent leverages the strengths of
corresponding observation set, 𝒪𝑓𝑢𝑙𝑙 or 𝒪𝑧𝑜𝑜𝑚𝑒𝑑 (𝒪𝑓𝑢𝑙𝑙 ← each individual model during action selection. For example,
𝒪𝑓𝑢𝑙𝑙 ∪ 〈𝑉𝑓𝑢𝑙𝑙 , 𝐴〉 and 𝒪𝑧𝑜𝑜𝑚𝑒𝑑 ← 𝒪𝑧𝑜𝑜𝑚𝑒𝑑 ∪ 〈𝑉𝑧𝑜𝑜𝑚𝑒𝑑 , 𝐴〉). we would expect the zoomed model to perform better when
Learning is performed using two convolutional neural important objects are near the agent, whereas the full model
networks (CNN) (Krizhevsky, Sutskever, and Hinton 2012), should perform better when information from the entire field
with one trained on the full observations (i.e., 𝒪𝑓𝑢𝑙𝑙 ) and a of vision is necessary. The primary goal of deployment is
second trained on the zoomed observations (i.e., 𝒪𝑧𝑜𝑜𝑚𝑒𝑑 ). for the agent to select similar actions to the expert when
These models represent the environment at two levels of presented with similar sensory inputs.
granularity and are used in combination to overcome limited
training data. For example, a nearby ball would be easier to
detect in the zoomed image because objects appear larger, 4. Evaluation
whereas the full image would be necessary to detect a goal To evaluate the performance of our DL LbO system we
net on the other side of the field. collected data from the RoboCup Simulation League
We use a modification of the CaffeNet architecture (Jia et (RoboCup 2016). The matches were 5 vs 5 soccer games
al. 2014): an input layer, five convolution layers, five with each player controlled by a scripted AI agent. The
pooling layers, two fully connected layers, and one softmax specific agent used, Krislet, performs simple soccer
loss layer. The network takes as input the pixel values using behaviors that involve locating the ball, running towards the
all three color channels (i.e., red, green, and blue), resulting ball, and kicking the ball towards the opponent’s goal. In
in 256 × 256 × 3 inputs. The outputs of the network each match, a single player was used as the expert (i.e., its
represent the confidence in each of the possible actions (i.e., inputs and actions were recorded). The learning agent
the confidence that each action should be selected in observed 10 full soccer matches, with each game being 10
response to the input image). In the soccer example, three minutes in length. In total, this resulted in approximately
actions1 are used: kick, dash (i.e., move), and turn. 40,000 observations for both the full and zoomed
Rather than training the entire network, our approach uses observation sets. However, the dataset is highly imbalanced
several layers that are pretrained on other data sources. The (73% dash, 26% turn, 1% kick), so a balanced training set
convolution and pooling layers are extracted from an was created such that each action was equally represented
existing network trained on ImageNet data (Jia et al. 2014), (1617 total observations in each observation set). A
whereas the fully connected layers and softmax loss layer balanced test set of 1029 observations was created by
are trained using observation data. This approach has two observing additional soccer matches.
primary advantages. First, the pretrained ImageNet layers The CNNs were trained using a base learning rate of 0.01,
can identify many visual features already (e.g., lines, curves, polynomial rate decay with a power of 3, and 13,000
shapes, objects). This removes the need to relearn these training iterations. Table 1 shows the F1 score (i.e., harmonic
common features. Second, the limited number of mean of precision and recall, with 1.0 being the maximum
observations makes it impractical to train the entire network. possible performance) when the test set was used to evaluate
Instead, the network learns how to use existing features to the trained models. In addition to our combined approach,
classify the observation data. Although some layers are we also evaluated performance when only the full or
pretrained, they do not bias the learning to any particular zoomed model was used for action prediction.
domain or task since the ImageNet dataset contains millions
of images across a variety of topics (i.e., they are not soccer- Table 1: Results of trained CNNs on RoboCup test data
specific images). During learning, both the full and zoomed Model F1 Kick F1 Dash F1 Turn F1 Overall
models use an identical architecture but are trained
Full 0.84 0.56 0.59 0.67
independently.
Zoomed 0.93 0.57 0.57 0.69
During deployment, the learning agent attempts to
replicate the expert’s behavior and uses its own visual input Combined 0.92 0.61 0.61 0.71
as input to the CNNs. For each input the agent receives, the
CNNs output six confidence outputs (i.e., both networks These results, while preliminary, show that the agent can
output confidence values for all three actions). The learn a suitable model for action selection. While both the

1
Soccer actions can also be parameterized (e.g., how hard to kick, turn
direction) but for simplicity our initial evaluation only examines action
classification.
full and zoomed models perform reasonably well, the best representation. In Proceedings of the 25th International Florida
performance was achieved when the Combined model was Artificial Intelligence Research Society Conference, 323-328.
Marco Island, USA: AAAI Press.
used. This demonstrates that using multiple representations
Floyd, M. W., and Esfandiari, B. 2011. A case-based reasoning
of the visual data is preferable since these models have
framework for developing agents using learning by observation. In
varying strengths and weaknesses. Proceedings of the 23rd IEEE International Conference on Tools
with Artificial Intelligence, 531-538. Boca Raton, USA: IEEE
Computer Society Press.
5. Conclusions and Future Work Floyd, M. W., Esfandiari, B., and Lam, K. 2008. A case-based
reasoning approach to imitating RoboCup players. In Proceedings
We described a preliminary study of how well a learning by
of the 21st International Florida Artificial Intelligence Research
observation agent can learn without explicitly modeling the Society Conference, 251-256. Coconut Grove, USA: AAAI Press.
objects it observes. Our approach uses an expert’s raw visual Gómez-Martín, P. P., Llansó, D., Gómez-Martín, M. A., Ontañón,
inputs at two levels of granularity to train a pair of CNNs. S., and Ram, A. 2010. MMPM: A generic platform for case-based
In our study, the agent reproduced the expert’s action planning research. In Proceedings of the International Conference
selection decisions reasonably well in tasks drawn from a on Case-Based Reasoning Workshops, 45-54. Alessandria, Italy.
simulated soccer domain. This indicates that even with Grollman, D. H., and Jenkins, O. C. 2007. Learning robot soccer
limited training observations, noisy observations, and partial skills from demonstration. In Proceedings of the IEEE
International Conference on Development and Learning, 276-281.
observability, it is possible to create an agent that can learn
London, UK: IEEE Press.
an expert’s behavior without being provided an explicit
Hausknecht, M., and Stone, P. (2016) Deep reinforcement learning
object model. in parameterized action space. In Proceedings of the International
Although our approach removes the need to model Conference on Learning Representations. San Juan, Puerto Rico.
observable objects, it still requires modeling the possible Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick,
actions. An area of future work will be to identify methods R. B., Guadarrama, S., and Darrell, T. 2014. Caffe: Convolutional
for learning the actions an expert performs based on architecture for fast feature embedding. In Proceedings of the ACM
observations. Additionally, we have only examined a single International Conference on Multimedia, 675-678. Orlando, USA:
ACM.
two-model architecture (i.e., selecting the most confident
prediction from two CNNs). In future work we will examine LeCun, Y., Bengio, Y. and Hinton, G. E. 2015. Deep learning.
Nature, 521, 436-444.
if added benefit can be achieved by training additional
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012.
models (e.g., other levels of granularity) or by modifying
Classification with deep convolutional neural networks. In
how the model outputs are combined (e.g., inducing a Proceedings of the 26th Annual Conference on Neural Information
decision tree from their output). Our preliminary evaluation Processing Systems, 1106-1114. Lake Tahoe, USA.
has only measured the performance from a single Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
experiment from a single expert in a single domain. We plan Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,
to perform a more thorough evaluation of the learning Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,
performance involving numerous experimental trails. This King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D.
2015. Human-level control through deep reinforcement learning.
will not only allow us to show the benefit of our approach, Nature, 518, 529-533.
but it will also allow for a thorough comparison with other
Ontañón, S., Mishra, K., Sugandh, N., and Ram, A. 2007. Case-
LbO agents that learn in RoboCup (Floyd, Esfandiari, and based planning and execution for real-time strategy games. In
Lam 2008; Young and Hawes 2015). To determine whether Proceedings of the 7th International Conference on Case-Based
our approach is truly domain-independent, we plan to Reasoning, 164-178. Belfast, UK: Springer.
conduct additional studies with different experts in different RoboCup. 2016. RoboCup Official Site. Retrieved from
environments. Finally, we plan to examine how this [http://www.robocup.org]
approach can be extended to learn from state-based experts Romdhane, H., and Lamontagne, L. 2008. Forgetting reinforced
since the RoboCup expert we examined is purely reactive cases. In Proceedings of the 9th European Conference on Case-
Based Reasoning, 474-486. Trier, Germany: Springer.
(i.e., the expert’s action is based entirely on its current visual
inputs). Rubin, J., and Watson, I. 2010. Similarity-based retrieval and
solution re-use policies in the game of Texas Hold’em. In
Proceedings of the 18th International Conference on Case-Based
Reasoning, 465-479. Alessandria, Italy: Springer.
References
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Coates, A., Abbeel, P., and Ng, A. Y. 2008. Learning for control Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam,
from multiple demonstrations. In Proceedings of the 25th V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner,
International Conference on Machine Learning, 144-151. N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K.,
Helsinki, Finland: ACM. Graepel, T., and Hassabis, D. 2016. Mastering the game of Go with
Floyd, M. W., Bicakci, M. V. and Esfandiari, B. 2012. Case-based deep neural networks and tree search. Nature, 529, 484-503.
learning by observation in robotics using a dynamic case
Thurau, C., Bauckhage, C., and Sagerer, G. 2003. Combining self
organizing maps and multilayer perceptrons to learn bot-behaviour
for a commercial game. In Proceedings of the 4th International
Conference on Intelligent Games and Simulation, 119-123.
London, UK: EUROSIS.
Young, J., and Hawes, N. 2015. Learning by observation using
qualitative spatial relations. In Proceedings of the International
Conference on Autonomous Agents and Multiagent Systems, 745-
751. Istanbul, Turkey: ACM.

You might also like