Vision and Memory
Vision and Memory
1–9
1 Image Synthesis Group, Department of Computer Science, Trinity College, Dublin 2, Republic of Ireland
Abstract
A memory model based on “stage theory”, an influential concept of memory from the field of cognitive psychology,
is presented for application to autonomous virtual humans. The virtual human senses external stimuli through
a synthetic vision system. The vision system incorporates multiple modes of vision in order to accommodate a
perceptual attention approach. The memory model is used to store perceived and attended object information at
different stages in a filtering process. The methods outlined in this paper have applications in any area where
simulation-based agents are used: training, entertainment, ergonomics and military simulations to name but a
few.
Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Virtual reality
module. Section 8 is entitled ‘conclusions and future work’ in Renault et al13 , behaviour is often defined as “the way that
and details a more elaborate system for controlling the filter- animals and human beings act” and is also often reduced to
ing process and driving the vision system, based on a model reaction to the environment. We agree with their statement
of human attention. that a better definition should also include:
“the flow of information by which the environment
2. Background acts on the living creature as well as the ways the
creature codes and uses this information”
Human character animation is an intriguing and challeng-
ing aspect of computer animation. It is also undoubtedly a It is further noted by Gilies7 that in order for a simula-
highly important aspect: human characters are widely used tion of human behaviour to be effective “it must include the
in productions ranging from video games to animated films. characters’ interaction with their environment and to do this
The challenge is two-fold: first, to provide a way to control a it must simulate the characters’ perception of the environ-
highly complex structure comprising many joints, subject to ment”. As will be seen in section 3, the behavioural ani-
kinematic or dynamic constraints; second, to produce body mation approach has been very successful at providing an-
poses and motions that appear to be natural. This is made imations for simpler organisms such as schools of fish and
considerably more difficult by the fact that the observers, as flocks of birds. Unfortunately, human behaviour is not easy
human beings, could be considered experts when viewing to model effectively. The human being is exceptionally com-
the motions of other humans. Although we may not always plex, and many mental processes are not well understood.
be able to articulate a precise problem with a motion, we
may be left with a feeling that something is wrong. Sub- 3. Related Work
tleties are often critical in such situations.
A number of researchers have studied the endowment of
There are a number of methods available for animating agents with internal sensory and storage mechanisms for the
human characters. The use of such methods is limited by purposes of animation. Early research proved particularly
their intended application. Generally speaking, these meth- successful at animating animal behaviour.
ods differ in terms of the degree of autonomy they provide,
and the degree of interactively they allow. In terms of auton- Reynolds16 presents a distributed behavioural model for
omy, some characters are completely user controlled, and re- flocks of birds and herds of animals. The method is based
quire a human operator to specify all joint motions by hand. on the insight that elements of the real system (birds) do
While providing plenty of control, such an approach can not have complete and perfect information about the world
prove tedious for complicated hierarchical characters. At the and that these imperfections have a major impact on the fi-
other extreme are characters that are completely autonomous nal behaviour of the system. The system is based on simu-
and do not require the intervention of a user at all. Here, all lated birds, or boids. These are similar in nature to the in-
motions are controlled entirely by a software program. Here, dividual particles in a particle system. Each boid is imple-
the drawback is that it may be difficult to get a character to mented as an independent actor that navigates according to,
act in a certain way. Interactivity is also important. among other things, its local perception of the dynamic envi-
ronment. The boid model does not attempt to directly simu-
Interactivity can be thought of as the time taken for late the senses used by the real animals; rather it attempts to
character to react to an action. Highly-interactive charac- make the same final information available to the behavioural
ters may be interacted with in real-time and calculations model that the real animal would receive as an end result
for such characters have strict execution rates. In contrast, of its perceptual and cognitive processing. Each boid has a
non-interactive characters often have calculations that are spherical zone of sensitivity centered at its local origin, and
conducted off-line. In this paper, we are interested in the the behaviours that comprise the flocking model are stated
highly-interactive animation of autonomous characters; that in terms of nearby flock-mates. A key issue is raised here re-
is, characters should be able to plan their own motions and garding behavioural animation: how to analyse the success
the time expended on the calculation of these motions should of the model. As Reynolds notes, it is difficult to objectively
not be excessive. measure how valid such simulations are. However, the flocks
built from the model seem to correspond to the observer’s in-
We categorise our approach to the problem of hu-
tuitive notion of a ‘flock-like motion’. An interesting result
man character animation as behavioural animation. In be-
of the experiments with the system reveal that the ‘flocking’
havioural animation, a character determines its own actions
behaviour that we intuitively recognise is not only improved
to a certain extent. This gives the character the ability to deal
by a limited, localized view of the world, but is dependent
with dynamic environments or unforeseen circumstances.
on it.
Most importantly, it seeks to free the animator from the need
to specify every detail of a character’s animation. In order Tu et al17 present a framework for animation featuring the
for a software program to control a character, it must have realistic appearance, movement and behaviour of individual
some sort of model of behaviour for that character. As noted and schools of fish with minimal input from the animator.
Their repertoire of behaviours relies on their perception of tor’s long-term visual memory of the 3D environment, and
the dynamic environment. Individual fish have motivations can handle static and dynamic objects. Using this long-term
as well as simple reactive behaviour. At each time step, habit, memory, an actor can find 3D paths through the environment
mental state and sensory information are used to provide an avoiding impasses.
intention. Behaviour routines are then executed based on this Kuffner et al10 present a perception-based navigation sys-
intention and motor controllers provide motions that fulfil tem for animated characters. Of particular interest to this
these behaviours. Habits are represented as numerical vari- study is an algorithm for simulating the visual perception
ables for determining individual tendencies towards bright- and memory of a character. The visual system provides a
ness, darkness, cold, warmth, and schooling. The individual feedback loop to the overall navigation strategy. The ap-
fish also has three mental state variables for hunger, libido proach taken builds on previous approaches12 . An unlit
and fear. Behaviour patterns may be interrupted by reac- model of the scene is rendered from the characters point of
tions to more pressing environmental stimuli (for example, view using a unique colour assigned to each object or ob-
a predator). The artificial fish has two on-board sensors with ject part. These objects and their locations are added to the
which to perceive its environment - a vision sensor and a character’s internal model of the environment. A record of
temperature sensor. The vision sensor is cyclopean covering perceived objects and their locations is kept as the character
a 300-degree spherical angle. An object is seen if any part of explores an unknown virtual environment, thus providing a
it enters this view volume and is not fully occluded by an- kind of spatial memory for each character. Unlike previous
other object. The vision sensor has access to the geometry, approaches, this method relies on the object geometry stored
material property and illumination information that is avail- in the environment and a list of objects IDs and positions.
able to the graphics pipeline and also to the object database This provides a relatively compact and fast representation of
and physical simulator for information such as identification each character’s internal world. Previously unobserved ob-
and velocities of objects. It is noted that as their holistic com- jects are added to the characters list of known objects, and
putational model exceeds a certain level of physical, motor, other visible objects are updated with their current trans-
perceptual and behavioural sophistication, the agent’s range formation. Objects that were previously visible but are no
of functionality broadens due to emergent behaviours. longer in view retain their most recent observed transforma-
Renault et al13 introduce a synthetic vision system for the tions.
high level animation of actors. The goal of the vision system Blumberg3 presents an ethologically inspired approach to
in this case is to allow the actor to move along a corridor real-time obstacle avoidance and navigation. Again, a crea-
avoiding objects and other synthetic actors. For the vision ture renders the scene from its own viewpoint. This render-
system, the scene is rendered from the point of view of the ing is used to recover a gross measure of motion energy as
actor and the output is stored in a 2D array. Objects in the well as other features of the environment, which are then
scene are not rendered using their usual colours, but are ren- used to guide movement. An approximate measure of mo-
dered using unique colours for each object. Each element in tion energy is calculated for each half of the image, which is
the 2D array consists of a vector containing the pixel at that then used to provide corridor following and obstacle avoid-
point, the distance from the actor’s eye to the pixel and an ance.
object identifier of any object that is at that position. The size
of the array is chosen so as to provide acceptable accuracy Gillies7 uses a psychological approach to design a visual
without consuming too much CPU time. A view resolution algorithm. The methods do not try to simulate the image on
of 30x30 was selected for the corridor problem. the retina of the actor or allow the actor to perceive features
such as colour and shape. Rather, they work at a higher level,
Noser et al12 extend previous work13 by adding mem- using basic features such as velocity and position to calculate
ory and learning mechanisms. They consider the navigation object features. Object features are rather abstract and rep-
problem as being comprised of two parts: global naviga- resent complex reasons as to why an object might be looked
tion and local navigation. Global navigation uses a simpli- at. Interest would be an example of an object feature. The
fied map to perform high-level path-planning. This map is system does not attempt to provide meaning for object prop-
somewhat simplified, however, and may not reflect recent erties and actors show more interest in some properties than
changes. In order to deal with this, the local navigation algo- in others. This means that the actors will have different reac-
rithm uses direct input from the environment to reach goals tions to an object.
and sub-goals given by the global navigation systems and
to avoid unexpected obstacles. This local navigation algo- Chopra et al5 propose a framework for generating vi-
rithm has no model of the environment and does not know sual attention behaviour in a simulated human agent based
the position of the actor in the world. The scene is rendered on observations from psychology, human factors and com-
as before and global distances to objects are extracted for puter vision. A number of behaviours are described, includ-
use by the navigation system. An octree data structure for ing eye behaviours for locomotion, monitoring, reaching, vi-
the 3D environment is constructed from the 2D image and sual search and free viewing.
the depth information. This data structure represents an ac- Hill8 provides a model of perceptual attention in order
to create plausible virtual human pilots for military simula- Secondly, synthetic vision may scale better than other tech-
tions. Objects are grouped according to various criteria, such niques in complex environments. By its very nature, the vi-
as object type. The granularity of object perception is then sual system provides a controllable filtering mechanism so
based on the attention level and goals of the pilot. as not to overwhelm our limited cognitive abilities. Thirdly,
the approach makes the actor less dependent on the under-
lying implementation of the environment because it is not
4. Synthetic Vision necessary to rely directly on the scene database. Finally,
Synthetic senses provide the means for actors to perceive Blumberg3 notes “... believable behaviour begins with be-
their environment through an indirect means. We regard such lievable perception”. There are also some beneficial side ef-
methods as being indirect, since actors are normally un- fects to this method: occluded objects are implicitly handled
restricted when performing interrogations of the environ- in a static scene and the method may be extended to dynamic
ment’s state in the database. Although this direct access scenes through the use of a memory model.
method is simple and fast, it suffers from scalability and Our synthetic vision module is based on the model de-
realism problems. Because of this, research has focused on scribed by Noser et al12 . This model uses false-colouring
providing agents with their own methods of perceiving the and dynamic octrees to represent the visual memory of the
environment. In some cases, actors are provided with a sen- character. We adopt a similar system to Kuffner et al10 , by
sory sphere that may deform in a direction depending on removing the octree structure. Rather, scene description in-
the velocity in that direction16 . Although this approximation formation is encoded with a vector that contains object ob-
is adequate to produce realistic group behaviour, it is men- servation information.
tioned that individuals would be better at path planning if
The process is as follows: Each object in the scene is as-
they could see their environment. Indeed, it has been noted
signed a single, false colour. The rendering hardware is then
that most characters do not have an omni-directional percep-
used to render the scene from the perspective of each agent.
tion; sensory information from the environment flows from
The frequency of this rendering may be varied. In this mode,
a primary direction, such as the cone of vision for a human
objects are rendered with flat shading in the chosen false-
character9 .
colour. No textures or other effects are applied. The agent’s
We focus on the visual modality of sensing in this paper. viewpoint does not need to be rendered into a particularly
Vision is regarded as the most important of all the senses for large area: our current implementation uses 128x128 ren-
humans. Research on synthetic sensors for other modalities derings (See Figure 2). The false-coloured rendering is then
has also been conducted14 . scanned, and the object false-colours are extracted.
It should be noted that the aim of the vision approach de- We extend the synthetic vision module by providing mul-
scribed here is not necessarily to imitate the human visual tiple vision modes. Each mode uses a different palette for
system as accurately as possible. Instead, it is to provide a false-colouring the objects. The differing vision modes are
reasonable estimate of what the visual system senses with- useful for capturing varying levels of information detail of
out incurring the costs associated with simulating the highly information about the environment. The two main vision
complicated mechanisms of the eye. For example, our sys- modes are referred to as distinct mode and grouped mode.
tem is monocular since object depth information may be ob- In the distinct vision mode, each object is false-coloured
tained during the rendering process. Noser et al14 differen- with a unique colour. The unique colours of objects in the
tiate between synthetic vision and artificial vision. While viewpoint rendering may then be used to do a look-up of the
synthetic vision is simulated vision for a digital actor, ar- object’s globally unique identifier in the scene database. This
tificial vision is the process of recognising the image of a identifiers is then passed to the memory model. This mode is
real environment captured by a camera. Since artificial vi- useful when a specific object is being attended to (Figure 2).
sion must obtain all of its information from the vision sensor,
The other primary vision mode is called grouped vi-
the task becomes more difficult, involving time-consuming
sion mode. In this mode, objects are false-coloured with
tasks such as image segmentation, recognition and interpre-
group colours, rather than individual colours. Objects may
tation.
be grouped according to a number of different criteria. Some
There are a number of reasons for adopting a computer vi- examples of possible groupings are brightness, luminance,
sion technique. First of all, it may be the simplest and fastest shape, proximity and type. The grouped vision mode is use-
way to extract useful information from the environment3 . ful for lower detail scene perception (Figure 2). Note that the
Underlying hardware can be taken advantage of and, since grouped vision mode only provides information about poten-
the object visibility calculation is fundamentally a render- tially visible objects. It is entirely possible that a group will
ing operation, all of the techniques that have been developed be marked as being in view when only one of the objects in
to speed up the rendering of large scenes can be adopted. the group are actually in view. For example, consider a group
These include scene-graph management and caching, hierar- consisting of a table with numerous glasses and a large bottle
chical level-of-detail (LOD) and frame-to-frame coherency9 . on it. Although the table and glasses may be out of view, the
Some form of memory is crucial for agents that are discon- Over the past 40 years, a number of different structural
nected from the environment database in some way. As with analyses of memory have been performed. These have been
living creatures, autonomous agents rely on their memory conducted on normal individuals as well as individuals suf-
to differentiate between what they have and have not ob- fering from brain damage and disease. We base our system
served. An agent that automatically knows the location of of memory on what is referred to as stage theory2 . They pro-
every object in the scene will destroy its plausibility with re- pose a model where information is processed and stored in 3
spect to a human viewer, while an agent that has no memory stages: sensory memory (STSS), short-term memory (STM)
of its surroundings will appear to be stupid when conducting and long-term memory (LTM). See Figure 1 for a schematic
tasks. There are many instances where memory plays a large of the model. This model provides a useful structure.
A goal command is given to the virtual human. This goal tween speed and functionality, while at the same time show-
command contains the globally unique identifier of the ob- ing parallels to the real-life system.
ject that attention is to be directed towards. If the object is
In terms of the memory model, work is currently focus-
already memorised in the STM or the LTM, then the obser-
ing on a more elaborate long-term memory system. Long-
vation information is extracted and the virtual human will
term memory poses a particular problem to research involv-
become attentive towards (look at) the object and update its
ing agents. Perhaps the most interesting question is “when
perception of the object using the distinct vision mode. If
do we forget things?” This is also a very important topic of
the object was memorised in the STM, this procedure is re-
research in the field of cognitive research. Research indicates
garded as a rehearsal.
that there are two main reasons why we may be unable to re-
If the object is not in the STM or the LTM, then the agent’s call what we have committed to memory. The first is that the
perception of the environment will be rendered using the memory has disappeared and the second is that the memory
group by proximity vision mode (currently, agents do not ini- is still there but we cannot retrieve it. This can be problem-
tiate an active search of their surroundings; they only search atic since it is not always easy for researchers to distinguish
the groups in their view at the time the task is issued). They between the two possibilities. The most interesting possibil-
then go through the groups in the STM one by one, and ren- ity is that we never really lose our memories; all memories
der them using the group by type vision mode. If an object of are still there, but we cannot retrieve them1 . Of course, for
the same type as the requested object is there, then they be- a system involving agents, we must assume that under some
come attentive towards the object and check to see if it is the circumstances, memories are forgotten. The goal of the long-
goal object. If it is not, the search continues through other term memory should be to store only information that is im-
objects of similar type in the group, and in the case where portant to the agent. This seems to suggest both a filtering
there are no more, the search proceeds to other groups. If it process that only allows important information in, and a for-
is the goal object, the perceived state of the object is entered getting process that keeps only the most important informa-
in the STM. tion. It is likely that the filtering process will be linked to the
attention mechanism discussed in section 8. In terms of the
The memory model outlined above was implemented on
forgetting process, Anderson1 offers some insight:
the ALOHA animation system, an animation system for
the real-time rendering of characters6 . This system uses the “Speed and probability of accessing a memory is
OpenGL API on a Windows platform. Figure 3 shows some determined by its level of activation, which in turn
sample screenshots from the ALOHA system. is determined by how frequently and how recently
we used the memory.”
mechanism for deciding what items from the STSS are en- More elaborate research on perceptual grouping and how it
tered into the STM. The closest parallel to this mechanism relates to task requirements would certainly be interesting.
in the real human is referred to as attention and is an elusive
concept. Pashler15 goes as far as to assume that:
References
“... no one knows what attention is, and that there
1. Anderson, J.R. Cognitive Psychology and Its Implica-
may not even be an “it” there to be known about.”
tions. 4th edn, New York: Freeman, 1995. 7
Nonetheless, from our point of view, the concept of an at-
2. Atkinson, R. and R. Shiffrin. Human memory: a pro-
tentional construct is useful. It fits neatly in with the idea of
posed system and its control processes. In K. Spence
a series of filtering mechanisms for extracting and storing
and J. Spence editors, the psychology of learning and
important information from the environment. A more proac-
motivation: advances in research and theory, Vol. 2.
tive attentional mechanism may also prove useful for provid-
New York: Academic Press, 1968. 1, 5, 6
ing low-level behavioural animation and establishing a sense
of presence; simple orienting behaviours towards important 3. Blumberg, B. Old Tricks, New Dogs: Ethology and
stimuli are a glaring omission when dealing with contempo- Interactive Creatures. PhD Dissertation, MIT Media
rary autonomous characters. Lab, 1996. 3, 4
The output from our attention model will essentially be 4. Bruce, V. and P.R. Green. Visual Perception: physiol-
a list of the current objects in the scene that the agent is ogy, psychology and ecology. 2nd edn, Lawrence Erl-
aware of (not necessarily all those objects that are visible) baum Associates Ltd., Hove, U.K., 1990. 7
and a ranking of these objects based on how interesting they
5. Chopra, S. and N. Badler. Where to look? Automating
are to the agent. In such a case, specifying an agents in-
attending behaviours of virtual human characters. Au-
terest becomes a difficult problem. We approach this prob-
tonomous Agents and Multi-Agent Systems. 4 (1/2):9-
lem by viewing interest as being a combination of bottom-
23, 2001. 3
up, attention-grabbing processes and top-down, task-related
processes. Both are necessary for human survival: for exam- 6. Giang, T., R. Mooney, C. Peters, and C. O’Sullivan.
ple, while carrying out a task-related attention (looking at ALOHA: adaptive level of detail for human animation.
your watch to find out the time) you may become aware of a Eurographics 2000, Short Presentations, 2000. 7
looming object in your periphery of vision (a speeding car).
7. Gillies, M. Practical Behavioural Animation Based On
In this case, an attention control process interrupts the task
Vision Asnd Attention. University of Cambridge Com-
at hand and switches attention towards the more immediate
puter Laboratory, Technical Report TR522, 2001. 2, 3,
threat.
5
The above results will be used to provide an object of
8. Hill, R.W. Perceptual Attention in Virtual Humans: To-
interest at any one time for the agent, which will invoke
wards Realistic and Believable Gaze Behaviours. Sim-
higher-level behaviours (for example, an orienting behaviour
ulating Human Agents, Fall Symposium, 2000. 3, 5
towards an interesting object). Overt attention may be espe-
cially important in providing basic low-level attention be- 9. Kuffner, J. Autonomous Agents for Real-Time Anima-
haviours to agents, in order to increase their realism and the tion. PhD Dissertation, Stanford University, 1999. 4
viewer’s sense of presence.
10. Kuffner, J. and J.C. Latombe. Perception-Based Nav-
This article has considered the first steps towards provid- igation for Animated Characters in Real-Time Virtual
ing attention-driven behavioural animation. These are all in- Environments. The Visual Computer: Real-Time Vir-
ternal however. It is envisaged that the results of the work tual Worlds, 1999. 3, 4
above will be used to create low-level behaviours. Example
11. Miller, G.A. The magic number seven, plus or minus
behaviours that we are interested in include directing char-
two: Some limits on our capacity for processing infor-
acter’s eye-gaze in stimulus-driven task-dependant situations
mation. Psychological Review, 63, pages 81-97, 1956.
and using memory and vision to affect search and prehension
6
behaviours.
12. Noser, N., O. Renault, D. Thalmann, and N.M. Thal-
It should be noted that currently the grouping mechanism
mann. Navigation for digital actors based on synthetic
for the grouped vision mode is pre-processed; essentially,
vision, memory and learning. Computer and Graphics,
objects are assigned group ids by human intervention at the
Vol. 19, pages 7-19, 1995. 3, 4
start of the program. Such a method will not suit dynamic
scenes. A structure for grouping objects according to vary- 13. Renault, O., D. Thalmann, and N.M. Thalmann. A
ing criteria (proximity and type) in a dynamic scene would vision-based approach to behavioural animation. Visu-
be complimentary to the current system. This also ignores a alization and Computer Animation, Vol. 1, pages 18-21,
more fundamental question: how do humans group objects? 1990. 2, 3
Figure 2: Sample object views as seen from the perspective of the agent with (a) no false colouring applied, (b) false colouring
according to object id, (c) false colouring according to object type, and (d) false colouring according to object
proximity.
Figure 3: The system in action. The left inset shows the scene rendered from the viewpoint of the agent. The right inset depicts
the same view false-coloured according to object proximity.