A Person Following Algorithm For Use With A Single Forward Facing
A Person Following Algorithm For Use With A Single Forward Facing
TigerPrints
5-2012
Recommended Citation
Ficht, Sean, "A Person Following Algorithm for Use with a Single Forward Facing RGB-D Camera on a
Mobile Robot" (2012). All Theses. 1350.
https://tigerprints.clemson.edu/all_theses/1350
This Thesis is brought to you for free and open access by the Theses at TigerPrints. It has been accepted for
inclusion in All Theses by an authorized administrator of TigerPrints. For more information, please contact
[email protected].
A Person Following Algorithm for Use with a
Single Forward Facing RGB-D Camera on a
Mobile Robot
A Thesis
Presented to
the Graduate School of
Clemson University
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
Electrical Engineering
by
Sean Michael Ficht
May 2012
Accepted by:
Dr. Stanley Birchfield, Committee Chair
Dr. Robert Schalkoff
Dr. Adam Hoover
Abstract
ii
particle filter that will serve as an integrating structure for the tracker, and a simplistic
robot control architecture.
In the following chapters I will discuss the motivation behind this work, pre-
vious research done in this area, the methods used in this thesis and the theory
behind them. Experimental results will then be analyzed and discussion concerning
the results and possible improvements to the system will be presented.
iii
Acknowledgements
I would like to thank Ninad Pradhan for his contributions to this work. We
worked alongside one another on this project and his help is greatly appreciated. I
would also like to thank Dr. Stan Birchfield for all his input on this research work.
iv
Table of Contents
Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Person Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Person Detection and Tracking . . . . . . . . . . . . . . . . . . . . . 9
3 Background Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 RGB-D Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Homographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . . . . . 17
3.4 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Color Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Overall Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Particle Filter Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Generic Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Specific Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Robot Control Architecture . . . . . . . . . . . . . . . . . . . . . . . 39
v
5.2 No Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Brief Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Long Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
List of Tables
vii
List of Figures
4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Algorithm flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Propagation rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Generic HOG detections . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Determining pixels to be used in color histogram . . . . . . . . . . . . 36
4.6 Initial template for the color histogram . . . . . . . . . . . . . . . . . 38
4.7 Expected position color histogram templates from frames in the sequence 38
4.8 Robot control architecture . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
Chapter 1
Introduction
In this section the motivation behind designing a robotic person follower will
be discussed as well as a general thesis outline given to guide the reader through this
thesis.
1.1 Motivation
Every day we see more and more robot/human interaction in society. Robotic
intelligence can be seen in many facets of life from car sensors that tell a person
when they are too close to an object to robotic vacuums. The intention behind the
study is to design a system that will enable a robot to reliably follow a person in
an indoor environment. The motivation behind designing this kind of system comes
from the desire to have a robot that will carry items for you. For example, in a
hospital setting, nurses and doctors must visit countless patients every day. To tend
to a patient a doctor need multiple items. Some of these items include medication,
charts, a computer to update patient history, and various other medical supplies. The
problems are every patient has a different medication and a different chart, there is
1
not an accessible computer in every patient’s room, and the various types of medical
supplies such as bandages and shots can add up to be a heavy load. This means that
while making rounds a doctor will have to take multiple trips to a supply closets, a
computer, and file cabinets. If a system existed that was able to carry all of these
items and follow the doctor around, the doctor would be able to visit more patients
in a shorter amount of time. This is just one situation in which the proposed person
following system could be of use. One can imagine other situations in which it would
be beneficial to not have to make multiple trips from one destination to another simply
because a person can only carry so much. This is the motivation behind designing a
person following robot.
This thesis outlines the design of a system with the ability to detect, track,
and follow a specific human target through varied environments in order to serve as a
solution to the motivating example. The thesis will progress from pertinent previous
work through design, testing, and conclusions.
Chapter 2 will outline existing methods for both person following and person
detection. Chapter 3 will cover the theory behind the methods taken in the design
of this person following system. Chapter 4 will cover an in depth discussion of the
methods used in order to design the system. Chapter 5 will present, catalog, and
analyze how well the system performed on the different test scenarios that must be
taken into account when designing a person following robot. Finally, chapter 6 will
present conclusions about the system and postulate future improvements that can be
made to the system.
2
Chapter 2
Related Work
Related work is broken into two subsections. The first subsection covers person
following. The person following subsection considers algorithms that include tracking
and following with a mobile robot. The second subsection considers person detection
and tracking without a mobile robot. A significant amount of research has also been
done in the area of crowd navigation and navigation planning [9],[10],[25]. However,
since crowd navigation is not in the scope of this thesis these methods will not be
discussed in depth.
3
Chen and Birchfield [4] devised a method based on matching sparse Lucas-
Kanade [20][24] features in a binocular stereo system. The algorithm, called Binocular
Sparse Feature Segmentation (BSFS), detects and matches feature points between a
pair of stereo images as well as between images in the video sequence. The BSFS
algorithm can be thought of as a two part algorithm. There is a detection mode and
a tracking mode.
In the detection mode, a pair of images is fetched from the stereo cameras.
Using the Lucas-Kanade approach, feature points are matched between the set of
images and the disparity between them is computed. A left-right consistency check
method is used. After matching features, a foreground segmentation is performed
to remove features that do not belong to the person being tracked. This is done
using the known disparity of the person in the previous frame, the estimated motion
of the background, and the computed motion of the person. The Viola-Jones [27]
face detector is also used in the detection step. It is used to initialize the person
being tracked as well as to increase robustness in detection mode. The results of the
matched and segmented features are combined with the results of the face detector
to determine if a person has been found.
Once a person has been found, the algorithm goes into tracking mode. The
person is tracked using Lucas-Kanade from frame to frame. As the features are lost
overtime, the algorithm determines whether or not the person has been lost and will
return to detection mode if the person has been determined to have been lost.
The BSFS algorithm performs well and does not require the person to wear a
different color from the background. It will also work in environments with clutter.
It can, however, be distracted by other objects or people with similar motion and
disparity to the person being tracked. The algorithm is not able to handle complete
occlusions nor can the algorithm handle a disappearance of the target person.
4
As far as this thesis is concerned, a second valid method of person following was
postulated by Brookshire [3]. In his paper he proposes using Histogram of Oriented
Gradient (HOG) features combined with a particle filter to follow a target person.
The algorithm breaks down into two areas, detection and tracking.
The algorithm detection is performed using video from a single camera of
a stereo pair of cameras. The algorithm uses HOG features trained using linear
Support Vector Machines (SVMs). The linear SVMs are trained off-line on positive
and negative training images. To calculate the HOG features in an image a 64x128
pixel detection window is used. Scanning for HOG features is done at multiple scales.
However, to speed up the process of scanning at multiple scales three steps are taken.
The algorithm first calculates an internal histogram [27], [29], then scales the internal
histogram as opposed to the image, and finally calculates the HOG features. This
speeds up the calculation because the internal histogram is only calculated once and
then simply indexed to find values at multiple scales.
Once detection has been done, tracking is performed using a particle filter. The
particle filter is implemented to curb the effect of missed detections and false positives.
The filtering is performed using a particle filter where each particle is processed using
T
a simple Kalman filter. The state of each particle is x y z ẋ ẏ ż where z
is determined from the stereo pair of images. A constant velocity model is assumed.
To simplify the platform motion the system uses readings from an IMU to allow for
updating of the pedestrians state relative to the robot.
Results showed the algorithm to be robust to changes in pose, but no results
were shown for full or partial occlusions. It should be noted that since a pair of stereo
images were used to calculate depth, as opposed to IR sensors, the robot is able to
function in an outdoor environment.
5
2.1.1 Appearance Based Person Following
6
Schlegel et al. [18] propose a combination of a fast color-based with a ro-
bust contour-based approach. The color-based portion of the algorithm uses a color
histogram to provide regions of interest to the contour-based portion of the algo-
rithm. The contour-based portion then searches for the target by matching extracted
contours against an adaptive contour template. This algorithm is more robust to sit-
uations in which similarly clothed people are present, but does not discuss occlusion
handling.
Sidenbladh et al. [21] present an algorithm that looks for a persons head. The
system locates the head of a person by using skin color detection. Then a control
loop is used to keep the person centrally located. This is done by adjusting the pan
and tilt of the camera as well as controlling the wheels of the mobile robot that the
camera is mounted on. The algorithm has trouble when objects or background is
present that is similar to the target skin tone, neither does it present a method for
occlusion handling. The target must also be facing the camera.
Another method that has been used to implement a person following algorithm
deals with using optical flow information. These algorithms are more robust to similar
colors between the target and other objects or backgrounds, but are sensitive to
camera vibration and subject to drift.
Piaggio et al. [16] propose one such optical flow based following method. The
algorithm first calculates the optical flow. The the optical flow image is thresholded
in order to segment the person from the background. The system applies a low-pass
filter to discard image pixels that don’t belong to a person. The width of the target
person is then extracted in order to calculate the persons distance from the robot and
7
the mobile robot is moved accordingly.
Chivilo et al. [5] use the optical flow calculated in the center area of the image.
The optical flow in this algorithm is calculated using the Horn and Schunk algorithm
[11]. Then, the basic idea of the system is to measure the relative velocity of the
target person with respect to the mobile robot and keeping this relative velocity as
small as possible.
As seen earlier in work by Chen and Birchfield stereo vision is also used in
person following algorithms.
Another example of a stereo based person follower comes from work done by
Beymer and Konolige [2]. The algorithm first creates an orthographic floor-plane
representation of the three dimensional stereo camera information. The system then
subtracts the background in one of two ways. (1) If the robot is not moving, average
background subtraction is used. (2) If the robot is moving, the system uses odometry
data to estimate background motion which is then subtracted. Once the background
has been subtracted, the system finds the target by modeling people as Gaussian
blobs. The target is then assigned a state vector and a Kalman filter is applied to
maintain the location of the person. No method of full occlusion handling is discussed
in this paper.
Other methods of person following that have been attempted include laser
based person following [13], video and RFID based person tracking [7], and tracking
based on active contours [26].
8
2.2 Person Detection and Tracking
Person detection and tracking differs from person following in that the algo-
rithms are not designed for use with a mobile robot. Cameras used are typically
stationary which means the background scene is constant.
One successful method for person tracking in RGB-D data was proposed by
Luber et al. [15]. The algorithm combines a multi-cue person detector for RGB-D
data with an online detector that learns individual target models.
The person detector for RGB-D data is made up of a HOG detector [6] per-
formed on the color image and a HOD (Histogram of Oriented Depths) detector
[22] performed on the depth image. This approach is called Combined Histogram of
Oriented Depths and Gradients (Combo-HOD). The HOG and HOD descriptors are
computed at multiple scales. This calculation is sped up by using integral tensors
which is an extension of an integral image. The descriptors are combined using a
weighted mean of the probabilities obtained by a sigmoid that is fit to the SVM out-
puts. The output of the detector are the positions and size of all the targets in 3D
space.
The online detector builds upon the online-boosting method for object de-
tection proposed by Grabner et al. [8] who propose applying on-line boosting to
‘selectors’ rather than directly to weak classifiers, where the ‘selector’ selects the
weak classifier with the lowest error. Luber et al. [15] compute 3 types of features
that correspond to the weak classifiers. The three include Haar-like features in the
intensity and depth images and illumination agnostic features in the color image. The
features are computed in rectangular regions with random positions and sizes within
a targets bounding box. This is done once when the target is initially found and
then fixed for the duration of the target’s presence. On-line boosting is then used
9
to continuously update the target model. To keep constant, the region for feature
detection is the same size as the bounding box of the previous detection. By sweeping
the bounding box around a local neighborhood, a confidence map is created yielding
the new tracking position at its maximum.
The on-line detector is then brought into a Kalman filter based on Multi-
hypothesis tracking framework [17] differing in that their on-line classifier adds an ap-
pearance likelihood that calculates how much the observed target appearance matches
the learned model.
Finally, at each iteration the tracker produces assignments of measurements
to tracks as well as interprets the measurements as being new tracks or false alarms.
It also decides if a track has been occluded or deleted.
The algorithm seems to perform quite well. The algorithm is designed for
detection and tracking of multiple targets as opposed to tracking one specific target
and occlusion is not handled. Once a target has been determined to be occluded the
tracking history of that target stops and if that target reappears it is then determined
to be a new target as opposed to a target reacquisition.
In another approach proposed by Bansal et al. [1], a method of pedestrian
detection based on structure and appearance classification is implemented.
First, a depth map of a given scene is produced by a set of stereo cameras. Pre-
computed templates at three separate depth intervals are used at runtime to search for
prospective candidates. Using template matching based on the depth interval that is
being searched, a correlation score map is produced. Using non-maximal suppression,
the peaks or the correlation score map are selected and initial regions of interest for
prospective candidates are the result. This initial set is reduced by first considering
overlapping regions of interest. If a region of interest overlaps an existing detection
by more than 70 percent it is discarded. Next, Canny edges are found for each region
10
of interest and a vertical projection of the binary mask of the edges gives them a one
dimensional histogram. Peaks of this histogram are then detected using mean-shift
with each peak corresponding to a possible pedestrian. A new region of interest is
then set at each peak with overlapping regions once again being tossed out. At the
same time, the algorithm is labeling pixels, based on height, as belonging to either
the ground plane, tall vertical structures, overhangs, or pedestrian candidates.
An image based classifier further evaluates the candidates. A HOG feature de-
tector is combined with contour segments of body parts in this classifier. To combine
the two methods, the algorithm uses chamfer matched templates on the regions of
interest to create a foreground mask. The foreground mask is then used globally on
the image to suppress the background giving enhanced gradient values on pedestrian
contours during the HOG calculation while suppressing others that are potentially
from the background.
This algorithm also performs quite well but once again is most well suited for
pure pedestrian detection in a moving car and does not employ any long time person
specific descriptors that would be needed to track the same target for long periods of
time.
A widely used method in person detection algorithms is Dalal and Triggs’
Histogram of Oriented Gradients for Human Detection method [6] which will be
discussed in detail later in this thesis. A multitude of algorithms exist that have
implemented some variation of HOG or expounded upon it [28], [19], [22] are just a
few examples of such algorithms.
11
Chapter 3
Background Theory
First, we need to look at how the RGB-D sensor works in order to validate
it as a useful tool for the purposes of tracking. The sensor used is the Xbox Kinect
camera. It is made up of two main parts. It has a projector and an IR VGA camera.
The projector bounces a laser across the entire field of view of the camera. The
camera then picks up the projected laser in order to segment objects into a depth
field. The reason this works is because the sensor gets all the pixels back as IR noise
measurements that vary in color depending on how close that pixel position is to the
camera. This is how the depth image is created and obtained. The RGB image is
obtained as it would be in any other camera. So essentially the RGB-D sensor is
providing us with a color image and a depth image that can be accessed as often as
they need to be. It works at 30 fps.
12
3.2 Homographies
13
Figure 3.2: 3D mapping to image plane
x
f x f 0 0 0
y
fy = 0 f 0 0
z
z 0 0 1 0
1
The matrix describing the mapping is called the camera projection matrix P .
This equation can be written more simply as:
zm = P M
where M = (x, y, z, 1)T are the homogeneous coordinates of the 3D point and m =
( fzx , fzy , 1)T are the homogeneous coordinates of the image point. Since it only con-
tains information about the focal distance f , the projection matrix P represents the
simplest possible case.
This formulation assumes a special choice of the world coordinate system
and the image coordinate system. This can, however, be generalized by introduc-
ing changes of the coordinate systems. Changing coordinates in space can be done by
multiplying the projection matrix P by a 4x4 matrix composed of a rotation matrix
14
R and a translation vector t. It describes the position and orientation of the camera
with respect to the world coordinate system.
R t
G=
0 1
The rows of the rotation matrix R are unit vectors that, together with the
optical center, define the camera reference frame, expressed in world coordinates.
Changing the coordinates in the image plane is equivalent to multiplying the matrix
P on the left by a 3x3 camera calibration matrix K.
fx −fx cotθ u0
fy
K=
0 sinθ
v0
0 0 1
P = K[I|0]G = K[R|t]
Given a world plane and the camera matrix we are now able to formulate
our homography matrix. Referring to Figure 3.3, the formulation of the homography
matrix is as follows:
15
Figure 3.3: Mapping world coordinates to image coordinates
x
u f x −f x cotθ u 0
y
v ∝ 0 fy
[R t ]
sinθ
v0 3x3 3x1
0
1 0 0 1
1
x
=K r1 r2 t
y
1
In this equation, K r1 r2 t represents the 3x3 homography matrix, H3x3 ,
that will be used in order to map real world coordinates to image coordinates which
will be used in our robotic follower to measure the targets movement distance between
frames.
16
3.3 Histogram of Oriented Gradients
Person detection using Histogram of Oriented Gradients (HOG) was first pro-
posed by Dalal and Triggs [6]. It has become the standard for person detection in
RGB and grayscale images. The underlying assumption in HOG is that gradient in-
formation is discriminative enough to detect people in an image. We will now describe
the OpenCV implementation of Dalal and Trigs’ HOG person detector.
The first step in computing HOG features involves computing the gradient
of the image. The gradients used are computed using simple −1 0 1 and
T
−1 0 1 kernels. For color images, separate gradients are calculated for each
color channel and the one with the largest norm as the pixel’s gradient vector is the
one chosen.
Once the gradient has been calculated, a binning procedure takes place. In
this step we are calculating a cell histogram. A cell is comprised of an 8x8 pixel area.
Each pixel within a cell contributes a weighted vote towards an edge orientation.
Orientations are separated into 9 separate bins between 0 degrees and 360 degrees
based on the angle of the gradient at the current pixel as seen below in Table 3.1
degrees bin
0 - 39 1
40 - 79 2
80 - 119 3
120 - 159 4
160 - 199 5
200 - 239 6
240 - 279 7
280 - 319 8
320 - 359 9
17
The weighted vote that goes into the bin is based on the magnitude of the
gradient at the current pixel. The votes of all the pixels contained within a cell make
up the cell histogram.
Once cell histograms have been computed, the cells are combined to form
blocks. A block is a 16x16 pixel space, and the blocks overlap each other by 8 pixels
as seen below in Figure 3.4. Hence, most cells will contribute to four blocks.
The reason for merging the cells into blocks is to reduce the effects of illumina-
tion variation as well as to reduce the effects of foreground-background contrast. Once
the cells are grouped into a block the block is normalized using the L2-Hys norm.
The L2-Hys norm is simply the L2 norm followed by clipping, limiting the maximum
values of the non-normalized block descriptor vector, v, to 0.2 and re-normalizing.
The equation for the L2 norm is as follows:
18
f=√ v
kvk22 +e2
where xt is the state at time t, and Φt is the measurement obtained at time t. The
motion model is given by p(xt |xt−1 ), and the observation model is given by p(Φt |xt ).
The factor p(xt−1 |Φ0:t−1 ) is the posterior from the previous time step. The Particle
filter assumes that the process is Markov, i.e., p(Φ0:t |x0:t ) = ti=0 p(Φi |xi ).
Q
(i)
x∗t = arg max p(xt |Φ0:t−1 ),
(i)
xt
19
for i = 1, . . . , M, where M is the number of particles.
To implement the particle filter, two initialization steps must first take place.
1 M
χ = {xm , wm }M
m=1 = {0, }
M m=1
Once this has been done, the following algorithm is run repeatedly until the entire
program has finished.
{xm M m m M
t }m=1 = {f (xt−1 , µt )}m=1
wtm = wt−1
m
p(Φt |xm
t )
( )M
wm
w m = PM
m=1 wm m=1
M
X
E[x] = xm · w m
m=1
20
5. If necessary re-sample.
The need for re-sampling arises due to the fact that particles will wander as they
are iteratively propagated by the motion model.. As the algorithm progresses, some
particles will drift away from the target observation sending their weights closer and
closer to zero. As more particles drift, we are left with only a small amount of particles
that carry any weight with them and, hence, are the only particles contributing to
the calculation of the expected state. This creates a problem of reliability. Therefore
a re-sampling criteria is set such that:
PM PM 2
1 m 1 m
m=1 w − m=1 w
m M
V AR(w ) M M 1 X
CV = = 2 = (M · wm − 1)2
E 2 [wm ] M m=1
P
1 M
M m=1 wm
where CV is known as the coefficient of variation and is used to calculate the effective
sample size:
M
ESS =
1 + CV
which lets us know how many particles still have valid weights. If the ESS is less than
some predetermined ratio of M then it is determined that a re-sample is necessary.
In this thesis the method of select with replacement is used to re-sample.
Select with replacement works by taking all the values of the normalized
weights, which are valued from 0 to 1, and indexing them by particle number. The
weight can be thought of as the y-axis of a graph and the indices of the particles
from 1 to M can be thought of as the x-axis of a graph. Next, we want to create
a cumulative weight graph. This means that particle 2 on the graph is equal to the
weight of particle 1 plus particle 2. This is done all the way through particle M and
particle M’s weight is guaranteed to be a value of 1. Finally, for M number of parti-
cles, a value between 0 and 1 is chosen at random and the index of the old particle
21
that holds that value is the index that is assigned to the new particle, meaning the
old particle state now the state of the new particle. This ensures that particles with
lower weights are less likely to be copied into the new particle list and particles with
higher weights will most likely be copied multiple times into the new particle list.
The reason that particles with higher weights will be copied more often stems from
the cumulative distribution that was set up. It is more likely that the number chosen
at random will fall within the range of a particle with a large weight range than it
will a particle with a small weight range. A graphical representation of this process
can be seen below in Figure 3.5.
22
3.5 Color Histogram
where bin 1 means that a pixel as a value between 0 and 127 on that channel and bin
2 means it has a value between 128 and 255. It can be seen here that when dealing
with a three channel color histogram the number of entries in the color histogram is
proportional to the number of bins that are uses to separate the ranges. The number
of entry elements in a color histogram can be calculated by:
(number of bins)3
23
Chapter 4
Methods
24
Color data from the sensor is used to run a generic person detector. The
generic person detector used in this thesis is Dalal and Triggs’ HOG person detector.
It generates gradient images and uses them to define the HOG descriptors which are
classified using a linear SVM classifier to provide generic person detections.
The person specific appearance model uses the color data in order to create
an appearance model. The purpose of the appearance model is to make it possible
to detect occlusions of the target person. If the target person is occluded by another
non-target detected person, the generic HOG detector and particle filter alone would
not detect the occlusion and would assume that the occluding party is the target
person. This more often than not will shift the system into thinking that it is now
following the occluding party, thereby deviating from following the actual target. By
employing a person specific appearance model, we can avoid this confusion. The
person specific appearance model must be robust to minor changes in the appearance
of the target due to pose change, but at the same time sensitive to major changes
that will set off an occlusion indicator regardless of whether or not a person has been
detected in the same area.
Scaled image coordinates and depth information create a hybrid state space
for the particle filter which is used for tracking. The person detections generated
for the generic HOG detector are compared to the propagated particle locations in
the hybrid state space. The weight of the particle is then updated using both the
indication of any person in that area and the specific appearance model to determine
the likelihood that this is the target person.
The appearance model is then used once again on the tail end of the algorithm
to conduct a final check on the target person. If the particle filter expected position
matches the specific appearance model within a given threshold then the system logs
the expected position as a positive detection. If the particle filter expected position
25
does not match the specific appearance model within the given threshold then it is
determined that an occlusion has occurred and the target position is not updated.
Finally, all the detection and tracking information is sent to the robot control
architecture which will determine, based on distance to the target and target position
within the frame, how to move in order to follow the target. The overall algorithm
procedure can be seen below in Figure 4.2.
26
4.1.1 Overall Methods Pseudocode
1. Initialize
2. Loop:
T
x y z
The incoming data gives both an RGB image and a depth image so the coordinates are
in a hybrid state space with x and y in image coordinates and z in depth coordinates.
These are the state values that will be propagated for each particle and will identify
the location of the target person.
The depth coordinate z comes directly from the depth data obtained from the
RGB-D sensor. The coordinates x and y are scaled according to the depth coordinate
z. The purpose for this is to equalize the weight contribution from each state element.
27
4.2.2 Motion model
The motion model operates in one of two ways. If the target has been consis-
tently found then the motion model operates in its normal mode. In this mode, the
particles are propagated in random directions according to a ray architecture that is
set up. At each iteration, 12 separate ray directions are defined in the x and z plane
and then using a random number generator, one of the ray directions is chosen for
each particle. If you imagine the ray spread in Figure 4.3 as being top down in the
world plane then this is how the particle states x and z are propagated.
Once a ray direction has been chosen, the propagation of each particle in x
and z is then determined by adding the corresponding ray coordinate to the previous
x or z state and then adding random Gaussian noise. The y value of each particle
is simply propagated by adding a random Gaussian noise to the previous state. The
28
equations governing this propagation can be seen below in equation 4.1.
yt = yt−1 + N (0, σy )
where rx is the x ray coordinate and rz is the z ray coordinate. The reason for not
including y in this ray scheme is due to the fact that the target will move very little
in the y direction throughout a test sequence. The target will mostly be moving in
the x, z plane.
One deviation from the simple motion model that has been made comes from
evaluating a particles location in x image coordinates. When a particle moves close
to image bound, the propagation direction is influenced such that the rays will be
directed away from the image boundary and will propagate with a larger magnitude
than it generally would if it were located more centrally in the image. This is done for
three reasons. First, it is known that the particle must stay within the image bounds
and a particle state that lies outside the image bounds does not make sense. Second,
the robot control architecture is making sure that, for the most part, the target stays
centrally located in the image. Finally, since the tracking is being done indoors in
hallways, image bounds are most likely going to belong to walls or the ceiling. Setting
these conditions on particle motion simply helps to avoid re-sampling too often.
The second mode of operation for the motion model occurs when the target
has not been located for 3 frames. If this is the case, the magnitude of the particle
propagation is increased in order to spread the particles across the image more. This
was implemented to solve cases in which the particles may drift off the target person
and can not reacquire the target because the propagation of each particle is too small
to get back to the target location.
29
4.2.3 Observation model
In order to update the particle weights, each particle state is evaluated against
T
each generic HOG detection in the x y z state space. This on its own is not
sufficient enough to bias particle weights toward the target person due to the fact that
a weight evaluated in state space will be high as long as it is near any person. Thus, the
individual weight contributions are scaled by an appearance dependent scale factor.
To perform this comparison an appearance model around each particle is constructed.
The appearance model of each particle is then compared to the appearance model of
the target taken in the initial frame. The comparison is calculated in equation 4.2.
orig
−(kdm
s −ds k)
∆ds = e 2σ 2 (4.2)
Where dm orig
s is the particle appearance model, ds is the initial appearance model
and ∆ds represents their comparison. Therefore, a large difference in the appearance
models results in a small scaling factor ∆ds . The observations for each value in the
state space are then compared against all generic HOG detections. The equations for
these calculations can be seen below in equation 4.3
P p
−(φt −φm 2
X x tx )
Otmx = e 2σ 2
p=1
P p
−(φt −φm 2
X y ty )
Otmy = e 2σ 2
(4.3)
p=1
P p
−(φt −φm 2
X z tz )
Otmz = e 2σ 2
p=1
where P is the number of generic detections and Otmx is the summed observation
resulting from comparing the x state variable φm
tx of the m
th
particle to each of the
x values φtx p of P generic person detections. Here, the generic term Ot is taking the
30
place of p(Φt |xt ) from section 3.4 meaning that Otmx would represent p(φm m
tx |xtx ). For
each particle these observation equations are a calculated sum that takes all generic
detections into consideration. This is done for all M particles.
Once the summed observations for a particle have been calculated, each par-
ticle weight is updated by multiplying the previous weight value by the sum of the
observations scaled by the appearance difference ∆ds . Mathematically:
wtm = wt−1
m
· ∆ds · (Otmx + Otmy + Otmz ) (4.4)
From equation 4.4 it can be seen that if a particle state is not close to a
generic detection location or the particle appearance model is not similar to the
initial appearance model its weight will not be high. After all particle weights have
been updated, they are normalized to a value of 1:
M
X
W = wm
m=1
(4.5)
m wm
w = f or m = 1...M
W
The normalized weights are then used in order to calculated the expected
state which is used for the current target expected position. The expected state is
31
calculated as in equation 4.6
M
X
Ex = xm · w m
m=1
M
X
Ey = y m · wm (4.6)
m=1
M
X
Ez = z m · wm
m=1
These expected state values are the values returned from the particle filter
and indicate the particle filters expected location of the target person. This location
will be checked for occlusion by the specific appearance model to determine whether
or not an occlusion has occurred. After the occlusion check, regardless of whether
or not the target is detected, a check is performed on the particles to determine if a
resample is necessary. The equation for determining the necessity of a resample can
be seen in section 3.4. If a resample is necessary then it is done using the select with
replacement method.
After the resample check, the tracking algorithm is finished and sends its
finding to the robot control architecture which responds accordingly.
1. Propagate particles
2. For all M
32
(c) Compare particle state to generic detections (Equation 4.3)
In this algorithm, the generic detector is used to find all the possible person
detections in a given frame. The generic detector used in this algorithm is the HOG
person detector. The OpenCV implementation was used. It is a multi-scale detection.
In this implementation block sizes are 16x16 pixels, cell sizes are 8x8 pixels and the
size of the detection window is 64x128. This means that there are 105 overlapping
blocks per detection window. Since the purpose of the generic detector is not to locate
one single person, but instead to present multiple detections to the particle filter and
specific appearance model, the detection rate for the HOG detector is set high. This
means that after the generic person detection portion of the algorithm runs there are
many false positive detections. This, however, is not a problem for the purposes of
this algorithm, because the false positives will simply be thrown out by the particle
filter and/or appearance model. It is not the job of the generic detector to be precise
in its classification of people. We simply want to make ensure that there are no false
negatives detected. The false positives are dealt with during the consideration as to
which target is the actual target. Snapshots of the output from the generic detector
can be seen below in Figure 4.4.
33
Figure 4.4: Generic HOG detections
The specific appearance model is the working piece of the algorithm that makes
distinctions between people possible. It helps create distinction between the generic
detections and provides a method for detecting occlusions.
It is used in the particle filter weight update to scale the state observations.
If the algorithm were to rely solely on the hybrid state locations of each particle for
specific target detection, we would find that the particles would often drift to other
non-target detections. If the particle filter was not scaled by this appearance factor
and relied solely on the spatial information, particles would gain weight when they
became close in position to any person detection and eventually attach themselves
to that detection regardless of their appearance similarity to the target person. This
would result in the eventual loss of the target person. The algorithm would then
assume the new person it is tracking is the target person.
It is also used as a final check on the expected target position to determine if an
occlusion has occurred. The particle filter implementation, regardless of the scaling
effect influence the specific appearance model has on the expected state of the target,
still gives an expected target position at every iteration. This means that there will
34
always be an expected target state coming from the particle filter and we have no
way of knowing for sure if the expected target state is the correct person or not. It
is possible that this is just another non-target person occluding our target person.
Therefore, the specific appearance model is used as a check against the expected
target state to make a final decision as to whether or not this is the correct target.
The specific appearance model is created with a color histogram. The color
histogram used in this algorithm employs 10 bins on each color channel. Each color
channel ranges in value from 0 to 255. This means that on each color channel a bin
has approximately 25.6 values assigned to it. This is of course not exact because a
color channel value is not a float. With 10 bins per color channel, the entire color
histogram has 1000 possible entries. Refer to section 3.5 for color histogram theory.
The purpose of this color histogram is to make a model of some window area for the
purpose of later comparison.
The initial color histogram is the color histogram that is used for the remainder
of the test period when considering possible person detections as the target detection.
It used essentially as a template. To create the initial color histogram the depth
information from the RGB-D sensor is first used. When the user initially clicks
the image to signify the target person, an average depth is calculated around this
clicked area. Then pixels inside the initial selection window that lie outside a given
threshold of this average depth are not considered while creating the color histogram.
The threshold used for this determining whether or not a pixel is close enough to
the average is 0.3 meters. In Figure 4.5 you can see examples of the pixels used
to construct the color histogram. Notice that the black areas correspond to pixels
that lie outside of the average depth area. Once the initial color histogram has been
constructed, no additional models are constructed.
35
Figure 4.5: Determining pixels to be used in color histogram
During the particle weight update, the initial specific appearance model (or
color histogram) is used to scale the spatial informations influence on the weight.
This is done by taking the particle state of the particle currently being updated and
creating a color histogram for the bounding box that will represent this particle.
The same method is used as was in creating the initial appearance model. Meaning,
the average depth is found and any pixels that lie outside the average depth value
plus the threshold are not considered when making the color histogram. The color
36
histogram of the particle state under consideration is then compared with the initial
color histogram using the L2 norm. This then gives the the value ∆ds from equation
4.2 which is used as the scaling factor on the weight update of the current particle in
equation 4.4.
37
Figure 4.6: Initial template for the color histogram
Figure 4.7: Expected position color histogram templates from frames in the sequence
the color histogram. Once the color histogram has been created for the expected
target position, it is compared against the initial specific appearance model (color
histogram) using the L2 norm. If the resulting difference is below the threshold, ta ,
then the expected target person is assumed to be the correct target person. If the
resulting difference is not below the threshold, ta , then an occlusion is declared. The
threshold value ta is varied between 0.15 and 0.35 in execution of this algorithm. The
condition that must be satisfied can be seen below in Equation 4.7. It should be
noted here that if an occlusion is declared, then a missed detection is added to the
38
counter that is used by the motion model.
kdexpected
s − dorig
s k < ta (4.7)
• Target found
4. else
• Target occluded
The input to the robot control architecture is the targets last position, the
targets current position, and an occlusion flag. The overall procedure of the control
architecture can be seen below in Figure 4.8.
Once the control architecture has the previous and current target positions,
there are two parameters that will instruct the robot to take no movement action
and stop. The first parameter is the occlusion flag. If it has been set, then the robot
executes a stop command. The robot does not update the last position and returns
to the tracking algorithm. However, if the the occlusion flag has not been set then the
second parameter considered is the magnitude of the movement. Since it is unlikely
that from frame to frame the target will move a very large distance, a check is set
39
Figure 4.8: Robot control architecture
in place to identify a large distance movement. The purpose for this check is to be
careful in the event that the tracker somehow misidentifies the target. We do not
want the robot to try to make a large movement that will set it off course for further
tracking. Therefore, if it is determined that the distance traveled from the previous
target detection is too great the robot will execute a stop command. Once again it
will not update the last position and will return to the tracking algorithm.
If neither of these flags have been set then the robot control architecture will
first determine if a turn is necessary. The need to turn is checked first for three
reasons. First, since testing is being done in a hallway, it is very rare that the robot
will ever have to turn. Second, we want to keep the target centrally located in the
image in order for the particle filter motion model to function properly. And third, if
a rotation is required, we do not want to set a rotate and forward movement command
at the same time because whichever command is given first will not be executed. The
check for a turn command is fairly simple. The control architecture checks to make
sure the center of the target is within the middle three fifths of the image. If the
target is outside of the middle three fifths then depending on whether the target is
40
in the right third or the left third the robot executes a rotate right or rotate left
command. Once this command has been set, the robot control architecture sets the
current target position to the previous target position and returns to the tracking
algorithm.
If a turn is not necessary then the robot evaluates the velocity at which it
needs to move forward or backward. This is done in a few steps. First the control
architecture maps the current position to the ground plane. Then, depending on
which section of the ground plane this mapped point is in the control architecture
chooses a homography. The ground plane is divided into three separate sections. The
reason for this is because the pixel distance between two pixels further off in the
ground plane correlate to a larger real world distance than the real world distance
that correlates to the pixel distance between two pixels that are closer in the ground
plane.
Three separate transformation homographies were calculated off-line. This
off-line calculation was done by taking pictures (at the same height as the camera
is on the robot) of a piece of cardboard with known lengths. Then the images were
brought into a program that allows the user to click on the corners of the piece of
cardboard. Corresponding the distances between the clicked points and the known
real world lengths allows for the construction of a homography, as discussed in section
3.2, that will allow us to transform points from image space to real world space.
Once the correct homography has been selected, the real world distance be-
tween the previous position and the current position is calculated and the robot
forward velocity is set such that it will move this distance. After the set forward
velocity command has been set, the robot control architecture sets the current tar-
get position to the previous target position and returns control over to the tracking
algorithm.
41
4.5.1 Robot Control Pseudocode
5. Else
(b) Choose homography based on which third of the ground plane the current
position is in
(c) Calculate the real world distance between the current position and last
position
42
Chapter 5
The algorithm was run on multiple test sets to determine how well it would
perform in different situations. The there were two test environments both of which
were located indoors in areas of minimal traffic. The setup used is the Microsoft
Kinect camera, the Pioneer P3-AT robot, on a Dell Studio 18 laptop with an i7 core
processor. The operating system being used is Ubuntu 11.04. ROS is used to connect
to and retrieve information from the camera and the ARIA library is used to control
the mobile robot.
We want to evaluate the algorithm performance in three main areas. The first
is scenes with no occlusion of the target, the second is scenes with minimal occlusion
and the third is scenes with large occlusions. The purpose for the three separate test
areas is to test the needs of person following algorithm which is threefold. (1) It needs
to be established that the baseline system works, (2) it needs to be established that
the algorithm can handle brief occlusions, and finally (3) the system should be able
to recover the target after a long term occlusion.
43
5.1 Types of Images
There are several different types of images that will be seen in the following
sections. To avoid the need to explain what it is the images actually represent, this
section will serve to show one type of each image for a single frame. The six image
types can be seen below in Figure 5.1
Each image is explained in order from top left to bottom right. The first
44
image is the depth image. It is a representation of the data that is provided from
the depth sensor. The second image shows every HOG person detection provided
from the generic person detector. The third image is the image with the final target
detection bounding box and the particle positions overlayed on top of it. The fourth
image shows the particle filter expected position bounding box. The fifth image is a
representation of the template that will be used to create a color histogram. And the
sixth image is simply the current image with the positive target detection bounding
box lain over it.
5.2 No Occlusion
The test sets for non-occlusion cases will consist of simple test cases in which
the target person is walking down the hallway in a straight path. There will be no full
or partial occlusions. There will however be other people present in the scenes. The
purpose of these tests is to evaluate the system response to baseline test cases. In this
section we want to determine the system response to different cases of non-occluding
robot/target interaction. To begin, in Figure 5.2 the initial progression of a single
target down the hall can be seen with the corresponding depth images.
In this initial run, the target individual is walking down the hall with only
45
one additional person present. The additional person is at a different depth. The
system performs as expected in this case and is easily able to follow the target. Next,
we want to test the system when another person appears at approximately the same
depth and relative location as the target. This case can be seen below in Figure 5.3.
It can be seen from the segment in Figure 5.3 that when an additional person
within proximity of the target appears in the scene with a similar depth, the tracker
stays with the target person and is not influenced by the sudden appearance of an-
other person. This is due to the relatively small particle propagation that is allowed
whenever a positive target identification has occurred.
For the final non-occlusion test, the system is tested for a target change in
appearance. Only the initial template is used to create the color histogram, thus, the
integrity of this method must be tested for cases in which the target changes pose.
Below in Figure 5.4 the initial template for the image sequence can be seen and in
figure 5.5 the detection images for later frames of the sequence are displayed.
It can be seen that the system is indeed robust to appearance changes in the
target. Regardless of the initial color histogram template that results from the target
facing away from the camera, a side or forward pose of the target is not enough of an
appearance change to set off an occlusion flag in the appearance comparison stage of
the algorithm. It should also be noted that the appearance model is unaffected by
46
Figure 5.4: Initial color histogram template
the illumination from the ceiling lights. The illumination changes the color histogram
template by creating streaks across the image, however, the appearance model is still
robust enough to not consider this change as an occlusion.
The previous section tested the systems robustness to the point that the target
was able to be tracked in three situations. The first being a scene with multiple people
in the image at different depths, the second being multiple people in the scene at
47
similar depths and the third being appearance changes of the target. Now we would
like to test the system against brief occlusions.
In order to test the system against brief occlusions a scenario is set up such
that the target person is fully occluded for a brief period of time while walking down
the hall. To show the effectiveness of the occlusion detection, the detection images
are compared against the particle filter expected position images, depth images, and
the color histogram templates. Below in Figure 5.6 and 5.7 two separate occlusion
instances, and the systems response to these instances, can be seen. The first occlusion
instance in Figure 5.6 compares the detections against the particle filter expected
positions and the second occlusion instance in Figure 5.7 compares the detections
against the depth images and the color histogram templates (where the first image is
the initial template used to create the color histogram).
Figure 5.6: Brief occlusion instance one: particle filter expected positions
48
Figure 5.7: Brief occlusion instance two: depth images and color histogram templates
From these two instances, it can be seen that the system is able to handle brief
occlusions. The particle filter expected state exists in the area where the occluding
person is, however, the appearance model on the tail end of the system registers an
appearance change signifying the detected person is an occluding person and should
not be confirmed as the target person. The threshold value ta used for this image
sequence is a value of 0.15. The appearance difference is above the threshold, so
the system declares that an occlusion has occurred and overrides the particle filter
49
expected position. Were the appearance model not present, the occluding party would
be chosen as the target person and quite possibly end up being the new target person.
Since the testing for brief occlusions was successful, the final system test is
the response to a long term occlusion. It needs to be established that the system can
reacquire the target after it has lost the target for more than three frames (which is
the standard in the brief occlusion). To conduct this test, the algorithm is run on
a sequence in which the target is occluded for ten frames before reappearing. The
detection results of this test can be seen in Figure 5.8.
This sequence shows that the system is able to recover from a long term
occlusion. The reason that this is possible is due to the fact that the appearance model
does not change after the initial frame. The low magnitude of particle propagation
also makes this possible. The systems ability to recover from a long term occlusion
is a positive aspect, however, a trade-off is made in order for the system to obtain
this quality. The trade-off comes in tracking target lateral motion and will be further
50
discussed in the next chapter.
The graphed data shows us that the system performs quite well overall. It
maintains about a 75 percent intersection over union throughout the sequence. It
should be noted that not every frame from the sequence is graphed. For a total frame
breakdown refer to Table 5.1.
51
True positives 79
True negatives 9
False positives 0
False negatives 12
From this table it can be inferred that the system, though detecting every
occlusion, does make a detection trade off. There were 12 frames during which the
target was not found even though visible in the scene. The reasons for this will be
discussed in the conclusion section.
52
Chapter 6
Conclusions
53
scattered about the image making it impossible to reacquire the target. This method
has been shown to work, but it does make the system susceptible to false negatives.
These false negatives occur when the target moves rapidly from one side of the image
to the other. Since the magnitude of the particle propagation is small, the particles
will not be able to keep up with the rapid movement. The target expected state
is then in a location in which no person, or the wrong person, is present and the
appearance model classifies the expected position as an occluded target.
The algorithm is also susceptible to large changes in scale of the target. If the
target starts off large and through the sequence progression is reduced significantly
in size, the system will no longer be able to detect the target. This stems from the
fact that the window size being used to estimate the size of the expected position is
an average of all windows. Though the generic HOG detector does search at multiple
scales, there is a limit to the scale size that is most likely forced upon the detector by
the training images that were used. This means humans that are off in the distance
do not get detected. For this algorithm that translates to the actual target bounding
box size not being considered in the average of all detection boxes. This means the
expected position bounding box will be much larger than it should be and will not
provide a good template for the color histogram. The appearance model will then
classify the expected position as an occlusion and a false negative will occur.
Other than these two issues, the system performed well. It is a real time
system which is a necessity for the following portion of the algorithm. In the next
section some possible system improvements will be discussed.
54
6.1 Future Work
Areas of future work should aim to improve on the two major deficiencies of the
algorithm. These two deficiencies are the particle motion model and scale issues. The
latter could most likely be solved by training a HOG detector that considers smaller
sized targets as positive human detections, while the former would entail creating a
more complicated motion model for the particle filter. It is possible that the motion
problem could be solved by allowing the magnitude of particle propagation to increase
but add more particles, however this will slow the system down considerably. Since
it is necessary for the algorithm to run in real time, increased processing time at each
frame is undesirable.
One other possible improvement that could be made to the system would be
incorporate a target motion history. This could be implemented by keeping a history
of target positions in a top down coordinate system. This target history could then be
used to calculate the target trajectory which could be used as another occlusion check.
In this theoretical implementation, both the appearance model and the trajectory
could be taken into consideration when making a decision about a possible occlusion.
One final addition that could be contributed to the algorithm is the concept of
creating an adaptive appearance model. To implement this, we would not only keep
the initial appearance model around for comparison but also keep a the previous 4 or
5 appearance models around. It is possible that this could help with the scale issue.
If the target is getting smaller it would be also be smaller in the updated appearance
models giving the algorithm more of a chance to classify the smaller target as non-
occluded.
55
Bibliography
[1] Mayank Bansal, Sang-Hack Jung, Bogdan Matei, Jayan Eledath, and Harpreet S.
Sawhney. A real-time pedestrian detection system based on structure and ap-
pearance classification. In ICRA’10, pages 903–909, 2010.
[2] D. Beymer and K. Konolige. Tracking people from a mobile platform. In Pro-
ceedings of the International Joint Conference on Artificial Intelligence., 2001.
[3] Jonathan Brookshire. Person following using histograms of oriented gradients.
International Journal of Social Robotics, 2:137–146, 2010.
[4] Zhichao Chen and Stanley Birchfield. Person following with a mobile robot us-
ing binocular feature-based tracking. In International Conference on Intelligent
Robots and Systems, 2007.
[5] G. Chivilo, F. Mezzaro, A. Sgorbissa, and R. Zaccaria. Follow-the-leader be-
haviour through optical flow minimization. In IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems., volume 4, pages 3182–3187, 2004.
[6] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In CVPR, pages 886–893, 2005.
[7] T. Germa, F. Lerasle, N. Ouadah, and V. Cadenat. Vision and rfid data fusion
for tracking people in crowds by a mobile robot. Computer Vision and Image
Understanding, 114(6):641 – 651, 2010. Special Issue on Multi-Camera and Multi-
Modal Sensor Fusion.
[8] H. Grabner and H. Bischof. On-line boosting and vision. In IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, volume 1, pages
260 – 267, June 2006.
[9] Peter Henry, Christian Vollmer, Brian Ferris, and Dieter Fox. Learning to nav-
igate through crowded environments. In International Conference on Robotics
and Automation, 2010.
[10] Frank Hoeller, Dirk Schulz, Mark Moors, and Frank E. Schneider. Accompanying
persons with a mobile robot using motion prediction and probabilistic roadmaps.
In IROS, pages 1260–1265, 2007.
56
[11] Berthold K. P. Horn and Brian G. Schunck. Determining optical flow. Artificial
Intelligence, 17:185–203, 1981.
[12] Thorsten Joachims. Advances in kernel methods. chapter Making large-scale sup-
port vector machine learning practical, pages 169–184. MIT Press, Cambridge,
MA, USA, 1999.
[13] Rachel Kirby, Reid Simmons, and Jodi Forlizzi. Variable sized grid cells for rapid
replanning in dynamic environments. In Proceedings of the IEEE/RSJ interna-
tional conference on Intelligent robots and systems, pages 4913–4918, Piscataway,
NJ, USA, 2009. IEEE Press.
[14] Hyukseong Kwon, Youngrock Yoon, Jae Byung Park, and Avinash C. Kak. Per-
son tracking with a mobile robot using two uncalibrated independently moving
cameras. In IEEE International Conference on Robotics and Automation, pages
2888–2894, 2005.
[15] Matthias Luber, Luciano Spinello, and Kai O. Arras. People tracking in RGB-
D data with online-boosted target models. In Proc. IEEE/RSJ International
Conference on Intelligent Robots and Systems, San Francisco, USA, 2011.
[17] Donald B. Reid. An algorithm for tracking multiple targets. IEEE Transactions
on Automatic Control, 24:843–854, 1979.
[18] Christian Schlegel, Jrg Illmann, Heiko Jaberg, Matthias Schuster, and Robert
Worz. Vision based person tracking with a mobile robot. In In Proceedings of
the British Machine Vision Conference, pages 418–427, 1998.
[19] William Robson Schwartz, Raghuraman Gopalan, Rama Chellappa, and Larry S.
Davis. Robust human detection under occlusion by integrating face and person
detectors. In Proceedings of the Third International Conference on Advances in
Biometrics, pages 970–979. Springer-Verlag, 2009.
[20] J. Shi and C. Tomasi. Good features to track. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 593–600, 1994.
[21] H. Sidenbladh, D. Kragic, and H.I. Christensen. A person following behaviour for
a mobile robot. In IEEE International Conference on Robotics and Automation.,
volume 1, pages 670–675, 1999.
[22] Luciano Spinello and Kai O. Arras. People detection in RGB-D data. In Proc.
IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011.
57
[23] M. Tarokh and P. Ferrari. Robotic person following using fuzzy control and
image segmentation. Journal of Robotic Systems, 20(9), 2003.
[24] C. Tomasi and Takeo Kanade. Detection and tracking of point features. Technical
Report CMU-CS-91-132, April 1991.
[25] Adrien Treuille, Seth Cooper, and Zoran Popović. Continuum crowds. In ACM
SIGGRAPH 2006, pages 1160–1168, 2006.
[27] Paul Viola and Michael Jones. Robust real-time face detection. International
Journal of Computer Vision, 57:137–154, 2004.
[28] Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. An HOG-LBP human detec-
tor with partial occlusion handling. In IEEE 12th International Conference on
Computer Vision, pages 32 –39, 2009.
[29] Qiang Zhu, Shai Avidan, Mei-chen Yeh, and Kwang-ting Cheng. Fast human
detection using a cascade of histograms of oriented gradients. In CVPR, pages
1491–1498, 2006.
58