0% found this document useful (0 votes)
59 views12 pages

2019 - Structural Analysis of Attributes For Vehicle Re-Identification and Retrieval

Uploaded by

Minh Hải Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views12 pages

2019 - Structural Analysis of Attributes For Vehicle Re-Identification and Retrieval

Uploaded by

Minh Hải Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

Structural Analysis of Attributes for Vehicle


Re-Identification and Retrieval
Yanzhu Zhao , Chunhua Shen , Huibing Wang , and Shengyong Chen, Senior Member, IEEE

Abstract— Vehicle re-identification plays an important role


in video surveillance applications. Despite the efforts made on
this problem in the past few years, it remains a challenging
task due to various factors such as pose variation, illumination
changes, and subtle inter-class difference. We believe that the
key information for identification has not been well explored in
the literature. In this paper, we first collect a vehicle dataset
‘VAC21’ which contains 7129 images of five types of vehicles.
Then, we carefully label the 21 classes of structural attributes
hierarchically with bounding boxes. To our knowledge, this is
the first dataset with several detailed attributes labeled. Based
on this dataset, we use the state-of-the-art one-stage detection
method, Single-shot Detection, as a baseline model for detecting
attributes. Subsequently, we make a few important modifications
tailored for this application to improve accuracy: 1) adding Fig. 1. Examples of two vehicles belonging to the same type. Some subtle
more proposals from low-level layers to improve the accuracy differences are marked by the boxes.
of detecting small objects and 2) employing the focal loss to
improve the mean average precision. Furthermore, the results of In terms of object re-identification, most previous works
the attribute detection can be applied to a series of vision tasks
that focus on analyzing the images of vehicles. Finally, we propose focus on human face or person identification in the litera-
a novel region of interests (ROIs)-based vehicle re-identification ture [3]–[5]. Different from face or person re-identification,
and retrieval method in which the ROIs’ deep features are used vehicle re-identification is more challenging as it is very
as discriminative identifiers, encoding the structure information difficult to discriminate vehicles with similar visual appearance
of a vehicle. These deep features are input to a boosting model which belongs to the same model. In other words, the inter-
to improve the accuracy. A set of experiments are conducted on
the dataset VehicleID and the experimental results show that our class (inter-ID) difference can be very subtle. What is more,
method outperforms the state-of-the-art methods. there even exist newly produced vehicles which look exactly
the same.
Index Terms— Vehicle attribute detection, deep learning,
vehicle re-identification, vehicle retrieval. Although the license plate provides a unique ID for a
vehicle, it is not always reliable as it can be easily faked;
or sometimes it is not recognizable. For instance, the plate
I. I NTRODUCTION can be occluded or removed. Also the resolution of the license
image can be of low resolution. As a result, appearance based

I N THE past a few years, vehicle search and re-identification


(Re-ID) [1], [2] have received increasing research interest
in computer vision due to the important applications in video
vehicle re-identification plays an important role in real-world
applications.
Fortunately, even some vehicles have similar visual appear-
surveillance. Vehicle re-identification which retrieves vehicles ance, each one still exhibits ‘personality’. This ‘personality’
that are captured by different cameras, is a challenging and demonstrates subtle differences even between two vehicles that
important problem in intelligent transportation systems. are of the same type, such as windshield stickers, personalized
decorations [6], etc. As shown in Fig. 1, even the two vehicles
Manuscript received July 23, 2018; revised December 31, 2018; accepted exhibit almost the same appearance, one can observe some
January 1, 2019. This work was supported by the National Natural Science
Foundation of China under Grant U1509207. The Associate Editor for this subtle differences.
paper was W. Lin. (Corresponding author: Shengyong Chen.) Inspired by this, we would like to exploit these person-
Y. Zhao is with the College of Computer Science and Technology, alized information which can be combined into a structural
Zhejiang University of Technology, Hangzhou 310023, China (e-mail:
[email protected]). feature to re-identify vehicles. The first step, as a result,
C. Shen is with the School of Computer Science, The University of Ade- is to be able to capture and encode this appearance infor-
laide, Adelaide, SA 5005, Australia (e-mail: [email protected]). mation. In this work, we first collect a vehicle dataset and
H. Wang is with the Information Science and Technology College, Dalian
Maritime University, Dalian 116026, China. then hierarchically label 21 classes of attributes with bound-
S. Chen is with the School of Computer Science and Technology, ing boxes, including cars (whole body), wind-shield glasses,
Zhejiang University of Technology, Hangzhou 310023, China, and also with paper boxes, etc. The dataset is therefore named ‘Vehicle
the School of Computer Science and Engineering, Tianjin University of
Technology, Tianjin 300384, China (e-mail: [email protected]). Attributes Classes21 (VAC21)’ and the detail is illustrated
Digital Object Identifier 10.1109/TITS.2019.2896273 in Fig. 3. We believe that these combined attributes serve
1524-9050 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

as discriminative structural features of a vehicle, with which In summary, our main contributions are as follows.
we may be able to improve the performance of vehicle 1) We collect a vehicle dataset VAC21 and carefully label
re-identification methods. the images for 21 classes of attributes. With these
At present, Convolutional Neural Networks (CNNs) based attributes we are able to learn discriminative features
detection methods [7]–[9] have demonstrated great superiority of a vehicle that are useful for vehicle identification.
in terms of detection accuracy. With the proposed dataset, To our knowledge, this is the first dataset with so many
we employ the state-of-the-art one-stage detection method fine attributes labeled.
SSD [7] as a baseline model for vehicle attribute detection. 2) Second, with the constructed dataset, we use the one-
The performance of SSD is typically very good on detecting stage detection method SSD as a baseline model for
large objects such as cars and buses. However, the vanilla attribute detection. To improve the accuracy of small
SSD method often does not perform well on detecting small object detection, we make modifications to the vanilla
objects in a car such as hangings, tissue boxes or stickers SSD method: Namely, we show that adding proposals
in the windscreen and these small objects are precisely the from low-level layers helps improve the accuracy of
unique signals of a vehicle. To improve the detection accuracy small objects; employment of the focal loss improves
on small objects, we propose to add default boxes from lower mAP too. Furthermore, the results of attribute detection
layers as object proposals, because lower layers in a CNN may can be widely applied to tasks of vehicle analysis.
sculpture finer details of the input objects. Our experimental 3) We propose a novel method in which the ROIs’ deep
results confirm the improvement of the detection accuracy features are aggregated into powerful feature represen-
on small objects with this modification. We also employ the tations. The aggregated feature representations boost
idea of focal loss [10] that tackles the class imbalance for peformance of vehicle re-identification and retrieval.
one-stage detection methods to further improve the mean Experiments are conducted with vehicle retrieval and
average precision (mAP). Experimental results demonstrate re-identification on the dataset VehicleID, demonstrat-
that the mAP can be further improved with the focal loss under ing that our method outperforms the state-of-the-art
different parameter settings. methods.
The outputs from the detector allow us to select regions of
interests (ROIs) and use them to re-identify vehicles. Existing
II. R ELATED W ORK
works [6], [11] draw inspirations from face and person
re-identification to typically use distance metric learning for Most previous works concentrate on vehicle model clas-
vehicle re-identification and retrieval. We believe that there sification [14], [15] which recognize vehicle models rather
is however still much room for improvement. Different from than the more challenging identities. Recently, vehicle
these works, we hypothesize that adding more locally discrim- re-identification has gained increasing attention.
inative features into the identification process will improve the Liu et al. [16] proposed a vehicle re-identification dataset
accuracy. VeRi which contains over 40,000 images of 619 vehicles
When an image is input to deep convolutional neural captured by 20 cameras in unconstrained traffic scene. Based
networks, it produces a sequence of layer activations, which on the dataset, they proposed an appearance-based method
can be considered as image features at different levels. that combines low-level features and high-level semantic infor-
The learned high-level features enjoy good generalization mation extracted by deep neural networks. Later, the work
ability and are robust to complex environments. In this work, of [17], [18], and [19] attempted to re-identify vehicles
we extract deep features from the classification model pre- by introducing complex spatio-temporal information into
trained on vehicle classification. Then with the results of appearance-based procedures.
the SSD detector, local features of ROIs can be extracted Liu et al. [6] presented a new large-scale vehicle
according to the detected location. re-identification dataset VehicleID which contains 26,267 dif-
These ROI features that are combined into a structural ferent vehicles with over 222,000 images collected from
feature can mark a vehicle uniquely. However, each feature of real surveillance cameras and labeled at the identity level.
an individual ROI can be limited if it is used for identification The large scale of the dataset facilitates the training of recent
individually. Moreover, a naive combination such as averaging deep learning models on the tasks of vehicle identification.
the results may not be able to fully exploit the power of Furthermore, they proposed a two-branch DCNN to learn the
these features. Thus an effective methodology is needed to relative distance between two different vehicles. Li et al. [11]
capture the complex feature dependencies. To this end, we proposed a deep joint discriminative learning model for vehicle
employ gradient tree boosting [12], [13] to combine the re-identification. The model uses a DCNN which combines
weak classifiers (an individual ROI feature) into a strong several different tasks, including identification, attribute recog-
classifier. The overall structure of our ROIs-based method for nition, verification and triplet tasks, to extract discriminative
re-identification is shown in Fig. 5. representations for vehicle images.
In the experiments, with the proposed method we show Different from these works, we try to deal with vehicle
two applications: vehicle retrieval and re-identification on re-identification with the help of some features of user
the large-scale dataset VehicleID [6]. Experimental results behaviors. Thus a detector is needed to extract all the per-
demonstrate that our method outperforms other state-of-the-art sonalized features at first. In this work, we collect a vehicle
approaches. dataset VAC21 whose images come from VOC2007 and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: STRUCTURAL ANALYSIS OF ATTRIBUTES FOR VEHICLE RE-IDENTIFICATION AND RETRIEVAL 3

surveillance cameras. Then we carefully label 21 classes of and the remaining ones are captured during both daytime
attributes hierarchically using bounding boxes. As far as we and nightime by multiple real-world surveillance cameras
know, this is the first dataset with so many fine attributes distributed in several cities in China. The collected images
labeled. include five types of common vehicles, which are, cars, buses,
A number of results were reported in the literature on trucks, trains and tricycles with different viewpoints, i.e., front,
tasks of vehicle detection, vehicle plate detection and vehicle side, rear.
logo detection so far. Traditional methods of vehicle detection Based on this dataset, we carefully label 21 classes of
mainly use information on symmetry [20], [21], colors [22], attributes with bounding boxes. There attributes are hier-
geometrical features [23] and texture [24]. Recently, DCNNs archically labeled to capture different levels of details.
have shown great superiority in classification and detection. Specifically, they generally fall into three different levels:
Fan et al. [25] applied Faster R-CNN [9], a method for generic the vehicle whole-body level, in which the whole body of
object detection, to improve the performance of Faster R-CNN vehicle is labeled and described; vehicle model level, such
on vehicle detection and achieved competitive results on the as the logo or the car light, in which features usually depict
KITTI vehicle dataset. the vehicle model; the personalized level, such as hanging
As for vehicle plate detection, Wu and Li [26] proposed an or stickers (signs of yearly service), in which the attributes
approach to detect the license plate by selecting the optimal describe the user personalization and can be used to reveal
frame. Chang et al. [27] detected license plate based on the subtle differences between vehicles.
the geometry features of license plates, i.e., shape, sym- More specifically, these attributes include annual ser-
metry, height-to-width ratio, color and textures. Gerber and vice signs (‘anusigns’, in abbreviation), back mirror
Chung et al. [28] designed a multi-CNN approach to detect (‘backmirror’), bus (‘bus’), car (‘car’), car light (‘carlight’),
license plates for mobile devices. A CNN is used to verify carrier (‘carrier’), car topwindow (‘cartopwindow’), entry
cars and then the output is provided to the next supervised license (‘entrylicense’), hanging (‘hungs’), lay ornament
CNN for vehicle plate detection. (‘layon’), light cover (‘lightcover’), logo (‘logo’), newer sign
On vehicle logo detection, most popular methods [29]–[31] (‘newersign’), tissue box (‘paperbox’), plate (‘plate’), safebelt
followed a hierarchical approach to first identify salient (‘safebelt’), train(‘train’), tricycle (‘tricar’), truck (‘truck’),
objects, such as the vehicle license plate or headlights, and wheel(‘wheel’), wind-shield glass (‘windglass’). Examples of
then outline a rough area that potentially contains the logo. annotated images are illustrated in Fig. 2 in which all of
Furthermore, a coarse-to-fine scan is processed to extract the classes mentioned above are displayed. As shown in the
the logo area using information such as symmetry and edge figure, the attributes are not completely independent of each
statistics in the image. other. Parts of them are included inside others. For instance,
As described above, all the detection tasks on vehicles the annual service signs and the tissue box are both included in
are tackled separately. To capture all the vehicle attributes the wind-shield glass while the two attributes are independent
together, we follow the idea of generic object detection. of each other. Thus these attributes are hierarchically struc-
Among the set of object detection methods, deep learning tured with this geometric relationship.
based ones have shown great superiority, i.e., SSD [7], Mask On the dataset VAC21, several classes of attributes are
R-CNN [8], Faster R-CNN [9], R-FCN [32] and YOLO [33]. labeled in each image. The quantity of each attribute in the
Wherein, the networks architectures can be divided into single dataset is shown in Table I. Note that the number of attributes
shot [7], [33] and two-stage region-proposal based ones [8], are unevenly distributed among different classes, because the
[9], [32]. Among these methods, the SSD is among the attributes of each car are usually unevenly distributed in real
methods with best accuracy while processing time being very world, especially the attributes that are produced by user
efficient [34]. behaviors, such as the annual service signs, hanging, lay
As the attribute detection is a pre-processing procedure ornament, etc.
whose results can be fed to further more advanced tasks, To significantly reduce the background noises, we auto-
in this work, we therefore choose the state-of-the-art detection matically crop the vehicles from the background in VAC21.
method SSD [7] as our detection baseline for the reasons As the locations of attributes are labeled in the images of
mentioned above. Furthermore, to improve the accuracy of VAC21, it is easy to crop the vehicles automatically by the
small objetcs detection, we add low-level layers for default known coordinates. Then a new dataset VAC21_S is produced.
boxes proposals. We also employ the focal loss [10] to improve Notice that parts of images contain more than one vehicle,
the mAP. thus VAC21_S contains more number of images than VAC21.
After training the detector, we propose a ROIs-based vehicle The number of images in VAC21_S becomes 12,071 but the
re-identification and retrieval method. Experimental results number of vehicles is same with VAC21. The example of
demonstrate that the ROI features improve the results on VAC21_S is also shown in Fig. 2. To facilitate further study,
vehicle re-identification and retrieval. we will make the dataset publicly accessible.
Based on this labeled dataset, a detection model can
III. P ROPERTIES OF VAC21 be trained. As the attributes contain defferent levels of
The VAC21 dataset contains 7129 images captured from two details, this dataset can be useful for a few applications
different scenarios. Among them, 2256 images were collected related to vehicle analysis such as vehicle detection [36],
from the VOC2007 dataset [35] that contain the vehicle label vehicle tracking [37], [38], fine-grained car recognition [14],
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 2. Examples for 21 vehicle attributes. Wherein, VAC21 is on the left side and the other side is VAC21_S.

TABLE I
Q UANTITY D ISTRIBUTION OF THE L ABELED ATTRIBUTES

vehicle model recognition [39], vehicle retrieval or


re-identification [1], [2], vehicle logo recognition [31], [40],
and vehicle plate recognition [27], etc.
These applications require different degrees of structure
details. For instance, vehicle tracking only need to obtain the Fig. 3. 21 classes of vehicle attributes. Different groups of attributes can be
applied to a few vision applications of vehicle analysis.
location of the vehicle. However, vehicle model recognition
requires more details such as logo, wind-shield glass, etc,
in order to recognize the specific series of vehicles produced
by the manufacturer. Furthermore, vehicle re-identification a single feed-forward convolutional neural network directly
needs more personalized information which produced by user predicting classes and anchor offsets without requiring region
preference. Therefore the applications can be hierarchically proposal generation. The architecture is shown in Fig. 4.
structured according to the requirements of details. As both of The overall objective loss function is a weighted sum of
the attributes and the applications are hierarchically structured, the localization loss (loc) and the classification confidence
there exist corresponding relation between them. The corre- loss (conf). Let x idj = {1, 0} be an indicator for matching
sponding applications for different vehicle attribute detection the i -th default box  to the j -th ground truth box (g) of
are illustrated in Fig. 3. Among those tasks, vehicle retrieval category d. Wherein i x idj ≥ 1. The overall loss function
and re-identification in traffic surveillance systems are very is represented as
challenging and have not yet been well explored. Next, we
show how to aggregate the detected attributes to form a power- 1  
ful representation for the task of re-identification and retrieval. L (x, c, l, g) = L con f (x, c) + αL loc (x, l, g) (1)
N

IV. ATTRIBUTE D ETECTION


where N is the number of matched defult boxes. If N = 0,
A. Single-Shot Detection the loss is set as 0. The localization loss is a Smooth
With the proposed dataset, we perform vehicle attribute L1 loss [41] between the predicted box (l) and the ground
detection using the single shot method SSD [7]. SSD is truth box (g). The confidence loss is the softmax loss over
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: STRUCTURAL ANALYSIS OF ATTRIBUTES FOR VEHICLE RE-IDENTIFICATION AND RETRIEVAL 5

estimated probability for the class with label y = 1. pt is


defined as

p if y = 1
pt = (3)
1 − p other wi se
Then focal loss is defined as
F L( pt ) = −αt (1 − pt )γ log( pt ) (4)
wherein αt ∈ [0, 1] is a weighting factor balances the impor-
tance of positive and negative examples. (1− pt )γ with tunable
focusing parameter γ ≥ 0, is a modulating factor for easy and
hard examples.
Fig. 4. The rough structure of SSD. As described in Section IV-A, the overall loss of SSD
L (x, c, l, g) is a weighted sum of the localization loss (loc)
and the confidence loss (conf). Herein the confidence loss
class confidence (c). which is the softmax loss over class confidence can be replaced

N  with the focal loss over class confidence.
L con f (x, c) = − x idj log(ĉid ) − log(ĉi0 ) 
N
i∈Pos i∈Neg L 1con f (x, c) = − αt x idj (1 − ĉid )γ log(ĉid )
exp(cid ) i∈Pos

where ĉid =  (2) − αt (1 − ĉi0 )γ log(ĉi0 )
d exp(ci )
d
i∈Neg
and the weight term α is set to 1 in our experiment. exp(cid )
As we know, SSD improves detection performance by using where ĉid =  (5)
d exp(ci )
d
default boxes at different scales that come from different
output layers. For the same size of tilling box, the layers from Finally, the training loss is the sum of the focal loss over class
low to high can capture different sizes of objects. Wherein confidence and the standard Smooth L1 loss.
the lower layer typically contains more information about
small objects and the higher layer contains more semantic V. V EHICLE R E -I DENTIFICATION
information while losing information about small objects due
The attribute detectors can help us to easily fetch the related
to lower spatial resolution of the feature maps. Different
ROIs. Since different ROIs correspond to different levels
resolutions encode different levels of features. Previous works
of details (namely the vehicle whole-body level, the vehi-
have shown that using feature maps from lower layers can
cle model level or the personalized level), combining them
improve semantic segmentation quality, because the lower
together will generate a powerful representation of an indi-
layers sculpture finer details of the input objects. Considering
vidual vehicle. Motivated by this observation, we propose a
the fact that there are many small and obscure (because
method to structurally aggregate these ROI features to perform
they are usually located behind the windowglass) objects
vehicle re-identification and retrieval.
which are important for vehicle re-identification in our dataset,
Our overall framework is shown in Fig. 5. Given two vehicle
we attempt to add default boxes from the lower output layers
images, each one’s ROIs are fetched by the detector at first.
to capture small objects.
At the same time, we train a DCNN with vehicle identity
classification and extract vehicle feature maps from the trained
B. SSD With Focal Loss model. Then each ROI feature can be extracted according to
the mapping relationship between the image and its feature
SSD, as a one-stage detector, shows great superiority in maps. Finally all the ROI features are concatenated into a
processing time. However, its accuracy is suppressed since multi-dimensional vector.
the class imbalance can not be well addressed as R-CNN-like To find out if two vehicle images belong to the same
detectors by a two-stage cascade and sampling heuristics [10]. one, we do not simply use the widely used L2 distance.
During training, most of the samples are negative and come Instead, we train an effective model to capture the complex
from background. Thus the samples with category information data dependencies and scalable learning systems that learn
may not be trained sufficiently and also their capacity may be the model of interest from the ROI features. In our method,
weaken by the distraction from the large number of negative gradient tree boosting is used to combat this difficulty, because
samples. In this work, to improve the performance, we employ of its effectiveness.
the focal loss [10] in the SSD architecture to combat imbalance
between foreground and background classes during training.
The focal loss is designed based on cross entropy (CE) A. Feature Extraction
loss: C E( p, y) = C E( pt ) = − log( pt ). Wherein, y ∈ {±1} As shown in Fig. 5, the features of vehicle data are learned
specifies the ground-truth class and p ∈ [0, 1] is the model’s from the DCNN. Here the model is trained with a residual
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 5. The structure of ROIs-based method for re-identification. To find out if two vehicle images belong to the same vehicle ID, each image’s ROIs are
fetched by the trained SSD-detector at first. Then a DCNN for vehicle ID classification is trained and used to extract ROI features. Finally, a boosting model
is trained to aggregate each ROI feature and the similarity between the two images can be measured using the boosting classification score.

network for vehicle classification. Then a set of 2-D feature T indicates the total number of leaves of the tree. Each fk
maps are extracted by feeding the data forward into the model. represents an independent tree that is represented by q and
The feature extraction layer is formally represented as an its leaf weights w. wi is a continuous value representing the
order-3 tensor T with h ×w×d elements. Wherein, the feature score on the i -th leaf of the tree. Learning is performed by
maps consist of a d-dimensional of h × w 2-D feature map. minimizing the following regularized objective.
These maps encode rich local visual and spatial information
 
which are beneficial to represent ROI characteristics. L(∅) = l( ỹi , yi ) + ( f k )
In order to derive a representation for each ROI, we first i
search for its corresponding region in the feature map 1
(yi , x i , h i , wi ) and then pool all local features falling into where, ( f ) = γ T + λw2 (7)
2
that region into a d-dimensional vector by average pooling.
Note the ROI feature is set as 0 if the corresponding ROI is Here l is a differentiable convex loss function measuring the
undetected. Finally, we concatenate all the n ROI features into disparity between the estimation y˜i and the ground truth yi .
a n · d-dimensional vector (represented as r) in order as the  is the regularization term that penalizes the complexity of
representation of the whole image. the model and smooths the learnt weights, to avoid overfitting.
The goal of the regularization term is to select a simpler model
B. Boosting among all predictive functions.
Instead of being optimized using traditional optimization
Gradient tree boosting [12], [13] is employed to learn the methods, the ensemble model in Eq. 7 is trained in an
feature dependencies between ROIs. The details of the gradient additive manner. In particular, let y˜i (t ) being the prediction
tree boosting and its learning objective are described as below: output of the i -th instance at the t-th iteration, the next
Let D = {(xi , yi )}(|D| = n, xi ∈ Rm , yi ∈ R) represent a training iteration will then add ft to minimize the following
data set with n samples and m features, a tree ensemble model function:
learns K additive functions to estimate the output.

K 
n

y˜i = ∅(xi ) = f k (xi ), fk ∈ F (6) L (t ) = l(yi , y˜i (t −1) + f t (xi )) + ( f t ) (8)


i=1
k=1

where F = f (x) = wq(x) (q : Rm → T, w ∈ RT ) represents It indicates that f t is selected if it gets improved the most
the searching space of regression trees (also known as CART). according to Eq. 7 and then is greedily added to the model.
q is a tree that maps an example to its corresponding leaf. The objective can be quickly optimized using second-order
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: STRUCTURAL ANALYSIS OF ATTRIBUTES FOR VEHICLE RE-IDENTIFICATION AND RETRIEVAL 7

approximation: c) Faster R-CNN: The training follows the methodology


of [9] closely. The batch size for RPN training is 256 and

n
1 the batch size for box classifier training is 32. The model is
L (t )  [l(yi , y˜i (t −1)) + gi f t (xi ) + h i f t2 (xi )] + ( f t )
2 trained with stochastic gradient decent (SGD), momentum 0.9.
i=1
(9) The initial learning rates is 10−3 and reduced by a factor
of 10 at multiple iterations of 60k and another 10 after
where gi = ∂ y˜i (t−1) l(yi , y˜i (t −1)) and h i = ∂ 2 (t−1) l(yi , y˜i (t −1)) 120k iterations.
y˜i 2) Experimental Results and Analysis: The experimental
are first and second order gradient of the loss function. If the
constant terms are removed, the objective at step t can be results are shown in Table V. It meets our expection that larger
simplified to: input size 500×500 shows better performance than 300×300.
For example, on dataset VAC21, mAP of SSD with lower

n
1 layer with 500 × 500 is 0.9% higher than that with 300 × 300;
L̃ (t ) = [gi ft (xi ) + h i f t2 (xi )] + ( f t ) (10) mAP of SSD with focal loss with 500 × 500 is 1.8% higher
2
i=1 than that with 300 × 300 on α = 0.25, γ = 2.
The mAPs of SSD with lower layer are close to that of
VI. E XPERIMENTS SSD on the two input sizes, but the average precisions (APs)
of small objects detection are improved, which is expected.
In this section, we conduct extensive experiments to demon- For example, with detection of the category ‘anusigns’ using
strate the effectiveness of our proposed methods. Wherein input resolution 500 × 500 and 300 × 300, SSD with lower
attribute detection is carried on at first and then the detection layer is 2.4% and 3.4% higher than SSD; 1.9% and 2%
results are applied to vehicle re-identification. higher than SSD with focal loss on α = 0.25, γ = 1
respectively.
For category ‘paperbox’ detection with input resolution
A. Attribute Detection
500 × 500 and 300 × 300, SSD with lower layer is 3.9% and
We conduct a series of experiments using the datasets 15.9% higher than SSD; 1.9% and 13.6% higher than SSD
VAC21 and VAC21_S with different input image sizes of with focal loss on α = 0.25, γ = 1 respectively. Similarly,
300 × 300 and 500 × 500 pixels. The implementation is done this tendency also holds on the VAC21_S.
using Caffe with the backbone network VGG. On VAC21, The mAPs of SSD with focal loss on both parameter settings
4940 images are used for training and validation and are close (within 0.5%) and both are higher than SSD. For
2189 images are used for testing. On VAC21_S, 8283 images instance, with input size 300 × 300 and 500 × 500, SSD with
are used for training and validation and 3788 images are used focal loss on α = 0.25, γ = 1 is 1% and 1.4% higher than
for testing. Moreover, we also compare the models with the SSD respectively.
two-stage detector Faster R_CNN. In addition, SSD with focal loss achieves the best mAP of
1) Implementation Details: SSD All the training follows 65.5% with input resolution 500 × 500 on α = 0.25, γ = 1,
the methodology of [7] closely with extensive geometric data which is 1.4% higher than SSD, 1.5% higher than SSD with
augmentation, i.e., crop, mirror and warp operations. There is a lower layer and 4.4% higher than Faster R-CNN.
small difference between input size 300 × 300 and 500 × 500. On the other hand, VAC21_S also shows the same tendency
For 300 × 300, both location and confidences are predicted with VAC21. Adding lower layer helps to improve average
from conv4_3 to conv11_2 with the smallest default box precision of small objects such as ‘anusigns’, ‘paperbox’,
scale smin 0.2, biggest default box scale smax 0.9 and scale ‘hungs’, etc. Focal loss tackles the problem of class imbalance
of the default box on conv4_3 smin 0.1. 500 × 500 adds extra well and helps to improve the mAP.
conv12_2 for prediction and smin set as 0.15, smax 0.9, Performance on VAC21_S is better than that on VAC21.

smin 0.07 on conv4_3. One reason is that the clutter background is removed and
a) SSD with lower layers: All the training also follows the interference is reduced. The other one is that the vehicle
the methodology of [7] closely with extensive geometric data becomes bigger when resizing the VAC21_S with the same
augmentation with crop, mirror and warp operations. We add size of VAC21. For example, on VAC21_S with input size
default boxes on layer conv3_3 and adjust the default box 500×500, the mAP of SSD with focal loss on α = 0.25, γ = 2
tilling to keep the total number of boxes similar to the original. is 72.3%, which is 6.9% higher than on VAC21.
There is also a small difference between input resolution Notably all the APs of the category ‘newersign’ detection
300 × 300 and 500 × 500. For 300 × 300, both location and are 0, primarily due to the lack of training data. As shown
confidences are predicted from conv3_3 to conv11_2 with in Table I, only seven images of ‘newersign’ are labeled. But
smin 0.15, smax 0.95 and smin  0.07 on conv3_3. Same as the models perform well on larger objects. Taking category
above, 500 × 500 adds extra conv12_2 for prediction and smin ‘car’ detection as an exmaple, with input resolution 500×500,

set as 0.15, smax as 0.95, smin as 0.07 on conv3_3. SSD with focal loss on α = 0.25, γ = 1 achieves a AP
b) SSD with focal loss: Two groups of α and γ are trained of 91.3% on VAC21; and the AP of SSD with focal loss on
respectively. The combinations of α = 0.25, γ = 1 and α = α = 0.25, γ = 2 rises to 96.4% on VAC21_S.
0.25, γ = 2 are reported effective in [10]. The training is Some detection results on VAC21 and VAC21_S with input
similar to SSD. size 500 × 500 are shown in Fig. 6. Furthermore, on our
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 6. Detection examples on VAC21 and VAC21_S with input resolution 500 × 500. Wherein, the top two rows are the results of SSD, the middle two
rows are the results of SSD with lower layer and the rest two rows are from SSD with focal loss.

datasets, the performance of SSD is significantly better than


the two-stage state-of-the-art methods such as Faster-rcnn.

B. Vehicle Re-Identification
1) Dataset: In this section, we evaluate our method on
the vehicle re-identification and retrieval tasks. We use the
recent released large-scale vehicle dataset VehicleID [6], which
contains 221,763 images from 26,267 vehicles. VehicleID is
split into a training set with 110,178 images of 13,134 vehicles
and a testing set with 111,585 images of 13,133 vehicles.
Follow the protocols in [6], we use the three testing sets with
different sizes (i.e., small: 800, medium: 1600 and large: 2400)
for the vehicle retrieval and re-identification tasks. For each Fig. 7. Detection examples on VehicleID with input size 500 × 500. The top
and bottom rows are the results of SSD with focal loss and SSD with lower
test dataset of re-identification (small, medium and large), layer respectively.
an image from each vehicle is randomly selected into the
gallery set, and all the other images are query images. For the
retrieval dataset (small, medium and large), max(6, Ni − 1) γ = 1, except for the small objects, that is, ‘anusigns’,
images of each vehicle is randomly selected into the gallery ‘entrylicense’, ‘hungs’ and ‘paperbox’ which are detected by
set, and the rest are put into the probe set. The evaluation SSD with lower layer model 500×500. Some detection results
metrics are top-1, top-5 accuracy and Cumulative Matching on VehicleID are shown in Fig. 7.
Characteristic (CMC) curves. The vehicle classification model for feature extraction
2) Implementation Details: The ROIs are mainly fetched is trained with the training set of VehicleID, that is,
using the SSD with focal loss model 500 × 500, α = 0.25, 110,178 images of 13,134 vehicles. The network is a modified
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: STRUCTURAL ANALYSIS OF ATTRIBUTES FOR VEHICLE RE-IDENTIFICATION AND RETRIEVAL 9

TABLE II
M AP OF V EHICLE R E -I DENTIFICATION TASK FOR D IFFERENT
C OMBINATION OF ROI S

Fig. 8. The CMC curves of our proposed method with the combination of
six ROIs on the VehicleID dataset (test size: 800).

TABLE III
C OMPARISON OF D IFFERENT M ETHODS FOR THE
V EHICLE R E -I DENTIFICATION TASK

TABLE IV
ResNet [43], a 28 layers residual network, that consists of five C OMPARISON OF D IFFERENT M ETHODS FOR THE
convolutional layers, eleven building blocks and one fully- V EHICLE R ETRIEVAL TASK (T OP 1)
conncted layer. We implement our network in Caffe [42].
During training, the input images are randomly cropped to
224 × 224 from the resized image (the short edge is resized
as 256) and randomly mirrored horizontally. Moreover,
the images are shuffled and corresponding batches are
generated.
The trained model is employed as a feature extractor for
extracting the features of test images. Also the ROI features of Then we simply combine them in order. For each image, all
each image can be extracted according to the detected location. of the ROI features are concatenated simply. During training,
For both vehicle retrieval and re-identification, the normalized for each pair of vehicle images (ri , r j ) we extract features
feature representations for images are extracted in both gallery by absolute value of subtraction |ri − r j | or concatenating
and probe sets. absolute value of subtraction with element-wise multiplication
To measure the similarity between two images, we use (|ri − r j |, ri r j ). The corresponding label is y ∈ {0, 1}.
boosting score instead of the L2 distance. The XGBoost [13] Training is done on 400,000 examples, with 200,000 posi-
is used to learn the weights of different features. First, we con- tives and 200,000 negatives, which are randomly selected from
duct an exploratory experiment on different combinations of the training set. The objective is binary Logistic with learning
ROIs to find out which range of interests are effective for rate 0.01, eta 0.1 and max depth of the tree is 7. We use GPU
identification. To narrow down the scope, we sort the ROIs by to accelerate the training. The similarity between two vehicles
the interest for identification. Specifically, we choose 800 vehi- is measured by the boosting score.
cles from the training data, and extract ROI features for them. 3) Experimental Results and Analysis:
Follow the protocols in [6], an image of each vehicle a) Vehicle re-identification: We evaluate our method on
is randomly selected into the gallery set, and all the the vehicle re-identification task following the widely used
other images are query images. For each ROI, we con- protocol mAP in re-identification. The same evaluation and
duct vehicle re-identification using the L2 distance, and split strategy in [6] are applied to the three testing image sets
sort the ROIs by the mAP. The ranking is listed as (i.e., test size = 800, 1,600 and 2,400) with different sizes.
follows: ‘vehicle’ (mainly includes ‘car’, ‘bus’ and ‘truck’), Table II illustrates the results of different combinations.
‘windglass’, ‘logo’, ‘anusigns’, ‘carlight’, ‘hungs’, ‘paperbox’, Here the number stands for the number of factors added,
‘layon’, ‘entrylisence’, ‘safebelt’, ‘backmirror’, ‘newersign’, ‘sub’ represents absolute value of substraction and ‘sub-multi’
‘lightcover’, ‘cartopwindow’, ‘carrier’, ‘wheel’. represents concatenating absolute value of subtraction with
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE V
D ETECTION R ESULTS BASED ON VAC21 AND VAC21_S

element-wise multiplication. For comparison, we also conduct be that the detection model performs not well enough for these
re-identification by the L2 distance. ROIs and also the numbers of these ROIs are quite limited,
The experimental results demonstrate that re-identification thus the effects of the these ROIs are weaken.
with boosting outperforms L2 distance. Namely, for test size In addition, boosting with absolute value of subtraction and
800, top-1, when the one ROI is selected, boosting with element-wise multiplication performs better than that with
absolute value of subtraction is 1.2% higher than L2 distance absolute value of subtraction. This shows that element-wise
and boosting with absolute value of subtraction and element- multiplication keeps more effective information.
wise multiplication is 2% higher than L2 distance. The CMC curves of the combination of six ROIs by
When the six ROIs are chosen, boosting with absolute connecting absolute value of subtraction with multiplication
value of subtraction is 7.9% higher than L2 distance and are shown in Fig. 8.
boosting with absolute value of subtraction and element-wise Table III shows the re-identification results of the pro-
multiplication is 8.8% higher than L2 distance. It shows posed method and other compared state-of-the-art methods.
our ROIs with boosting method aggregates the ROI features The experimental results show that our ROIs-based method
dependencies and the model of interest well, which helps to outperforms the other methods on all the three testing datasets.
improve the mAP. b) Vehicle retrieval: We also evaluate our method on
With the increasing number of ROIs, re-identification by vehicle retrieval task following the widely used protocol in
L2 distance becomes worse. While our ROIs with boost- retrieval task. We follow the split strategy in [6] for the
ing method performs the tendency of upgrading firstly then three testing image sets with different sizes. The experiments
descending latter. For example, with test size 800, top-1, when are similar with the ones for re-identification. Here we only
the number of ROIs increasing from 1 to 15, the mAP of conduct the best combination of ROIs in re-identification.
L2 decreases from 0.7 to 0.655, and boosting with absolute Table IV shows the retrieval results of the proposed method
value of subtraction and element-wise multiplication increases and other compared state-of-the-art methods. The experimental
from 0.72 to 0.761 and then decreases to 0.735. Wherein, results also show our ROIs-based method significantly outper-
the maximum mAP of boosting comes when the number of forms the other methods on all the three testing datasets.
ROIs is 6, which demostrates not all the ROIs are effective
for re-identification. As the order of ROIs metioned before, VII. C ONCLUSION
six ROIs include ‘vehicle’, ‘windglass’, ‘logo’, ‘anusigns’, In this paper, we first collect a vehicle dataset VAC21 and
‘carlight’, ‘hungs’. Wherein ‘vehicle’ and ‘windglass’ contain carefully label 21 classes of structure attributes. With the
comprehensive vehicle information; ‘carlight’ and ‘logo’ con- constructed dataset, we apply the state-of-art one-stage detec-
tain the information of vehicle model; ‘anusigns’ and ‘hungs’ tion method SSD as a baseline model for attribute detection.
depict personalized information. We make improvements by adding proposals from low-level
However, with the increase of personalized information, layers to improve the accuracy of detecting small objects.
such as ‘paperbox’, ‘layon’, ‘entrylicense’ and ‘newersign’ etc, Furthermore we employ the focal loss to improve detection
the mAP of re-identification decreases. The main reason may accuracy.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: STRUCTURAL ANALYSIS OF ATTRIBUTES FOR VEHICLE RE-IDENTIFICATION AND RETRIEVAL 11

Then we propose a novel ROIs-based vehicle [24] T. Kalinke, C. Tzomakas, and W. V. Seelen, “A texture-based object
re-identification and retrieval method and conduct a series of detection and an adaptive model-based classification,” in Proc. IEEE
Intell. Vehicles Symp., 1998, pp. 341–346.
experiments. The experimental results show that our method [25] Q. Fan, L. Brown, and J. Smith, “A closer look at faster R-CNN for
outperforms state-of-the-art methods in the literature. vehicle detection,” in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2016,
pp. 124–129.
R EFERENCES [26] H. Wu and B. Li, “License plate recognition system,” in Proc. Int. Conf.
Multimedia Technol., Jul. 2011, pp. 5425–5427.
[1] W. Fang, J. Chen, C. Liang, Y. Wang, and R. Hu, “Vehicle [27] S.-L. Chang, L.-S. Chen, Y.-C. Chung, and S.-W. Chen, “Automatic
re-identification collaborating visual and temporal-spatial network,” license plate recognition,” IEEE Trans. Intell. Transp. Syst., vol. 5, no. 1,
in Proc. 5th Int. Conf. Internet Multimedia Comput. Service (ICIMCS), pp. 42–53, Mar. 2004.
New York, NY, USA, 2013, pp. 121–125. [28] C. Gerber and M. Chung, “Number plate detection with a multi-
[2] D. Zapletal and A. Herout, “Vehicle re-identification for automatic video convolutional neural network approach with optical character recognition
traffic surveillance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. for mobile devices,” J. Inf. Process. Syst., vol. 12, no. 1, pp. 100–108,
Workshops (CVPRW),, Jun. 2016, pp. 1568–1574. 2016.
[3] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified [29] Y. Wang, Z. Liu, and F. Xiao, “A fast coarse-to-fine vehicle logo
embedding for face recognition and clustering,” in Proc. IEEE Conf. detection and recognition method,” in Proc. IEEE Int. Conf. Robot.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823. Biomimetics (ROBIO), Dec. 2007, pp. 691–696.
[4] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with rela- [30] D. F. Llorca, R. Arroyo, and M. A. Sotelo, “Vehicle logo recognition in
tive distance comparison for person re-identification,” Pattern Recognit., traffic images using HOG features and SVM,” in Proc. 16th Int. IEEE
vol. 48, no. 10, pp. 2993–3003, Oct. 2015. [Online]. Available: Conf. Intell. Transp. Syst. (ITSC), Oct. 2013, pp. 2229–2234.
http://www.sciencedirect.com/science/article/pii/S0031320315001296 [31] Y. Huang, R. Wu, Y. Sun, W. Wang, and X. Ding, “Vehicle logo
[5] W. Lin et al., “Learning correspondence structures for person recognition system based on convolutional neural networks with a
re-identification,” IEEE Trans. Image Process., vol. 26, no. 5, pretraining strategy,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 4,
pp. 2438–2453, May 2017. pp. 1951–1960, Aug. 2015.
[6] H. Liu, Y. Tian, Y. Yang, L. Pang, and T. Huang, “Deep relative distance [32] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-
learning: Tell the difference between similar vehicles,” in Proc. IEEE based fully convolutional networks,” in Proc. Adv. Neural Inf. Process.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2167–2175. Syst., 2016, pp. 379–387.
[7] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
Comput. Vis. Cham, Switzerland: Springer, Oct. 2016, pp. 21–37. once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
[8] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Vis. Pattern Recognit., 2016, pp. 779–788.
Proc. Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988. [34] J. Huang et al., “Speed/accuracy trade-offs for modern convolu-
[9] S. Ren, K. He, R. Girshick, and J. Sun. (2015). “Faster R-CNN: Towards tional object detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern
real-time object detection with region proposal networks.” [Online]. Recognit. (CVPR), Jul. 2017, pp. 3296–3297.
Available: https://arxiv.org/abs/1506.01497 [35] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[10] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. (2017). A. Zisserman. The PASCAL Visual Object Classes Challenge (VOC)
“Focal loss for dense object detection.” [Online]. Available: Results. Accessed: Jan. 31, 2019. [Online]. Available: http://
https://arxiv.org/abs/1708.02002 www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html
[11] Y. Li, Y. Li, H. Yan, and J. Liu, “Deep joint discriminative learning for [36] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,”
vehicle re-identification and retrieval,” in Proc. IEEE Int. Conf. Image IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 5, pp. 694–711,
Process. (ICIP), Sep. 2017, pp. 395–399. May 2006.
[12] J. H. Friedman, “Greedy function approximation: A gradient boosting [37] B. C. Matei, H. S. Sawhney, and S. Samarasekera, “Vehicle tracking
machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001. across nonoverlapping cameras using joint kinematic and appearance
[13] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” features,” in Proc. CVPR, Jun. 2011, pp. 3465–3472.
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery [38] Y. Xiang, C. Song, R. Mottaghi, and S. Savarese, “Monocular multiview
Data Mining (KDD), New York, NY, USA, 2016, pp. 785–794. object tracking with 3D aspect parts,” in Proc. Eur. Conf. Comput.
doi: 10.1145/2939672.2939785. Vis. (ECCV), 2014, pp. 220–235.
[14] Q. Hu, H. Wang, T. Li, and C. Shen, “Deep CNNs with spatially [39] A. Psyllos, C.-N. Anagnostopoulos, and E. Kayafas, “Vehicle model
weighted pooling for fine-grained car recognition,” IEEE Trans. Intell. recognition from frontal view image measurements,” Comput. Standards
Transp. Syst., vol. 18, no. 11, pp. 3147–3156, Nov. 2017. Interfaces, vol. 33, no. 2, pp. 142–151, Feb. 2011. [Online]. Available:
[15] L. Yang, P. Luo, C. C. Loy, and X. Tang, “A large-scale car dataset http://www.sciencedirect.com/science/article/pii/S0920548910000838
for fine-grained categorization and verification,” in Proc. IEEE Conf. [40] A. P. Psyllos, C.-N. E. Anagnostopoulos, and E. Kayafas, “Vehicle
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3973–3981. logo recognition using a SIFT-based enhanced matching scheme,”
[16] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification IEEE Trans. Intell. Transp. Syst., vol. 11, no. 2, pp. 322–328,
in urban surveillance videos,” in Proc. IEEE Int. Conf. Multimedia Jun. 2010.
Expo (ICME), Jul. 2016, pp. 1–6. [41] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
[17] X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach to (ICCV), Dec. 2015, pp. 1440–1448.
progressive vehicle re-identification for urban surveillance,” in Computer [42] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
Vision—ECCV. Cham, Switzerland: Springer, 2016, pp. 869–884. ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.
[18] Y. Shen et al., “Learning deep neural networks for vehicle Re-ID with [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
visual-spatio-temporal path proposals,” in Proc. IEEE Int. Conf. Comput. for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Vis. (ICCV), 2017, pp. 1918–1927. Recognit. (CVPR), Jun. 2016, pp. 770–778.
[19] Z. Wang et al., “Orientation invariant feature embedding and spatial [44] Y. Yuan, K. Yang, and C. Zhang. (2016). “Hard-aware deeply cascaded
temporal regularization for vehicle re-identification,” in Proc. IEEE Int. embedding.” [Online]. Available: https://arxiv.org/abs/1611.05720
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 379–387.
[20] T. Zielke, M. Brauckmann, and W. Vonseelen, “Intensity and edge-based
symmetry detection with an application to car-following,” CVGIP, Image Yanzhu Zhao received the Ph.D. degree from the
Understand., vol. 58, no. 2, pp. 177–190, 1993. School of Computer Science and Technology, Zhe-
[21] Y. Gao and H. J. Lee, “Vehicle make recognition based on convolutional jiang University of Technology, Hangzhou, China,
neural network,” in Proc. 2nd Int. Conf. Inf. Sci. Secur. (ICISS), in 2018. From 2016 to 2017, she was a Visiting
Dec. 2015, pp. 1–4. Student with The University of Adelaide, Adelaide,
[22] S. D. Buluswar and B. A. Draper, “Color machine vision for Australia. Her research interests include computer
autonomous vehicles,” Eng. Appl. Artif. Intell., vol. 11, no. 2, vision and deep learning.
pp. 245–256, 1998.
[23] M. Bertozzi, A. Broggi, and S. Castelluccio, “A real-time oriented
system for vehicle detection,” J. Syst. Archit., vol. 43, nos. 1–5,
pp. 317–325, 1997.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Chunhua Shen studied at Nanjing University, Shengyong Chen (M’01–SM’10) received the Ph.D.
Nanjing, China, and Australia National University, degree in robot vision from the City University of
Canberra, ACT, Australia, and received the Ph.D. Hong Kong in 2003. He was with the University
degree from the University of Adelaide. He was of Hamburg from 2006 to 2007. He is currently a
with the Computer Vision Program, National ICT Professor with the Zhejiang University of Tech-
Australia, Canberra Research Laboratory, for about nology, and also with the Tianjin University of Tech-
six years. He is currently a Professor with the School nology, China. He has published over 100 scientific
of Computer Science, The University of Adelaide. papers in international journals, and is an inventor
His research interests are in the intersection of of over 100 patents. His research interests include
computer vision and statistical machine learning. computer vision, robotics, and image analysis. He is
From 2012 to 2016, he held an Australian Research a fellow of the IET, and a Senior Member of the
Council Future Fellowship. IEEE and CCF. He received the National Outstanding Youth Foundation
Award of China in 2013, and a fellowship from the Alexander von Humboldt
Foundation of Germany.

Huibing Wang received the Ph.D. degree from


the School of Computer Science and Technology,
Dalian University of Technology, Dalian, China.
From 2016 to 2017, he was a Visiting Student at
The University of Adelaide, Adelaide, Australia.
He is currently a Research Fellow with Dalian
Maritime University, Dalian, China. His research
interests include computing vision and machine
learning.

You might also like