Person Re-Identification by Deep Joint Learning of
Person Re-Identification by Deep Joint Learning of
net/publication/318830184
CITATIONS READS
430 4,007
3 authors, including:
All content following this page was uploaded by Xiatian Zhu on 12 February 2018.
2194
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
subject to the surrounding background clutter within an im- (JLML) CNN model that aims to discover and capture con-
age, and potentially also misalignment and partial occlusion currently complementary discriminative information about a
from poor detection. In this setting, we wish to discover and person image from both local and global visual features of
optimise jointly correlated complementary feature selections the image in order to optimise person re-id under significant
in the local and global representations, both subject to the viewing condition changes across locations. This is in con-
same label constraint concurrently. Whilst the former aims trast to most existing re-id methods typically depending only
to address detection misalignment and occlusion by localised on either local or global features alone.
fine-grained saliency information, the latter exploits holistic
coarse-grained context for more robust global matching. 2.2 Joint Learning Multi-Loss
Our contributions are: (I) We propose the idea of learning
concurrently both local and global feature selections for op- Local Branch
Sparsity
timising feature discriminative capabilities in different con- Shared Conv 1
Pooling-256
text whilst performing the same person re-id tasks. This is Local 1 FC-512
FC-IDs
currently under-studied in the person re-id literature to our Stripe 1 ID Class
best knowledge. (II) We formulate a novel Joint Learning Local k Labels
Multi-Loss (JLML) CNN model for not only learning both Slice
Stripe k Local m
global and local discriminative features in different context 𝑾𝐿 1,2
by optimising multiple classification losses on the same per- Global Branch Sparsity
Stripe m
Pooling-512 FC-512 FC-IDs
son label information concurrently, but also utilising their ID Class
complementary advantages jointly in coping with local mis- Global Labels
𝑾𝐺 2,1
alignment and optimising holistic matching criteria for per-
son re-id. This is achieved with a deep two-branch CNN ar- Figure 1: The Joint Learning Multi-Loss (JLML) CNN model.
chitecture by imposing inter-branch interaction between the
local and global branches, and enforcing a separate learning The overall design of the proposed JLML model is de-
objective loss function to each branch for learning indepen- picted in Figure 1. This JLML model consists of a two-
dent discriminative capabilities. (III) We introduce a struc- branches CNN network: (1) One local branch of m streams
tured sparsity based feature selection learning mechanism for of an identical structure with each stream learning the most
improving multi-loss joint feature learning robustness w.r.t. discriminative local visual features for one of m local im-
noise and data covariance between local and global represen- age regions of a person bounding box image; (2) Another
tations. Extensive evaluations demonstrate the superiority of global branch responsible for learning the most discrimi-
the proposed JLML model over a wide range of existing state- native global level features from the entire person image.
of-the-art re-id models on four benchmark datasets. For concurrently optimising per-branch discriminative fea-
Related Works. The JLML method is related to the saliency ture representations and discovering correlated complemen-
learning based models [Zhao et al., 2013; Wang et al., 2014a] tary information between local and global feature selections,
in terms of modelling localised part importance. However, a joint learning scheme that subjects both local and global
these existing methods consider only the patch appearance branches to the same identity label supervision is formulated
statistics within individual locations but no global feature rep- with two underlying principles:
resentation learning, let alone the correlation and comple- (I) Shared low-level features. We construct the global and lo-
mentary information discovery between local and global fea- cal branches on a shared lower conv layer, in particular the
tures as modelled by the JLML. Whilst the more recent SCS first conv layer, for facilitating inter-branch common learn-
[Chen et al., 2016] and MCP [Cheng et al., 2016] consider ing. The intuition is that, the lower conv layers capture low-
both levels of representation, the JLML model differs signifi- level features such as edges and corners which are common
cantly from them: (i) The SCS method focuses on supervised to all patterns in the same images. This shared learning is
metric learning, whilst the JLML aims at joint discrimina- similar in spirit to multi-task learning [Argyriou et al., 2007],
tive feature learning and needs only generic metrics for re- where the local and global feature learning branches are two
id matching. (ii) The local and global branches of the MCP related learning tasks. Sharing the low-level conv layer re-
model are supervised and optimised by a triplet ranking loss, duces the model parameter size therefore model overfitting
in contrast to the proposed multiple classification loss design. risks. This is especially critical in learning person re-id mod-
(iii) The JLML is uniquely capable of performing structured els when labelled training data is limited.
feature sparsity regularisation. (II) Multi-task independent learning. To maximise the learn-
ing of complementary discriminative features from local
2 Model Design and global representations, the remaining layers of the two
branches are learned independently subject to given identity
2.1 Problem Definition labels. That is, the JLML model aims to learn concurrently
We assume a set of n training images I = {Ii }ni=1 with multiple identity feature representations for different local
the corresponding identity labels as Y = {yi }ni=1 . These image regions and the entire image, all of which aim to max-
training images capture the visual appearance of nid (where imise the same identity matching both individually and col-
yi ∈ [1, · · · , nid ]) different people under non-overlapping lectively at the same time. Independent multi-task learning
camera views. We formulate a Joint Learning Multi-Loss aims to preserve both local saliency in feature selection and
2195
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
global robustness in image representation. To that end, the sity principle and the identity label constraint simultaneously.
JLML model is designed to perform multi-task independent Similarly, we also enforce a local feature sparsity constraint
learning subject to shared identity label constraints by allo- by an exclusive group LASSO [Kong et al., 2014]:
cating each branch with a separate objective loss function. By cl X
X m
doing so, the per-branch learning behaviour is conditioned in- `1,2 = kWL k1,2 = i
kwl,j k21 (2)
dependently on the respective feature representation. We call i=1 j=1
this branch-wise loss formulation as the MultiLoss design.
where WL is the parameter matrix of the local branch feature
Table 1: JLML-ResNet39 model (1.5 billion FLOPs). MP: Max- layer with m × dl and cl (512) as the input and output di-
Pooling; AP: Average-Pooling; S: Stride; SL: Slice; CA: Concate- mensions (m the image stripe number), and wl,j i
∈ Rdl ×1
nation; G: Global; L: Local.
defines the parameter vector for contributing the i-th out-
Layer # Layer Output Size Global Branch Local Branch
put feature dimension from the j-th local input feature vec-
1 conv1 112×112 3×3, 32, S-2 tor, j ∈ [1, 2, · · · , m]. This `1,2 regularisor performs sparse
3×3 MP, S-2 SL-4, 2×2 MP, S-1 feature selection for individual image regions in conjunction
G: 56×56 1×1, 32 1×1, 16 with the global feature selection learning.
9 conv2 x
L: 28×56 3×3, 32×3 3×3, 16×3
Loss Function. For model training, we utilise the cross-
1×1, 64 1×1, 32
entropy classification loss function for both global and local
1×1, 64 1×1, 32
G: 28×28 branches so to optimise person identity classification given
9 conv3 x 3×3, 64 ×3 3×3, 32×3
L: 14×28 training labels of multiple person classes extracted from pair-
1×1, 128 1×1, 64
wise labelled re-id dataset. Formally, we predict the posterior
1×1, 128 1×1, 64
G: 14×14 probability ỹi of image Ii over the given identity label yi :
9 conv4 x 3×3, 128×3 3×3, 64 ×3
L: 7×14
1×1, 256 1×1, 128
exp(wy>i xi )
G: 7×7
1×1, 256 1×1, 128 p(ỹi = yi |Ii ) = P|n | (3)
id >
k=1 exp(wk xi )
9 conv5 x 3×3, 256×3 3×3, 128×3
L: 4×7
1×1, 512 1×1, 256
where xi refers to the feature vector of Ii from the corre-
1 fc 1×1 h 7×7 AP i 4×7
h AP, CA-4i sponding branch, and wk the prediction function parameter
1×1, 512 1×1, 512
of training identity class k. The training loss on a batch of
1 fc 1×1 ID # ID # nbs images is computed as:
Network Construction. We adopt the Residual CNN unit nbs
1 X
[He et al., 2016] as the JLML’s building blocks due to l=− log p(ỹi = yi |Ii ) (4)
its capacity for deeper model design whilst retaining a nbs i=1
smaller model parameter size. Specifically, we customise the Combined with the group sparsity based feature selection reg-
ResNet50 architecture in both layer and filter numbers and ularisations, we have the final loss function for the global and
design the JLML model as a 39 layers 2-branches ResNet local branch sub-networks as:
(JLML-ResNet39) tailored for re-id tasks. The configuration
of JLML-ResNet39 is given in Table 1. Note that, the ReLU lglobal = l + λglobal kWG k2,1 , llocal = l + λlocal kWL k1,2 (5)
rectification non-linearity [Krizhevsky et al., 2012] after each where λglobal and λlocal control the balance between the iden-
conv layer is omitted for brevity. tity label loss and the feature selection sparsity regularisation.
Feature Selection. To optimise JLML model learning ro- We empirically set λlocal = λglobal = 5×10−4 .
bustness against noise and diverse data source, we introduce
a feature selection capability in JLML by a structure spar- Choice of Loss Function. Our JLML model learning deploys
sity induced regularization [Kong et al., 2014; Wang et al., a classification loss function. This differs significantly from
2013]. Our idea is to have a competing-to-survive mecha- the contrastive loss functions used by most existing deep
nism in feature learning that discourages irrelevant features re-id methods designed to exploit pairwise re-id labels de-
whilst encourages discriminative features concurrently in dif- fined by both positive and negative pairs, such as the pairwise
ferent local and global context to maximise a shared identity verification [Varior et al., 2016; Subramaniam et al., 2016;
matching objective. To that end, we sparsify the global fea- Ahmed et al., 2015; Li et al., 2014], triplet ranking [Cheng et
ture representation with a group LASSO [Wang et al., 2013]: al., 2016], or both [Wang et al., 2016a; Chen et al., 2017a].
Our JLML model training does not use any labelled nega-
dg tive pairs inherent to all person re-id training data, and we
extract identity class labels from only positive pairs. The mo-
X
`2,1 = kWG k2,1 = kwgi k2 (1)
i=1
tivations for our JLML classification loss based learning are:
d
(i) Significantly simplified training data batch construction,
where WG = [wg1 , · · · , wg g ] ∈ Rcg ×dg is the parameter e.g. random sampling with no notorious tricks required, as
matrix of the global branch feature layer taking as input dg shown by other deep classification methods [Krizhevsky et
dimensional vectors from the previous layer and outputting al., 2012]. This makes our JLML model more scalable in
cg dimensional (512-D) feature representation. Specifically, real-world applications with very large training population
with the `1 norm applied on the `2 norm of wgi , our aim is to sizes when available and/or imbalanced training data sam-
learn selectively feature importance subject to both the spar- pling from different camera views. This also eliminates the
2196
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
undesirable need for carefully forming pairs and/or triplets in 775 distractor people included in the test gallery. We used the
preparing re-id training splits, as in most existing methods, benchmarking 10 people splits [Loy et al., 2009] and the av-
due to the inherent imbalanced negative and positive pair size eraged performance. On CUHK03, following [Li et al., 2014]
distributions. (ii) Visual psychophysical findings suggest that we repeated 20 times of random 1260/100 training/test splits
representations optimised for classification tasks generalise and reported the averaged accuracies under the single-shot
well to novel categories [Edelman, 1998]. We consider that evaluation setting. On Market-1501, we used the standard
re-id tasks are about model generalisation to unseen test iden- training/test split (750/751) [Zheng et al., 2015]. We used the
tity classes given training data on independent seen identity cumulative matching characteristic (CMC) to measure re-id
classes. Our JLML model learning exploits this general clas- accuracy on all benchmarks, except on Market-1501 we also
sification learning principle beyond the strict pairwise relative used the recall measure by mean Average Precision (mAP).
verification loss in existing re-id models.
Model Training. We adopt the Stochastic Gradient De- Table 3: Person re-id method categorisation by features and metrics.
Cat: Category; DL: Deep Learning; CPSL: Camera-Pair Specific
scent (SGD) optimisation algorithm [Krizhevsky et al., 2012] Learning; DVM: Deep Verification Metric; DVM, L2: Ensemble of
to perform the batch-wise joint learning of local and global DVM and L2; CHS: Fusion of Colour, HOG, SILPT features.
branches. Note that, with SGD we can naturally synchronise Feature Metric
the optimisation processes of the two branches by constrain- Cat Method
Hand-Crafted DL CPSL Generic
ing their learning behaviours subject to the same identity la- XQDA [Liao et al., 2015] LOMO - XQDA -
bel information at each update. This is likely to avoid repre- GOG [Matsukawa et al., 2016b] GOG - XQDA -
A [ ]
NFST Zhang et al., 2016 LOMO, KCCA - NSFT -
sentation learning divergence between two branches and help [
SCS Chen et al., 2016 ] CHS - SCS -
enhance the correlated complementary learning capability. DCNN+ [Ahmed et al., 2015] - DCNN+ DVM -
X-Corr [Subramaniam et al., 2016] - X-Corr DVM -
B
MTDnet [Chen et al., 2017a] - MTDnet DVM, L2 -
2.3 Person Re-Id by Generic Distance Metrics S-CNN [Varior et al., 2016] - S-CNN - L2
Once the JLML model is learned, we obtain a 1,024-D joint DGD [Xiao et al., 2016] - DGD - L2
C
MCP [Cheng et al., 2016] - MCP - L2
representation by concatenating the local (512-D) and global JLML (Ours) - JLML - L2
(512-D) feature vectors (the fc layers in Table 1). For per-
son re-id, we deploy this 1,024-D deep feature representation Competitors. We compared the JLML model against 10 ex-
using only a generic distance metric without camera-pair spe- isting state-of-the-art methods as listed in Table 3. They range
cific distance metric learning, e.g. the L2 distance. from hand-crafted and deep learning features to domain-
specific distance metric learning methods. We summarise
3 Experiments them into three categories: (A) Hand-crafted (feature) with
domain-specific distance learning (metric); (B) Deep learn-
Datasets. For evaluation, we used four benchmarking re-id ing (feature) with domain-specific deep verification (met-
datasets, VIPeR [Gray and Tao, 2008], GRID [Loy et al., ric) learning; (C) Deep learning (feature) with generic non-
2009], CUHK03 [Li et al., 2014], and Market-1501 [Zheng et learning L2 distance (metric).
al., 2015]. These datasets present a wide range of re-id evalu-
ation scenarios with different population sizes under different Table 4: JLML training parameters. BLR: base learning rate; LRP:
challenging viewing conditions (Figure 2 and Table 2). learning rate policy; MOT: momentum; IT: iteration; BS: batch size.
Parameter BLR LRP MOT IT # BS
Pre-train 0.01 step (0.1, 100K) 0.9 300K 32
Train 0.01 step (0.1, 20K) 0.9 50K 32
2197
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
Table 5: CUHK03 evaluation. 1st /2nd best in red/blue. and lower image resolution. On this dataset, the best perform-
Annotation Labelled Detected ers are hand-crafted feature methods (SCS and NFST) rather
Cat
Rank (%) R1 R5 R10 R20 R1 R5 R10 R20 than deep models. This is in contrast to the tests on CUHK03
XQDA 55.2 77.1 86.8 83.1 46.3 78.9 83.5 93.2 and Market-1501. Nevertheless, the JLML model remains
A GOG 67.3 91.0 96.0 - 65.5 88.4 93.7 - the best among all deep methods with or without deep ver-
NSFT 62.5 90.0 94.8 98.1 54.7 84.7 94.8 95.2 ification metric learning. This validates the superiority and
DCNN+ 54.7 86.5 93.9 98.1 44.9 76.0 83.5 93.2 robustness of our deep joint global and local representation
X-Corr 72.4 95.5 - 98.4 72.0 96.0 - 98.2 learning of multi-loss classification given sparse training data.
B
MTDnet 74.7 96.0 97.5 - - - - - We attribute this property to the JLML’s capability of mining
S-CNN - - - - 68.1 88.1 94.6 - complementary features in different context for both handling
C DGD 75.3 - - - - - - - local misalignment and optimising global matching.
JLML 83.2 98.0 99.4 99.8 80.6 96.9 98.7 99.2
Table 8: GRID evaluation. 1st /2nd best in red/blue.
3.1 Comparisons to State-Of-The-Arts Cat Rank (%) R1 R5 R10 R20
(I) Evaluation on CUHK03. Table 5 shows the com- XQDA 16.6 33.8 41.8 52.4
parisons of JLML against 8 existing methods on CUHK03. A GOG 24.7 47.0 58.4 69.0
It is evident that JLML outperforms existing methods SCS 24.2 44.6 54.1 65.2
in all categories on both labelled and detected bounding B X-Corr 19.2 38.4 53.6 66.4
boxes, surpassing the 2nd best performers DGD and X-Corr C JLML 37.5 61.4 69.4 77.4
on corresponding labelled and detected images in Rank-1 (IV) Evaluation on GRID. We compared JLML against
by 7.9%(83.2-75.3) and 8.6%(80.6-72.0) respectively. X- 4 competing methods on GRID. In addition to poor image
Corr/GOG/JLML also suffer the least from auto-detection resolution, poor lighting and a small training size (125 peo-
misalignment, indicating the robustness of the joint learning ple), GRID also has extra distractors in the testing population
approach to mining complementary local and global discrim- therefore presenting a very challenging but realistic re-id sce-
inative features. nario. Table 8 shows a significant superiority of JLML over
Table 6: Market-1501 evaluation. 1st /2nd best in red/blue. All per- existing state-of-the-arts, with Rank-1 12.8%(37.5-24.7) bet-
son bounding box images were auto-detected. ter than the 2nd best method GOG, a 51.8% relative improve-
Query Type Single-Query Multi-Query ment. This demonstrates the unique and practically desirable
Cat
Measure (%) R1 mAP R1 mAP advantage of JLML in handling more realistically challeng-
XQDA 43.8 22.2 54.1 28.4 ing open-world re-id matching where large numbers of dis-
A SCS 51.9 26.3 - - tractors are usually present.
NFST 61.0 35.6 71.5 46.0
S-CNN 65.8 39.5 76.0 48.4 3.2 Further Analysis and Discussions
C
JLML 85.1 65.5 89.7 74.5 We further examined the component effects of our JLML
(II) Evaluation on Market-1501. We evaluated the JLML model on the Market-1501 dataset in the following aspects.
against four existing models on Market-1501. Table 6 shows Table 9: Complementary benefits of global and local features.
the clear performance superiority of JLML over all state-of- Query Type Single-Query Multi-Query
the-arts with more significant Rank-1 advantages over other Measure (%) R1 mAP R1 mAP
methods compared to CUHK03, giving 19.3%(85.1-65.8) JLML (Global) 77.4 56.0 85.0 66.0
JLML (Local) 78.9 57.8 86.4 68.4
(SQ) and 13.7%(89.7-76.0) (MQ) gains over the 2nd best S- JLML (joint) 85.1 65.5 89.7 74.5
CNN. This further validates the advantages of our joint learn-
ing of multi-loss classification for optimising re-id especially (I) Complementary of Global and Local Features. We
when the re-id test population size increases (750 people on evaluated the complementary effects of our jointly learned
Market-1501 vs. 100 people on CUHK03). local and global features by comparing their individual re-
id performance against that of the joint features. Table 9
Table 7: VIPeR evaluation. 1st /2nd best in red/blue. shows that: (i) Any of the two feature representations alone
Cat Rank (%) R1 R5 R10 R20 is competitive for re-id, e.g. the local JLML feature sur-
XQDA 40.0 68.1 80.5 91.1 passes S-CNN (Table 6) by Rank-1 13.1%(78.9-65.8) (SQ)
GOG 49.7 - 88.7 94.5 and 10.4%(86.4-76.0) (MQ); and by mAP 18.3%(57.8-39.5)
A
NFST 51.1 82.1 90.5 95.9 (SQ) and 20.0%(68.4-48.4) (MQ). (ii) A further perfor-
SCS 53.5 82.6 91.5 96.7 mance gain is obtained from the joint feature representation,
DCNN+ 34.8 63.6 75.6 84.5 yielding further 6.2%(85.1-78.9) (SQ) and 3.3%(89.7-86.4)
B
MTDnet 47.5 73.1 82.6 - (MQ) in Rank-1 increase, and 7.7%(65.5-57.8) (SQ) and
MCP 47.8 74.7 84.8 91.1 6.1%(74.5-68.4) (MQ) in mAP boost. These results show the
C DGD 38.6 - - - complementary advantages of jointly learning the local and
JLML 50.2 74.2 84.3 91.6
global features in different context using the JLML model.
(III) Evaluation on VIPeR. We evaluated the performance (II) Importance of Branch Independence. We evaluated
of JLML against 8 strong competitors on VIPeR, a more chal- the importance of branch independence by comparing our
lenging test scenario with fewer training classes (316 people) MultiLoss design with a UniLoss design that merges two
2198
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
Table 10: Importance of branch independence. et al., 2012], VGG16 [Simonyan and Zisserman, 2015],
Loss
Query Type Single-Query Multi-Query GoogLeNet [Szegedy et al., 2015], and ResNet50 [He et al.,
Measure (%) R1 mAP R1 mAP
2016]) in model size and complexity. Table 13 shows that the
Global Feature 58.3 31.7 70.4 43.2
UniLoss Local Feature 46.3 26.3 58.0 34.0 JLML has both the 2nd smallest model size (7.2 million pa-
Full 76.1 52.2 83.7 62.8 rameters) and the 2nd smallest FLOPs (1.54×109 ), although
Global Feature 77.4 56.0 85.0 66.0 containing more streams (5 vs. 1 in all other CNNs) and more
MultiLoss Local Feature 78.9 57.8 86.4 68.4
Full 85.1 65.5 89.7 74.5 layers (39, more than all except ResNet50).
branches into a single loss [Cheng et al., 2016]. Table
10 shows that the proposed MultiLoss model significantly
4 Conclusion
improves the discriminative power of global and local re- We presented a novel Joint Learning of Multi-Loss (JLML)
id features, e.g. with Rank-1 increase of 9.0%(85.1-76.1) CNN model (JLML-ResNet39) for person re-identification
(SQ) and 6.0%(89.7-83.7) (MQ); and mAP improvement of feature learning. In contrast to existing re-id approaches
13.3%(65.5-52.2) (SQ) and 11.7%(74.5-62.8) (MQ). This that employ either global or local appearance features alone,
shows that branch independence plays a critical role in joint the proposed model is capable of extracting and exploiting
learning of multi-loss classification for effective feature opti- both and maximising their correlated complementary effects
misation. One plausible reason is due to the negative effect by learning discriminative feature representations in differ-
of a single loss imposed on the learning behaviour of both ent context subject to multi-loss classification objective func-
branches, caused by the potential divergence in discriminative tions in a unified framework. This is made possible by the
features in different context (local and global). This is shown proposed JLML-ResNet39 architecture design. Moreover,
by the significant performance degradation of both global and we introduce a structured sparsity based feature selective
local features when the UniLoss model is imposed. learning mechanism to further improve joint feature learn-
Table 11: Benefits from shared low-level features. ing. Extensive comparative evaluations on four re-id bench-
Query Type Single-Query Multi-Query
mark datasets were conducted to validate the advantages of
Measure (%) R1 mAP R1 mAP the proposed JLML model over a wide range of the state-
Without Shared Feature 83.2 63.1 88.3 72.1 of-the-art methods on both manually labelled and more chal-
With Shared Feature 85.1 65.5 89.7 74.5 lenging auto-detected person images. We also provided com-
(III) Benefits from Shared Low-Level Features. We eval- ponent evaluations and analysis of the model performance in
uated the effects of interaction between global and local order to give insights on the JLML model design.
branches introduced by the shared conv layer (common
ground) by deliberately removing it and then comparing the Acknowledgements
re-id performance. Table 11 shows the benefits from jointly
This work was partially supported by the China Scholarship Coun-
learning low-level features in the common conv layers, e.g. cil, Vision Semantics Ltd and Royal Society Newton Advanced Fel-
improving Rank-1 by 1.9%(85.1-83.2) / 1.4%(89.7-88.3) and lowship Programme (NA150459).
mAP by 2.4%(65.5-63.1) / 2.4%(74.5-72.1) for single-/multi-
query re-id. This confirms a similar finding as in multi-task
learning study [Argyriou et al., 2007]. References
Table 12: Effects of selective feature learning (SFL). [Ahmed et al., 2015] Ejaz Ahmed, Michael Jones, and Tim K
Query Type Single-Query Multi-Query
Marks. An improved deep learning architecture for person re-
Measure (%) R1 mAP R1 mAP identification. In CVPR, 2015.
Without SFL 83.4 63.8 88.7 72.9 [Argyriou et al., 2007] Andreas Argyriou, Theodoros Evgeniou,
With SFL 85.1 65.5 89.7 74.5 and Massimiliano Pontil. Multi-task feature learning. In NIPS,
(IV) Effects of Selective Feature Learning. We eval- 2007.
uated the contribution of our structured sparsity based Se- [Chen et al., 2016] Dapeng Chen, Zejian Yuan, Badong Chen, and
lective Feature Learning (SFL) (Eqs. (1) and (2)). Ta- Nanning Zheng. Similarity learning with spatial constraints for
ble 12 shows that our SFL mechanism can bring additional person re-identification. In CVPR, 2016.
re-id matching benefits, e.g. improving Rank-1 rate by [Chen et al., 2017a] Weihua Chen, Xiaotang Chen, Jianguo Zhang,
1.7%(85.1-83.4) (SQ) and 1.0%(89.7-88.7) (MQ); and mAP and Kaiqi Huang. A multi-task deep network for person re-
by 1.7%(65.5-63.8) (SQ) and 1.6%(74.5-72.9) (MQ). identification. In AAAI, 2017.
Table 13: Comparisons of model size and complexity. FLOPs: the [Chen et al., 2017b] Ying-Cong Chen, Xiatian Zhu, Wei-Shi
number of FLoating-point OPerations; PN: Parameter Number. Zheng, and Jian-Huang Lai. Person re-identification by camera
Model FLOPs PN (million) Depth Stream # correlation aware feature augmentation. TPAMI, 2017.
AlexNet 7.25×108 58.3 7 1
VGG16 1.55×1010 134.2 16 1 [Cheng et al., 2016] De Cheng, Yihong Gong, Sanping Zhou, Jin-
ResNet50 3.80×109 23.5 50 1 jun Wang, and Nanning Zheng. Person re-identification by multi-
GoogLeNet 1.57×109 6.0 22 1 channel parts-based cnn with improved triplet loss function. In
JLML-ResNet39 1.54×109 7.2 39 5 CVPR, 2016.
(V) Comparisons of Model Size and Complexity. We com- [Edelman, 1998] Shimon Edelman. Representation is repre-
pared the proposed JLML-ResNet39 model with four sem- sentation of similarities. Behavioral and Brain Sciences,
inal classification CNN architectures (Alexnet [Krizhevsky 21(04):449–467, 1998.
2199
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
[Farenzena et al., 2010] Michela Farenzena, Loris Bazzani, [Paisitkriangkrai et al., 2015] Sakrapee Paisitkriangkrai, Chunhua
Alessandro Perina, Vittorio Murino, and Marco Cristani. Person Shen, and Anton van den Hengel. Learning to rank in person
re-identification by symmetry-driven accumulation of local re-identification with metric ensembles. In CVPR, 2015.
features. In CVPR, 2010. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew
[Girshick et al., 2014] Ross Girshick, Jeff Donahue, Trevor Darrell, Zisserman. Very deep convolutional networks for large-scale im-
and Jitendra Malik. Rich feature hierarchies for accurate object age recognition. In ICLR, 2015.
detection and semantic segmentation. In CVPR, 2014. [Subramaniam et al., 2016] Arulkumar Subramaniam, Moitreya
[Gong et al., 2014] Shaogang Gong, Marco Cristani, Shuicheng Chatterjee, and Anurag Mittal. Deep neural networks with in-
Yan, and Chen Change Loy. Person re-identification. Springer, exact matching for person re-identification. In NIPS, 2016.
January 2014. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia,
[Gray and Tao, 2008] Douglas Gray and Hai Tao. Viewpoint invari- Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Er-
ant pedestrian recognition with an ensemble of localized features. han, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
In ECCV, 2008. with convolutions. In CVPR, 2015.
[Torralba et al., 2006] Antonio Torralba, Aude Oliva, Monica S
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
Castelhano, and John M Henderson. Contextual guidance of eye
Jian Sun. Deep residual learning for image recognition. In CVPR, movements and attention in real-world scenes: the role of global
2016. features in object search. Psychological Review, 113(4):766,
[Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, 2006.
Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadar- [Varior et al., 2016] Rahul Rama Varior, Mrinal Haloi, and Gang
rama, and Trevor Darrell. Caffe: Convolutional architecture for Wang. Gated siamese convolutional neural network architecture
fast feature embedding. In ACM MM, 2014. for human re-identification. In ECCV, 2016.
[Koestinger et al., 2012] Martin Koestinger, Martin Hirzer, Paul [Wang et al., 2013] Hua Wang, Feiping Nie, and Heng Huang.
Wohlhart, Peter M. Roth, and Horst Bischof. Large scale met- Multi-view clustering and feature learning via structured sparsity.
ric learning from equivalence constraints. In CVPR, 2012. In ICML, 2013.
[Kong et al., 2014] Deguang Kong, Ryohei Fujimaki, Ji Liu, Feip- [Wang et al., 2014a] H. Wang, S. Gong, and T. Xiang. Unsu-
ing Nie, and Chris Ding. Exclusive feature learning on arbitrary pervised learning of generative topic saliency for person re-
structures via l1,2-norm. In NIPS, 2014. identification. In British Machine Vision Conference, Notting-
[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Ge- ham, UK, September 2014.
offrey E Hinton. Imagenet classification with deep convolutional [Wang et al., 2014b] Taiqing Wang, Shaogang Gong, Xiatian Zhu,
neural networks. In NIPS, 2012. and Shengjin Wang. Person re-identification by video ranking. In
ECCV, 2014.
[Kviatkovsky et al., 2013] Igor Kviatkovsky, Amit Adam, and
Ehud Rivlin. Color invariants for person reidentification. TPAMI, [Wang et al., 2016a] Faqiang Wang, Wangmeng Zuo, Liang Lin,
35(7):1622–1634, 2013. David Zhang, and Lei Zhang. Joint learning of single-image
and cross-image representations for person re-identification. In
[LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, CVPR, 2016.
and Patrick Haffner. Gradient-based learning applied to docu-
[Wang et al., 2016b] Hanxiao Wang, Shaogang Gong, Xiatian Zhu,
ment recognition. IEEE, 86(11):2278–2324, 1998.
and Tao Xiang. Human-in-the-loop person re-identification. In
[Li et al., 2014] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. ECCV, 2016.
Deepreid: Deep filter pairing neural network for person re- [Wang et al., 2016c] Taiqing Wang, Shaogang Gong, Xiatian Zhu,
identification. In CVPR, 2014. and Shengjin Wang. Person re-identification by discriminative
[Liao et al., 2015] Shengcai Liao, Yang Hu, Xiangyu Zhu, and selection in video ranking. TPAMI, 38(12):2501–2514, 2016.
Stan Z Li. Person re-identification by local maximal occurrence [Xiao et al., 2016] Tong Xiao, Hongsheng Li, Wanli Ouyang, and
representation and metric learning. In CVPR, 2015. Xiaogang Wang. Learning deep feature representations with do-
[Loy et al., 2009] Chen Change Loy, Tao Xiang, and Shaogang main guided dropout for person re-identification. In CVPR, 2016.
Gong. Multi-camera activity correlation analysis. In CVPR, [Xiong et al., 2014] Fei Xiong, Mengran Gou, Octavia Camps, and
2009. Mario Sznaier. Person re-identification using kernel-based metric
[Ma et al., 2017] Xiaolong Ma, Xiatian Zhu, Shaogang Gong, learning methods. In ECCV. 2014.
Xudong Xie, Jianming Hu, Kin-Man Lam, and Yisheng Zhong. [Zhang et al., 2016] Li Zhang, Tao Xiang, and Shaogang Gong.
Person re-identification by unsupervised video matching. Pattern Learning a discriminative null space for person re-identification.
Recognition, 65:197–210, 2017. In CVPR, 2016.
[Matsukawa et al., 2016a] Tetsu Matsukawa, Takahiro Okabe, [Zhao et al., 2013] Rui Zhao, Wanli Ouyang, and Xiaogang Wang.
Einoshin Suzuki, and Yoichi Sato. Hierarchical gaussian descrip- Unsupervised salience learning for person re-identification. In
tor for person re-identification. In CVPR, 2016. CVPR, 2013.
[Matsukawa et al., 2016b] Tetsu Matsukawa, Takahiro Okabe, [Zheng et al., 2013] Wei-Shi Zheng, Shaogang Gong, and Tao Xi-
Einoshin Suzuki, and Yoichi Sato. Hierarchical gaussian descrip- ang. Reidentification by relative distance comparison. TPAMI,
tor for person re-identification. In CVPR, 2016. 35(3):653–668, March 2013.
[Navon, 1977] David Navon. Forest before trees: The precedence [Zheng et al., 2015] Liang Zheng, Liyue Shen, Lu Tian, Shengjin
of global features in visual perception. Cognitive Psychology, Wang, Jingdong Wang, and Qi Tian. Scalable person re-
9(3):353–383, 1977. identification: A benchmark. In ICCV, 2015.
2200