Paper 64-Person Re Identification System at Semantic

This paper presents a novel Person Re-Identification (Re-ID) system that integrates a Pedestrian Attribute Ontology (PAO), Local Multi-task Deep Convolutional Neural Network (MDCNN), and an Imbalanced Data Solver (IDS) to enhance the accuracy of identifying individuals across different camera views. The proposed system addresses challenges such as imbalanced data and the need for semantic information by utilizing pedestrian attributes for filtering candidates and improving feature discrimination. Experimental results on the Market1501 dataset demonstrate the effectiveness of the proposed approach compared to existing state-of-the-art methods.

Uploaded by

NGOC LY QUOC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views10 pages

Paper 64-Person Re Identification System at Semantic

Uploaded by

NGOC LY QUOC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 2, 2020

Person Re-Identiﬁcation System at Semantic Level

based on Pedestrian Attributes Ontology
Ngoc Q. Ly1, Hieu N. M. Cao2 Thi T. Nguyen3
Department of Computer Vision and Cognitive Cybernetics Computer Vision and Cognitive Cybernetics
VNUHCM-University of Science VNUHCM-University of Science
Ho Chi Minh City Ho Chi Minh City
Vietnam Vietnam

Abstract—Person Re-Identification (Re-ID) is a very accuracy. In recent years, with the growth of convolutional
important task in video surveillance systems such as tracking neural networks (CNNs) and careful-annotated benchmarks,
people, finding people in public places, or analysing customer CNN-based models which learn deep features from data
behavior in supermarkets. Although there have been many outperform hand-crafted methods by a large margin and
works to solve this problem, there are still remaining challenges achieve remarkable accuracy [12]. To obtain more
such as large-scale datasets, imbalanced data, viewpoint, fine- discriminative features to deal with the inter-class challenge,
grained data (attributes), the Local Features are not employed at some works try to extract local features from local regions in
semantic level in online stage of Re-ID task, furthermore, the different ways, such as pose normalization [13-15], part-based
imbalanced data problem of attributes are not taken into
learning [16-19], or attention mechanism [20-23]. Although
consideration. This paper has proposed a Unified Re-ID system
consisted of three main modules such as Pedestrian Attribute
these great works gain very high performance in accuracy and
Ontology (PAO), Local Multi-task DCNN (Local MDCNN),
mAP, they are still only employed deep features, which do not
Imbalance Data Solver (IDS). The new main point of our Re-ID contain semantic information and cannot be explained by
system is the power of mutual support of PAO, Local MDCNN human.
and IDS to exploit the inner-group correlations of attributes and Pedestrian Attribute Recognition is a task that predicts a
pre-filter the mismatch candidates from Gallery set based on number of predefined attributes describing a pedestrian.
semantic information as Fashion Attributes and Facial Similar to Re-ID, this task takes bounding boxes of pedestrian
Attributes, to solve the imbalanced data of attributes without
captured by cameras as inputs. Attributes are semantic
adjusting network architecture and data augmentation. We
experimented on the well-known Market1501 dataset. The
information. They are extracted based on attribute learning
experimental results have shown the effectiveness of our Re-ID model. They could be robust to challenges such as pose,
system and it could achieve the higher performance on lighting, camera characteristics. According to Re-ID problem,
Market1501 dataset in comparison to some state-of-the-art Re-ID facial attributes and cloth attributes are considered. Combining
methods. a set of large enough attributes can help improve the
discrimination of Re-ID features. Furthermore, unlike low-
Keywords—Person Re-Identification (Re-ID); Pedestrian level visual features or high-level deep features, attributes are
Attributes Ontology (PAO); Deep Convolution Neuron Network easy to understand for human [24]. Attributes can also be
(DCNN); Multi-task Deep Convolution Neuron Network expanded to a range of other applications, such as clothes
(MDCNN); Local Multi-task Deep Convolution Neuron Network retrieval, face retrieval. Most existing Re-ID studies use global
(Local MDCNN); Imbalanced Data Solver (IDS) features to predict all attributes [24-27]. However, most
attributes appear in local positions, so global features are
I. INTRODUCTION insufficient to recognize them. Some works notice this
Re-ID is the problem of recognising and associating a drawback and improve by divide global features into local
person at different physical locations over time after the person parts [28,29], but they still consider attributes as an auxiliary
had been previously observed visually elsewhere. Solving the branch to enrich deep features.
Re-ID problem has gained a rapid increase in attention in both
academic research communities and industrial laboratories in In this work, we proposed two simple CNN-based models,
one for extracting deep global features, and the other for
recent years. It has many applications, such as tracking people
across cameras, images retrieval, or customer behavior analysis predicting pedestrian attributes. In the learning stage of the
[1]. Due to using appearance features from input images, this attribute recognition model (ARM), we split feature maps at a
problem suffers from the common challenges in visual specific mid-level layer into multiple branches, with respect to
recognition: illumination, pose variation, occlusion, intra-class human’s body parts. Each branch use a local feature map,
and inter-class variations. Early studies aim to make full use of which is horizontally split from global feature map, to predict a
hand-crafted visual features [2-8] or metric learning [2,4,5, 9- group of corresponding relevant attributes. The attributes
11] to build a best descriptor for each person. These methods groups are applied from a predefined PAO, which can help
can solve one or some of the above challenges, but are very leverage the intra-class correlation of attributes into the
computational expensive and do not reach high results of learning process. Besides, we take into consideration the

504 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

imbalance of attributes and handle it by employing the is considered as containing information of a correspond body
Matthews correlation coefficient (MCC). In the inference part. Author in [17] do the same methods but splitted global
stage, different from previous methods, for each query image, features into equal stripes in multiple granularities. Each stripe
we firstly use attributes prediction to filter out mismatch is then adopted separately to an identity classification. Many of
candidates, and then the remaining ones will be used to find the above methods achieve remarkable performance in Re-ID
out best matching by deep global features. task. However, none of them consider semantic information,
such as attributes, but only try to make robust deep global/local
The main contributions of this paper are as follows: 1) We
features.
propose the Pedestrian Attribute Ontology to conduct
Pedestrian Attribute Learning Process and Re-ID process; C. Attribute for Re-ID
2) We propose the Pedestrian Attribute Learning Model based Attributes are signatures of concepts. It is difficult to
on Local Multi-task learning; 3) We propose integrating recognize concepts, but recognizing signatures is much easier.
Imbalanced Data Solver based on MCC to Re-ID system; Attributes help re-identify pedestrians from coarse to fine, and
4) We propose new Re-ID method based on Deep Global help to understand images in a detail level. Therefore,
Features and Pedestrian Semantic Information. attributes can help increase the discrimination between
II. RELATED WORKS pedestrians. There are many works investigating attributes as
auxiliary information to Re-ID. In [24,25], a DCNN classifier
A. Hand-Crafted-Features-based Re-ID is first trained on an independent attributes annotated dataset,
Traditional approaches mainly focus on designing then the attributes predicted by that model is used to train and
discriminative visual hand-crafted features. Colors and texture fine-tune another person re-id model. It can be noticed that this
are usually employed. In [2], RGB and HSV color vectors are kind of methods would push the error of former attribute model
extracted from input images and then fed into a Maximum to the Re-ID model, especially the attribute model still do not
Likelihood model to learn the image similarities. In another consider the imbalance problem. In [26,27], an end-to-end
approach [3], Gabor filters are used to extract texture features. Multi-task DCNN model is proposed to do both attribute
Covariances of these features are also employed by the Region recognition and Re-ID tasks simultaneously. In these works,
Covariance Descriptor in [6,7]. In [8], a robust features named each attribute probability is predicted by forwarding a same
Local Maximal Occurrence (LOMO) are proposed. LOMO is global feature vector to a corresponding linear layer. This
obtained by sliding a window on input images and taking the vector is also used to retrieve nearest neighbors in inference
maximum values of different features from all patches under stage of Re-ID task. In fact, many attributes just appear in
the window. Apart from proposing discriminative features, small regions on human body, so using a unique global feature
other works [5, 9-11] try to design an effective metric to learn vector to learn all attributes is inefficient. Recognizing this
the similarity and difference between images. drawback, [28,29] proposed part-based CNN models, in which
a global feature map from a middle layer is horizontally split
B. CNN-based Re-ID into 4 disjoint equal local feature maps, each one is then
CNNs have been widely employed in person re- forwarded to some other convolutional layers followed by a
identification due to their great performance in many different last linear layer to predict probabilities of a group of attributes.
computer vision tasks [30–32]. Earlier works use global The improvement in this method is that it use multiple local
features extracted from a CNN to train a siamese network [33– features to predict groups of suitable attributes. However, some
35]. For example, [33] proposed a Deep Ranking model aiming attributes are distributed over more than one part, so it is
to maximize the rank of Euclide distance of same identity’s confused to choose output of which part to evaluate attributes
feature vectors. Author in [34] employ Recurrent Neural recognition. Therefore, in inference stage of Re-ID task, the
Network to make use of motion information for more authors do not use attributes predictions anymore, but only
discriminative person description. Author in [35] proposed a enrich features by concatenating all deep local and global
Pyramid Person Matching Network to learn the correspondence features. Furthermore, none of the above methods handle the
of misalignment components in image pairs. Attention imbalance data problem of attributes, which is an inherent
mechanisms are applied in many models [20-23] to focus on problem in many classification tasks.
salient parts to extract more useful and discriminative Therefore, in this paper, followed the methods in [28],
information. Recent studies start to consider part-level features which is to build a DCNN model that split middle global
as complementary features for their models due to its fine- feature map into multiple local parts. But instead of distributing
grained information. Early part-based approaches apply each attribute over multiple parts and concatenating local and
predefined rigid grids on input images as local parts [36,37]. global deep features, we proposed novel methods for
This way of partition is insufficient because person detection improvement: 1) We build a Pedestrian Attribute Ontology
boxes are not always correct. In a very detail partition way, (PAO) for better attributes learning, and also for easily
[38] train a semantic parsing model to localize pixel-level body expanding in the future; 2) Based on PAO, we build a Local
parts. A weighted sum layer is then used to fuse global and Multi-task DCNN model (Local MDCNN) to exploit inner
local features for identity classification. Extra pose estimators group and inter group correlations between attributes; 3) We
[13,14] or spatial constraints [15] are also utilized to normalize incorporate an Imbalanced Data Solver (IDS) to our Pedestrian
deformable pedestrian parts to obtain more robust features. In Attribute Recognition module; and 4) we build a novel Person
another way to learn local features [16-18], global features are Re-identification system flexibly combining global deep
horizontal pooled and then separated into vectors, each of them

505 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

features and pedestrian semantic information (facial and cloth easily update new attributes in the future. To do this, we firstly
attributes). manually classify attributes into groups based on their
corresponding correlations. These groups are combined into a
III. METHOD semantic hierarchical tree, also known as attributes ontology.
Briefly, our contribution is a novel Person Re-identification To bring the PAO into deep model, we base on an observation
system based on Deep Global Features and Pedestrian that, most of attributes can be seen at a local region instead of a
Attributes. In this section, we focus on the main points of our whole human body. For example, "is wearing hat" can be
system. In the off-line stage, we build a PAO and then a Local predicted by learning from "head" region. Therefore, our Local
MDCNN to learn pedestrian attributes. Besides, we use MDCNN learns attributes from local features instead of global
transfer learning to train a siamese network to learn person ones. Follow [39], we employ a clothing attribute dataset
global deep features. In the on-line stage, for each query image, named DeepFashion [40] to build a general FasAO, and our
we firstly use pedestrian semantic information, so-called prior knowledge to build a general FaAO. In experiment, we
attribute, to pre-filter candidate images, and then use global re-build specific PAO on another dataset that has both Re-ID
deep features to find nearest neighbor in the remaining ones, or and attribute label.
vice versa. In [41], Gruber stated that ontology is a formal, explicit
Our Re-ID plays a very important role in a multi-camera specification of shared concepts. An ontology is formed by
tracking person system. In each viewing range of camera, the four principle components: individuals, classes, concepts and
tracking task can be performed by the conventional methods, relations between them. The class components can has multiple
but when the person moves from view range of camera (i) to layers. In our case, we formulate person attribute ontology by
camera (i+1), Re-ID is very useful to identify the monitored five components:
person is being lost track.
 Person (individuals): a layer representing people
Our system is organized into two phases: 1) Offline Phase: objects.
This phase is designed to build the PAO to support the Deep
 Regions (classes): a layer representing human’s body
Global Features Learning model (DGFL model) and the
regions, consisted of five parts: head, upper body, lower
Pedestrian Attributes Learning model (PAL model, aka the
Local MDCNN model). In order to improve the performance body, whole body (upper and lower) and foot.
of learning models, we take into account the imbalanced data  Categories (classes): a layer representing types of
problem, and to prepare features for gallery set for matching corresponding clothes in each region.
process in the online phase. The PAO is manually built based
on domain knowledge in the field of fashion attributes and  Attributes (concepts): clothing attributes with respect to
facial attributes. It is a hierarchical semantic tree. Its purpose is each category and facial attributes.
to exploit the correlations between attributes for not only better  Relations: consisted of 3 relations: part of (between
learning but also easier expanding in the future. The DGFL regions and individuals), has a (between regions -
model is designed and trained to extract the deep global categories and categories - attributes), is a (between
features of the input image of each person. The PAL model is attributes and their values)
designed and trained to extract the predefined attributes of each
person. The deep features and the semantic information such as The semantic hierarchical tree consisted of three main
attributes are then mutually combined in the online phase. The levels: Regions, Categories and Attributes. Fig. 1 shows our
imbalanced data problem also is solved in this phase by the PAO. In Fig. 1, human body firstly is split into five regions. In
IDS. Its purpose is to find the best thresholds for each attribute each region, there are multiple categories of clothing items.
prediction in the training set. In the online phase, the best And for each item, there are relevant attributes depending on
chosen thresholds are used to convert continuous predicted its kind. Basically, it is not too difficult to know which items
outputs of PAL model to corresponding binary values to use should be put into which body regions. Here we take some
for the query process. 2) Online Phrase: This phase is examples from DeepFashion dataset and visualize in Fig. 2 to
organized to run the query process including deep features show some popular kinds of clothing items. The PAO shows
extraction, attributes information extraction and retrieval. After two properties of attributes which are inner group correlation
getting deep features from DGFL model and attributes and inter group correlation. These two properties help us in the
information from PAL model, for each query image, we firstly step of designing deep model that are:
use attributes to filter out candidate images with different
 Firstly, when training a deep attribute recognition
attributes of the query image, and then use global deep features
to find nearest neighbor in the remain ones, or do the steps vice model, global features are usually be used to predict all
attributes. But, in real life, people just need to see a
versa.
local region to find out attributes related to this region.
A. Pedestrian Attribute Ontology (PAO) For example, we can know if a man is wearing hat or
Our PAO is inspired by our Face Attribute Ontology not by looking at his head, and do not need to look at
(FaAO) and our Fashion Attribute Ontology (FasAO) [39]. the other regions. This is the inter group correlation
PAO helps us to exploit inner group and inter group between attributes. Ontology help us to see which
correlations between attributes, it is then very useful for attributes should not go together and therefore should
training the Local MDCNN. PAO also helps the developer to not be predicted in same local features.

506 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

 Secondly, there are many attributes having co- Clothing attributes are very numerous and variety. Follow
appearance relation to each other. For example, a Ly et al. we choose six types of clothing attributes to
person having beard is usually a man, so the two demonstrate our ontology, and divide them into two groups:
attributes Is Male and Having beard usually being zeros i) general attributes which are attributes that most of items
or ones together. This is the inner group correlation. would have, include: color, texture, shape; and ii) specific
Ontology help us to group these attributes into same attributes which are attributes that only exist on some items.
classes, and therefore deep model should predict them Besides, we also show some facial attributes, which can be
in same local features recognized from the head position. Some examples for clothing
attributes and facial attributes are showed in Fig. 3 and Fig. 4,
respectively.
In summary, Pedestrian Attribute Ontology is a hierarchical
semantic tree, in which attributes are classified into groups. It
not only can improve learning process of deep model (in
comparison to the models without it), but also easily update
more attributes if it is necessary in the future. In the next sub-
section, we base on this ontology to design an effective deep
model for attribute recognition task.
1) Models: Our system bases on two models: a Person
Deep Global Features Learning model and a Pedestrian
Attribute Learning model. The former one is trained to extract
Fig. 1. Our Pedestrian Attribute Ontology.
deep global features vectors and the latter one is trained to
extract attributes vectors. Since deep features achieved high
performances on many tasks in computer vision, it should not
be ignored in our system. However, deep features do not
contain semantic information. With only one input image, we
cannot understand what do deep features mean, but with
attributes features, we can know which attributes exist on the
persons in the input images. Therefore, both of the above
features can mutually support to get high performance in our
system.
a) Person Deep Global Features Learning model: Since
we are trying to proof by experiments that attributes prediction
can help improve Re-ID results, so we just build a simple
person deep global features learning model instead of using
complex architecture like other great works. Concretely, we
transfer 50-layers Residual Network [32] which was
pretrained on the famous image classification dataset
ImageNet. In our architecture, we remove the last 1000-units
linear layer and append a 1x1 convolutional layer to reduce
dimension from 2048 down to 256. This last 256-D vector is
the feature vector of the input bounding box, which is then
used for matching in the inference stage. Fig. 5 show our Re-
ID model architecture in the offline phase.
A couple of images are of the same person when their
corresponding deep features have high similarity. To train the
model to achieve this goal, in training stage, we use Triplet
Loss [42] as our loss function. This function was used by many
Re-ID models in particular and image retrieval models in
general. Its goal is to learn the similarity between same ID
inputs and the divergence between different ID inputs.
Equation (1) shows the formula of this function. Whenever the
training process is finished, Euclidean dissimilarity distance
between feature vectors of images of same person ( f a and f p)
should less than those of different person ( f a and f n) by a
margin m.
Fig. 2. Some kinds of Clothing Items Extracted from the PAO.
TripletLoss = max(0, || f a − f p || - || f a − f n || + m (1)

507 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

features: mismatched results would rise in cases of different

persons having same appearance (such as same cloths, same
pose). In these cases, we need more detail information to
distinguish them instead of global deep features only. And
pedestrian attributes are semantic information at fine-grained
level. Therefore, the attributes are used in our system to filter
out false positive candidate images in those cases to get better
performance for Re-ID system.
b) Person Attribute Learning model: Pedestrian
Attribute Learning model is designed based on Pedestrian
Attribute Ontology. From the ontology, to make sure that the
inner group and inter group correlation can be leveraged into
deep model, we build PAL model with three levels: regions,
categories and attributes. Firstly, a middle layer global feature
map is split into four equal parts, which means each one
Fig. 3. Some Clothing Attributes with their Values, Extracted from the PAO.
occupies 25% ratio height of person body. From top to
bottom, the parts respectively are head, upper body, lower
body, foot region. The body region is the merger of upper
body and lower body. Secondly, each region split features are
learnt in a corresponding local sub-network to extract
information relevant to that local part. And finally, local
features of each part are fed into multiple smaller branches,
which then predict the attributes in the PAO.
In briefly, our PAL model has three parts: i) a global part
that learns common features; multiple attribute learning sub-
networks, consist of: ii) local parts that learn local features, and
iii) attribute parts that learn specific features and predict a
group of suitable attributes. Fig. 6 shows our PAL model and
Fig. 7 show a sub-network from our PAL model, which is
taken from the head region. We also transfer 18-layer Residual
Network [32] into our architecture. ResNet18 has 5 layer
groups. We apply the conv_0, conv_1 and conv_2 to the first
part, conv_3 to the second one and conv_4 for the last one,
which is also shown in Fig. 6 and Fig. 7.
Fig. 4. Some Facial Attributes with their Values, Extracted from the PAO.

Fig. 6. Our Pedestrian Attribute Learning Model.

Fig. 5. Person Deep Global Features Learning Model. The Inputs of each
Timestep Are a Triple of Images, Including Anchor, Positive and Negative
one. The Model Parameters are Shared between them.

In the inference stage, with an input image, its

corresponding deep features vectors are extracted and then
compared to other pre-extracted vectors of gallery images. The
nearest gallery one, i.e. the one has smallest Euclidean
dissimilarity distance, would be chosen to match with input
image as the same person. With the above strategy, the more
similar the two individuals are, the smaller Euclidean
dissimilarity distance between the corresponding deep features Fig. 7. A Sub-Network from our Pedestrian Attribute Multi-Task Learning
Model, which is Taken from the Head Region.
vectors has. However, this leads to another drawback of deep

508 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

Outputs of the model are a vector whose number of parts: training set has 12936 images of 751 persons and test set
dimensions is equal to number of binary attributes. In the has 19281 images of the other 750 persons. From the test set,
learning stage, we use Binary Cross Entropy function for each from 1 to 6 random images of each individual are chosen to
attribute, and get average of those of all attributes as our final form a query set, and remaining are the gallery set. Re-ID task
loss function. With each M-dimension output vector ̂i and a is performed by matching images in query set with images in
M-dimension ground truth vector ̂i, where M is the number of gallery set, and return top-k most similar ones.
binary attributes, then the average Binary Cross Entropy loss
For the pedestrian attribute recognition task, we use this
function formula is shown in Equation (2):
Market1501-attribute data set, which is basically the
AvgBCELoss = – 1/M (aijlog ̂ij + (1 – )log(1– ̂ (2) Market1501 data set, but the attribute annotations are proposed
by other authors Lin et al. There are total 27 attributes, but we
2) Handling imbalanced data: One of the important only use 25 binary attributes which have proportions of
improvements of our method is that we incorporate an positive samples rate more than 0.5%. Table I show these
Imbalance Data Solver into our Person Re-identification attributes which are clustered into local positions by us.
system. Imbalanced data is a common problem in B. Using only Person Deep Global Features Learning Model
classification tasks. There are many ways to handle it, such as:
First, we evaluate our Person Deep Global Features
oversampling/ undersampling, use weighted loss function, etc. Learning (PDGFL) model when do not use complementary
In the task of multi-label classification like pedestrian attributes attribute information. We train our PDGFL model in 60
recognition, we cannot increase or decrease the amount of epochs, by Adam optimizer algorithm, with default hyper-
samples because it will affect all attributes, and weighting loss parameters, except the learning rate is set to 3.10−4. Input
will lead to a bunch of hyper-parameters must to be tuned. images are rescaled to 192x96 before fed into the network.
Therefore, we choose the way of adjusting the thresholds of Feature vector dimensions are 256. We evaluate 3 version of
binary attributes, instead of using common values of 0.5. Residual Network [32]: ResNet18, ResNet50 and ResNet101.
Concretely, for each attribute, we perform a grid search to Results are reported in mAP, top-1, top-5 and top-10 accuracy,
choose a best threshold from a list of predefined candidate which are shown in Table II. In our experiments, ResNet101
thresholds. The best one is which we get the highest value of achieves highest performance. ResNet50’s result is less than
ResNet101’s by a very small gap, but it has only a half of
Matthews correlation coefficient [43] when using it to convert
number of layers compare to ResNet101. This means even
from probability to binary prediction. Matthews correlation more complex and deeper model still cannot distinguish similar
coefficient (MCC) [43] is a famous metric used to measure the appearance individuals.
quality of a binary classifier. Its formula takes into
consideration the 4 popular values of classification problem: C. Attribute Recognition Model
true positives (TP), true negatives (TN), false positives (FP) Attribute recognition model is the principle component of
and false negatives (FN), which is shown in Equation (3). our proposed methods. We demonstrate our model in 2
scenarios: without/with Ontology, without/with Matthews
MCC=Matthewscorrcoef = (3) correlation coefficient (both are using Ontology). In all
√( ) ( ) ( ) ( )
attribute recognition experiments, we use the same train-
Matthews correlation coefficient has value range from –1 to validation-test split and optimizer algorithm as Re-ID
1: experiment, except number of epochs is now set to 10.
 It achieves maximum value of 1 when both FP and FN Firstly, we re-build the attribute ontology on Market1501-
are zeros, which means no sample are false predicted attribute data set. The ontology is shown in Fig. 8.
and the classifier result matches exactly the ground
truth. TABLE. I. 25 ATTRIBUTES FROM MARKET1501-ATTRIBUTE DATA SET

 Contrary, it achieves minimum value of -1 when both of Position Attribute

TP and TN are zeros, which means classifier result is head gender, hair length, wearing hat
completely opposite to the ground truth. body carrying backpack, carrying handbag, carrying bag

 And it is 0 if the prediction is random. upper sleeve length, 8 colors of upper clothing
length of lower clothing, type of lower clothing, 8 colors of
Therefore, we want to choose which threshold makes the lower
lower clothing
coefficient gain as highest score as possible. foot none
IV. RESULTS AND EVALUATION
TABLE. II. QUERY RESULT OF DIFFERENT MODELS WITHOUT
A. Data Set COMPLEMENTARY ATTRIBUTE INFORMATION
We demonstrate our proposed method on a large Re-ID Model Top-1 Top-5 Top-10 mAP
data set named Market1501 [44]. This data set contains 32668 ResNet18 77.7% 90.6% 93.5% 57.9%
bounding-box images of 1501 persons, which are captured
ResNet50 81.4% 91.8% 94.7% 65.1%
from six different cameras in front of a supermarket near
Tsinghua University, China. The authors divided it into three ResNet101 82.0% 93.1% 95.6% 66.0%

509 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

improved by a large gap. For example, attribute

"wearing hat" has largest growth with 31%, although
this attribute positives rate is only 2.6%. This shows
that handling imbalance problem by adjusting
thresholds help improve significantly the quality of a
multi-label classifier.
 With Local multi-task training: As depicted in column
ONTO + MCC and ONTO + MCC + LM, average F1-
score again increases 13% when training multiple
models corresponding to each local part. And all of
attributes have F1-scores increasing, too. Moreover,
there are many attributes having result greater than
90%, includes: up white, up red, up yellow, lower type,
and down black. This shows that, training separate
models for each position can also improve the
prediction result.
D. Attribute Filtering for Person Re-Identification
Fig. 8. The PAO Implemented on Market1501-Attribute Data Set. 1) Compare to the case of using only global deep
features: From the results of attribute recognition models in
Secondly, we evaluate our proposed attribute recognition
Table III, because of not all of attributes give remarkable
model in four versions:
scores, instead of use all of them, we only choose five highest
 Baseline: This is simply a ResNet18 network replaced F1-score attributes to improve re-id performance. Concretely,
last 1000-units linear layer by 25-units linear layer. In for each query image in query set of Market1501 data set:
other words, this model predicts all attributes from a
a) Firstly: We use each of these attributes to filter out
unique global features.
candidates in gallery set that mismatch the attributes of query
 ONTO: This is the above proposed network, which has image.
multiple branches to predict attributes from suitable
local features. But in this version, we still do not handle TABLE. III. RESULTS OF DIFFERENT VERSIONS OF PROPOSED MODEL (F1-
SCORES)
the imbalance problem.
 ONTO + MCC: This is simply the same as ONTO ONTO +
ONTO +
Position Attribute Baseline ONTO MCC +
version, but we handle the imbalance data problem by MCC
LM
apply Matthews correlation coefficient in adjusting
thresholds. Threshold candidates are predefined by an gender 40.26 71.45 78.03 88.91
arithmetic progression from 0.01 to 0.99 with step of Head hair length 51.32 65.62 73.43 86.38
wearing hat 06.74 22.82 53.42 68.26
0.01. The best thresholds are then used in test phase to
convert continuous outputs to binary values. backpack 45.89 53.06 70.12 84.76
Body bag 43.57 46.94 53.54 77.10
 ONTO + MCC + LM: This is the version which has 4 handbag 11.40 32.47 54.57 64.77
models corresponding to 4 parts: head, body, upper and
lower. We observed that, instead of training a unique sleeve length 11.96 44.86 66.57 74.94
up black 47.63 69.53 83.06 88.75
multi-task model containing all of local branches,
up white 73.58 74.05 79.58 91.92
training separate models for each local branch make it up red 71.47 73.88 88.93 90.05
easy to update new attributes and re-train in the future. Upper up purple 20.02 47.74 67.91 84.99
up yellow 72.86 76.80 81.37 94.16
Table III shows the results of 4 versions. We can see some up gray 21.63 52.53 68.89 82.32
observations: up blue 45.84 58.23 68.91 84.92
up green 41.53 44.68 64.22 84.99
 With Ontology: As depicted in column Baseline and
ONTO, average F1-score increases 12% when using lower length 59.56 74.41 80.85 83.26
complementary ontology, compare to a plain network. lower type 78.78 82.20 89.20 92.16
There are 24/25 attribute’s which have F1-scores down black 74.19 83.67 87.85 91.19
down white 48.63 49.35 69.58 78.10
increasing, too. This proofs that using local features, or down pink 62.77 61.80 66.79 89.07
concretely using ontology, is more powerful than using Lower
down yellow 0.21 11.08 31.34 43.08
global features in attribute prediction. down gray 51.20 57.92 64.61 78.84
down blue 58.77 61.43 64.04 81.34
 With Matthews correlation coefficient: As depicted in down green 29.29 44.52 54.96 76.30
column ONTO and ONTO + MCC, average F1-score down brown 43.85 53.61 68.46 82.39
now increases 13% when using MCC. All of attributes
Average 43.32 56.59 69.20 81.72
have F1-scores increasing, too. Some attributes are

510 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

b) Secondly: the remaining candidates are used to find TABLE. IV. QUERY RESULTS BY ATTRIBUTES PRE-FILTERING USING 5
ATTRIBUTES HAVING BEST PREDICTION
K nearest neighbors by comparing Euclidean dissimilarity
distance of deep features extracted by the above ResNet50 Re- Attribute
Top-1 Top-5 Top-10 mAP
ID model. Pre-filtering
The 5 chosen attributes are: up red, up white, up yellow, None 81.4% 91.8% 94.7% 65.1%
lower length and down black. Table IV shows attributes pre- up red 83.3% 93.9% 92.6% 75.2%
filtering results in mAP and top-K. As depicted in Table IV,
pre-filtering by attribute down black gives best results in top-K up white 83.6% 93.7% 95.8% 74.4%
accuracy, and by attribute up red gives best result in mAP. up yellow 83.8% 93.6% 95.9% 74.9%
Consider all of these 5 attributes in pre-filtering, although the lower type 84.2% 94.1% 96.0% 74.7%
top-k accuracy values are improved by a small gap, the mAP
values increase remarkably, at least 9.3% comparing to the down black 85.2% 95.3% 96.9% 74.8%
case that does not use attributes in pre-filtering. This indicates
that, if in the future we can have better attribute recognition TABLE. V. COMPARISON BETWEEN COMBINATION OF GLOBAL AND
LOCAL DEEP FEATURES AND COMBINATION OF GLOBAL DEEP FEATURES AND
models, so that more attributes have high prediction results, ATTRIBUTE INFORMATION
then Re-ID results would be derived to a better performance
too. Case Top-1 Top-5 Top-10 mAP

2) Compare to the case of using global and local deep Position: Head
features: Most of previous works use attributes as auxiliary Deep Global
78.4% 89.7% 93.9% 56.8%
+ Local Feature
task for enhancing the global/local deep features, and do not
Deep Global
employ attributes prediction in the test stage. Therefore, we feature
75.1% 87.2% 92.6% 53.3%
perform some experiments to compare the two cases: using + All local
global and local deep features with using global deep features attributes
and attributes information. Combination of global and local Deep Global
feature
deep features in our experiments are extracted as follow: 81.6% 91.1% 93.8% 61.2%
+ Only attribute
a) For each position: we get the output of the local part gender
in our above network (Fig. 6), which is a feature map. Position: Body
b) Then: we apply a max-pooling operation to convert Deep Global
80.4% 89.7% 94.9% 62.0%
feature map to a feature vector. This is the local deep feature + Local Feature
vector of the corresponding position. Deep Global
feature
c) Finally: we concatenate deep global features and 72.7% 79.6% 85.1% 50.3%
+ All local
local feature of one of the 4 regions (head, body, upper, lower) attributes
to form a unique deep feature vector and then use it to perform Deep Global
matching process in test stage. feature
78.3% 91.4% 95.5% 63.1%
Combination of deep global features and attributes + Only attribute
information is exactly the pre-filtering strategy in the previous backpack
section. For each position, we compare two cases: i) use all Position: Upper
attributes of that position; and ii) use only one attribute with Deep Global
80.3% 91.5% 94.8% 61.7%
best prediction of that position. Results of the comparisons of + Local Feature
all of 4 positions are showed in Table V, the arrows indicates Deep Global
the result of using attribute information is higher or lower than feature
77.7% 89.4% 92.9% 59.9%
using local features, and the bold values is the highest values + All local
attributes
between 3 cases in that positions.
Deep Global
As depicted in Table V, when using all attributes in feature 83.8% 93.6% 95.6% 74.9%
filtering step, top-k accuracies and mAP in all positions are + Only up yellow
lower than combinations of global and local features. However, Position: Lower
it is the opposite for the case of using only one best attribute. In Deep Global
78.3% 90.9% 94.5% 61.1%
the position head, attribute gender has higher the top-1, top-5 + Local Feature
and mAP at about 3-5%, and the top-10 accuracy is smaller Deep Global
only 0.1% compare to using complementary local features. In feature
69.1% 77.4% 82.5% 47.8%
the position body, attribute backpack is not as good as local + All local
attributes
features, because it has a not-too-good prediction F1-score,
about 84%. In the positions upper and lower, using Deep Global
feature
corresponding best attributes gives performance totally higher 84.2% 94.1% 96.0% 74.7%
+ Only attribute
than deep local features. Fig. 9 shows some samples that query lower type
results are rearranged and improved by attribute filtering.

511 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

TABLE. VI. COMPARISON WITH OTHER METHODS ON MARKET1501 DATA

SET

Methods Top-1 Top-5 Top-10 mAP

Schumann and
83.61% 92.61% 95.34% 62.6%
Stiefelhagen [24]
Lin et al. [26] 84.29% 93.2% 95.19% 64.67%
Zhang and Xu [28] 86.58% 94.48% 96.73% 68.08%
Ours, pre-filtering by
85.2% 95.3% 96.9% 74.8%
attribute down black

V. CONCLUSIONS
In this paper, we present a new method using semantic
information like as pedestrian attributes to improve person re-
identification performance. Our methods is a unified Re-ID
system consisted of two main modules: 1) Pedestrian
Attributes Learning model (PAO + Local MDCNN + IDS);
2) Person Re-ID model (Deep Global Features based Person
Re-ID + Pedestrian Attribute based Person Re-ID). We show
that the performance of our Re-ID system is better than some
state-of-the-art Re-ID methods. In the future, if more powerful
attribute recognition model were proposed, Re-ID task would
be driven to a better performance and Re-ID system at
semantic level will be integrated to Visual Question Answering
(VQA) to improve the intelligence of video surveillance
system.
ACKNOWLEDGMENT
This research is funded by Viet Nam National University
Ho Chi Minh City (VNUHCM) under grant no. B2018-18-01.
Thank to AIOZ Pte Ltd. company for the valuable support on
internship cooperation.
REFERENCES
[1] S. Gong, M. Cristani, S. Yan, C. C. Loy, “Person Re-Identification;
Springer Publishing Company,” Incorporated, 2014.
[2] L. Bazzani, M. Cristani, V. Murino, “Symmetry-driven accumulation
of local features for human characterization and re-identification,”
Computer Vision and Image Understanding 2013, 117, pp. 130–144.
doi:10.1016/j.cviu.2012.10.008.
[3] Y. Zhang, S. Li, “Gabor-LBP Based Region Covariance Descriptor for
Person Re-identification,” Image and Graphics (ICIG), 2011 Sixth
Fig. 9. Some Samples that Query Results are Improved by Attribute International Conference on 2011. doi:10.1109/ICIG.2011.40.
Filtering. [4] B. Prosser, W. S. Zheng, S. Gong, T. Xiang, “Person Re-Identification
by Support Vector Ranking,” 2010, Vol. 2, pp. 1–11.
doi:10.5244/C.24.21.
3) Compare to other methods: We use attribute “down
[5] M. Gou, F. Xiong, O. Camps, M. Sznaier, “Person Re-Identification
black” which has the best performance in improving Re-ID Using Kernel-Based Metric Learning Methods,” 2014. doi:10.1007/978-
results in comparison with related works. As depicted in 3-319-10584-0_1.
Table VI, our method achieves the higher performance than [6] W. Ayedi, H. Snoussi, M. Abid, “A fast multi-scale covariance
the other works in mAP, top-5 and top-10 accuracy. Note that descriptor for object re-identification,” Pattern Recognition Letters -
PRL 2011, 33. doi:10.1016/j.patrec.2011.09.006.
these works only use attribute as auxiliary information in
[7] S. Ba˛k, E. Corvee, F. Bremond, M. Thonnat, “Multiple-shot Human
learning stage. The results show that using attribute as a pre- Re-Identification by Mean Riemannian Covariance Grid,” 2011, pp. 179
filter in inference stage can achieve equivalent or even better – 184. doi:10.1109/AVSS.2011.6027316.
performance. The methods presented in [24], [26], [28] were [8] S. Liao, Y. Hu, X. Zhu, S. Li, “Person re-identification by Local
selected to compare performance with our method for the Maximal Occurrence representation and metric learning,” 2015, pp.
2197–2206. doi:10.1109/CVPR.2015.7298832.
following reasons: all these methods have used pedestrian’s
[9] K. Weinberger, J. Blitzer, K. Saul, “Distance Metric Learning for Large
attributes in Person Re-Identification in different ways but Margin Nearest Neighbor Classification,” 2006; Vol. 10.
have not yet used attribute pre-filters and considered data [10] M. Guillaumin, J. J. Verbeek, C. Schmid, “Is that you? Metric learning
imbalance. We would like to show that our method can approaches for face identification,” 2009 IEEE 12th International
Conference on Computer Vision 2009, pp. 498–505.
overcome their drawbacks listed in Section II (part C).

512 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

[11] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, “Large [29] Y. Chen, S. Duffner, A. Stoian, J. Y. Dufour, A. Baskurt, “Pedestrian
Scale Metric Learning from Equivalence Constraints,” 2012. Attribute Recognition with Part-based CNN and Combined Feature
doi:10.1109/CVPR.2012.6247939. Representations,” VISIGRAPP, 2018.
[12] L. Zheng, Y. Yang, A. G. Hauptmann, “Person Re-identification: Past, [30] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification
Present and Future,” ArXiv 2016, abs/1610.02984. with Deep Convolutional Neural Networks,” Neural Information
[13] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, Q. Tian, “Pose-Driven Deep Processing Systems 2012, 25. doi:10.1145/3065386.
Convolutional Model for Person Re-identification,” 2017 IEEE [31] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for
International Conference on Computer Vision (ICCV) 2017, pp. 3980– Large-Scale Image Recognition,” arXiv 1409.1556 2014.
3989. [32] K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image
[14] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, W. Gao, “Attention driven Recognition,” 2016 IEEE Conference on Computer Vision and Pattern
person re-identification,” Pattern Recognition 2019, 86, 143–155. Recognition (CVPR) 2016, pp. 770–778.
[15] D. Li, X. Chen, Z. Zhang, K. Huang, “Learning Deep Context-aware [33] S. Z. Chen, C. C. Guo, J. Lai, “Deep Ranking for Person Re-
Features over Body and Latent Parts for Person Re-identification,” 2017. Identification via Joint Representation Learning,” IEEE Transactions on
[16] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Image Processing 2016, 25, pp. 2353–2367.
Zhang, J. Sun, “AlignedReID: Surpassing Human-Level Performance in [34] N. McLaughlin, J. M. del Rincón, P. C. Miller, “Recurrent
Person Re-Identification,” ArXiv 2017, abs/1711.08184. Convolutional Network for Video-Based Person Re-identification,”
[17] Y. Sun, L. Zheng, Y. Yang, Q. Tian, “Wang, S. Beyond Part Models: 2016 IEEE Conference on Computer Vision and Pattern
Person Retrieval with Refined Part Pooling (and A Strong Convolutional Recognition (CVPR) 2016, pp 1325–1334.
Baseline),” ECCV, 2017. [35] C. Mao, Y. Li, Z. Zhang, Y. Zhang, X. Li, “ Pyramid Person Matching
[18] X. Bai, M. Yang, T. Huang, Z. Dou, R. Yu, Y. Xu, “Deep-Person: Network for Person Re-identification,” ACML, 2017.
Learning Discriminative Deep Features for Person Re-Identification,” [36] H. Shi, X. Zhu, S. Liao, Z. Lei, Y. Yang, S. Z. Li, “Constrained Deep
2017. Metric Learning for Person Re-identification,” ArXiv 2015,
[19] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, “Learning Discriminative abs/1511.07545.
Features with Multiple Granularities for Person Re-Identification,” [37] S. Wu, Y. C. Chen, X. Li, A. Wu, J. You, W. S. Zheng, “An enhanced
ACM Multimedia, 2018. deep feature representation for person re-identification,” 2016 IEEE
[20] H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, “End-to-End Comparative Winter Conference on Applications of Computer Vision (WACV) 2016,
Attention Networks for Person Re-Identification,” IEEE Transactions on pp. 1–8.
Image Processing 2017, 26, pp 3492–3506. [38] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, M. Shah,
[21] A. Rahimpour, L. Liu, A. Taalimi, Y. Song, H. Qi, “Person re- “Human Semantic Parsing for Person Re-identification,” 2018
identification using visual attention,” 2017 IEEE International IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference on Image Processing (ICIP) 2017, pp. 4242–4246. 2018, pp. 1062–1071.
[22] S. Zhou, J. Wang, D. Meng, Y. Liang, Y. Gong, N. Zheng, [39] N. Ly, T. Do, B. Nguyen, “Large-Scale Coarse-to-Fine Object Retrieval
“Discriminative Feature Learning with Foreground Attention for Person Ontology and Deep Local Multitask Learning,” Computational
Re-Identification,” IEEE transactions on image processing : a Intelligence and Neuroscience 2019, 2019, pp. 1–40.
publication of the IEEE Signal Processing Society 2018. doi:10.1155/2019/1483294.
[23] D. Ouyang, Y. Zhang, J. Shao, “Video-based person re-identification via [40] Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, “DeepFashion: Powering
spatio-temporal attentional and two-stream fusion convolutional Robust Clothes Recognition and Retrieval with Rich Annotations,”
networks,” Pattern Recognition Letters 2019, 117, 153–160. Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[24] A. Schumann, R. Stiefelhagen, “Person Re-identification by Deep
Learning Attribute-Complementary Information,” 2017 IEEE [41] T. R. Gruber, “Toward principles for the design of ontologies used for
Conference on Computer Vision and Pattern Recognition Workshops knowledge sharing,” Int. J. Hum.-Comput. Stud. 1993, 43, pp. 907–928.
(CVPRW) 2017, pp.1435–1443. [42] F. Schroff, D. Kalenichenko, J. Philbin, “FaceNet: A unified embedding
[25] C. Su, S. Zhang, J. Xing, W. Gao, Q. Tian, “Deep Attributes Driven for face recognition and clustering,” 2015 IEEE Conference on
Multi-Camera Person Re-identification,” ArXiv 2016, abs/1605.03259. Computer Vision and Pattern Recognition (CVPR) 2015, pp. 815–823.
[26] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Y. Yang, “Improving Person Re- [43] S. Boughorbel, F. Jarray, M. El-Anbari, “Optimal classifier for
identification by Attribute and Identity Learning,” ArXiv 2017, imbalanced data using Matthews Correlation Coefficient metric,” PloS
abs/1703.07220. one, 2017.
[27] N. McLaughlin, J. M. del Rincón, P. C. Miller, “Person Reidentification [44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, “Scalable
Using Deep Convnets With Multitask Learning,” IEEE Transactions on Person Re-identification: A Benchmark,” Computer Vision, IEEE
Circuits and Systems for Video Technology 2017, 27, pp. 525–539. International Conference on, 2015.
[28] G. Zhang, J. Xu, “Person Re-identification by Mid-level Attribute and
Part-based Identity Learning,” ACML, 2018.