Paper 64-Person Re Identification System at Semantic
Paper 64-Person Re Identification System at Semantic
Abstract—Person Re-Identification (Re-ID) is a very accuracy. In recent years, with the growth of convolutional
important task in video surveillance systems such as tracking neural networks (CNNs) and careful-annotated benchmarks,
people, finding people in public places, or analysing customer CNN-based models which learn deep features from data
behavior in supermarkets. Although there have been many outperform hand-crafted methods by a large margin and
works to solve this problem, there are still remaining challenges achieve remarkable accuracy [12]. To obtain more
such as large-scale datasets, imbalanced data, viewpoint, fine- discriminative features to deal with the inter-class challenge,
grained data (attributes), the Local Features are not employed at some works try to extract local features from local regions in
semantic level in online stage of Re-ID task, furthermore, the different ways, such as pose normalization [13-15], part-based
imbalanced data problem of attributes are not taken into
learning [16-19], or attention mechanism [20-23]. Although
consideration. This paper has proposed a Unified Re-ID system
consisted of three main modules such as Pedestrian Attribute
these great works gain very high performance in accuracy and
Ontology (PAO), Local Multi-task DCNN (Local MDCNN),
mAP, they are still only employed deep features, which do not
Imbalance Data Solver (IDS). The new main point of our Re-ID contain semantic information and cannot be explained by
system is the power of mutual support of PAO, Local MDCNN human.
and IDS to exploit the inner-group correlations of attributes and Pedestrian Attribute Recognition is a task that predicts a
pre-filter the mismatch candidates from Gallery set based on number of predefined attributes describing a pedestrian.
semantic information as Fashion Attributes and Facial Similar to Re-ID, this task takes bounding boxes of pedestrian
Attributes, to solve the imbalanced data of attributes without
captured by cameras as inputs. Attributes are semantic
adjusting network architecture and data augmentation. We
experimented on the well-known Market1501 dataset. The
information. They are extracted based on attribute learning
experimental results have shown the effectiveness of our Re-ID model. They could be robust to challenges such as pose,
system and it could achieve the higher performance on lighting, camera characteristics. According to Re-ID problem,
Market1501 dataset in comparison to some state-of-the-art Re-ID facial attributes and cloth attributes are considered. Combining
methods. a set of large enough attributes can help improve the
discrimination of Re-ID features. Furthermore, unlike low-
Keywords—Person Re-Identification (Re-ID); Pedestrian level visual features or high-level deep features, attributes are
Attributes Ontology (PAO); Deep Convolution Neuron Network easy to understand for human [24]. Attributes can also be
(DCNN); Multi-task Deep Convolution Neuron Network expanded to a range of other applications, such as clothes
(MDCNN); Local Multi-task Deep Convolution Neuron Network retrieval, face retrieval. Most existing Re-ID studies use global
(Local MDCNN); Imbalanced Data Solver (IDS) features to predict all attributes [24-27]. However, most
attributes appear in local positions, so global features are
I. INTRODUCTION insufficient to recognize them. Some works notice this
Re-ID is the problem of recognising and associating a drawback and improve by divide global features into local
person at different physical locations over time after the person parts [28,29], but they still consider attributes as an auxiliary
had been previously observed visually elsewhere. Solving the branch to enrich deep features.
Re-ID problem has gained a rapid increase in attention in both
academic research communities and industrial laboratories in In this work, we proposed two simple CNN-based models,
one for extracting deep global features, and the other for
recent years. It has many applications, such as tracking people
across cameras, images retrieval, or customer behavior analysis predicting pedestrian attributes. In the learning stage of the
[1]. Due to using appearance features from input images, this attribute recognition model (ARM), we split feature maps at a
problem suffers from the common challenges in visual specific mid-level layer into multiple branches, with respect to
recognition: illumination, pose variation, occlusion, intra-class human’s body parts. Each branch use a local feature map,
and inter-class variations. Early studies aim to make full use of which is horizontally split from global feature map, to predict a
hand-crafted visual features [2-8] or metric learning [2,4,5, 9- group of corresponding relevant attributes. The attributes
11] to build a best descriptor for each person. These methods groups are applied from a predefined PAO, which can help
can solve one or some of the above challenges, but are very leverage the intra-class correlation of attributes into the
computational expensive and do not reach high results of learning process. Besides, we take into consideration the
504 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
imbalance of attributes and handle it by employing the is considered as containing information of a correspond body
Matthews correlation coefficient (MCC). In the inference part. Author in [17] do the same methods but splitted global
stage, different from previous methods, for each query image, features into equal stripes in multiple granularities. Each stripe
we firstly use attributes prediction to filter out mismatch is then adopted separately to an identity classification. Many of
candidates, and then the remaining ones will be used to find the above methods achieve remarkable performance in Re-ID
out best matching by deep global features. task. However, none of them consider semantic information,
such as attributes, but only try to make robust deep global/local
The main contributions of this paper are as follows: 1) We
features.
propose the Pedestrian Attribute Ontology to conduct
Pedestrian Attribute Learning Process and Re-ID process; C. Attribute for Re-ID
2) We propose the Pedestrian Attribute Learning Model based Attributes are signatures of concepts. It is difficult to
on Local Multi-task learning; 3) We propose integrating recognize concepts, but recognizing signatures is much easier.
Imbalanced Data Solver based on MCC to Re-ID system; Attributes help re-identify pedestrians from coarse to fine, and
4) We propose new Re-ID method based on Deep Global help to understand images in a detail level. Therefore,
Features and Pedestrian Semantic Information. attributes can help increase the discrimination between
II. RELATED WORKS pedestrians. There are many works investigating attributes as
auxiliary information to Re-ID. In [24,25], a DCNN classifier
A. Hand-Crafted-Features-based Re-ID is first trained on an independent attributes annotated dataset,
Traditional approaches mainly focus on designing then the attributes predicted by that model is used to train and
discriminative visual hand-crafted features. Colors and texture fine-tune another person re-id model. It can be noticed that this
are usually employed. In [2], RGB and HSV color vectors are kind of methods would push the error of former attribute model
extracted from input images and then fed into a Maximum to the Re-ID model, especially the attribute model still do not
Likelihood model to learn the image similarities. In another consider the imbalance problem. In [26,27], an end-to-end
approach [3], Gabor filters are used to extract texture features. Multi-task DCNN model is proposed to do both attribute
Covariances of these features are also employed by the Region recognition and Re-ID tasks simultaneously. In these works,
Covariance Descriptor in [6,7]. In [8], a robust features named each attribute probability is predicted by forwarding a same
Local Maximal Occurrence (LOMO) are proposed. LOMO is global feature vector to a corresponding linear layer. This
obtained by sliding a window on input images and taking the vector is also used to retrieve nearest neighbors in inference
maximum values of different features from all patches under stage of Re-ID task. In fact, many attributes just appear in
the window. Apart from proposing discriminative features, small regions on human body, so using a unique global feature
other works [5, 9-11] try to design an effective metric to learn vector to learn all attributes is inefficient. Recognizing this
the similarity and difference between images. drawback, [28,29] proposed part-based CNN models, in which
a global feature map from a middle layer is horizontally split
B. CNN-based Re-ID into 4 disjoint equal local feature maps, each one is then
CNNs have been widely employed in person re- forwarded to some other convolutional layers followed by a
identification due to their great performance in many different last linear layer to predict probabilities of a group of attributes.
computer vision tasks [30–32]. Earlier works use global The improvement in this method is that it use multiple local
features extracted from a CNN to train a siamese network [33– features to predict groups of suitable attributes. However, some
35]. For example, [33] proposed a Deep Ranking model aiming attributes are distributed over more than one part, so it is
to maximize the rank of Euclide distance of same identity’s confused to choose output of which part to evaluate attributes
feature vectors. Author in [34] employ Recurrent Neural recognition. Therefore, in inference stage of Re-ID task, the
Network to make use of motion information for more authors do not use attributes predictions anymore, but only
discriminative person description. Author in [35] proposed a enrich features by concatenating all deep local and global
Pyramid Person Matching Network to learn the correspondence features. Furthermore, none of the above methods handle the
of misalignment components in image pairs. Attention imbalance data problem of attributes, which is an inherent
mechanisms are applied in many models [20-23] to focus on problem in many classification tasks.
salient parts to extract more useful and discriminative Therefore, in this paper, followed the methods in [28],
information. Recent studies start to consider part-level features which is to build a DCNN model that split middle global
as complementary features for their models due to its fine- feature map into multiple local parts. But instead of distributing
grained information. Early part-based approaches apply each attribute over multiple parts and concatenating local and
predefined rigid grids on input images as local parts [36,37]. global deep features, we proposed novel methods for
This way of partition is insufficient because person detection improvement: 1) We build a Pedestrian Attribute Ontology
boxes are not always correct. In a very detail partition way, (PAO) for better attributes learning, and also for easily
[38] train a semantic parsing model to localize pixel-level body expanding in the future; 2) Based on PAO, we build a Local
parts. A weighted sum layer is then used to fuse global and Multi-task DCNN model (Local MDCNN) to exploit inner
local features for identity classification. Extra pose estimators group and inter group correlations between attributes; 3) We
[13,14] or spatial constraints [15] are also utilized to normalize incorporate an Imbalanced Data Solver (IDS) to our Pedestrian
deformable pedestrian parts to obtain more robust features. In Attribute Recognition module; and 4) we build a novel Person
another way to learn local features [16-18], global features are Re-identification system flexibly combining global deep
horizontal pooled and then separated into vectors, each of them
505 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
features and pedestrian semantic information (facial and cloth easily update new attributes in the future. To do this, we firstly
attributes). manually classify attributes into groups based on their
corresponding correlations. These groups are combined into a
III. METHOD semantic hierarchical tree, also known as attributes ontology.
Briefly, our contribution is a novel Person Re-identification To bring the PAO into deep model, we base on an observation
system based on Deep Global Features and Pedestrian that, most of attributes can be seen at a local region instead of a
Attributes. In this section, we focus on the main points of our whole human body. For example, "is wearing hat" can be
system. In the off-line stage, we build a PAO and then a Local predicted by learning from "head" region. Therefore, our Local
MDCNN to learn pedestrian attributes. Besides, we use MDCNN learns attributes from local features instead of global
transfer learning to train a siamese network to learn person ones. Follow [39], we employ a clothing attribute dataset
global deep features. In the on-line stage, for each query image, named DeepFashion [40] to build a general FasAO, and our
we firstly use pedestrian semantic information, so-called prior knowledge to build a general FaAO. In experiment, we
attribute, to pre-filter candidate images, and then use global re-build specific PAO on another dataset that has both Re-ID
deep features to find nearest neighbor in the remaining ones, or and attribute label.
vice versa. In [41], Gruber stated that ontology is a formal, explicit
Our Re-ID plays a very important role in a multi-camera specification of shared concepts. An ontology is formed by
tracking person system. In each viewing range of camera, the four principle components: individuals, classes, concepts and
tracking task can be performed by the conventional methods, relations between them. The class components can has multiple
but when the person moves from view range of camera (i) to layers. In our case, we formulate person attribute ontology by
camera (i+1), Re-ID is very useful to identify the monitored five components:
person is being lost track.
Person (individuals): a layer representing people
Our system is organized into two phases: 1) Offline Phase: objects.
This phase is designed to build the PAO to support the Deep
Regions (classes): a layer representing human’s body
Global Features Learning model (DGFL model) and the
regions, consisted of five parts: head, upper body, lower
Pedestrian Attributes Learning model (PAL model, aka the
Local MDCNN model). In order to improve the performance body, whole body (upper and lower) and foot.
of learning models, we take into account the imbalanced data Categories (classes): a layer representing types of
problem, and to prepare features for gallery set for matching corresponding clothes in each region.
process in the online phase. The PAO is manually built based
on domain knowledge in the field of fashion attributes and Attributes (concepts): clothing attributes with respect to
facial attributes. It is a hierarchical semantic tree. Its purpose is each category and facial attributes.
to exploit the correlations between attributes for not only better Relations: consisted of 3 relations: part of (between
learning but also easier expanding in the future. The DGFL regions and individuals), has a (between regions -
model is designed and trained to extract the deep global categories and categories - attributes), is a (between
features of the input image of each person. The PAL model is attributes and their values)
designed and trained to extract the predefined attributes of each
person. The deep features and the semantic information such as The semantic hierarchical tree consisted of three main
attributes are then mutually combined in the online phase. The levels: Regions, Categories and Attributes. Fig. 1 shows our
imbalanced data problem also is solved in this phase by the PAO. In Fig. 1, human body firstly is split into five regions. In
IDS. Its purpose is to find the best thresholds for each attribute each region, there are multiple categories of clothing items.
prediction in the training set. In the online phase, the best And for each item, there are relevant attributes depending on
chosen thresholds are used to convert continuous predicted its kind. Basically, it is not too difficult to know which items
outputs of PAL model to corresponding binary values to use should be put into which body regions. Here we take some
for the query process. 2) Online Phrase: This phase is examples from DeepFashion dataset and visualize in Fig. 2 to
organized to run the query process including deep features show some popular kinds of clothing items. The PAO shows
extraction, attributes information extraction and retrieval. After two properties of attributes which are inner group correlation
getting deep features from DGFL model and attributes and inter group correlation. These two properties help us in the
information from PAL model, for each query image, we firstly step of designing deep model that are:
use attributes to filter out candidate images with different
Firstly, when training a deep attribute recognition
attributes of the query image, and then use global deep features
to find nearest neighbor in the remain ones, or do the steps vice model, global features are usually be used to predict all
attributes. But, in real life, people just need to see a
versa.
local region to find out attributes related to this region.
A. Pedestrian Attribute Ontology (PAO) For example, we can know if a man is wearing hat or
Our PAO is inspired by our Face Attribute Ontology not by looking at his head, and do not need to look at
(FaAO) and our Fashion Attribute Ontology (FasAO) [39]. the other regions. This is the inter group correlation
PAO helps us to exploit inner group and inter group between attributes. Ontology help us to see which
correlations between attributes, it is then very useful for attributes should not go together and therefore should
training the Local MDCNN. PAO also helps the developer to not be predicted in same local features.
506 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
Secondly, there are many attributes having co- Clothing attributes are very numerous and variety. Follow
appearance relation to each other. For example, a Ly et al. we choose six types of clothing attributes to
person having beard is usually a man, so the two demonstrate our ontology, and divide them into two groups:
attributes Is Male and Having beard usually being zeros i) general attributes which are attributes that most of items
or ones together. This is the inner group correlation. would have, include: color, texture, shape; and ii) specific
Ontology help us to group these attributes into same attributes which are attributes that only exist on some items.
classes, and therefore deep model should predict them Besides, we also show some facial attributes, which can be
in same local features recognized from the head position. Some examples for clothing
attributes and facial attributes are showed in Fig. 3 and Fig. 4,
respectively.
In summary, Pedestrian Attribute Ontology is a hierarchical
semantic tree, in which attributes are classified into groups. It
not only can improve learning process of deep model (in
comparison to the models without it), but also easily update
more attributes if it is necessary in the future. In the next sub-
section, we base on this ontology to design an effective deep
model for attribute recognition task.
1) Models: Our system bases on two models: a Person
Deep Global Features Learning model and a Pedestrian
Attribute Learning model. The former one is trained to extract
Fig. 1. Our Pedestrian Attribute Ontology.
deep global features vectors and the latter one is trained to
extract attributes vectors. Since deep features achieved high
performances on many tasks in computer vision, it should not
be ignored in our system. However, deep features do not
contain semantic information. With only one input image, we
cannot understand what do deep features mean, but with
attributes features, we can know which attributes exist on the
persons in the input images. Therefore, both of the above
features can mutually support to get high performance in our
system.
a) Person Deep Global Features Learning model: Since
we are trying to proof by experiments that attributes prediction
can help improve Re-ID results, so we just build a simple
person deep global features learning model instead of using
complex architecture like other great works. Concretely, we
transfer 50-layers Residual Network [32] which was
pretrained on the famous image classification dataset
ImageNet. In our architecture, we remove the last 1000-units
linear layer and append a 1x1 convolutional layer to reduce
dimension from 2048 down to 256. This last 256-D vector is
the feature vector of the input bounding box, which is then
used for matching in the inference stage. Fig. 5 show our Re-
ID model architecture in the offline phase.
A couple of images are of the same person when their
corresponding deep features have high similarity. To train the
model to achieve this goal, in training stage, we use Triplet
Loss [42] as our loss function. This function was used by many
Re-ID models in particular and image retrieval models in
general. Its goal is to learn the similarity between same ID
inputs and the divergence between different ID inputs.
Equation (1) shows the formula of this function. Whenever the
training process is finished, Euclidean dissimilarity distance
between feature vectors of images of same person ( f a and f p)
should less than those of different person ( f a and f n) by a
margin m.
Fig. 2. Some kinds of Clothing Items Extracted from the PAO.
TripletLoss = max(0, || f a − f p || - || f a − f n || + m (1)
507 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
Fig. 5. Person Deep Global Features Learning Model. The Inputs of each
Timestep Are a Triple of Images, Including Anchor, Positive and Negative
one. The Model Parameters are Shared between them.
508 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
Outputs of the model are a vector whose number of parts: training set has 12936 images of 751 persons and test set
dimensions is equal to number of binary attributes. In the has 19281 images of the other 750 persons. From the test set,
learning stage, we use Binary Cross Entropy function for each from 1 to 6 random images of each individual are chosen to
attribute, and get average of those of all attributes as our final form a query set, and remaining are the gallery set. Re-ID task
loss function. With each M-dimension output vector ̂i and a is performed by matching images in query set with images in
M-dimension ground truth vector ̂i, where M is the number of gallery set, and return top-k most similar ones.
binary attributes, then the average Binary Cross Entropy loss
For the pedestrian attribute recognition task, we use this
function formula is shown in Equation (2):
Market1501-attribute data set, which is basically the
AvgBCELoss = – 1/M (aijlog ̂ij + (1 – )log(1– ̂ (2) Market1501 data set, but the attribute annotations are proposed
by other authors Lin et al. There are total 27 attributes, but we
2) Handling imbalanced data: One of the important only use 25 binary attributes which have proportions of
improvements of our method is that we incorporate an positive samples rate more than 0.5%. Table I show these
Imbalance Data Solver into our Person Re-identification attributes which are clustered into local positions by us.
system. Imbalanced data is a common problem in B. Using only Person Deep Global Features Learning Model
classification tasks. There are many ways to handle it, such as:
First, we evaluate our Person Deep Global Features
oversampling/ undersampling, use weighted loss function, etc. Learning (PDGFL) model when do not use complementary
In the task of multi-label classification like pedestrian attributes attribute information. We train our PDGFL model in 60
recognition, we cannot increase or decrease the amount of epochs, by Adam optimizer algorithm, with default hyper-
samples because it will affect all attributes, and weighting loss parameters, except the learning rate is set to 3.10−4. Input
will lead to a bunch of hyper-parameters must to be tuned. images are rescaled to 192x96 before fed into the network.
Therefore, we choose the way of adjusting the thresholds of Feature vector dimensions are 256. We evaluate 3 version of
binary attributes, instead of using common values of 0.5. Residual Network [32]: ResNet18, ResNet50 and ResNet101.
Concretely, for each attribute, we perform a grid search to Results are reported in mAP, top-1, top-5 and top-10 accuracy,
choose a best threshold from a list of predefined candidate which are shown in Table II. In our experiments, ResNet101
thresholds. The best one is which we get the highest value of achieves highest performance. ResNet50’s result is less than
ResNet101’s by a very small gap, but it has only a half of
Matthews correlation coefficient [43] when using it to convert
number of layers compare to ResNet101. This means even
from probability to binary prediction. Matthews correlation more complex and deeper model still cannot distinguish similar
coefficient (MCC) [43] is a famous metric used to measure the appearance individuals.
quality of a binary classifier. Its formula takes into
consideration the 4 popular values of classification problem: C. Attribute Recognition Model
true positives (TP), true negatives (TN), false positives (FP) Attribute recognition model is the principle component of
and false negatives (FN), which is shown in Equation (3). our proposed methods. We demonstrate our model in 2
scenarios: without/with Ontology, without/with Matthews
MCC=Matthewscorrcoef = (3) correlation coefficient (both are using Ontology). In all
√( ) ( ) ( ) ( )
attribute recognition experiments, we use the same train-
Matthews correlation coefficient has value range from –1 to validation-test split and optimizer algorithm as Re-ID
1: experiment, except number of epochs is now set to 10.
It achieves maximum value of 1 when both FP and FN Firstly, we re-build the attribute ontology on Market1501-
are zeros, which means no sample are false predicted attribute data set. The ontology is shown in Fig. 8.
and the classifier result matches exactly the ground
truth. TABLE. I. 25 ATTRIBUTES FROM MARKET1501-ATTRIBUTE DATA SET
And it is 0 if the prediction is random. upper sleeve length, 8 colors of upper clothing
length of lower clothing, type of lower clothing, 8 colors of
Therefore, we want to choose which threshold makes the lower
lower clothing
coefficient gain as highest score as possible. foot none
IV. RESULTS AND EVALUATION
TABLE. II. QUERY RESULT OF DIFFERENT MODELS WITHOUT
A. Data Set COMPLEMENTARY ATTRIBUTE INFORMATION
We demonstrate our proposed method on a large Re-ID Model Top-1 Top-5 Top-10 mAP
data set named Market1501 [44]. This data set contains 32668 ResNet18 77.7% 90.6% 93.5% 57.9%
bounding-box images of 1501 persons, which are captured
ResNet50 81.4% 91.8% 94.7% 65.1%
from six different cameras in front of a supermarket near
Tsinghua University, China. The authors divided it into three ResNet101 82.0% 93.1% 95.6% 66.0%
509 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
510 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
b) Secondly: the remaining candidates are used to find TABLE. IV. QUERY RESULTS BY ATTRIBUTES PRE-FILTERING USING 5
ATTRIBUTES HAVING BEST PREDICTION
K nearest neighbors by comparing Euclidean dissimilarity
distance of deep features extracted by the above ResNet50 Re- Attribute
Top-1 Top-5 Top-10 mAP
ID model. Pre-filtering
The 5 chosen attributes are: up red, up white, up yellow, None 81.4% 91.8% 94.7% 65.1%
lower length and down black. Table IV shows attributes pre- up red 83.3% 93.9% 92.6% 75.2%
filtering results in mAP and top-K. As depicted in Table IV,
pre-filtering by attribute down black gives best results in top-K up white 83.6% 93.7% 95.8% 74.4%
accuracy, and by attribute up red gives best result in mAP. up yellow 83.8% 93.6% 95.9% 74.9%
Consider all of these 5 attributes in pre-filtering, although the lower type 84.2% 94.1% 96.0% 74.7%
top-k accuracy values are improved by a small gap, the mAP
values increase remarkably, at least 9.3% comparing to the down black 85.2% 95.3% 96.9% 74.8%
case that does not use attributes in pre-filtering. This indicates
that, if in the future we can have better attribute recognition TABLE. V. COMPARISON BETWEEN COMBINATION OF GLOBAL AND
LOCAL DEEP FEATURES AND COMBINATION OF GLOBAL DEEP FEATURES AND
models, so that more attributes have high prediction results, ATTRIBUTE INFORMATION
then Re-ID results would be derived to a better performance
too. Case Top-1 Top-5 Top-10 mAP
2) Compare to the case of using global and local deep Position: Head
features: Most of previous works use attributes as auxiliary Deep Global
78.4% 89.7% 93.9% 56.8%
+ Local Feature
task for enhancing the global/local deep features, and do not
Deep Global
employ attributes prediction in the test stage. Therefore, we feature
75.1% 87.2% 92.6% 53.3%
perform some experiments to compare the two cases: using + All local
global and local deep features with using global deep features attributes
and attributes information. Combination of global and local Deep Global
feature
deep features in our experiments are extracted as follow: 81.6% 91.1% 93.8% 61.2%
+ Only attribute
a) For each position: we get the output of the local part gender
in our above network (Fig. 6), which is a feature map. Position: Body
b) Then: we apply a max-pooling operation to convert Deep Global
80.4% 89.7% 94.9% 62.0%
feature map to a feature vector. This is the local deep feature + Local Feature
vector of the corresponding position. Deep Global
feature
c) Finally: we concatenate deep global features and 72.7% 79.6% 85.1% 50.3%
+ All local
local feature of one of the 4 regions (head, body, upper, lower) attributes
to form a unique deep feature vector and then use it to perform Deep Global
matching process in test stage. feature
78.3% 91.4% 95.5% 63.1%
Combination of deep global features and attributes + Only attribute
information is exactly the pre-filtering strategy in the previous backpack
section. For each position, we compare two cases: i) use all Position: Upper
attributes of that position; and ii) use only one attribute with Deep Global
80.3% 91.5% 94.8% 61.7%
best prediction of that position. Results of the comparisons of + Local Feature
all of 4 positions are showed in Table V, the arrows indicates Deep Global
the result of using attribute information is higher or lower than feature
77.7% 89.4% 92.9% 59.9%
using local features, and the bold values is the highest values + All local
attributes
between 3 cases in that positions.
Deep Global
As depicted in Table V, when using all attributes in feature 83.8% 93.6% 95.6% 74.9%
filtering step, top-k accuracies and mAP in all positions are + Only up yellow
lower than combinations of global and local features. However, Position: Lower
it is the opposite for the case of using only one best attribute. In Deep Global
78.3% 90.9% 94.5% 61.1%
the position head, attribute gender has higher the top-1, top-5 + Local Feature
and mAP at about 3-5%, and the top-10 accuracy is smaller Deep Global
only 0.1% compare to using complementary local features. In feature
69.1% 77.4% 82.5% 47.8%
the position body, attribute backpack is not as good as local + All local
attributes
features, because it has a not-too-good prediction F1-score,
about 84%. In the positions upper and lower, using Deep Global
feature
corresponding best attributes gives performance totally higher 84.2% 94.1% 96.0% 74.7%
+ Only attribute
than deep local features. Fig. 9 shows some samples that query lower type
results are rearranged and improved by attribute filtering.
511 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
V. CONCLUSIONS
In this paper, we present a new method using semantic
information like as pedestrian attributes to improve person re-
identification performance. Our methods is a unified Re-ID
system consisted of two main modules: 1) Pedestrian
Attributes Learning model (PAO + Local MDCNN + IDS);
2) Person Re-ID model (Deep Global Features based Person
Re-ID + Pedestrian Attribute based Person Re-ID). We show
that the performance of our Re-ID system is better than some
state-of-the-art Re-ID methods. In the future, if more powerful
attribute recognition model were proposed, Re-ID task would
be driven to a better performance and Re-ID system at
semantic level will be integrated to Visual Question Answering
(VQA) to improve the intelligence of video surveillance
system.
ACKNOWLEDGMENT
This research is funded by Viet Nam National University
Ho Chi Minh City (VNUHCM) under grant no. B2018-18-01.
Thank to AIOZ Pte Ltd. company for the valuable support on
internship cooperation.
REFERENCES
[1] S. Gong, M. Cristani, S. Yan, C. C. Loy, “Person Re-Identification;
Springer Publishing Company,” Incorporated, 2014.
[2] L. Bazzani, M. Cristani, V. Murino, “Symmetry-driven accumulation
of local features for human characterization and re-identification,”
Computer Vision and Image Understanding 2013, 117, pp. 130–144.
doi:10.1016/j.cviu.2012.10.008.
[3] Y. Zhang, S. Li, “Gabor-LBP Based Region Covariance Descriptor for
Person Re-identification,” Image and Graphics (ICIG), 2011 Sixth
Fig. 9. Some Samples that Query Results are Improved by Attribute International Conference on 2011. doi:10.1109/ICIG.2011.40.
Filtering. [4] B. Prosser, W. S. Zheng, S. Gong, T. Xiang, “Person Re-Identification
by Support Vector Ranking,” 2010, Vol. 2, pp. 1–11.
doi:10.5244/C.24.21.
3) Compare to other methods: We use attribute “down
[5] M. Gou, F. Xiong, O. Camps, M. Sznaier, “Person Re-Identification
black” which has the best performance in improving Re-ID Using Kernel-Based Metric Learning Methods,” 2014. doi:10.1007/978-
results in comparison with related works. As depicted in 3-319-10584-0_1.
Table VI, our method achieves the higher performance than [6] W. Ayedi, H. Snoussi, M. Abid, “A fast multi-scale covariance
the other works in mAP, top-5 and top-10 accuracy. Note that descriptor for object re-identification,” Pattern Recognition Letters -
PRL 2011, 33. doi:10.1016/j.patrec.2011.09.006.
these works only use attribute as auxiliary information in
[7] S. Ba˛k, E. Corvee, F. Bremond, M. Thonnat, “Multiple-shot Human
learning stage. The results show that using attribute as a pre- Re-Identification by Mean Riemannian Covariance Grid,” 2011, pp. 179
filter in inference stage can achieve equivalent or even better – 184. doi:10.1109/AVSS.2011.6027316.
performance. The methods presented in [24], [26], [28] were [8] S. Liao, Y. Hu, X. Zhu, S. Li, “Person re-identification by Local
selected to compare performance with our method for the Maximal Occurrence representation and metric learning,” 2015, pp.
2197–2206. doi:10.1109/CVPR.2015.7298832.
following reasons: all these methods have used pedestrian’s
[9] K. Weinberger, J. Blitzer, K. Saul, “Distance Metric Learning for Large
attributes in Person Re-Identification in different ways but Margin Nearest Neighbor Classification,” 2006; Vol. 10.
have not yet used attribute pre-filters and considered data [10] M. Guillaumin, J. J. Verbeek, C. Schmid, “Is that you? Metric learning
imbalance. We would like to show that our method can approaches for face identification,” 2009 IEEE 12th International
Conference on Computer Vision 2009, pp. 498–505.
overcome their drawbacks listed in Section II (part C).
512 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
[11] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, “Large [29] Y. Chen, S. Duffner, A. Stoian, J. Y. Dufour, A. Baskurt, “Pedestrian
Scale Metric Learning from Equivalence Constraints,” 2012. Attribute Recognition with Part-based CNN and Combined Feature
doi:10.1109/CVPR.2012.6247939. Representations,” VISIGRAPP, 2018.
[12] L. Zheng, Y. Yang, A. G. Hauptmann, “Person Re-identification: Past, [30] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification
Present and Future,” ArXiv 2016, abs/1610.02984. with Deep Convolutional Neural Networks,” Neural Information
[13] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, Q. Tian, “Pose-Driven Deep Processing Systems 2012, 25. doi:10.1145/3065386.
Convolutional Model for Person Re-identification,” 2017 IEEE [31] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for
International Conference on Computer Vision (ICCV) 2017, pp. 3980– Large-Scale Image Recognition,” arXiv 1409.1556 2014.
3989. [32] K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image
[14] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, W. Gao, “Attention driven Recognition,” 2016 IEEE Conference on Computer Vision and Pattern
person re-identification,” Pattern Recognition 2019, 86, 143–155. Recognition (CVPR) 2016, pp. 770–778.
[15] D. Li, X. Chen, Z. Zhang, K. Huang, “Learning Deep Context-aware [33] S. Z. Chen, C. C. Guo, J. Lai, “Deep Ranking for Person Re-
Features over Body and Latent Parts for Person Re-identification,” 2017. Identification via Joint Representation Learning,” IEEE Transactions on
[16] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Image Processing 2016, 25, pp. 2353–2367.
Zhang, J. Sun, “AlignedReID: Surpassing Human-Level Performance in [34] N. McLaughlin, J. M. del Rincón, P. C. Miller, “Recurrent
Person Re-Identification,” ArXiv 2017, abs/1711.08184. Convolutional Network for Video-Based Person Re-identification,”
[17] Y. Sun, L. Zheng, Y. Yang, Q. Tian, “Wang, S. Beyond Part Models: 2016 IEEE Conference on Computer Vision and Pattern
Person Retrieval with Refined Part Pooling (and A Strong Convolutional Recognition (CVPR) 2016, pp 1325–1334.
Baseline),” ECCV, 2017. [35] C. Mao, Y. Li, Z. Zhang, Y. Zhang, X. Li, “ Pyramid Person Matching
[18] X. Bai, M. Yang, T. Huang, Z. Dou, R. Yu, Y. Xu, “Deep-Person: Network for Person Re-identification,” ACML, 2017.
Learning Discriminative Deep Features for Person Re-Identification,” [36] H. Shi, X. Zhu, S. Liao, Z. Lei, Y. Yang, S. Z. Li, “Constrained Deep
2017. Metric Learning for Person Re-identification,” ArXiv 2015,
[19] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, “Learning Discriminative abs/1511.07545.
Features with Multiple Granularities for Person Re-Identification,” [37] S. Wu, Y. C. Chen, X. Li, A. Wu, J. You, W. S. Zheng, “An enhanced
ACM Multimedia, 2018. deep feature representation for person re-identification,” 2016 IEEE
[20] H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, “End-to-End Comparative Winter Conference on Applications of Computer Vision (WACV) 2016,
Attention Networks for Person Re-Identification,” IEEE Transactions on pp. 1–8.
Image Processing 2017, 26, pp 3492–3506. [38] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, M. Shah,
[21] A. Rahimpour, L. Liu, A. Taalimi, Y. Song, H. Qi, “Person re- “Human Semantic Parsing for Person Re-identification,” 2018
identification using visual attention,” 2017 IEEE International IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference on Image Processing (ICIP) 2017, pp. 4242–4246. 2018, pp. 1062–1071.
[22] S. Zhou, J. Wang, D. Meng, Y. Liang, Y. Gong, N. Zheng, [39] N. Ly, T. Do, B. Nguyen, “Large-Scale Coarse-to-Fine Object Retrieval
“Discriminative Feature Learning with Foreground Attention for Person Ontology and Deep Local Multitask Learning,” Computational
Re-Identification,” IEEE transactions on image processing : a Intelligence and Neuroscience 2019, 2019, pp. 1–40.
publication of the IEEE Signal Processing Society 2018. doi:10.1155/2019/1483294.
[23] D. Ouyang, Y. Zhang, J. Shao, “Video-based person re-identification via [40] Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, “DeepFashion: Powering
spatio-temporal attentional and two-stream fusion convolutional Robust Clothes Recognition and Retrieval with Rich Annotations,”
networks,” Pattern Recognition Letters 2019, 117, 153–160. Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[24] A. Schumann, R. Stiefelhagen, “Person Re-identification by Deep
Learning Attribute-Complementary Information,” 2017 IEEE [41] T. R. Gruber, “Toward principles for the design of ontologies used for
Conference on Computer Vision and Pattern Recognition Workshops knowledge sharing,” Int. J. Hum.-Comput. Stud. 1993, 43, pp. 907–928.
(CVPRW) 2017, pp.1435–1443. [42] F. Schroff, D. Kalenichenko, J. Philbin, “FaceNet: A unified embedding
[25] C. Su, S. Zhang, J. Xing, W. Gao, Q. Tian, “Deep Attributes Driven for face recognition and clustering,” 2015 IEEE Conference on
Multi-Camera Person Re-identification,” ArXiv 2016, abs/1605.03259. Computer Vision and Pattern Recognition (CVPR) 2015, pp. 815–823.
[26] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Y. Yang, “Improving Person Re- [43] S. Boughorbel, F. Jarray, M. El-Anbari, “Optimal classifier for
identification by Attribute and Identity Learning,” ArXiv 2017, imbalanced data using Matthews Correlation Coefficient metric,” PloS
abs/1703.07220. one, 2017.
[27] N. McLaughlin, J. M. del Rincón, P. C. Miller, “Person Reidentification [44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, “Scalable
Using Deep Convnets With Multitask Learning,” IEEE Transactions on Person Re-identification: A Benchmark,” Computer Vision, IEEE
Circuits and Systems for Video Technology 2017, 27, pp. 525–539. International Conference on, 2015.
[28] G. Zhang, J. Xu, “Person Re-identification by Mid-level Attribute and
Part-based Identity Learning,” ACML, 2018.
513 | P a g e
www.ijacsa.thesai.org