Jiang 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Future Generation Computer Systems 123 (2021) 94–104

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Semantic segmentation for multiscale target based on object


recognition using the improved Faster-RCNN model
∗ ∗
Du Jiang a , Gongfa Li a,b,c , , Chong Tan d , Li Huang e,f , , Ying Sun d , Jianyi Kong a
a
The Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and
Technology, Wuhan, 430081, China
b
Research Center for Biomimetic Robot and Intelligent Measurement and Control, Wuhan University of Science and
Technology, Wuhan, 430081, China
c
The Institute of Precision Manufacturing, Wuhan University of Science and Technology, Wuhan University of Science and
Technology, Wuhan, 430081, China
d
The Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and
Technology, Wuhan, 430081, China
e
College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430081, China
f
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and
Technology, Wuhan, 430081, China

article info a b s t r a c t

Article history: Image semantic segmentation has received great attention in computer vision, whose aim is to segment
Received 22 January 2021 different objects and provide them different semantic category labels so that the computer can fully
Received in revised form 11 April 2021 obtain the semantic information of the scene. However, the current research mainly focuses on color
Accepted 27 April 2021
image data as training, for outdoor scenes and single task semantic segmentation. This paper carries
Available online 4 May 2021
out multi-task semantic segmentation model in the complex indoor environment on joint target
Keywords: detection using RGB-D image information based on the improved Faster-RCNN algorithm, which can
Semantic segmentation simultaneously realize the indoor scene semantic segmentation, target classification and detection
Object recognition multiple visual tasks. In which, in view of the influence of uneven lighting in the environment, the
Multiscale target method of fusion of RGB images and depth images is improved. While enhancing the fusion image
Multi-task feature information, it also improves the efficiency of model training. Simultaneously, in order to meet
Faster-RCNN the needs for operating on multi-scale target objects, the non-maximum value suppression algorithm
is improved to improve the model’s performance. So as to realize the output of the model’s multi-task
information, the loss function has also been redesigned and optimized. The indoor scene semantic
segmentation model constructed in this paper not only has good performance and high efficiency, but
also can segment the contours of different scale objects clearly and adapt to the indoor uneven lighting
environment.
© 2021 Published by Elsevier B.V.

1. Introduction image processing helps human resist diseases such as the coron-
avirus disease (COVID-19) [4] and automatic driving [5]. And their
As one of the most convenient ways to obtain object infor- joint can effectively improve the efficiency of urban management,
mation, images play an important role in information transmis- helping establish the relevant disaster prevention mechanism [6].
Image Semantic Segmentation is a basic research direction in
sion. At present, Computer Vision (CV) technology is developing
computer vision [7,8]. Its purpose is to segment the target in one
rapidly because of the outbreak of Artificial Intelligence (AI) and
scene with various colors for semantic color labeling, and to label
Deep Learning (DL). The tehnology processing allows intelligent
each pixel in the image with semantic information. In which, the
robots to observe and understand the world through vision like semantics refers to the category names of different targets in the
humans, and have the ability to adapt to the environment au- image, which is extremely important for image understanding,
tonomously [1–3]. For example, the combination of learning and target recognition and detection, and object tracking. At same
time, Target detection is inextricably bound up with semantic
∗ Corresponding authors. segmentation. The former needs to gain the position information
E-mail addresses: [email protected] (D. Jiang), [email protected] of the object, and the latter takes into account not just the
(G. Li), [email protected] (C. Tan), [email protected] (L. Huang), position but also the content information. Therefore, more refined
[email protected] (Y. Sun), [email protected] (J. Kong). vision tasks can be achieved with them, which are widely used

https://doi.org/10.1016/j.future.2021.04.019
0167-739X/© 2021 Published by Elsevier B.V.
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

in mobile robots, smart security, unmanned driving, industrial make up for the shortcomings of traditional semantic segmenta-
vision, virtual reality and others. tion algorithms, this type of algorithm runs slowly and needs to
Compared with the image segmentation for the single object, be improved in the accuracy of image segmentation and target
scene semantic segmentation needs to deal with more difficult recognition rate. It is not suitable for current video and computer
issues in providing a predefined semantic category pixel label vision fields, demanding higher real-time requirements.
for scene image or video [9,10]. At the same time, there are a Deep learning is combined with classifiers to implement im-
lots of challenges such as the abundant semantic categories, mu- age pixel classification, which can quickly extract the semantic
tual occlusion, uneven lighting [11,12], the similarity of diverse information of the low-level, middle-level, and high-level images,
objects and so on in indoor scenes. At the same time, semantic thereby improving the accuracy of image semantic segmentation.
segmentation consumes a lot of computing resources, so how Wang [23] et al. used CNN model to obtain features from images
and perform clustering operations on color images to obtain
to improve the efficiency of the algorithm is quite necessary,
superpixels. Finally, superpixels was classified with a classifier,
especially on edge devices [13]. With the wide application of
and thus semantic segmentation for indoor scenes was achieved
service robots, indoor scene understanding has attracted the at-
with deep learning. The deeper CNN network was applied by
tention of many researchers which is closely related to indoor Farabet [24] et al. to extract and integrate the features of images
scene semantic segmentation. Based on this, aiming at the process with different resolutions, in which, the rough pixel blocks in the
of color information and depth information fusion, this paper image were predicted smoothly with a segmentation tree algo-
contribute to reduce the amount of fusion information, so as to rithm. In 2014, Girshick [25] et al. proposed Regions with CNN
cut down the consumption of equipment calculation. Features (RCNN) based on CNN, which uses a selection search
The key contributions of this work are as follows: algorithm to extract several candidate regions whose features
(i) A multi-task semantic segmentation model for multiscale were extracted with CNN and classified with SVM. However, the
target is proposed on the basis of the improved Faster-RCNN RCNN algorithm relied heavily on candidate regions, which would
model. cause image distortion, resulting in poor image segmentation
(ii) A fusion method of depth image and color image is applied accuracy and slowing segmentation speed. In order to overcome
to improve the performance of the model. the above-mentioned shortcomings, Hariharan [26] et al. used
(iii) An improved NMS method is proposed for better selecting the Multiscale Combinatorial Grouping (MCG) algorithm in the
of local candidate regions. Simultaneous Detection and Segmentation (SDS) method, which
(iv) The performance of the proposed algorithm is analyzed could separate from the foreground of the target area. The target
and verified with results of several experiments. candidate region extracted the relevant features of the image,
This article mainly consists of 5 parts. Among them, part 2 and then performed joint training on the features of the two
includes the relevant research results, in which the rationality of parts, and finally applied the Non-maximum Suppression (NMS)
the methods is also discussed. The third part then illustrates the algorithm for region enhancement, which further improves the
performance of image segmentation. Nevertheless, RCNN still has
data set and pre-processing methods. The next part introduces
the shortcomings of a large number of generated target candi-
the experimental results in this paper.
date regions, the shape of the candidate regions and the large
amount of network calculations. He [27] et al. built a spatial
2. Related works
pyramid pooling network (SPPNet) and connected it to the back
of the convolutional layer to reduce the amount of repeated
Before deep learning technology became widespread, tradi- calculations in the feature extraction process of the network.
tional image semantic segmentation mainly performed related Ren [28] et al. introduced a Region Proposal Network (RPN) on the
operations on the target area of the image, using artificially Fast-RCNN network to quickly generate high-quality candidate
designed feature extractors to extract relevant features such as regions. Chen [29] et al. put forward a pyramid module with
texture, color, and shape of the image, which would be sent to the contextual information of different regions, which improved
a classifier (such as SVM, etc.)or other intelligent algorithms to the quality of scene understanding tasks. In addition, relevant
predict the target category in the image [14–18]. However, these researchers used a bottleneck module similar to the residual
methods often contain less information due to single features, network to solve the problem of real-time segmentation, mainly
resulting in unsatisfactory segmentations. through a series of downsampling, dimensionality reduction, and
In 2001, Lafferty [19] et al. proposed Conditional Random Field hole convolution operations [30], although them improved Real-
Models (CRFs) for text analysis. After continuous development, time performance, but losing a lot of detailed information often
they were gradually used in other tasks such as image recog- leads to unsatisfactory accuracy. In order to solve the problems
nition and segmentation, becoming one of the most successful of low computational efficiency and high storage overhead of
probability graph models for segmentation, whose advantage was image semantic segmentation based on deep learning in the early
that it can fuse the local features of multiple types of images, days, Long [31] and others designed a fully convolutional neural
contact global context information, and finally perform poste- network structure (FCNs), which were mainly based the VGG-16
rior probability modeling [20,21]. However, pairwise conditional network and the fully connected layer in CNN. The size of input
image and output image was unified using some methods like
random field models (Pairwise CRFs), generally used for image
upsampling and cropping, for which the end-to-end algorithm
segmentation, only uses the local texture information of the
become possible.
image and does not introduce the global shape features of the
To sum up, the most popular method is to use deep learning
image, which often leads to poor recognition results. In order to achieve image semantic segmentation. while facing indoor
to solve this problem, Kohli [22] et al. proposed higher order scenes, segmentation efficiency, accuracy and visualization effect
conditional random field models (Higher Order CRFs, HCRFs) on cannot fully meet the requirements because of the existence of
the basis of CRFs model in 2008. By introducing a high-order different lighting environment and scene mesoscale problems.
potential function, the pixel categories in the image segmentation Therefore, this paper mainly studies the deep learning method,
block were constrained on the pixel set to maintain consistency. further optimizes the rgb-d information fusion method, reduces
At the same time, it was proved that the Graph Cuts algorithm the amount of information in the process of image processing,
can be applied to solve the model, and the high-order potential and enhances the efficiency of the model. At the same time, by
function can be optimized by solving the minimum segmentation optimizing the other relevant algorithm, the stability of the model
problem. Although the segmentation algorithm based on CRFs can in the indoor scene is enhanced.
95
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Fig. 1. Multi-task semantic segmentation algorithm (MSSA-RCNN) flowchart.

3. Semantic segmentation for multiscale objects fusing color


and depth information

This paper built a multi-task semantic segmentation model


for multiscale objects in indoor scenes with joint target detec-
tion, which could detect the target’s location information while
interpreting the target’s semantic information, and outputting
the confidence of target object detection. In addition, by fusing
depth image information, not only could the influence of the
indoor lighting be overcome, but also more comprehensive image
features could be extracted, improving the training speed and
testing speed of the model. Fig. 2. Spectrum image of each channel of RGB image and HHA image of indoor
scene.
3.1. Description of algorithm design

In a series of target recognition algorithms based on the information. In order to solve this problem, this paper com-
idea of candidate regions, the Faster-RCNN model is superior bined the color and depth images in the early stage to obtain
to other network models in terms of accuracy and real-time more feature-rich image training. Drawing lessons from Saurabh
performance. So, the Faster-RCNN model was adopted as the basis Gupta [32] et al. the depth images in the NUYDv2 dataset were
to achieve a multi-task semantic segmentation algorithm (MSSA- converted into HHA three-channel images (horizontal difference,
RCNN) shown as Fig. 1, Which adds information fusion compared height to the ground and the angle of the surface normal vector).
with the method in Ref. [2]. It mainly consisted of three parts: the This paper analyzed the frequency domain characteristics of RGB
RGB and Depth image fusion in the early stage, the object recog- images and HHA images with the Fourier transform formula
nition framework and the semantic segmentation branch based
(Eq. (1)) to get the frequency spectrum of the original color image
on FCN. Fusion of color images and depth images is to avoid the
and HHA image.
influence of illumination factors on color images on the one hand,
and on the other hand to no longer train color images and depth W −1
H −1
1
f (ximg , yimg )e−j2π (uximg /W +v,yimg /H)
∑ ∑
images separately during network training, which is beneficial to F (u, v ) = (1)
WH
improve the detection speed of target objects. The fused image ximg =0 yimg =0
not only contains all the information of the color image, but also
in which,
contains the visual information of the depth image, and its feature
(u, v ), (ximg , yimg ) are the frequency domain and spatial coor-
expression ability is stronger than the methods with using color
or depth image training alone. For the further improvement in the dinates respectively;
accuracy of target detection, this paper designed a new method W , H are the width and height of the image respectively;
of screening target candidate frame, which took into account f (ximg , yimg ) is the pixel value at (ximg , yimg ).
the overlap degree of the candidate frame area and the number The spectrum images of each channel of RGB image and HHA
of surrounding frames, which changed the choice of candidate image are shown in Fig. 2. Among them, the frequency distribu-
frames in the traditional NMS algorithm. In the semantic seg- tion of the three channels in the color scene image is relatively
mentation branch, the introduction of RoI Align based on the FCN concentrated and similar, indicating that the features of the three
removed the quantization operation, and better solved the mis- channel images of the color image are relatively similar and
alignment problem of the two quantization in RoI Pooling, so that the visual features are obvious. Although the image frequency
the original image pixels and the output image pixels could fully distributions of the two H channels in the HHA image are not
aligned, while avoiding pixel error. completely consistent, they are relatively similar and concen-
trated, indicating that the original images of the two channels
3.2. Fusing RGB-D image have obvious visual features and represent different features. The
frequency distribution of the A channel is relatively scattered. It
In indoor scenes, uneven illumination often affects the ex- shows that the original image corresponding to this channel has
traction of RGB image features, leading to the lack of feature no obvious features.
96
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Fig. 3. Spectrum image of each channel of HHG image.

Fig. 5. RPN network [28].

performance of the target classification, but will make the net-


work convergence very slow, making the network unable to train.
The ResNet network protects the integrity of the information by
directly detouring the input information to the output. The entire
network only needs to learn the part of the difference between
input and output, thus simplifying the learning objectives and
difficulty, and solving the above problems to a certain extent.

3.4. RPN network


Fig. 4. ResNet module [2].
RPN [28] is a kind of lightweight neural network, mainly using
sliding windows to scan areas to search for targets in the image
Therefore, this article chooses to remove the third channel in as shown in Fig. 5. The area scanned by RPN will output k Anchor
the HHA image, and use the grayscale image of the color image to boxes. If the size of Feature Map is H ∗ W , then the number
replace the A channel, thereby fusing the color and depth images of aiming frames generated is H ∗ W ∗k. At the same time, RPN
to obtain the fused HHG image (the first H is the horizontal has two outputs, one is the target and non-target probability,
parallax, the second H is the height to the ground, and the G to determine whether the vector is the target object or the
channel is the grayscale image of the color image which can re- background, that is, 2k in the figure; the other is the rectangle
duce the amount of information while preserving image features coordinates x, y, width w and height h of the frame, namely 4k
as much as possible). The HHG image (Fig. 3) fully considers the in the picture. In this paper, an RPN network was used to extract
color image and depth image information, so it is unnecessary the region proposals from the feature map whose features was
to train the color and depth images separately during training. simultaneously extracted with a CNN network. Then, More ROIs
While accelerating the model training speed, better experimental are obtained by combining them, which were input to the ROI
results will also be obtained. align layer.
RPN shares the convolution features of the previous backbone
3.3. Backbone network network with the multi-task semantic segmentation algorithm
designed. In Faster-RCNN, about 17,000 candidate regions will
The standard convolutional neural network (ResNet50 or be generated after the aiming frame is regressed. At this time,
ResNet101) was selected for backbone network as the feature after the NMS algorithm, there will be 2000–3000 aiming frames
extractor. The full name of ResNet is Residual Network, which left, and the threshold for judging whether to discard overlapping
was proposed by He [33] et al. in 2015, adding a direct connection frames is 0.7. However, there are borders with a threshold value
channel according to the Highway Network. The previous module of 0.5 to 0.7 in the overlapping borders. These borders can be
is shown in Fig. 4(a), and the performance input is a simple removed, so that the final number of borders can be about 2000.
linear transformation. The improved module is shown in Fig. 4(b). In addition, for a border with a lower confidence level, if there
Highway Network make it possible to gain a certain proportion of are more overlapping borders with a higher confidence level, the
the output of the previous network layer. Therefore, the current information contained in this border is included by the border
network layer can learn the remaining output of the before one with a higher confidence level, and the low confidence level that
rather than the whole output. satisfies this condition The border can be discarded. Therefore,
In order to get the stronger identity mapping capabilities, this chapter considers the improvement of the NMS algorithm
ResNet mainly designs the Skip Connection structure, thereby after the implementation of the NMS, and the implementation of
allowing the net to be expanded and its performance to be the improved NMS algorithm. The overlap rate of the candidate
improved. As shown in Fig. 4, define the underlying mapping so frame and the number of frames around the candidate frame are
that the superimposed nonlinear layer satisfies another mapping used as the basis for selecting the frame, as follows:
H(xinput ) = G(xinput ) − xinput , So the original feature is mapped to First, as the same as the method in [2], the non-maximum sup-
H(xinput ) + xinput . The whole process can be expressed as: pression algorithm (NMS) is applied to fine-tune its position and
scale to obtain the final optimal target bounding box [34]. Fig. 6
youtput = H(xinput , {ωi })+xinput (2)
is a schematic diagram of the implementation of the NMS algo-
in which, xinput and youtput respectively represent the input and rithm. Taking the four objects (chair, book, table and keyboard)
output of the convolutional layer; in Fig. 6 as an example, there are N rectangular boxes containing
The function H(xinput {ωi }) is a residual mapping that can be each object respectively, among which they will be selected from
learned. high to low according to MNS. By setting the threshold, the
In the actual network model training, after the deep con- rectangle with low confidence is deleted and repeated until the
volutional neural network reaches a certain number of layers, best rectangle is selected to achieve the position output of the
continuing to increase the network layer does not improve the target.
97
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Fig. 7. Schematic diagram of the improved NMS scheme.

NMS algorithm can quickly reduce tens of thousands of candidate


regions to 2k 3k in a short time, performing NMS first and then
NMS’ operation can not only reduce the number of candidate
regions to 2k, but also better Select a local candidate area.
Fig. 6. Implementation of non-maximum suppression (NMS) algorithm [2].

3.5. Loss function


Intersection-over-Union (IoU) is often applied in target detec-
The model constructed in this paper includes three tasks
tion, which represents the overlap rate of the generated Candi-
(classification, detection and segmentation), so the loss function
date Bound and Ground Truth Bound, that is, their intersection
should also be divided into three parts, which are the correspond-
and union The ideal situation is that their ratio is equal to 1,
ing errors. Thereby, The loss function of the model is defined as
that is, complete overlap. With IoU, RCNN determines the overlap
follows:
between these candidate frames and the real artificially labeled
in a lots of target candidate frames. The threshold is set to 0.5. If Ltotal = Lcls + Lbox + Lseg (6)
IoU > 0.5, these candidate frames are the needed by default, and
vice versa. where: Lcls and Lbox represent the classification loss and regression
Each frame of the improved NMS is represented by a five- loss respectively. Like the loss function defined in the target de-
tuple (x1 , y1 , x2 , y2 , score). Where (x1 , y1 ), (x2 , y2 ) represent the tection algorithm Faster-RCNN, the fully connected layer is used
position coordinates of the upper left and lower right corners of to predict the category of each RoI and the coordinate position of
the target frame respectively, and score represents the confidence its rectangular box in the figure;
that the target object is contained in the frame. The improved Lseg is the loss function of the semantic segmentation branch,
NMS algorithm first sorts each tuple in ascending order according the output dimension of the segmentation branch for each RoI
to the number of scores, and then calculates the overlap ratio be- is km2 , which represents k binary semantic segmentation masks
tween each frame according to formula (3), which is the quotient with a resolution of m × m, and each category has a binary mask
of the intersection and union of the two frame regions with a resolution of, where k is the category quantity.
Therefore, this experiment uses sigmoid on each pixel, in
U(p,q) inter(p,q)
p=[1,n−α]
= (3) which Lseg is defined as the average cross-entropy binary loss
q=[p+1,n]
area(p) + area(q) − inter(p,q) function. For RoI with a real category of k, only the loss function
where: U(p,q) , inter(p,q) are the overlap rate and overlap area of the is calculated on the kth semantic segmentation mask, and the
frame p and q respectively (%, mm2); output of other masks is not included.
area(p) , area(q) are the area of the frame p and q respectively
(mm2); 4. Multi-task semantic segmentation experiment
n is the total number of frames (pieces);
α is the cut-off threshold. 4.1. Data set preparation
For frame p, the counted quantity of U(p,q) q=[p+1,n] ≥ β is
Sumi, ifSumi ≥ α , then discard framep, otherwise keep it. The
value α found in the experiment is determined by the number of
frames n. when n < 2050, the improved NMS algorithm is not
implemented; when n > 3000, the first 2000 frames with higher
scores are reserved. Implementing the improved NMS algorithm
in the algorithm can keep the number of frames to about 2000.
Where inter(p,q) and area(p) can be calculated by Eqs. (4) and (5):
p q p q
( =( Max( 0, Min
) x2 , x2( − Max , x1 + 1
( ( ( ) ( ) ))
inter(p,q) ) x1)) (4)
×Max 0, Min yp2 , yq2 − Max yp1 , yq1 + 1
p p p p
( ) ( )
area(p) = x2 − x1 + 1 × y2 − y1 + 1 (5)
This paper used Kinect color depth camera to collect indoor
The schematic diagram of NMS’ in the experiment is shown in scene images under various angles and illumination backgrounds,
Fig. 7. The dashed line is the border with higher confidence, and and constructs indoor scene RGB-D dataset as experimental data.
the solid line is the border with lower confidence in the local area, The data set had 2900 color images and 2900 depth images, in
in which, all of Up,q are larger than 0.5. The solid frame would be which the 2100 were training sets and the others were test sets.
abandoned in the figure after NMS’ being executed. Although the The scene image contains four types of common objects in life:
98
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

recall = TPnum /(TPnum + FNnum ) (8)


where: TPnum is the actual number of positive samples in the
correct prediction sample;
FPnum is the number of positive samples that are incorrectly
divided into;
NPnum is the number (pieces) of the error divided into negative
samples.
With precision and recall, you can calculate the average preci-
sion (AP), and draw the P–R curve according to the precision and
Fig. 8. Indoor scene images under different angles and different illuminations. recall. The area under the P–R curve is the average precision AP.
The calculation formula is as follows:
∫ 1
Chair, Book, Table and Keyboard, and the others were regarded AP = p(r)dr (9)
as Background (Fig. 8). 0

This article used the Tensorflow to configure and build the MAP (Mean Average Precision) is to average all categories, the
environment under the Ubuntu 16.04 x86 hardware environment. formula is:
Among them, CPU was Intel core i7, GPU was GTX 2080, memory ∑n
AP(K )
k=1
was 32G; Cuda with Cudnn version 9.0, 7.0, Tensorflow version MAP = (10)
1.12, Python version 3.6, Opencv version 2.0. After completing N
the experimental environment configuration and data set prepro- Among them, N represents the number of categories in the
cessing, the model was trained. The parameters in the experiment picture, and the number of categories in this experiment is 4.
were set as follows, batch_size = 8, epoch = 19, steps_per_epoch
= 10k, that was, the total number of iterations was 19w; learn- 4.3. Results of the experiment
ing_rate = 0.001, learning_momentum = 0.9, weight_decay =
0.0001. 4.3.1. Loss function
The loss function represents the difference between the pre-
4.2. Semantic segmentation algorithm evaluation criteria dicted value of the model and the training sample. The smaller
the value is, the closer the prediction sample is to the real sample,
There are many ways to evaluate the performance such as and the better the robustness of the model is. On the contrary, the
Recall rate and Precision rate of the segmentation algorithm. In greater the difference between the predicted sample and the real
the case of high requirements for both, F1 (or F-score) can be sample [35].
used to measure them. However, mAP also considers the ranking It can be seen from Fig. 9 that when the total loss of multi-task
of retrieval effect, and solves the problem of single point value semantic segmentation based on color image training reaches
limitation of them. So this article chooses the MAP value. Before around 14w, the loss function value drops to about 0.05, and
calculating the MAP value, it is necessary to calculate the preci-
oscillates around 0.05, basically reaching a state of convergence,
sion and recall. The accuracy rate is the ratio of the actual number
and at the same time regression loss, Both the classification loss
of positive samples in the prediction sample to the number of all
and the segmentation loss reached the state of convergence, in-
positive samples, which can be calculated according to formula
dicating that the model training effect is good. Fig. 10 is an image
(7), the recall rate is the ratio of the actual number of positive
of various loss functions based on depth image training with the
samples in the prediction sample to the number of all predicted
number of iterations. It can be seen that only Fig. 10(b) when the
samples, and the calculation formula is (8).
number of iterations reaches 16w, the regression loss function
precision = TPnum /(TPnum + FPnum ) (7) value drops to around 0.002, reaching the basic Convergence,

Fig. 9. Loss function curve based on color image training model.

99
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Fig. 10. Loss function curve based on depth image training model.

Fig. 11. Loss function curve of training model based on RGB-D fusion.

while the overall loss, classification loss, and segmentation loss loss function curve change is more stable, indicating that the
under deep image training have not reached convergence with training model obtained is better than training using color images
the change of the number of iterations. It can be seen from Fig. 11 or depth images, and it also shows The RGB-D image fusion
that the overall loss of multi-task semantic segmentation based method in this paper is feasible, the image features after fusion
on RGB-D image fusion training reaches 10w iterations, the loss are richer, and more image features are learned during network
function value drops to about 0.1, reaching a state of convergence, training.
and the convergence speed of various loss functions It is also
faster and the convergence effect is better.
It can be seen from the results: (1) All three kinds of image 4.3.2. Mean average accuracy
data can realize multi-task semantic segmentation of joint target This paper used the average accuracy MAP value of the four
detection, which shows that the multi-task semantic segmenta- categories of semantic segmentation in the scene to measure
tion algorithm proposed in this paper is feasible and effective. the performance of the three image data training models. After
(2) The loss function of RGB-D image fusion training has the completing the model training, the test set in the database was
fastest convergence speed, the best convergence effect, and the used to test the trained models under the three images.
100
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Fig. 12. Prediction based on color image training model. Fig. 14. Based on the prediction of the image training model after fusion.

Fig. 15. Comparison of test accuracy of multi-task semantic segmentation under


three image data training models.

prediction error occurs. This shows that the training model based
on the depth image data is not effective. Through analysis, it is
found that this is caused by the lack of more image information in
the depth image data, resulting in fewer image features extracted
by the network.
The test accuracy comparison of multi-task semantic segmen-
tation under the three image data training models is shown in
Fig. 15. The multi-task semantic segmentation algorithm has a
Fig. 13. Prediction based on depth image training model. MAP value of 90.925% under color image training, and a MAP
value of 75.9% under deep image training. After fusion, the MAP
value under image training has reached 93.575%, which is higher
From the predictions in Figs. 12 and 14, we can see that the than when used alone Color and depth image training increased
model trained based on color image data and fusion image data by 2.65% and 17.675% respectively. At the same time, the accuracy
successfully realized the prediction of objects in the test set, and of the four categories obtained by the fusion data training is
significantly higher than the accuracy of each category obtained
various objects achieved good prediction results, indicating that
by training using color or depth images alone.
the model training effect is good. It can be seen from Fig. 13 that
the model trained based on the depth image data predicts the 4.3.3. Visualization of model prediction results
objects in the test set and finds that there are two types of books, The visualization results based on the color image training
but there is only one type of book in the actual scene, and the model are shown in Fig. 16, the visualization results based on
101
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

results showed that the constructed multi task model could not
only realize the output of multi results, but also achieve more
clear edge contour segmentation of objects with different scales,
such as books and keyboards displayed in the scene, combined
with information fusion.

5. Conclusion

In order to realize the semantic segmentation of multi-scale


objects in indoor scenes, a multi task semantic segmentation
model is proposed, which can not only obtain the location in-
formation of objects, but also further obtain the semantic infor-
mation of objects. The model is mainly improved on the basis
of fast RNN, such as optimizing and improving the NMS pro-
cess, adding information fusion and so on. Through the self built
indoor scene data model, and training the model. The model
is evaluated from three aspects: loss function, map value and
Fig. 16. Visualization based on color image training model.
visual image. From the experimental results, it can be found that
the model can adapt to the needs of multi task and multi-scale
output. At the same time, the information fusion method can
help the model overcome the influence of illumination on the
one hand, and accelerate the training speed of the model on
the other hand. Compared with other research results, the main
contribution of this paper is to build a multi task semantic seg-
mentation model based on information fusion, which can ensure
the accurate semantic segmentation of the model and further
improve the efficiency of the model. In the future, the scene with
more complex scale objects should be further studied, and the
occlusion in the actual environment and the dynamic changes of
objects should be further considered.

CRediT authorship contribution statement

Du Jiang: Methodology, Software, Writing - original draft.


Gongfa Li: Conceptualization, Software. Chong Tan: Data cura-
tion, Writing - original draft. Li Huang: Methodology, Software.
Fig. 17. Visualization based on deep image training model. Ying Sun: Writing - review & editing, Formal analyses. Jianyi
Kong: Writing - review & editing.

Declaration of competing interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared
to influence the work reported in this paper.

Acknowledgments

This work was supported by grants of National Natural Science


Foundation of China (Grant Nos. 52075530, 51575407, 51505349,
61733011, 41906177); the Grants of Hubei Provincial Depart-
ment of Education, China (D20191105); the Grants of National
Defense Pre-Research Foundation of Wuhan University of Science
and Technology, China (GF201705) and Open Fund of the Key
Laboratory for Metallurgical Equipment and Control of Ministry of
Education in Wuhan University of Science and Technology, China
(2018B07, MECOF2019B13).
Fig. 18. Visualization based on image training model after fusion.
References

[1] S. Lowry, N. Sünderhauf, P. Newman, J.J. Leonard, D. Cox, P. Corke, M.J.


Milford, Visual place recognition: A survey, IEEE Trans. Robot. 32 (1) (2015)
1–19.
the depth image training model are shown in Fig. 17, and the [2] L. Huang, M. He, C. Tan, D. Jiang, G. Li, H. Yu, Jointly network image
visualization results based on the color and depth image fusion processing: multi-task image semantic segmentation of indoor scene based
of the training model are shown in Fig. 18. on CNN, IET Image Process. 14 (15) (2020) 3689–3697.
According to the visual image results, the recognition effect [3] J. You, W. Liu, J. Lee, A DNN-based semantic segmentation for detecting
of objects in the scene was better after information fusion. The weed and crop, Comput. Electron. Agric. 178 (2020) 105750.

102
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

[4] S. Bhattacharya, P.K. Reddy Maddikunta, Q.-V. Pham, T.R. Gadekallu, S.R. [26] B. Hariharan, P. Arbelaez, R. Girshick, J. Malik, Object instance segmentation
Krishnan S, C.L. Chowdhary, M. Alazab, M. Jalil Piran, Deep learning and and fine-grained localization using hypercolumns, IEEE Trans. Pattern Anal.
medical image processing for coronavirus (COVID-19) pandemic: A survey, Mach. Intell. 39 (4) (2016) 627–639.
Sustainable Cities Soc. 65 (2021) 102589, http://dx.doi.org/10.1016/j.scs. [27] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convo-
2020.102589. lutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach.
[5] C. Jiang, R. Li, T. Chen, C. Xu, L. Li, S. Li, A two-lane mixed traffic flow Intell. 37 (9) (2015) 1904–1916.
model with drivers’ intention to change lane based on cellular automata, [28] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object
Int. J. Bio-Inspir. Comput. 6 (4) (2020) 229–240. detection with region proposal networks, IEEE Trans. Pattern Anal. Mach.
[6] S. Khan, K. Muhammad, S. Mumtaz, S.W. Baik, V.H.C. de Albuquerque, Intell. 39 (6) (2016) 1137–1149.
Energy-efficient deep CNN for smoke detection in foggy IoT environment, [29] Y. Chen, Y. Lin, Y. Niu, X. Ke, T. Huang, Pyramid context contrast for
IEEE Internet Things J. 6 (6) (2019) 9237–9245, http://dx.doi.org/10.1109/ semantic segmentation, IEEE Access 7 (2019) 173679–173693.
JIOT.2019.2896120. [30] J. Leng, Y. Liu, S. Chen, Context-aware attention network for image
[7] P. Hu, F. Perazzi, F.C. Heilbron, O. Wang, Z. Lin, K. Saenko, S. Sclaroff, Real- recognition, Neural Comput. Appl. 31 (12) (2019) 9295–9305.
time semantic segmentation with fast attention, IEEE Robot. Autom. Lett. [31] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for seman-
6 (1) (2020) 263–270. tic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (4) (2017)
[8] Y. Weng, Y. Sun, D. Jiang, B. Tao, Y. Liu, J. Yun, D. Zhou, Enhancement of 640–651.
real-time grasp detection by cascaded deep convolutional neural networks, [32] S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from RGB-
Concurr. Comput.: Pract. Exper. 33 (5) (2020) e5976, http://dx.doi.org/10. D images for object detection and segmentation, in: European Conference
1002/CPE.5976. on Computer Vision, Springer, 2014, pp. 345–360.
[9] N. Marchal, C. Moraldo, H. Blum, R. Siegwart, C. Cadena, A. Gawel, Learning [33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
densities in feature space for reliable segmentation of indoor scenes, IEEE in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Robot. Autom. Lett. 5 (2) (2020) 1032–1038. Recognition, 2016, pp. 770–778.
[10] D. Jiang, G. Li, Y. Sun, J. Hu, J. Yun, Y. Liu, Manipulator grabbing position [34] A. Neubeck, L. Van Gool, Efficient non-maximum suppression, in: 18th
detection with information fusion of color image and depth image using International Conference on Pattern Recognition, ICPR’06, 3, IEEE, 2006,
deep learning, J. Ambient Intell. Hum. Comput. (2021) http://dx.doi.org/10. pp. 850–855.
1007/s12652-020-02843-w. [35] H. Zhang, Y. Tian, K. Wang, W. Zhang, F.-Y. Wang, Mask SSD: An effective
[11] Y. Sun, Y. Weng, B. Luo, G. Li, B. Tao, D. Jiang, D. Chen, Gesture recognition single-stage approach to object instance segmentation, IEEE Trans. Image
algorithm based on multi-scale feature fusion in RGB-D images, IET Image Process. 29 (1) (2019) 2078–2093.
Process. 14 (15) (2020) 3662–3668.
[12] H. Duan, Y. Sun, W. Cheng, D. Jiang, J. Yun, Y. Liu, Y. Liu, D. Zhou, Gesture
recognition based on multi-modal feature weight, Concurr. Comput.: Pract.
Du Jiang received B.S. degree in mechanical engi-
Exper. 33 (5) (2020) e5991, http://dx.doi.org/10.1002/CPE.5991.
neering and automation from Wuhan University of
[13] Z. Zhou, H. Yu, C. Xu, Z. Chang, S. Mumtaz, J. Rodriguez, BEGIN: Big data
Science and Technology, Wu Han, China, in 2017. He
enabled energy-efficient vehicular edge computing, IEEE Commun. Mag. is currently occupied in his PhD. degree in mechanical
56 (12) (2018) 82–89. design and theory at Wuhan University of Science
[14] D. Jiang, G. Li, Y. Sun, J. Kong, B. Tao, Gesture recognition based on and Technology. His current research interests include
skeletonization algorithm and CNN with ASL database, Multimedia Tools image processing and intelligent controls.
Appl. 78 (21) (2019) 29953–29970.
[15] D. Jiang, G. Li, Y. Sun, J. Kong, B. Tao, D. Chen, Grip strength forecast and
rehabilitative guidance based on adaptive neural fuzzy inference system
using sEMG, Pers. Ubiquitous Comput. (2019) http://dx.doi.org/10.1109/
TRO.2015.2496823.
[16] D. Jiang, Z. Zheng, G. Li, Y. Sun, J. Kong, G. Jiang, H. Xiong, B. Tao, S. Xu, H.
Gongfa Li received the Ph.D. degree in Wuhan Uni-
Yu, et al., Gesture recognition based on binocular vision, Cluster Comput. versity of Science and Technology, Wuhan, China. He
22 (Supplement 6) (2019) 13261–13271. is currently a professor in Wuhan University of Sci-
[17] W. Cheng, Y. Sun, G. Li, G. Jiang, H. Liu, Jointly network: a network ence and Technology. His major research interests
based on CNN and RBM for gesture recognition, Neural Comput. Appl. 31 are computer aided engineering, mechanical CAD/CAE,
(Supplement 1) (2019) 309–323. modeling and optimal control of complex industrial
[18] F. Xiao, G. Li, D. Jiang, Y. Xie, J. Yun, Y. Liu, L. Huang, Z. Fang, An effective process.
and unified method to derive the inverse kinematics formulas of general
six-DOF manipulator with simple geometry, Mech. Mach. Theory 159
(2021) 104265, http://dx.doi.org/10.1016/j.mechmachtheory.2021.104265.
[19] J. Lafferty, M. Andrew, P. Fernando, Conditional random fields: Probabilistic
models for segmenting and labeling sequence data, in: Proceedings of
the Eighteenth International Conference on Machine Learning, 2001, pp. Chong Tan received M.S. degree in mechanical en-
282–289. gineering and automation from Wuhan University of
[20] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost for image un- Science and Technology, Wu Han, China, in 2020. His
derstanding: Multi-class object recognition and segmentation by jointly current research interests include image processing and
modeling texture, layout, and context, Int. J. Comput. Vis. 81 (1) (2009) intelligent controls.
2–23.
[21] L. Wei, Y. Zhang, C. Huang, Z. Wang, Q. Huang, F. Yin, Y. Guo, L. Cao, Inland
lakes mapping for monitoring water quality using a detail/smoothing-
balanced conditional random field based on landsat-8/levels data, Sensors
20 (5) (2020) 1345.
[22] P. Kohli, P.H. Torr, et al., Robust higher order potentials for enforcing label
consistency, Int. J. Comput. Vis. 82 (3) (2009) 302–324.
[23] R. Wang, W. Wan, K. Di, R. Chen, X. Feng, A high-accuracy indoor- Li Huang is currently a associate professor of computer
positioning method with automated RGB-D image database construction, science of the School of Computer Science and Technol-
Remote Sens. 11 (21) (2019) 2572. ogy, WUST, Wuhan, China. She received her Ph.D. in
[24] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features computer science from Huazhong University of Science
for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2012) and Technology in 2011. Her research interests include
data management, semantic web and knowledge.
1915–1929.
[25] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
for accurate object detection and semantic segmentation, in: 27th IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp.
580–587.

103
D. Jiang, G. Li, C. Tan et al. Future Generation Computer Systems 123 (2021) 94–104

Ying Sun is currently a professor in Wuhan University Jianyi Kong received the Ph.D. degree in Helmut
of Science and Technology. Her major research focuses Schmidt University, Germany. He is currently a pro-
on teaching research in Mechanical Engineering. fessor in Wuhan University of Science and Technology.
His research interests are intelligent machine and con-
trolled mechanism, mechanical and dynamic design
and fault diagnosis of electrical system, mechanical
CAD/CAE, intelligent design and control.

104

You might also like