Auxiliary Bounding Box Regression For Object Detec
Auxiliary Bounding Box Regression For Object Detec
https://doi.org/10.1007/s11220-020-00319-x
ORIGINAL PAPER
Abstract
Object detection in optical remote sensing imagery is being explored to deal with
arbitrary orientations and complex appearance which is still a major issue in recent
years. To perceive a better solution to the addressed problem, the post-processing
of bounding boxes (BBs) has been evaluated and discussed for the applications of
object detection. In this paper, the proposed method has divided into two stages; the
first stage is based on thresholding of BBs with respect to the confidence values and
the second stage is based on the area-based BB regression (BBR). In BBR, the area
of each BB was estimated then the oversized and undersized BBs were removed with
respect to the size of objects which are being detected. The widely known region-
based approaches RCNN, Fast-RCNN and Faster-RCNN are used for evaluation and
comparative analysis validates the proposed framework. The results show that the
proposed post-processing is very effective for each kind of region-based detector.
1 Introduction
The advanced developments in computer vision technologies are the reasons for
the enhanced performance of object detection and localization [1–6]. Object detec-
tion has played a vital role to create a comfortable and secure environment such
as vehicle detection is very useful for traffic control systems, surveillance, moni-
toring and management [1]. In recent developments of cutting-edge techniques, the
* Shahid Karim
[email protected]
1
Department of Computer Science, ILMA University, Karachi, Pakistan
2
School of Electronics and Information Engineering, Harbin Institute of Technology,
Harbin 150001, China
3
School of Computer Science and Technology, Xidian University, Xi’an 710071, China
13
Vol.:(0123456789)
Convolutional Neural Network (CNN) based techniques are more promising, and
object detection is continuously a prominent issue. Furthermore, bounding boxes
(BBs) play a vital role in object detection tasks, and regression of these BBs is
obligatory to advance the performance of object detection.
Bounding box regression (BBR) is the core stage in the field of computer vision.
In literature, numerous BBR approaches have been proposed to upgrade the detec-
tion performance. Firstly, it is important to locate the candidate regions inside the
image, usually which are recognized as region proposals. The test image might com-
prehend by multiple numbers of objects and it is necessary to locate all the nec-
essary or prominent existing objects in a given image. Secondly, the outcome of
proposals is essential to be refined for better performance of object detection. The
sliding window approach is well-known among previous approaches to generate
such significant regions in an image [7].
Similarly, an optimal BBR approach has been proposed for Region-based CNN
(RCNN) which enhanced the performance of object detection [8]. On the other
hand, various reasons for decreasing the accuracy of object detection were discussed
and solutions were suggested to overwhelm these detection problems [9]. Particu-
larly, the specific problems were related to occlusions, errors in the localization,
size of the object, viewpoint, visibility and aspect ratio. More importantly, incorrect
detection reduces the performance so the leading advantage of BBR approach is to
refine the detections. The BBR method is very low-cost and appropriate for localiza-
tion applications. An optimal BBR is computationally not expensive and capable to
compensate for the imperfections of detections/localizations.
Our work mainly addresses the bounding box regression problem. Firstly, we ana-
lyze the detection performance of state-of-the-art algorithms such as RCNN, Fast-
RCNN and Faster-RCNN. Then, we propose a bounding box regression approach
based on thresholding to improve the detection results. Subsequently, we adopted
the area based bounding box regression approach which tremendously improves the
detection results. Lastly, the proposed approach has been compared with the default
performance of state of the art detection methods.
The remaining sections of the paper are arranged as follows: Sect. 2 elaborates
on the previous study; Sect. 3 is based on the methodology. Section 4 presents the
results and discussion. Finally, Sect. 5 concludes our work with future directions.
2 Previous Work
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Sensing and Imaging (2021) 22:5 Page 3 of 10 5
[5 × 5] and so on. At the completion of the process, the best match has been found
among all the predictions [12].
Actually, there are two types of detection approaches that have been proposed:
region-based approaches and regression-based approaches in the last few years. In
region-based approaches, the candidate regions have been found at the pre-process-
ing stage [7, 8, 13, 14]. Among region-based approaches, initially, the candidate
regions are classified and then refined for improvement. The main disadvantage of
region approaches is the computationally expensive regression which can be ignored
due to optimal detection accuracy. On the other hand, regression-based approaches
do not utilize BBR at the pre-processing step and the proposed methodologies have
succeeded to improve the performance of detections by getting into own typical
strategies [12, 15, 16]. As in the case of You Look Only Once (YOLO) approach,
the BBs have been significantly reduced or diminished as compared to selective
search (SS) [17] in R-CNN which is specifically based on spatial constraints [15].
Single shot multi-box detector (SSD) requires the ground truth and an input image
at the starting stage. Later on, it estimates a set of BBs (e.g., 4) with distinct ratios at
every location in feature maps and finds the confidence and offsets for every single
BB [14].
Furthermore, the detection accuracy is being measured by improving the mean
average precision (mAP) for multiple thresholds [18, 19]. The process of optimal
localization and unwanted mislocalization can be limited by specifying the value of
intersection over union (IoU). If the ground truths and predicted boxes overlap more
than the given IoU, then localization/detection is acceptable [19]. Furthermore, the
IoU is not promising for non-overlapping BBs and cannot estimate the distance
when BBs are not overlapping. To overcome this major issue, a Generalized Inter-
section over Union (GIoU) based regression method has been introduced by incor-
porating it as a loss into standard detection approaches [20]. Somehow, BBR loss is
also a considerable metric to enhance localization accuracy. Mostly, this drawback
has been observed in the ground truths and localization variance consents to merge
the neighboring BBs in non-maximum suppression (NMS) [21]. By incorporating
these both metrics IoU and BBR loss, detection performance can be optimized con-
structively [22]. On the other hand, the Distance-IoU (DIoU) has been proposed to
overcome the limitations of IoU and GIoU. This DIoU has been utilized in NMS to
suppress the redundant BBs [23]. The DIoU loss is able to uphold the edge provided
by the IoU loss and reduce the distance between two boxes which helps the target
estimation to be more accurate [24].
3 Methodology
Several methods for bounding box regression (BBR) have been contemplated in the
previous work. BBR is the fundamental requirement in object detection, localization,
recognition and classification applications. Initially, we have trained the detectors such
as RCNN [8], Fast-RCNN [25] and Faster-RCNN [7] in a conventional way then con-
tinued towards post-processing. Our proposed method is based on two steps to decrease
BBs; 1) thresholding and 2) area-based bounding box regression. The parameters of
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
5 Page 4 of 10 Sensing and Imaging (2021) 22:5
selected detectors were set as default such as the number of epochs was set to 10, batch
size 32 and learning rate 1e−6. These parameters are meaningful for further fine-tuning
of the network. The flow diagram of our proposed framework is shown in Fig. 1.
3.1 Dataset
The datasets of optical remote sensing imageries are motivating the upcoming devel-
opments such as a recently developed dataset named object detection in aerial images
(DOTA) [26], which is very useful for several remote sensing applications. Simulta-
neously, it consists of several images of dissimilar objects such as vehicles, airplanes,
oil storage tanks, playgrounds, tennis courts, swimming pools, and so on. Due to the
availability of several images for the application of multiple object detections, we have
utilized DOTA for our proposed framework. The training images were 446 images with
5271 ROIs for airplane detection, 47 images with 1913 ROIs for oil tank detection, 500
images with 729 ROIs for playground detection, 103 images with 1091 ROIs for swim-
ming pool detection, and 239 images with 2003 ROIs for tennis court detection. All
training images were taken from DOTA except playground images which were taken
from Google Earth.
3.2 Thresholding
There are several ways to optimize the object detection model in which thresholding
is very common in various applications. In the given object detection framework, the
score/confidence based thresholding has been utilized to enhance the detection accu-
racy. The outcome of airplane detection without thresholding can be observed in
Fig. 3b. There are several unwanted BBs which hold the confidence less than 0.9. Simi-
larly, it happens with other objects in the detection process. In experiments, the thresh-
olding was limited to 0.9 to reduce the unwanted false alarms. Several limits were ana-
lyzed such as 0.75, 0.8, 0.85, and it is observed that the optimal results were found at
the threshold of 0.9. There are five numbers of BBs which were eliminated by thresh-
olding as shown in Fig. 3c.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Sensing and Imaging (2021) 22:5 Page 5 of 10 5
Fig. 3 The images are presented to describe the significance of the proposed framework a original
image, b airplane detection without any post-processing, c airplane detection with thresholding and d
airplane detection with thresholding and area-based BBR
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
5 Page 6 of 10 Sensing and Imaging (2021) 22:5
BBs were removed by area-based BBR. Suppose, we can get the optimal results hav-
ing the BBs with the area range of 100–150 square meters then BBs with area higher
than 150 and lower than 100 square meters will be eliminated by area-based BBR.
Lastly, area-based BBR was applied to reduce the BBs and outcomes were promising
as compared to obtained results without thresholding and area-based BBR. The yellow
boxes in Fig. 2 represent the required bounding boxes while red and green bounding
boxes were eliminated through our proposed framework. The area of bounding boxes is
denoted by k as shown in Fig. 2. The value of k should be in the specified range to get
the appropriate bounding boxes otherwise eliminated. The appraisal of the proposed
method can be visualized from the difference of BBs in Fig. 3b, d. There are nine num-
bers of BBs which were eliminated by area-based BBR as shown in Fig. 3d.
In this section, we evaluate and discuss the effects of the proposed framework for
object detection. The large dataset DOTA was utilized to train the well-known
region-based CNN models and the performance of the proposed framework was
analyzed. Firstly, we have trained object detectors such as RCNN, Fast-RCNN and
Faster-RCNN with DOTA images. Multiple detectors were selected to evaluate the
effectiveness of the proposed method, and comparative study leads towards better
recommendations. Secondly, the proposed framework was implemented to enhance
detection performance. The results associated with BBR are shown in Fig. 3 in which
(a) is the original test image, (b) is the detection result without post-processing, (c)
after thresholding and (d) after implementing area-based BBR and these results were
obtained by implementing Faster-RCNN. The analysis for the airplane detection was
conducted on a small test set which was based on 35 test images with the existence
of 218 numbers of airplanes. The accuracy assessment was conducted on test images
and metrics have been given in Tables 1 and 2 before and after implementing area-
based BBR respectively. The parameters TP, FP and FN have used for true positive,
false positive and false negative respectively as given in Tables 1 and 2.
Similarly, the optimal results were conducted on other objects after utilizing
thresholding and area-based BBR as shown in Fig. 4a–d. The accuracy assessment
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Sensing and Imaging (2021) 22:5 Page 7 of 10 5
of the proposed framework is given in Table 1 which clearly shows that the proposed
method is promising to enhance the recall and precision rates for object detection.
To ensure improvement in accuracy assessment, we have investigated the precision
and recall rate for airplanes. The optimal results of detections for oil tanks, play-
grounds, tennis courts and swimming pools have been shown in Fig. 4. These results
authenticate that the proposed approach is promising to reduce the false alarms.
Fig. 4 The images are presented to describe the significance of the proposed framework after implemen-
tation of thresholding and area-based BBR for the detection of similar-sized objects individually such as
a oil tank, b playground, c swimming pool, and d tennis court
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
5 Page 8 of 10 Sensing and Imaging (2021) 22:5
There are fourteen numbers of unwanted BBs that have cleared away as shown in
Fig. 3.
According to the previous work, the performance of RCNN, Fast-RCNN and
Faster-RCNN was promising and all methods are worthwhile for specific applica-
tions. The proposed methodology improves the performance of each detector indi-
vidually as the proposed method is based on post-processing and it is advantageous
in such a way of post-processing adaptation. More specifically, it can be applied to
any kind of detection method at the post-processing stage such as after finishing the
training stage of detectors or detection process as well. Conclusively, it enhances the
accuracy of detectors after the implementation of every single step such as thresh-
olding and area-based BBR.
Several methods for object detection are being analyzed to make improvements
and the core differences exist between the working strategies of distinct methods.
For instance, the key role of SSD is that it contributes some feature maps on the
uppermost of YOLO which are taken from distinct layers and it affects the speed and
accuracy of SSD. Additionally, SSD takes more memory as compared to YOLO due
to feature maps addition, and the computational speed of YOLO is faster than Faster
and Fast RCNN. In contrast, the accuracy and mAP of YOLO are lower than Fast
and Faster RCNN [15].
TP
Precision = (1)
TP + FP
TP
Recall = (2)
TP + FN
5 Conclusions
In this paper, we have presented a novel approach to reduce the false positive alarms
in object detection or localization process. The proposed approach is based on
thresholding and area-based BBR. According to experiments, the thresholding has
effectively reduced the unwanted boxes and area-based BBR is more effective than
thresholding. Results show that the proposed method reduces unwanted BBs with a
substantial ratio. The widely known object detectors RCNN, Fast-RCNN and Faster-
RCNN were employed to reveal the comparative study of object detections. There
is only one constraint associated with area-based BBR to get the optimal results of
object detection that the size of objects which are being detected should be similar
in all test images. If the size of the objects will change significantly, then it will be
difficult to reduce the BBs promisingly.
The optimal solutions are still required for accuracy enhancement of object
detection which can be considered in the future study. Furthermore, the proposed
approach can be improved for the detection of multiple size objects.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Sensing and Imaging (2021) 22:5 Page 9 of 10 5
Acknowledgments This work was supported by the National Natural Science Foundation of China under
Grants 61471148.
References
1. Tayara, H., Soo, K. G., & Chong, K. T. (2018). Vehicle detection and counting in high-resolution
aerial images using convolutional regression neural network. IEEE Access, 6, 2220–2230.
2. Li, K., Cheng, G., Bu, S., & You, X. (2018). Rotation-insensitive and context-augmented object
detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 56(4),
2337–2348.
3. Koga, Y., Miyazaki, H., & Shibasaki, R. (2018). A CNN-based method of vehicle detection from
aerial images using hard example mining. Remote Sensing, 10(1), 124.
4. Bazi, Y., & Melgani, F. (2018). Convolutional SVM networks for object detection in UAV imagery.
IEEE Transactions on Geoscience and Remote Sensing, 56(6), 3107–3118.
5. ElMikaty, M., & Stathaki, T. (2018). Car detection in aerial images of dense urban areas. IEEE
Transactions on Aerospace and Electronic Systems, 54(1), 51–63.
6. Qiu, S., Wen, G., Deng, Z., Liu, J., & Fan, Y. (2018). Accurate non-maximum suppression for object
detection in high-resolution remote sensing images. Remote Sensing Letters, 9(3), 237–246.
7. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection
with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
8. Girshick, R., Donahue, J., Darrell, T., Berkeley, U. C., & Malik, J. (2014). Rich feature hierar-
chies for accurate object detection and semantic segmentation. In Proceedings of IEEE confer-
ence on computer vision pattern recognition (Columbus, Ohio) (pp. 2–9). https://doi.org/10.1109/
CVPR.2014.81.
9. Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In Euro-
pean conference on computer vision (pp. 340–353).
10. Karim, S., Zhang, Y., Asif, M. R., & Ali, S. (2017). Comparative analysis of feature extraction
methods in satellite imagery. Journal of Applied Remote Sensing, 11(4), 42618.
11. Karim, S., Zhang, Y., Ali, S., & Asif, M. R. (2018). An improvement of vehicle detection under
shadow regions in satellite imagery. In Proceedings of SPIE—the international society for optical
engineering (Vol. 10615). https://doi.org/10.1117/12.2303518.
12. Liu, W., et al. (2016). SSD: Single shot multibox detector. In European conference on computer
vision (pp. 21–37).
13. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional net-
works for visual recognition. In European conference on computer vision (pp. 346–361).
14. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional
networks. In Advances in neural information processing systems (pp. 379–387).
15. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time
object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition
(Las Vegas, NV, USA) (pp. 779–788).
16. Najibi, M., Rastegari, M., & Davis, L. S. (2016). G-cnn: An iterative grid based object detector. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2369–2377).
17. Uijlings, J. R. R., Van De Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective
search for object recognition. International Journal on Computer Vision, 104(2), 154–171.
18. Gidaris, S., & Komodakis, N. (2016). Locnet: Improving localization accuracy for object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 789–798).
19. Dickerson, N. L. (2017). Refining bounding-box regression for object localization. Dissertations
and Theses, Paper 3940. https://doi.org/10.15760/etd.5824.
20. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized
intersection over union: A metric and a loss for bounding box regression. In Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 658–666).
21. He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding box regression with uncer-
tainty for accurate object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 2888–2897).
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
5 Page 10 of 10 Sensing and Imaging (2021) 22:5
22. Qian, X., Lin, S., Cheng, G., Yao, X., Ren, H., & Wang, W. (2020). Object detection in remote
sensing images based on improved bounding box regression and multi-level features fusion. Remote
Sensing, 12(1), 143.
23. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU loss: Faster and better
learning for bounding box regression. In AAAI (pp. 12993–13000).
24. Yuan, D., Chang, X., & He, Z. (2020). Accurate bounding-box regression with distance-IoU loss for
visual tracking. arXiv:2007.01864
25. Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer
vision (Boston, Massachusetts) (pp. 1440–1448).
26. Xia, G.-S., et al. (2018). DOTA: A large-scale dataset for object detection in aerial images. In Pro-
ceedings of CVPR.
27. Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2015). Hypercolumns for object segmentation
and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pat-
tern recognition, 2015 (pp. 447–456).
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at