Chen Et Al. - 2018 - An Algorithm For Highway Vehicle Detection Based On Convolutional Neural Network-Annotated

Chen et al. present a novel vehicle detection algorithm utilizing a convolutional neural network (CNN) that improves detection speed and accuracy compared to existing methods like Faster R-CNN. Their approach incorporates techniques such as k-means clustering for vehicle scales, feature fusion, and batch normalization, achieving a 6.5% increase in mean average precision (mAP) and operating at 15 frames per second on a NVIDIA TITAN XP. Extensive experiments on the JiangSuHighway Dataset demonstrate the effectiveness of their framework in real-time vehicle detection and classification.

Uploaded by

ravinderytuse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views7 pages

Chen Et Al. - 2018 - An Algorithm For Highway Vehicle Detection Based On Convolutional Neural Network-Annotated

Uploaded by

ravinderytuse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Chen et al.

EURASIP Journal on Image and Video Processing (2018) 2018:109

https://doi.org/10.1186/s13640-018-0350-2
EURASIP Journal on Image
and Video Processing

RESEARCH Open Access

An algorithm for highway vehicle detection

based on convolutional neural network
Linkai Chen1,2, Feiyue Ye2, Yaduan Ruan1* , Honghui Fan2 and Qimei Chen1

Abstract
In this paper, we present an efficient and effective framework for vehicle detection and classification from traffic
surveillance cameras. First, we cluster the vehicle scales and aspect ratio in the vehicle datasets. Then, we use
convolution neural network (CNN) to detect a vehicle. We utilize feature fusion techniques to concatenate high-
level features and low-level features and detect different sizes of vehicles on different features. In order to improve
speed, we naturally adopt fully convolution architecture instead of fully connection (FC) layers. Furthermore, recent
complementary advances such as batch-norm, hard example mining, and inception have been adopted. Extensive
experiments on JiangSuHighway Dataset (JSHD) demonstrate the competitive performance of our method. Our
framework obtains a significant improvement over the Faster R-CNN by 6.5% mean average precision (mAP). With
1.5G GPU memory at test phase, the speed of the network is 15 FPS, three times faster than the Faster R-CNN.
Keywords: Vehicle detection, Convolution neural network, k-means, Feature concatenate

1 Introduction In the two-stage, Region-based Convolutional Network

Vehicle detection is a very important component in traffic method (R-CNN) is the pioneer of deep-network-based
surveillance and automatic driving [1]. The traditional ve- object detection. R-CNN utilizes selective search to gen-
hicle detection algorithms such as Gaussian mixed model erate 2000 candidate boxes; each candidate box is to be
(GMM) [2] has achieved promising achievements. But it is warped into fixed size and as an input image of CNN, so
not ideal due to illumination changes, background clutter, 2000 candidate boxes will be computer 2000 times. It
occlusion, etc. Vehicle detection is still an important chal- has too low efficiency. In order to reduce computer, Fast
lenge in computer vision. R-CNN [5] generates candidate boxes on the last layer
With the revival of DNN [3], object detection has feature map and adopts Rol pooling.
achieved significant advances in recent years. Current top Under Fast R-CNN pipeline, Faster R-CNN [4] shares
deep-network-based object detection frameworks can be di- full-image convolutional feature with the detection network
vided into two categories: the two-stage approach, including to enable nearly cost-free region proposals. The aforemen-
[4–8], and one-stage approach, including [9–11]. In the tioned approaches adopt fully connection layers to classify
two-stage approach, a sparse set candidate object boxes is object. It is time-consuming and space-consuming both in
first generated by selective search or region proposal net- training and inference time. R-FCN [8] uses fully convolu-
work, and then, they are classified and regressed. In the tion and adding position-sensitive score maps. Neverthe-
one-stage approach, the network straightforward generated less, R-FCN still needs region proposals generated from
dense samples over locations, scales, and aspect ratios; at region proposal network.
the same time, these samples will be classified and regressed. The aforementioned methods are general object
The main advantage of one-stage is real time; however, its detection methods. However, vehicle detection is special
detection accuracy is usually behind the two-stage, and one detection. If we straightforwardly use general object detec-
of the main reasons is class imbalance problem [12]. tion algorithms to detect vehicles, the effect is not the
best. The main reasons are the following three aspects: (1)
Faster R-CNN and Single Shot MultiBox Detector (SSD)
* Correspondence: [email protected]
1
School of Electronic Science and Engineering, Nanjing University, Xianlin using aspect ratios are [0.5, 1, 2], but the aspect ratio
Road No.163, Nanjing 210023, China range of vehicles is not so big. (2) In Faster R-CNN and
Full list of author information is available at the end of the article

© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made.
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 2 of 7

SSD extract candidate regions on high-level feature map, frames sequences to detect the motion object. Frame sub-
the high-level feature has more semantic information, but traction is characterized by simple calculation and adapt-
cannot locate well. (3) Vehicle detection requires high ing dynamic background, but it is not ideal for motion
real-time, but Faster R-CNN adopts FC layers. It takes that is too fast or too slow. Optical flow [16] calculates the
about 0.2 s per image for VGG16 [13] network. motion vector of each pixel and tracks these pixels, but
Aimed to the general object detection methods, there this approach is complex and time-consuming. Back-
exist problems. We present an efficient and effective frame- ground subtraction such as GMM are widely used in ve-
work for vehicle detection and classification from traffic hicle detection by modeling the distribution of the
surveillance cameras. This method fuses the advantages of background and foreground [2]. However, these ap-
two-stage approach and one-stage approach. Meanwhile, proaches cannot classify and detect still vehicles.
we use some tricks such as hard example mining [14], data Hand-crafted feature-based approaches include Histo-
augmentation, and inception [15]. The main contributions gram of Oriented
of our work are summarized as follows: Gradients (HOG) [17], SIFT [18], and Harr-like. Be-
fore the success of CNN-based approaches, hand-crafted
1) We use k-means algorithm to cluster the vehicle scales feature approaches such as deformable part-based model
and aspect ratios in the vehicle datasets. This process (DPM) [19] have achieved the state-of-art performance.
can improve 1.6% mean average precision (mAP). DPM explores improved HOG feature to describe each
2) We detect vehicles on different feature map part of vehicle and followed by classifiers like SVM and
according to different size vehicles. Adaboost. However, hand-crafted feature approaches
3) We fuse the low-level and high-level feature map, have low feature representation.
so the low-level feature map has more semantic CNN-based approaches have shown rich representation
information. power and achieved promising results [4–6, 9, 11]. R-CNN
uses object proposal generated by selective search [20] to
Our detector is time and resource efficient. We evaluate train CNN for detection tasks. Under the R-CNN frame-
our framework on JiangSuHighway Dataset (JSHD) (Fig. 1) work, SPP-Net [7] and Fast R-CNN [5] speed up through
and obtain a significant improvement over the generating region proposal on feature map; these approaches
state-of-the-art Faster R-CNN by 6.5% mAP. Furthermore, only need computer once. Faster R-CNN [4] uses region
our framework achieves 15 FPS on a NVIDIA TITAN XP, proposal network instead of selective search, then it can train
three times faster than the seminal Faster R-CNN. end to end and the speed and accuracy also improve. R-FCN
[8] tries to reduce the computation time with
2 Related work position-sensitive score maps. Considering the high effi-
In this section, we give a brief introduction of vehicle ciency, the one-stage approach attracts much more attention
detection in traffic surveillance cameras. Vision-based recently. YOLO [9] uses a single feed-forward convolutional
vehicle detection algorithms can be divided into three network to directly predict object classes and locations,
categories: motion-based approaches, hand-crafted which is extremely fast. SSD [11] extracts anchors of differ-
feature-based approaches, and CNN-based approaches. ent aspect ratios and scales on multiple feature maps. It can
Motion-based approaches include frame subtraction, obtain competitive detection results and higher speed. For
optical flow, and background subtraction. Frame subtrac- example, the speed of SSD is 58PFS on a NVIDIA TITAN X
tion computers the differences of two or three consecutive for 300 × 300 input, nine times faster than Faster R-CNN.

Fig. 1 Vehicle detection on JSHD

Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 3 of 7

3 Methods and speed. It can improve 1.6% mAP on our vehicle

This section describes our object detection framework dataset.
(Fig. 2). We first introduce k-means algorithm to prepare
data in Section 3.1. Then, in Section 3.2, we present feature 3.2 Baseline network
concatenate to fuse high-level and low-level feature map. We use VGG-16 as the baseline network, which is
Next, we explain how to generate candidate anchor boxes pre-trained with ImageNet [23] dataset. It has 13 convolu-
on different feature map in Section 3.3. In Section 3.4, we tional layer and three fully connected layers. In order to im-
discuss how to detect different size vehicles on different prove detection speed and reduce the parameters, we use
feature map. Finally, we introduce batch-norm, hard ex- convolutional layer instead of the last three fully connected
ample, and inception; these tricks can improve the result. layers. It has been proved to be effective in paper [8].

3.1 Bounding box clustering 3.3 Feature concatenate

The traditional object detection algorithms use sliding win- Previous work on Faster R-CNN only uses the last feature
dow to generate candidate proposal, but these methods are map to general candidate proposal, and it is not good
time-consuming. In CNN-based detectors such as Faster enough for vehicle detection, because the vehicle scale
R-CNN and SSD use aspect ratios [0.5, 1, 2], so the candi- change is larger. It is beneficial to concatenate the
date proposals are less than sliding window. But there are high-level and low-level feature [24]. The high-level feature
two issues in this way. The first issue is that the aspect ra- layers have more semantic information for object classifica-
tios are hand-picked. If we pick better priors of dataset, it tion but lack insight to precise object localization. However,
will be easier for the network to predict good detections. the low-level feature layers have a better scope to localizing
The second issue is that the aspect ratios are designed for objects as they are closer to raw image. In [24], objects de-
general object detection such as PASCAL VOC [21] and tect on a single concatenate feature map, and it is not ac-
COCO [22] dataset. It is not very suitable for vehicle detec- curate enough for multi-scale. In order to detect on
tion. In order to solve these issues, we run k-means cluster- multi-layers, we adopt feature pyramid in our network, as
ing on our dataset instead of choosing aspect ratios by shown in Fig. 3. Feature pyramid can enrich the feature
hand. The cluster centroids are significantly different than presentation and detect objects on different feature layers.
hand-picked anchor boxes. They are suitable for vehicle de- This way is suitable for multi-scale; we can detect different
tection. The k-means algorithm can be formulated as: size vehicles on different feature layers. As shown in Fig. 3,
Firstly, a deconvolutional layer is applied to the last feature
X
k X
map (conv7), and a convolutional layer is grafted on back-
E¼ kx−μi k22 ð1Þ
i¼1 x∈C i
bone layer of conv6 to guarantee that the inputs have the
same dimension. Then, the two corresponding feature
where x is the sample, μi is the average vector of Ci, maps are merged by element-wise addition. In our network,
and k is the center of clustering. We run k-means by the last layer feature map is 5 × 5, after deconvolution is
various k on vehicle sizes and aspect ratios (see Fig. 2). 10 × 10. The size is the same as conv6 feature map. We use
We choose k = 5 for vehicle weight and height and k = 3 four convolutional layers (conv4–7) to generate four differ-
for aspect ratios as a good trade-off between accuracy ent size feature pyramid layers. So we can detect different

Fig. 2 Clustering vehicle size and aspect ratios on JSHD. Left is the result of aspect ratios. Right is the result of vehice size
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 4 of 7

Fig. 3 Vehicle detection framework

size vehicles on different size feature pyramid layers. And we adopt data augmentation. We mainly use two data aug-
this way can improve detection accuracy. mentation methods to construct a robust model to adapt a
variation vehicle. One is using flipping input image. The
3.4 Reference boxes second is randomly sampling a patch whose edge length is
In order to detect different size vehicles on different fea- {0.5, 0.6, 0.7, 0.8, 0.9} of the original image and at least one
ture map, we generate candidate proposals on different vehicle’s center is within this patch. Please refer to [11] for
feature map. As we all know, different feature maps have more details.
different receptive field sizes. Low-level feature map has
smaller receptive field sizes compared with high-level fea- 4.2 Hard negative mining
ture map. We run k-means on JSHD to get five vehicle After the matching step, most of the default boxes are
sizes and three aspect ratios. The width and height of each negatives; similar to [11], we use hard negative mining to
box are computed with respect to the aspect ratio. In total, mitigate the extreme class imbalance. At each mini-batch,
there are two scales and three aspect ratios at each feature we make the ration between negative and positive below
map location. Finally, we combine them together and use 1:3, instead of using all negative anchors in training.
non-maximum suppression (NMS) to refine the results.
4.3 Loss function
3.5 Inception We use a multi-task loss function to jointly train our
We use some new technology to improve detection re- network end to end. The loss function consists of two
sults including inception and batch normalization. We parts, the loss of classification and bounding box regres-
employ the inception module on our network. In this sion. We denote the loss of bounding box regression
paper, we just use the most simple structure incep-
tionV1, as shown in Fig. 4. The inception module can
improve the feature presentation and detection accuracy.
Batch normalization leads to significant improvements
in convergence while training. By adding batch
normalization on all feature layers in our network, we
obtain more than 1.9% improvement in mAP. Batch
normalization also helps regularize the model.

4 Training and testing

In this section, we introduce the details of network
training and testing, including data augmentation, hard
example mining loss function, and parameter selection.

4.1 Data augmentation

We are lack of labeled data, because labeling data is expen-
Fig. 4 The basic inception module
sive. In order to avoid overfitting while training network,
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 5 of 7

with Lloc. It optimizes the smoothed L1 loss between the Table 1 Results on the JSHD
predicted bounding box locations t = (tx, ty, tw, th) and Models Overall (mAP) Car Bus Minibus Truck
target offsets t∗ = (t∗x, t∗y, t∗w, t∗h). We adopt the parame- Fast R-CNN 67.2 53.6 83.2 62.5 69.5
terizations of the four coordinates as follows (2): Faster R-CNN 69.2 55.2 85.4 64.7 71.3

t x ¼ ðx−xa Þ=wa ; ty ¼ ðy−ya Þ=ha ; YOLO 58.9 46.6 75.2 53.4 60.5
t w ¼ logðw=wa Þ; t h ¼ logðh=ha Þ; SSD300 68.8 54.4 85.1 64.3 71.5
ð2Þ
t x ¼ ðx −xa Þ=wa ; t y ¼ ðy −ya Þ=ha ; SSD512 71.2 57.4 87.2 66.8 73.4
tw ¼ logðw =wa Þ; t h ¼ logðh =ha Þ; RON320 [25] 73.6 60.2 89.5 69.1 75.5
Ours 75.7 62.4 91.8 71.3 77.3
where x, y, w, and h denote the box’s center coordinates
and its width and height. Variables x, xa, and x∗ are for
the predicted box, anchor box, and ground-truth box re-
vehicle is classified into four categories (bus, minibus,
spectively (likewise for y, w, h).
car, truck). We do not care for the vehicles that are too
The loss of classification is denoted as Lcls, which is
small. Specifically, the vehicle whose height is less than
computed by a Softmax over K + 1 outputs for each
20 pixels will be ignored. We use random 3000 frames
anchors.
to train our network and 2000 frames to test.
Therefore, our loss function is defined as follows (3):

1 1 5.2 Optimization
L¼ Lcls þ λ Lloc ð3Þ During training phase, stochastic gradient descent
N cls N loc
(SGD) is used to optimize our network. We initialize the
where λ balances the two parts. We consider the two parameters for all the newly added layers by drawing
parts equally import, and λ = 1 is set to computer the weights from a zero-mean Gaussian distribution with
multi-stage loss. standard deviation 0.01. We set the learning rate as
0.001 for the first 60k iterations and 0.0001 for the next
5 Experiments, results, and discussion 60k iterations. The batch size is 16 for a 320 × 320
In order to evaluate the effectiveness of our network. model. We use a momentum of 0.9 and a weight decay
We conduct experiments on our vehicle dataset (JSHD). of 0.005.
The experiments are implemented on Ubuntu 16.04
with a GPUs (NVIDIA TITAN XP) and i7 7700 CPU. 5.3 Results
To evaluate the effectiveness of our proposed network,
5.1 Dataset we compare our proposed network to the
There are large number of various types of vehicles in state-of-the-art detectors on JSHD. Table 1 shows the re-
highway surveillance video. And it is suitable for traffic sults of our experiment. It is obvious that our network
video analysis because of the large and long view of the outperforms the other algorithms. We achieve a signifi-
road. So we construct a new dataset from 25 videos of cant overall improvement of 6.5% mAP over the
JiangSu highway, Jiangsu province, which we called state-of-the-art Faster R-CNN and 6.9% over SSD300. It
JSHD, as shown in Figs. 1 and 5. The dataset contains is clear that the precision of the car is lower than that of
5000 frames which are captured from the videos. The the bus and truck because the deep learning detection

Fig. 5 Success detection results on JSHD test set

Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 6 of 7

Table 2 Module ablation analysis ongoing, and those data and materials are still required by the author and
co-authors for further investigations.
Faster R-CNN Ours
k-means? √ √ √ √ Authors’ contributions
Feature concatenate? √ √ √ LC and YR designed the research. FY, HF, and QC analyzed the data. LC wrote
and edited the manuscript. All authors read and approved the final manuscript.
Detecting on different feature? √ √
Batch normalization? √ Ethics approval and consent to participate
We approved.
JSHD (mAP) 69.2 70.8 72.8 73.8 75.7

Consent for publication

algorithms is not friendly to small object. Our network We agreed.
shows a strong ability on detecting vehicles with a large
variance of scales, especially for small vehicles. In test- Competing interests
The authors declare that they have no competing interests.
ing, the speed of our network is 15 FPS, three times fas-
ter than the Faster R-CNN. It can been apply to
real-time intelligent transportation systems. In summary, Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
our network achieves the best trade-off between accur- published maps and institutional affiliations.
acy and speed.
Author details
1
School of Electronic Science and Engineering, Nanjing University, Xianlin
5.4 Discussion Road No.163, Nanjing 210023, China. 2School of Computer and Engineering,
Our network baseline is Faster R-CNN and SSD. We im- Jiangsu University of Technology, Zhongwu Road No.1801, Changzhou
213001, China.
prove the baseline with k-means, feature concatenate, and
detecting on different features to enhance detection. The Received: 28 July 2018 Accepted: 28 September 2018
analysis results have been shown in Table 2. We can see that
the feature concatenate module is important for detection.
References
Figure 5 demonstrates qualitative evaluations for our 1. X. Hu et al., “SINet: A Scale-insensitive Convolutional Neural Network for Fast
approach on the test set. We succeed in detecting most Vehicle Detection,” arXiv, vol 1 (2018), pp. 1–10
of the vehicles in different appearances, different scales, 2. C. Stauffer and W. E. L. Grimson, Adaptive background mixture models for
real-time tracking, in Computer Vision and Pattern Recognition, 1999. IEEE
and heavy occlusion. Computer Society Conference on, Fort Collins, 1999, p. 252 Vol. 2
3. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief
nets. Neural Comput. 18(7), 1527–1554 (2006)
6 Conclusions 4. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object
In this paper, we present an improved convolutional net- Detection with Region Proposal Networks (Nips, Montréal, 2015)
work for fast and accurate highway vehicle detection. 5. R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, vol. 2015. 1440–1448 Inter, pp
We use k-means to cluster dataset and learn prior infor- 6. R. Girshick, J. Donahue, T. Darrell, U.C. Berkeley, J. Malik, Rich feature
mation. We use feature concatenate to extract more rich hierarchies for accurate object detection and semantic segmentation (2014),
features. In order to detect different sizes of vehicles, we pp. 2–9
7. K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep
detect on different features. With these technology ap- convolutional networks for visual recognition. IEEE Trans. Pattern Anal.
plication, our framework obtains a significant improve- Mach. Intell. 37(9), 1904-1916 (2015)
ment over Faster R-CNN and SSD, especially small 8. J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection Via Region-Based Fully
Convolutional Networks,” 2016
vehicles. Furthermore, we will do challenging research in 9. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once:
urban with occlusion and complex scene. Unified, Real-Time Object Detection,” 2015
10. J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” 2016
Abbreviations 11. W. Liu, D. Anguelov, D. Erhan, and C. Szegedy, “SSD: Single Shot MultiBox
CNN: Convolution neural network; FC: Fully connection; GMM: Gaussian Detector,” no. 1 (2016)
mixed model; JSHD: JiangSuHighway Dataset; NMS: Non-maximum 12. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense
suppression; SGD: Stochastic gradient descent Object Detection,” 2017
13. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale
Image Recognition (2014), pp. 1–14
Acknowledgements
14. A. Shrivastava, Training Region-based Object Detectors with Online Hard
Thanks for the editor and reviewers.
Example Mining (Cvpr, Las Vegas, 2016)
15. C. Szegedy et al., Going Deeper with Convolutions (2014), pp. 1–9
Funding 16. Z. Sun, G. Bebis, R. Miller, in The International IEEE Conference on Intelligent
This research was supported by the National Natural Science Foundation of Transportation Systems, 2004. Proceedings. On-road vehicle detection using
China (Grant Nos. 61472166 and 61502226), Natural Science Fundation of optical sensors: a review (2004), pp. 585–590
Changzhou (CE20175026), Qing Lan Project of Jiangsu Province. 17. N. Dalal, B. Triggs, in IEEE Computer Society Conference on Computer Vision &
Pattern Recognition. Histograms of oriented gradients for human detection
Availability of data and materials (2005), pp. 886–893
Data will not be shared; the reason for not sharing the data and materials is 18. D.G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints
that the work submitted for review is not completed. The research is still International Journal of Computer Vision, 60(2), 91-110 (2014)
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 7 of 7

19. PF. Felzenszwalb, RB. Girshick, D. McAllester, and D. Ramanan, Object

detection with discriminatively trained part-based models. IEEE Trans.
Pattern Anal. Mach. Intell. 47(2), 6–7 (2014)
20. J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers, A.W.M. Smeulders, Selective
search for object recognition. Int. J. Comput. Vis. 104(2) (2013)
21. L. Wen et al., in arXiv. DETRAC: A new benchmark and protocol for multi-
object tracking (2015)
22. T.-Y. Lin et al., in Computer Vision -- ECCV. Microsoft COCO: common objects
in context, vol 2014 (2014), pp. 740–755
23. O. Russakovsky et al., ImageNet large scale visual recognition challenge. Int.
J. Comput. Vis. 115(3), 211–252 (2015)
24. T. Kong, A. Yao, Y. Chen, and F. Sun, “HyperNet: Towards Accurate Region
Proposal Generation and Joint Object Detection,” 2016
25. T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “RON: Reverse Connection
with Objectness Prior Networks for Object Detection,” p. 2017, 2017