0% found this document useful (0 votes)
53 views8 pages

Weednet: Dense Semantic Weed Classification Using Multispectral Images and MAV For Smart Farming

Selective weed treatment

Uploaded by

mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views8 pages

Weednet: Dense Semantic Weed Classification Using Multispectral Images and MAV For Smart Farming

Selective weed treatment

Uploaded by

mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

588 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO.

1, JANUARY 2018

weedNet: Dense Semantic Weed Classification Using


Multispectral Images and MAV for Smart Farming
Inkyu Sa , Zetao Chen , Marija Popović, Raghav Khanna, Frank Liebisch , Juan Nieto ,
and Roland Siegwart

Abstract—Selective weed treatment is a critical step in au-


tonomous crop management as related to crop health and yield.
However, a key challenge is reliable and accurate weed detection to
minimize damage to surrounding plants. In this letter, we present
an approach for dense semantic weed classification with multi-
spectral images collected by a micro aerial vehicle (MAV). We use
the recently developed encoder–decoder cascaded convolutional
neural network, SegNet, that infers dense semantic classes while
allowing any number of input image channels and class balancing
with our sugar beet and weed datasets. To obtain training datasets,
we established an experimental field with varying herbicide levels
resulting in field plots containing only either crop or weed, en-
abling us to use the normalized difference vegetation index as a
distinguishable feature for automatic ground truth generation. We
train six models with different numbers of input channels and con-
dition (fine tune) it to achieve ∼0.8 F1-score and 0.78 area under
the curve classification metrics. For the model deployment, an em-
bedded Graphics Processing Unit (GPU) system (Jetson TX2) is
tested for MAV integration. Dataset used in this letter is released
Fig. 1. NIR (top-left) and Red channel (bottom-left) input images with ground
to support the community and future work. truth (top-right). Bottom-right is the probability output from our dense semantic
segmentation framework.
Index Terms—Aerial systems, applications, agricultural au-
tomation, robotics in agriculture and forestry.
as flexible, cost-efficient platforms replacing laborious manual
procedures.
I. INTRODUCTION
Specifically, weed treatment is a critical step in autonomous
O SUSTAIN a growing worldwide population with suf-
T ficient farm produce, new smart farming methods are
required to increase or maintain crop yield while minimizing
farming as it directly associates with crop health and yield [1].
Reliable, and precise weed detection is a key requirement for
effective treatment as it enables subsequent processes, e.g., se-
environmental impact. Precision agriculture techniques achieve lective stamping, spot spraying, and mechanical tillage, while
this by spatially surveying key indicators of crop health and minimizing damage to surrounding vegetation. However, ac-
applying treatment, e.g., herbicides, pesticides, and fertilizers, curate weed detection presents several challenges. Traditional
only to relevant areas. Here, robotic systems can be often used object-based classification approaches are likely to fail due to
unclear crop-weed boundaries, as exemplified in Fig. 1. This
aspect also impedes manual data labeling which is required for
Manuscript received September 9, 2017; accepted November 1, 2017. Date
of publication November 20, 2017; date of current version December 11, 2017. supervised learning algorithms.
This paper was recommended for publication by Associate Editor J. Kim and To address these issues, we employ a state-of-the-art dense
Editor W. K. Chung upon evaluation of the reviewers’ comments. This work (pixel-wise) Convolutional Neural Network (CNN) for segmen-
was supported in part by the European Union’s Horizon 2020 Research and
Innovation Programme under Grant Agreements 644227 and 644128, and in tation. We chose this CNN because it can predict relatively
part by the Swiss State Secretariat for Education, Research and Innovation faster than other algorithms while maintaining competitive per-
under Contracts 15.0029 and 15.0044. (Corresponding author: Inkyu Sa.) formance. The network utilizes a modified VGG16 [2] archi-
I. Sa, M. Popović, R. Khanna, J. Nieto, and R. Siegwart are with the
Autonomous Systems Lab, ETH Zurich, Zürich 8092, Switzerland (e-mail: tecture (i.e., dropping last two fully-connected layers) as an
[email protected]; [email protected]; raghav.khanna@mavt. encoder, and a decoder is formed with upsampling layers that
ethz.ch; [email protected]; [email protected]). are counterparts for each convolutional layer in the encoder.
Z. Chen is with the Vision for Robotics Lab, Department of Mechani-
cal and Process Engineering, ETH Zurich, Zürich 8092, Switzerland (e-mail: Our CNN encodes visual data as low-dimensional features
[email protected]). capturing semantic information, and decodes them back to
F. Liebisch is with Crop Science, Department of Environmental Systems higher dimensions by up-sampling. To deal with intensive man-
Science, ETH Zurich, Zürich 8092, Switzerland (e-mail: frank.liebisch@
usys.ethz.ch). ual labeling tasks, we maneuver a micro aerial vehicle (MAV)
Digital Object Identifier 10.1109/LRA.2017.2774979 with a downward-facing multispectral camera over a designed
2377-3766 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
SA et al.: weedNET: DENSE SEMANTIC WEED CLASSIFICATION USING MULTISPECTRAL IMAGES AND MAV FOR SMART FARMING 589

field with varying herbicide levels applied to different crop field work learns to up-sample the encoder feature maps to full input
areas. To this end, we obtain 3 types of multispectral image resolution for per-pixel segmentation. The encoder and decoder
datasets which contain (i) only crop, (ii) only weed, and (iii) both networks are trained simultaneously and end-to-end.
crop and weed. For (i) and (ii), we can easily create ground truth
by extracting Normalized Difference Vegetation Index (NDVI) B. MAV-Based Crop-Weed Detection
indicating vegetation cover. Unfortunately, we could not avoid
For smart farming applications, it is becoming increasingly
manual labeling for (iii), which took about 60 mins per im-
important to accurately estimate the type and distribution of the
age (with 30 testing images). Given these training and testing
vegetation on the field. There are several existing approaches
datasets, we train 6 different models with varying input chan-
which exploit different types of features and machine learning
nels and training conditions (i.e., fine-tune) to see the impact of
algorithms to detect vegetation [11]–[13]. Torres-Sánchez et al.
these variances. The trained model is deployed on an embedded
[14] and Khanna et al. [15] investigate the use of NDVI and
GPU computer that can be mounted on a small-scale MAV.
Excess Green Index (EGI) to automatically detect vegetation
The contributions of this system paper are:
r Release of pixel-wise labelled (ground-truth) sugar from soil background. Comparatively, Guo et al. [16] exploit
spectral features from RGB images and decision tress to separate
beet/weed datasets collected from a controlled field
vegetation. A deeper level of smart farming is an automatic
experiment,1
r A study on crop/weed classification using dense semantic interpretation of the detected vegetation into classes of crop and
weed. Pérez-Ortiz et al. [17] utilize multispectral image pixel
segmentation with varying multispectral input channels.
values as well as crop row geometric information to generate
Since the dense semantic segmentation framework predicts
features for classifying image patches into valuable crop, weed,
the probability for each pixel, the outputs can be easily used
and soil. Similarly, Peña et al. [18] exploit spatial and spectral
by high-level path planning algorithms, e.g., monitoring- [3] or
characteristics to first extract image patches, and then use the
exploration-based [4], for informed data collection and complete
geometric information of the detected crop rows to distinguish
autonomy on the farm [5].
crops and weeds. In [19], visual features as well as geometric
The remainder of this letter is structured as follows. Section II
information of the detected vegetation are employed to classify
presents the state-of-the-art on pixel-wise semantic segmenta-
the detected vegetation into crops and weeds using Random
tion, vegetation detection using a MAV, and CNN with multi-
Forest algorithms [20]. All the above-mentioned approaches
spectral images. Section III describes how the training/testing
either directly operate on raw pixels or rely on a fixed set of
dataset is obtained and details of our model training procedure.
handcrafted features and learning algorithms [21].
We present our experimental results in Section IV, and conclude
However, in the presence of large data, recent developments of
the letter in Section V.
deep learning have shown end-to-end learning approaches [22]
outperforms traditional hand-crafted feature learning [6], [23].
II. RELATED WORK
Inspired by this, Mortensen et al. [24] proposed a CNN-based
This section briefly reviews the state-of-the-art in deep seg- semantic segmentation approach to classify different types of
mentation models, general methods of detecting vegetation, and crops and estimate their individual amount of biomass. Com-
segmentation based on multispectral images. pared to their approach which operates on RGB images, the
approach in this letter extract information from multispectral
A. Pixel-Wise Segmentation Using Deep Neural Network images using a different state-of-the-art per-pixel segmentation
The aim of the image segmentation task is to infer a human- model.
readable class label for each image pixel, which is an important
and challenging task. The most successful approaches in recent C. Applications Using Multispectral Images
years rely on CNN. Early CNN-based methods perform seg- Multi-spectral images provide the possibility to create vegeta-
mentation in a two-step pipeline, which first generates region tion specific indices based on radiance ratios which are more ro-
proposals and then classifies each proposal to a pre-defined cat- bust under varying lighting conditions and therefore are widely
egory [6], [7]. Recently, fully Convolutional Neural Networks explored for autonomous agriculture robotics [25]–[28]. In [29],
(FCNNs) have become a popular choice in image segmentation, images captured from a six band multi-spectral camera were ex-
due to their rich feature representation and end-to-end train- ploited to segment different parts of sweet pepper plants. Sim-
ing [8], [9]. However, these FCNN-based methods usually have ilarly, in [18], the authors managed to compute weed maps in
a limitation of low-resolution prediction, due to the sequen- maize fields from multispectral images. In [30], multispectral
tial max-pooling and down-sampling operation. SegNet [10], images were used to separate sugar beets from a thistle. In our
the module that our weedNet is based on, is a recently pro- study, we apply a CNN-based pixel-wise segmentation model
posed pixel-wise segmentation module that carefully addresses directly on the multispectral inputs to accurately segment crops
this issue. It has an encoder and a corresponding decoder net- from weeds.
work. The encoder network learns to compress the image into a As shown above, there are intensive research interests in agri-
lower-resolution feature representation, while the decoder net- cultural robotics domain employing state-of-the-art CNN, multi-
spectral images, and MAVs for rapid field scouting. Particularly,
1 Available at: https://goo.gl/UK2pZq precise and fast weed detection is the key front-end module that
590 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 1, JANUARY 2018

TABLE I
IMAGE DATASETS FOR TRAINING AND TESTING

(NIR+Red+NDVI) Crop Weed Crop-weed Num. multispec

Training 132 243 — 375


Testing — — 90 90
Altitude (m) 2 2 2 —

Fig. 3. An input crop image (left), with NDVI is extracted using NIR+Red
channels (mid.). An auto-threshold boundary detection method outputs clear
Fig. 2. Aerial view of our controlled field with a varying herbicide levels. The edges between vegetation and others (shadows, soil, and gravel) (right).
maximum amount of herbicide is applied to the left crop training rows (yellow),
and no herbicide is utilized for the right weed training rows (green). The middle
shows mixed variants due to medium herbicide usage (red).
crops, weeds, and crop-weed mixtures. Each training image/test
image consisted of near-infrared (NIR, 790 nm), Red channel
can lead subsequent treatments to success. To our best knowl- (660 nm), and NDVI imagery.
edge, this work is the first trial applying a real-time (≈2 Hz)
CNN-based dense semantic segmentation using multispectral
images taken from a MAV to agricultural robotics application B. Data Pre-Processing
while maintaining applicable performance. Additionally, we re- For image acquisition, we use a Sequoia multispectral sensor
lease our dataset utilized in this letter to support the community that features four narrow band global shutter imagers (1.2 MP),
and future work since there are insufficient public available and one rolling shutter RGB camera (16 MP). From corre-
weed dataset [13]. sponding NIR and Red images, we extract the NDVI, given
(NIR−Red)
by NDVI = (NIR+Red) . This vegetation index clearly indicates
III. METHODOLOGIES the difference between soil and plant, as exemplified in Fig. 3
In this section, we present our approaches to dataset acqui- (NDVI raw).
sition and pre-processing through image alignment and NDVI 1) Image Alignment: To calculate indices, we performed ba-
extraction before outlining our model training method. sic image processing for the NIR and Red images through image
undistortion, estimation of geometric transformation ∈ SE3 us-
ing image correlation, and cropping. Note that the processing
A. Dataset Acquisition time for these procedures is negligible since these transforma-
Dense manual annotation of crop/weed species is a chal- tions need to be computed only once for cameras attached rigidly
lenging task, as shown in Fig. 1. Unlike urban street (e.g., [31] with respect to each other. It is also worth mentioning that we
and [32]) or indoor scenes [33] that can easily be understood and could not align the other image channels, e.g., Green and Red
inferred intuitively, plant boundaries in our datasets are difficult Edge, in the same way due to the lack of similarities. Further-
to distinguish and may require domain-specific knowledge or more, it is difficult to match them correctly without estimating
experience. They also require finer-selection tools (e.g., pixel- the depth of each pixel accurately. Therefore, our method as-
level selection and zoom-in/out) rather than polygon-based out- sumes that the camera baseline is much smaller than the distance
line annotation with multiple control points. Therefore, it may from the ground and camera (∼two orders of magnitude). We
be difficult to use coarse outsourced services [34]. only account for camera intrinsics, and do not apply radiometric
To address this issue, we designed the 40 m × 40 m weed and atmospheric corrections.
test field shown in Fig. 2. We applied different levels of her- 2) NDVI Extraction: We then applied a Gaussian blur to
bicide to the field; max, mid, and min, corresponding to left the aligned images (threshold=1.2), followed by a sharpen-
(yellow), mid (red), and right (green) respectively. Therefore, ing procedure to remove fine responses (e.g., shadows, small
as expected, images from the left-to-right field patches contain debris). An intensity histogram clustering algorithm, Otsu’s
crop-only, crop/weed, and weed-only, respectively. We then ap- method [35], is used for threshold selection on the resultant
plied basic automated image processing techniques to extract image and blob detection is finally executed with the mini-
NDVI from the left and right patch images, as described in the mum of 300 connected pixels. Fig. 3 shows the detected bound-
following sub-section. The crop-weed images could only be an- ary of a crop image and each class is labeled as {Bg, Crop,
notated manually following advice from crop science experts. Weed}={0,1,2}. Bg indicates background (mostly soil but not
This process took on average about 60 mins/image. As shown in necessary), Crop and Weed are the sugar beet plant and weed
Table I, we annotated 132, 243, and 90 multispectral images of class respectively.
SA et al.: weedNET: DENSE SEMANTIC WEED CLASSIFICATION USING MULTISPECTRAL IMAGES AND MAV FOR SMART FARMING 591

Fig. 4. An encoder-decoder cascaded dense semantic segmentation frame-


work [10]. It has 26 convolution layers followed by ReLU activation and 5
max-pooling for the encoder (first half) and 5 up-sampling for the decoder (sec-
ond half) linked with pooling indices. The first concatenation layer allows for
any number of input channels. The output depicts the probability of 3 classes.

C. Dense Semantic Segmentation Framework


The annotated images are fed into SegNet, a state-of-the
art dense segmentation framework shown in Fig. 4. We retain
the original network architecture (i.e., VGG16 without fully-
connected layers and additionally upsample layers for each Fig. 5. Data collection setup. A DJI Mavic flies over our experimental field
counterpart for max-pooling) [10] and we only present our with a 4-band multispectral camera. In this letter, we consider only NIR and
Red channels due to difficulties in image registration of other bands. The graph
modifications. illustrates the reflectance of each band for healthy and sick plants, and soil
Firstly, the frequency of appearance (FoA) for each class is (image courtesy of Micasense2 ).
adapted based on our training dataset for better class balanc-
ing [36]. This is used to weigh each class inside the neural A downward-facing Sequoia multispectral camera is mounted
network loss function and requires careful tuning. For example, on a commercial MAV, DJI Mavic, recording datasets at 1 Hz
as the weed class appears less frequently than bg and crop, its (Fig. 5). The MAV is manually controlled in position-hold mode
FoA is lower in comparison. If a false-positive or false-negative assisted by GPS and internal stereo-vision system at 2 m height.
is detected in weed classification (i.e., a pixel is incorrectly clas- It can fly around 15 ∼17 mins with an additional payload of
sified as weed), then the classifier is penalized more than for 274.5 g, including an extra battery pack, Sequoia camera, ra-
the other classes. A class weight can be written as: diation sensor, a DC-converter, and 3D-printed mounting gear.
Once a dataset is acquired, the information is transferred man-
FoA(c) ually to a ground station.
wC = (1)
F oA(c) For model training, we use NVIDIA’s Titan X GPU module
on a desktop computer and a Tegra TX2 embedded GPU module
ICTotal with an Orbitty carrier board for model inference.
F oA(c) = (2)
ICj We use MATLAB to convert the collected datasets to the Seg-
Net data format and annotate the images. A modified version of
where F oA(c) is the median of F oA(c), ICTotal is the total number Caffe [38] with cuDNN processes input data using CUDA, C++
of pixels in class c, and ICj is the number of pixels in the jth and Python 2.7. For model training, we set the following param-
image where class c appears, with j ∈ {1, 2, 3...N } as the image eters: learning rate = 0.001, maximum iterations = 40,000 (640
sequence number. epochs), batch size = 6, weight decay rate = 0.005 and Stochas-
Secondly, we implemented a simple input/output layer that tic Gradient Descent (SGD) solver [39] is used for the opti-
reads images and outputs them to the subsequent concatenation mization. The average model training time given the maximum
layer. This allows us to feed any number of input images to number of iterations is 12 hrs. Fig. 6 shows the loss and average
the network, which is useful for hyperspectral image process- class accuracy over 40,000 iterations. This figure suggests that
ing [37]. 10,000–20,000 maximum iterations are sufficient since there is
a very subtle performance improvement beyond this.
IV. EXPERIMENTAL RESULTS
B. Quantitative Results
In this section, we present our experimental setup, followed
by a qualitative and quantitative evaluation of our proposed ap- For quantitative evaluation, we use a harmonic F1 score mea-
proach. We also demonstrate a preliminary performance evalu- sure capturing both precision and recall performance as:
ation of our model deployed on an embedded computer. precisionc · recallc
F 1(c) = 2 ·
precisionc + recallc
A. Experimental Setup
T Pc
As shown in Fig. 2, we cultivated a 40 m× 40 m test precisionc =
T P c + F Pc
sugar beet field with varying herbicide levels applied for au-
tomated ground-truth acquisition following the procedures in T Pc
recallc = (3)
Section III-B2. T P c + F Nc
592 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 1, JANUARY 2018

Fig. 6. Loss and average class accuracy of three input channels with fine-
tuning over various iterations. The maximum number of iterations is set to
40,000, which takes 12 hrs. Fig. 8. AUC for our 3 channel model corresponding to dark blue bars (left-
most) in Fig. 7. Note that the numbers do not match since these measures capture
slightly different properties.

dataset we have is relatively small compared with the pre-trained


ImageNet (1.2 Million images).
The yellow, gray, light blue, and green bars in Fig. 7 present
performance measures with varying numbers of input channels.
We expected more input data to yield better results since the
network captures more useful features that help to distinguish
Fig. 7. F1-scores of 6 models per classes (horizontal axis). Higher values between classes. This can be seen by comparing the results
indicate the better detection performance. Note that the right bars show weed of the 2 channel model (NIR+Red, yellow bar from Fig. 7)
classification F1-score.
and the 1 channel model of NIR and Red (gray and light blue
respectively). 2 channel model outperforms for both crop and
weed classification. However, interestingly, the number of input
where precisionc and T Pc indicate the precision and number
data does not always guarantee performance improvement. For
of true positives, respectively, for class c. The same subscript
instance, the NIR+Red model surpasses the 3 channel model for
convention is applied to other notation. Note that the output of
weed classification performance. We are still investigating this
SegNet are the probabilities of each pixel belonging to each
and suspect that it could be due to i) NDVI image is produced
defined class. In order to compute four fundamental numbers
based on NIR and Red images meaning that NDVI depends
(i.e., TP, TN, FP, and FN), the probability is converted into a
on those two images rather than capturing new information,
binary value. We simply assign the class label of maximum
ii) inaccurate image alignment of NIR and Red image channel
probability and compute precision, recall, accuracy, and F1-
on the edge of image where larger distortions exist than in the
score. All models presented in this section are trained and tested
optical center. Inference performance with varying input data,
with the datasets in Table I.
is discussed in Section IV-E.
Fig. 7 shows the F1-scores of 6 different models trained with
We also use area-under-the-curve (AUC) measures for quan-
varying input data and conditions. There are three classes; Bg,
titative evaluation. For example, Fig. 8 shows the AUC of 3
Crop, and Weed. All models perform reasonably well (above
channel model. It can be seen that there is small performance
80% for all classes) considering the difficulty of the dataset. In
variation. As these measures capture different classifier proper-
our experiments, we vary two conditions; using a pre-trained
ties, they cannot be directly compared.
model (i.e., VGG16) for network initialization (fine-tuning) and
varying the number of input channels
As shown in Fig. 7, the dark blue and orange (with and C. Qualitative Results
without fine-tuning given 3 channel input images, respectively) We perform a qualitative evaluation with our best perfor-
showed that fine-tuning does not impact the output significantly mance model. For visual inspection, we present 7 instances of
in comparison with more general object detection tasks (e.g., all input data, ground-truth, and network probability output in
urban scenes or daily life objects). This is mainly because our Fig. 9. Each row displays an image frame, with the first three
training datasets (sugar beet and weed images) have different columns showing the input of our 3 channel model. NDVI is
appearance and spectra (660–790 nm) for the pre-trained model displayed as heat map scale (colors closer to red depict higher
based on RGB ImageNet dataset [40]. There may be very few response). The fourth column is annotated ground-truth, and the
scenes that are similar to our training data. Moreover, the size of fifth is our probability output. Note that each class probability is
SA et al.: weedNET: DENSE SEMANTIC WEED CLASSIFICATION USING MULTISPECTRAL IMAGES AND MAV FOR SMART FARMING 593

Fig. 9. The qualitative results of 7 frames (row-wise). The first three columns are input data to the CNN, with the fourth and fifth showing ground-truths and
probability predictions. Because we map the probability of each class to the intensity of R, G, B for visualization, some boundary areas have mixed color.

color-coded as intensities of the corresponding color such that D. Discussion, Limitations, and Outlook
background, crop, and weed represent blue, red, and green, re-
We demonstrated a crop/weed detection pipeline that uses
spectively. It can be seen that some boundary areas have mixed training/testing datasets obtained from consistent environmental
color (e.g., a pixel value of [0.4, 0.6, 0.7] that indicates prob-
conditions. Although it shows applicable qualitative and quan-
ability of 0.4 background, 0.6 crop and 0.7 green). There are
titative results, it is important to validate how well it handles
noticeable misclassification areas of crop detection in the last scale variance, and its spatio-temporal consistency.
row and evident weed misclassification in the second and fifth.
We study these aspects using an independent image in Fig. 10.
This mostly occurs when crops or weeds are surrounded by each
The image was captured from a different sugar beet field in a
other, implying that the network captures not only low-level fea- month prior to the dataset we used (same altitude and sensor).
tures, such as edges or intensities, but also object shapes and
It clearly shows that most of the crops are classified as weed
textures. (green) meaning our classifier requires more temporal training
594 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 1, JANUARY 2018

Fig. 10. Test image from a different growth stage of sugarbeet plants (left)
and its output (right). The same color code is applied for each class as in Fig 9,
i.e., blue=bg, red=crop, and green=weed.). This shows that our model
Fig. 11. (a) An embedded GPU module (190 g and 7.5 W) on which we
reports a high false-positive rate because our training dataset has bigger crops
deploy our trained model. (b) Network forward-pass processing time (y-axis)
and different type of weeds. Most likely, this is caused by a limited crop/weed
for different models (x-axis) using Tixan X (red) and Jetson TX2 (blue).
temporal training dataset.

dataset. As shown in Fig. 9, we trained a model with larger crops features (e.g., captures different properties of a plant), the size
and weeds images than in the new image. Additionally, there remains identical.
may be a different type of weed in the field that the model has Note that offline inference (i.e., loading images from stored
not been trained for. files) is performed on TX2 since the multispectral camera only
This exemplifies an open issue concerning supervised learn- allows for saving images to a storage device. We are planning
ing approaches. To address this, we require more training data to integrate another hyperspectral camera (e.g., Xemia snapshot
covering multi-scale, wide-weed varieties over longer time peri- camera) for online and real-time object detection that can be
ods to develop a weed detector with spatio-temporal consistency interfaced with a control [42] or informative path planning [3]
or smarter data augmentation strategies. Even though manually modules.
annotating each image would be a labor intensive task, we are
planning to incrementally construct a large dataset in the near
future. V. CONCLUSIONS
We demonstrated CNN-based dense semantic classification
for weed detection with aerial multispectral images taken from
E. Inference on an Embedded Platform an MAV. The encoder-decoder cascaded deep neural network is
Recent developments in the deep CNN have played a signif- trained on a dataset obtained from a herbicide-controlled sugar
icant role in computer vision and machine learning (especially beet field to address labor intensive labeling tasks. The data ob-
object detection and semantic feature learning). However, there tained from this field is categorized into images containing only
is still an issue of running trained models with relatively fast crops or weeds, or a crop-weed mixture. For the homogeneous
speed (2∼5 Hz) on a physically constrained device, which can imagery data, vegetation is automatically distinguished by ex-
be deployed on a mobile robot. To address this, researchers of- tracting NDVI from multispectral images and applying classic
ten utilize a ground station that has a decent GPU computing image processing for model training. For the mixed imagery
power with WiFi connection [41]. However, this may cause a data, we performed manual annotation taking ∼30 hours.
large time delay, and it may be difficult to ensure wireless com- We trained 6 different models on varying numbers of input
munication between a robot and ground station if coverage is channels, training conditions, and evaluated them quantitatively
somewhat limited. using F1-scores and AUC as metrics. A qualitative assessment
Using an onboard GPU computer can resolve this issue. Re- was then performed by a visual comparison of ground-truth with
cently an embedded GPU module, Jetson TX2, has been re- probability prediction outputs. Given the test dataset (mixed),
leased that performs reasonably well as shown in Fig. 11(a). the proposed approach reports an acceptable performance of
It has 2 GHz hexa-CPU cores, 1.3 GHz 256 GPU cores, and ∼0.8 F1-score for weed detection. However, we found spatio-
consumes 7.5 W while idle and 14 W for maximum utilization. temporal inconsistencies in our model due to limitations in the
Fig. 11(b) shows processing time comparison between Titan X dataset it was trained on.
(red) and TX2 (blue). We process 300 images using 4 models We then deploy the model on an embedded system that can
denoted in the x-axis. Titan X performs 3.6 times faster than be carried by a small MAV, and compare its performance to a
TX2 but considering its power consumption ratio, 17.8 (250 W high-performance desktop GPU in terms of inference speed and
for Titan X maximum utilization), TX2 performs significantly accuracy. Our experimental results estimate that the proposed
well. Another interesting observation in Fig. 11(b) is that net- deep neural network system can run the high-level perception
work forward-pass processing time does not affect the number of task at 1.8 Hz on the embedded platform which can be deployed
input channels much. This is because the first multi-convolution on a MAV.
layer of all these models has the same filter size of 64. While Finally, multispectral weed and crop images with the corre-
varying number of inputs affects the contents of these low-level sponding ground truth used in this letter are released for the
SA et al.: weedNET: DENSE SEMANTIC WEED CLASSIFICATION USING MULTISPECTRAL IMAGES AND MAV FOR SMART FARMING 595

robotics community and to support future work in agricultural [18] J. M. Peña, J. Torres-Sánchez, A. I. de Castro, M. Kelly, and F. López-
robotics domain. Granados, “Weed mapping in early-season maize fields using object-based
analysis of unmanned aerial vehicle (UAV) images,” PloS One, vol. 8,
no. 10, 2013, Art. no. e77151.
ACKNOWLEDGMENT [19] P. Lottes, R. Khanna, J. Pfeifer, R. Siegwart, and C. Stachniss, “UAV-
based crop and weed classification for smart farming,” in Proc. IEEE Int.
The author would like to acknowledge the support of NVIDIA Conf. Robot. Autom., May 2017, pp. 3024–3031.
[20] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
Corporation with the donation of the Titan X Pascal GPU used 2001.
for this research and H. Zellweger for the management of the [21] I. Sa et al., “Peduncle detection of sweet pepper for autonomous crop
experimental field. harvesting—Combined Color and 3-D Information,” IEEE Robot. Autom.
Lett., vol. 2, no. 2, pp. 765–772, Apr. 2017.
[22] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor
REFERENCES with reinforcement learning,” IEEE Robot. Autom. Lett., vol. 2, no. 4,
pp. 2096–2103, Oct. 2017.
[1] D. Slaughter, D. Giles, and D. Downey, “Autonomous robotic weed control [23] M. Di Cicco, C. Potena, G. Grisetti, and A. Pretto, “Automatic model
systems: A review,” Comput. Electron. Agric., vol. 61, no. 1, pp. 63–78, based dataset generation for fast and accurate crop and weeds detection,”
2008. arXiv:1612.03019, 2016.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [24] A. K. Mortensen, M. Dyrmann, H. Karstoft, R. N. Jørgensen, and
large-scale image recognition,” arXiv: 1409.1556, 2014. R. Gislum, “Semantic segmentation of mixed crops using deep convo-
[3] M. Popovic, T. Vidal-Calleja , G. Hitz, I. Sa, R. Y. Siegwart, and J. Nieto, lutional neural network,” in Proc. Int. Conf. Agric. Eng., 2016.
“Multiresolution mapping and informative path planning for UAV-based [25] C. McCool, I. Sa, F. Dayoub, C. Lehnert, T. Perez, and B. Upcroft, “Visual
terrain monitoring,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., detection of occluded crop: For automated harvesting,” in Proc. IEEE Int.
Vancouver, BC, Canada, 2017. Conf. Robot. Autom., 2016, pp. 2506–2512.
[4] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart, “Re- [26] A. Lucieer, Z. Malenovskỳ, T. Veness, and L. Wallace, “Hyperuas—
ceding horizon “next-best-view” planner for 3D exploration,” in Proc. Imaging spectroscopy from a multirotor unmanned aircraft system,” J.
IEEE Int. Conf. Robot. Autom., 2016, pp. 1462–1468. Field Robot., vol. 31, no. 4, pp. 571–590, 2014.
[5] I. Sa et al., “Build your own visual-inertial odometry aided cost-effective [27] R. Khanna, I. Sa, J. Nieto, and R. Siegwart, “On field radiometric calibra-
and open-source autonomous drone,” arXiv:1708.06652, 2017. tion for multispectral cameras,” in Proc. IEEE Int. Conf. Robot. Autom.,
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies 2017, pp. 6503–6509.
for accurate object detection and semantic segmentation,” in Proc. IEEE [28] C. Hung, J. Nieto, Z. Taylor, J. Underwood, and S. Sukkarieh, “Or-
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587. chard fruit segmentation using multi-spectral feature learning,” in Proc.
[7] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous de- IEEE/RSJ Int. Conf. Intell. Robots Syst., 2013, pp. 5314–5320.
tection and segmentation,” in Proc. Eur. Conf. Computer Vision, Springer, [29] C. Bac, J. Hemming, and E. Van Henten, “Robust pixel-based classification
2014, pp. 297–312. of obstacles for robotic harvesting of sweet-pepper,” Comput. Electron.
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Agric., vol. 96, pp. 148–162, 2013.
“DeepLab: Semantic image segmentation with deep convolutional nets, [30] F. J. Garcia-Ruiz, D. Wulfsohn, and J. Rasmussen, “Sugar beet (Beta
atrous convolution, and fully connected CRFS,” arXiv:1606.00915, vulgaris L.) and thistle (Cirsium arvensis L.) discrimination based on field
2016. spectral data,” Biosyst. Eng., vol. 139, pp. 1–15, 2015.
[9] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise [31] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The
convolutional networks for semantic segmentation,” in Proc. IEEE Int. kitti dataset,” Int. J. Robot. Res., vol. 32, pp. 1231–1237, 2013.
Conf. Comput. Vis., 2015, pp. 1635–1643. [32] M. Cordts et al., “The cityscapes dataset for semantic urban scene under-
[10] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolu- standing,” in Proc. IEEE Conf Comput. Vis. Pattern Recognit., 2016, pp.
tional encoder-decoder architecture for image segmentation,” IEEE Trans. 3213–3223.
Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017. [33] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene
[11] F. Liebisch et al., “Automatic UAV-based field inspection campaigns for understanding benchmark suit,” in Proc. IEEE Conf. Comput. Vis. Pattern
weeding in row crops,” in Proc. EARSeL SIG Imag. Spectrosc. Workshop, Recognit., 2015, pp. 567–576.
Zurich, Switzerland, 2017. [34] A. Sorokin and D. Forsyth, “Utility data annotation with Amazon me-
[12] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool, “DeepFruits: chanical turk,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
A fruit detection system using deep neural networks,” Sensors, vol. 16, Recognit. Workshops, Jun. 2008, pp. 1–8.
no. 8, 2016, Art. no. 1222. [35] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE
[13] S. Haug and J. Ostermann, “A crop/weed field image dataset for the eval- Trans. Syst., Man, Cybern., vol. SMC-9, no. 1, pp. 62–66, Jan. 1979.
uation of computer vision based precision agriculture tasks,” in Computer [36] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
Vision—ECCV 2014 Workshops. Zürich, Switzerland: Springer, 2014, labels with a common multi-scale convolutional architecture,” in Proc.
pp. 105–116. IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658.
[14] J. Torres-Sánchez, F. López-Granados, and J. M. Peña, “An automatic [37] R. Khanna, I. Sa, J. Nieto, and R. Siegwart, “On field radiometric calibra-
object-based method for optimal thresholding in UAV images: Application tion for multispectral cameras,” in Proc. IEEE Int. Conf. Robot. Autom.,
for vegetation detection in herbaceous crops,” Comput. Electron. Agric., May 2017, pp. 6503–6509.
vol. 114, pp. 43–52, 2015. [38] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
[15] R. Khanna, M. Möller, J. Pfeifer, F. Liebisch, A. Walter, and R. Siegwart, ding,” arXiv:1408.5093, 2014.
“Beyond point clouds-3D mapping and field parameter measurements [39] L. Bottou, “Large-scale machine learning with stochastic gradient de-
using UAVs,” in Proc. IEEE 20th Conf. Emerg. Technol. Factory Autom., scent,” in Proc. COMPSTAT’2010, Springer, 2010, pp. 177–186.
2015, pp. 1–4. [40] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
[16] W. Guo, U. K. Rage, and S. Ninomiya, “Illumination invariant segmen- lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
tation of vegetation for time series wheat images based on decision tree [41] A. Giusti et al., “A machine learning approach to visual perception of
model,” Comput. Electron. Agric., vol. 96, pp. 58–66, 2013. forest trails for mobile robots,” IEEE Robot. Autom. Lett., vol. 1, no. 2,
[17] M. Pérez-Ortiz, J. Pena, P. A. Gutiérrez, J. Torres-Sánchez, C. Hervás- pp. 661–667, Jul. 2016.
Martı́nez, and F. López-Granados, “A semi-supervised system for weed [42] M. Kamel, T. Stastny, K. Alexis, and R. Siegwart, “Model predictive
mapping in sunflower crops using unmanned aerial vehicles and a control for trajectory tracking of unmanned aerial vehicles using robot
crop row detection method,” Appl. Soft Comput., vol. 37, pp. 533–544, operating system,” in Robot Operating System (ROS): The Complete Ref-
2015. erence, vol. 2, A. Koubaa, Ed. New York, NY, USA: Springer, 2017.

You might also like