0% found this document useful (0 votes)
22 views13 pages

Song 2014

This document summarizes a research paper that presents a new two-step method for automatically recognizing and counting fruits from images taken in cluttered greenhouses. The first step uses a bag-of-words model to locate fruits in a single image, and the second step statistically aggregates estimates from multiple images to improve detection rates. The method is demonstrated on a dataset of over 28,000 images of pepper plants, achieving a 74.2% correlation with manual counts without adjustment.

Uploaded by

Santhosh B M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Song 2014

This document summarizes a research paper that presents a new two-step method for automatically recognizing and counting fruits from images taken in cluttered greenhouses. The first step uses a bag-of-words model to locate fruits in a single image, and the second step statistically aggregates estimates from multiple images to improve detection rates. The method is demonstrated on a dataset of over 28,000 images of pepper plants, achieving a 74.2% correlation with manual counts without adjustment.

Uploaded by

Santhosh B M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

Available online at www.sciencedirect.com

ScienceDirect

journal homepage: www.elsevier.com/locate/issn/15375110

Research Paper

Automatic fruit recognition and counting from


multiple images

Y. Song a, C.A. Glasbey a,*, G.W. Horgan b, G. Polder c, J.A. Dieleman d,


G.W.A.M. van der Heijden c
a
Biomathematics and Statistics Scotland, Edinburgh EH9 3JZ, UK
b
Biomathematics and Statistics Scotland, Aberdeen AB21 9SB, UK
c
Biometris, Wageningen UR, P.O. Box 100, 6700 AC Wageningen, Netherlands
d
Wageningen UR Greenhouse Horticulture, P.O. Box 644, 6700 AP Wageningen, Netherlands

article info
In our post-genomic world, where we are deluged with genetic information, the bottleneck
Article history: to scientific progress is often phenotyping, i.e. measuring the observable characteristics of
Received 20 March 2013 living organisms, such as counting the number of fruits on a plant. Image analysis is one
Received in revised form route to automation. In this paper we present a method for recognising and counting fruits
28 November 2013 from images in cluttered greenhouses. The plants are 3-m high peppers with fruits of
Accepted 13 December 2013 complex shapes and varying colours similar to the plant canopy. Our calibration and
Published online 20 January 2014 validation datasets each consist of over 28,000 colour images of over 1000 experimental
plants. We describe a new two-step method to locate and count pepper fruits: the first step
is to find fruits in a single image using a bag-of-words model, and the second is to aggregate
estimates from multiple images using a novel statistical approach to cluster repeated,
incomplete observations. We demonstrate that image analysis can potentially yield a good
correlation with manual measurement (94.6%) and our proposed method achieves a cor-
relation of 74.2% without any linear adjustment for a large dataset.
ª 2013 IAgrE. Published by Elsevier Ltd. All rights reserved.

harvesting. A recently new direction is to find the fruits for


1. Introduction plant breeding purposes (Alimi et al., 2013): to automatically
recognise, count and measure the fruits in order to assess the
There are an increasing number of robotics applications differences in quality of the genetic material. When the
aimed at detecting fruits from images or videos (De-An, measurements are made by a computer, this is often referred
Jidong, Wei, Ying, & Yu, 2011; Ji et al., 2012; Linker, Cohen, & to as digital phenotyping and the field is growing in impor-
Naor, 2012; Tanigaki, Fujiura, Akase, & Imagawa, 2008). tance, e.g. Furbank and Tester (2011).
Although various research efforts have been made in this The aim in our application is to locate and count green and
field, challenges still remain for complex scenes with varying red pepper fruits on large, dense pepper plants growing in a
lighting conditions, low contrast between fruits and leaves, greenhouse. Alimi et al. (2013) described the use of manual
foreground occlusions and cluttered backgrounds. Most of fruit measurements (manual phenotyping) for predicting yield
these applications have been to find the fruits for automatic in pepper plants. Our work is to automatically detect and

* Corresponding author. Tel.: þ44 1316504899.


E-mail address: [email protected] (C.A. Glasbey).
1537-5110/$ e see front matter ª 2013 IAgrE. Published by Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.biosystemseng.2013.12.008
204 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

Fig. 1 e Examples in the training data. The top three rows are fruit examples, and the bottom three are background. The
background templates are much larger than the fruit templates, and their sizes have been adjusted for display purposes.

count any fruit in images of dense pepper plants, to reduce they used morphological operations and constant shape
manual measurement and labour requirements, and to in- constraint to separate the round apple fruits from leaves. This
crease objectivity. In a recent paper, van der Heijden et al. is not possible for our images, since the difference in colour
(2012) showed that several manual measurements could be and shape between fruits and other plant parts are small.
replaced by image analysis leading to the same QTL (positions Jimenez, Jain, Ceres, and Pons (1999) provided a review of
on a genetic map, which shows a relation with the trait under different vision systems to recognise fruits for automated har-
study). Besides they showed that image analysis could aid in vesting using a laser range-finder. Zhao, Tow, and Katupitiya
the identification of additional physiological traits that are (2005) presented methods to recognise apples grown on trees,
hard or impossible to measure by human operators. which used the texture and redness colour. It was shown that
Machine vision applications developed for fruit have been redness works equally well for green apples as for red ones.
reviewed by Brosnan and Sun (2004) and Lee et al. (2010). Yang, Dickinson, Wu, and Lang (2007) proposed methods to
Compared with previous fruit applications, e.g. finding red recognise mature fruit and locate cluster positions for tomato
apples in green canopies (Bulanon, Kataoka, Ota, & Hiroma, harvest applications. Kitamura and Oka (2005) described a
2002), we are looking for predominantly green fruits. picking robot to recognise and cut sweet peppers in green-
Stajnko, Lakota, and Hocevar (2004) described the use of houses, but their image analysis methods were developed only
thermal imaging for measuring apple fruits. In their work, for this specific application under fixed lighting conditions.
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 205

In this paper we describe a new method to locate and count


green peppers in a cluttered complex image, using a two-step
approach. In a first step, the fruits are located in a single image
and in a second step multiple views are combined to increase
the detection rate of the fruits. The approach to find the
pepper fruits in a single image is based on a combination of (1)
finding points of interest, (2) applying a complex high-
dimensional feature descriptor of a patch around the point
of interest and (3) using a so-called bag-of-words (Nilsback &
Zisserman, 2006; Sivic & Zisserman, 2008) for classifying the
patch. For complex images, every object detector will yield
both false positives and missing detections. If the application
is video-based, one could apply a number of tracking-by-
detection approaches (Breitenstein, Reichlin, Leibe, Koller-
Meier, & Van Gool, 2009). These methods continuously
perform a detection algorithm in individual frames and then
associate detections across frames. In our case, we are not Fig. 2 e Overview of our fruit recognition method in
using a video-based approach, but since images are recorded Section 3.
every 5 cm, we can use multiple views of the same fruit. We
show a new statistical approach to combine information from
multiple views to improve the detection rate of the fruits.
(width  height) pixels. For illumination, a Xenon flashlight
(VIGI-LuxTM MVS 5002 Strobe, PerkinElmer) with a light pulse
2. The datasets duration of 30 ms was used, which allowed a short shutter time
of 70 ms for the cameras, resulting in sharp images, hardly
The plant material used in this paper consists of pepper plants influenced by ambient light. For further details, see Polder,
of 148 recombinant inbred lines resulting from a cross be- van der Heijden, Glasbey, Song, and Dieleman (2009).
tween a large-fruited bell pepper (‘Yolo Wonder’) and a small- The total number of colour images in each trial exceeded
fruited chilli pepper (‘Criollo de Morelos 334’). Including par- 28,000. Since these plants belonged to different genotypes,
ents and F1, there were 151 genotypes, and they were rando- their fruits varied greatly in size and shape (Alimi et al., 2013).
mised over four compartments in an incomplete block design. Some examples can be seen in Fig. 1.
There were 264 experimental plots grown in a standard For every plant in the experiment, fruits were physically
double-row arrangement, and each plot consisted of eight counted and harvested shortly after image collection. We
plants. The four plants in the centre of a plot were experi- randomly selected a row with 10 experimental plots from the
mental plants, and the other four were border plants, making validation trial. There were 408 images in total, and we
1056 experimental plants in total. Two trials were conducted manually labelled all fruits visible in each image. This set was
in 2009. The first trial was in spring (from December 2008 to created as the ground truth in order to evaluate the perfor-
May 2009) and the second in autumn (from June to September mance of our methods.
2009). Further information regarding the trials can be found in For algorithm training, we randomly extracted 110 fruit
Alimi et al. (2013). In this paper, the first trial was used for templates that have one fruit (see Fig. 1) and 80 background
training, and the second for validation. templates that do not have any fruit. The size of the fruit
Our aim is to develop a high-throughput image analysis templates varies from 18  60 to 72  119. The 110 fruit tem-
tool and record images of plants in their growing conditions plates were manually classified into red and green fruits.
without transporting the plants to a controlled environment. There were 104 green fruits and only 6 red fruits. As seen in
We used an imaging robot known as SPYSEE to capture images Fig. 1, the fruit templates are cluttered with some background
of pepper plants. A trolley was equipped with cameras, com- pixels in addition to the fruit. Background templates were also
puter, illumination and wheel encoder to record images of collected in the same greenhouse environment, and con-
pepper plants in the greenhouse, while moving between the tained plant parts (e.g. leaf and branch, but no fruits) as well as
rows of pepper plants. The trolley was pushed along heating background objects (e.g. growth pots).
pipes, 60 cm apart, by a human operator. As pepper plants can
grow to more than 3 m in height, and the distance between
plants and camera is relatively small due to the narrow space 3. Fruit recognition
between two adjacent rows, we used a vertical stack of four
cameras. Every 5 cm, the wheel encoder on the trolley auto- Our approach is to allocate a support window (e.g. a rectangle
matically triggered the flashes and cameras. The distance patch) for every pixel in an image using a sliding window
from the cameras to the plants was short, requiring a large approach (Dalal & Triggs, 2005; Ferrari, Fevrier, Jurie, &
field of view lens. Colour cameras with a high resolution, type Schmid, 2008), and then verify whether there are sufficient
CSFS20CC2 Teli FireWire, were used with a resolution of features within the sliding window to classify it as a fruit
1024  1280 pixels and Lensagon lens, type CMFA0420ND, with object. Using a sliding window for every pixel in a 4801280
a 75 field of view. The region of interest was set to 480  1280 image would require more than 600,000 windows per image,
206 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

which would not be practical for analysing such a large green plant parts. For each colour pixel (R,G,B), the first
dataset. We therefore applied a simple, fast method to identify transformation G  B quantifies the intensity difference be-
approximately 10,000 points of interests (POI) per image, dis- tween green and blue. The second transformation G  R
carding points that clearly were not fruits, thus significantly quantifies the intensity difference between green and red. The
reducing the number of required operations. An overview of final transformation G/(R þ G þ B) quantifies the proportion of
our methods can be found in Fig. 2. Similar approaches have green. This simple, straightforward colour transformation
been taken by Leibe, Leonardis, and Schiele (2004), Leibe, was chosen because it is less sensitive to changing illumina-
Seemann, and Schiele (2005) and others. tion conditions than the original R, G and B values, but no
attempt was made to optimise the transformation (Gevers &
3.1. Initial points of interest Smeulders, 1999).
Colour pixels in a training template are first transformed
The steps we have taken to identify POIs are somewhat arbi- into a N  3 vector with columns G  B, G  R and G/(R þ G þ B).
trary. However, we argue that this is not important. There are For example, for a 1280  480 template,
typically 10 fruits in an image, so a cautious 60-fold pruning N ¼ 1280  480 ¼ 614,400. To reduce N, for each template, two
from the original 600,000 points down to about 10,000 by dis- clusters were found in the transformed space using K-means
carding non-interesting points is small compared to the second clustering with a Euclidean metric. The means and numbers
stage 1000-fold selection process. Alternative approaches that of pixels in the two clusters are then used to represent the
could have been considered include the thresholding used by template. We found that two clusters were sufficient to cap-
Reis et al. (2012) to distinguish red and white grapes from leaves, ture the variability in pixel values in templates, whereas one
and the linear colour model approach of Teixido et al. (2012). cluster, or equivalently sample means, lost this variability.
Colour transformation A classifier is trained on colour in- Figure 3 shows the mean values extracted from the fruit and
formation to identify the initial points of interest. Many pep- background templates. Note that (R,G,B) has a value range of
per fruits are green, and we transform the RGB colour [0,1] and we applied a linear rescaling so that G  B, G  R and
intensity in order to distinguish between the fruit and other G/(R þ G þ B) also lie in the range [0,1].

Fig. 3 e Relationship between the fruit group and the background group in G L B, G L R and G/(R D G D B) from all training
templates. The x-axis, y-axis and z-axis are normalised values for G L B, G L R and G/(R D G D B) respectively. Red and
green dots are for red and green fruits respectively. The background group is represented by blue dots.
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 207

Table 1 e Means (standard deviations) of normalised, transformed colours.


Template group Probability % GB G/(R þ G þ B) GR
Background 99.27 0.463 (0.078) 0.332 (0.036) 0.532 (0.024)
Red fruits 0.04 0.508 (0.024) 0.305 (0.073) 0.451 (0.093)
Green fruits 0.69 0.583 (0.065) 0.398 (0.036) 0.547 (0.027)

Colour classifier Given the transformed vectors for the pedestrians, and here we explore its potential to obviate the
templates of red fruit, green fruit and background, we then need for many different shapes and sizes of templates to
constructed a Naive Bayes classifier with these three classes detect our highly variable fruit shapes.
using prior parameters given in Table 1. For simplicity and For each initial point, we allocate a support window cen-
speed, we ignored correlation between variables. The classi- tred at the point in order to provide sufficient image infor-
fier was applied separately to the transformed colours of every mation for recognition. In this work, the size of the support
pixel in an image, and the posterior probabilities of red and window used was 4090 pixels, which was based on the
green fruits were combined. Figure 4 shows an example of the average size of the fruit templates.
posterior probabilities for the combined fruit group and the Feature extraction To determine whether a fruit is present
background group. We applied a thresholding on the posterior in a support window, we describe the window using two
probabilities Tp to obtain the initial points of interest. different feature sets: MSCR features (Forssén, 2007) and
texture features obtained by local range filters.
An MSCR feature set is a set of descriptors of coloured el-
3.2. Bag-of-Words model
lipses in a window. These descriptors are found using an
MSCR detector, which is an extension to colour of the maxi-
Our approach is inspired by, and builds on, Nilsback and
mally stable extremal region (MSER) covariant region detector
Zisserman (2006), who used a ‘bag of words’ (BoW) approach
(Matas, Chum, Martin, & Pajdla, 2002). The original MSER de-
to classify flowers. We combine this with the use of Maximally
tector finds regions (ellipses) that are stable over a wide range
Stable Colour Region (MSCR) features (Forssén, 2007). The
of thresholdings of a grey-scale image. In MSCR, regions are
methodology is well established in detection and tracking of

Fig. 4 e An example illustrating the fruit recognition method. We first identify a number of possible fruit positions (initial
points of interest), and then verify each fruit position for the removal of false and duplicated estimates by a Bag-of-Words
model. Initial fruit probability is calculated based on G L B, G L R and G/(R D G D B). Fruit recognition is obtained by applying
the Bag-of-Words model on the initial points of interest.
208 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

detected that are stable across a range of time-steps in an similarly for the local range features, following Nilsback and
agglomerative clustering of image pixels, based on proximity Zisserman (2006). Then, using the constructed vocabularies,
and similarity in colour (Forssén, 2007). Default parameters as we learned the frequency distribution of the combined vo-
described in Forssén (2007) were used. The obtained feature cabularies (2000 words) for the training data.
set provides an approximate description of the ‘objects’ in a SVM classifier Finally a support-vector-machine (SVM)
window (see Fig. 5(b) in the form of ellipses, which constitute classifier was used on all the frequency distributions of the
an affine-invariant object representation when viewed from training data to represent two groups, Fruit and Others. In ef-
different angles. We used the geometric shape (five variables) fect, the BoW model represents each image by a frequency
and mean colour (three variables) of the fitted ellipses as a distribution of its visual vocabularies.
feature set.
Besides MSCR features, texture features from local range 3.3. Using bag-of-words model
filters are also used. A local range filter simply calculates per
colour the difference between the largest and smallest in- For processing a validation image, we first find the initial
tensity in the filter window. Nilsback and Zisserman (2006) points of interest in an image, as described in Section 3.1. Next
used a set of filters with 4 sizes: 3, 7, 11, 15, to define the tex- the MSCR and local range features are calculated per window
ure. To reduce the amount of computational load and mem- at each initial point. From the quantised vector of these two
ory, we used only two filter sizes and good results were vocabularies, the frequency distribution in the bag-of-words
obtained with the filter sizes 5  5 and 9  9 (see Figures 5(c) frequency histogram is calculated, which subsequently is
and (d)). classified to a fruit class or not, using the SVM.
Bag-of-Words (BoW) frequency distribution Next a so- The outputs include fruit locations, and each estimate also
called ‘bag of words’ approach is used as proposed by Zisser- has a weight threshold for the two classes. The weight
man and collaborators (Nilsback & Zisserman, 2006; Sivic & threshold W is the (arbitrary) distance computed from the
Zisserman, 2008). Nilsback and Zisserman (2006) describe a SVM classification and a higher value means that the window
set of flower images by creating a flower vocabulary, using is more likely to belong to that class. In fact, the values
three different vocabularies for colour, shape (SIFT features) quantify how far an object of interest is from the decision line
and texture (using the filter bank). Each vocabulary vector is separating the two groups. A smaller value means closer to
quantised (discretised) to obtain so-called Visual Words. The the borderline, while a larger value means it is more likely to
frequency histogram of the visual words form a so-called bag belong to that class.
of words. These frequency histograms can then be used to When points of interest are ‘close’ together, we obtain
calculate similarities, yielding a quick search method for im- multiple classifications. In that case, we select the point/
ages or videos (Sivic & Zisserman, 2008). window with the highest weight threshold W. ‘Close’ is
We used K-means clustering to construct a vocabulary defined here as the overlap between the two windows, and we
with 1000 ‘words’ to represent the MSCR features, and consider that two windows which have more than 50%

Fig. 5 e MSCR features and image textures. Each ellipse in (b) is an MSCR feature, and its region is filled by the average colour
of that ellipse. The MSCR features provide an approximation to an image. Regions not covered by the MSCR features were
shown as blue. The 5 3 5 and 9 3 9 textures are obtained by a range filter on the colour image. The colour indicates the
magnitude of local colour variation, e.g. green regions indicate large variation in green colour. (a) Image (b) MSCR (c) 5 3 5
texture (d) 9 3 9 texture.
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 209

overlap with each other are ‘close’. Overlap is calculated by the results we sum the K’s at the 4 heights to obtain total fruit
the intersection of two detection windows divided by their count. Let (ak,bk) denote the true column & row coordinates of
union. fruit k in image 0, and gk the true shift in column coordinate
Overall, the recognition method performs reasonably well between consecutive images. We propose as our observation
given the challenges we faced (see discussion section). The model:
relationship between successive views must be investigated      2 
to help filter out isolated false positives, find occluded fruits xij akðijÞ þ igkðijÞ s ; 0
wN ; x ; (1)
yij bkðijÞ 0; s2y
and produce total fruit count.
where k(ij) denotes the correct fruit label of the observation
indexed (i,j), and ðs2x ; s2y Þ denote the variances of the normally
distributed observation errors. There are 3K parameters (a, b,
4. Fruit counting from multiple views
g) associated with each experimental plot, together with 2
variance parameters which are common to all plots. The
Since we observe the same plant/fruit in multiple images, we
challenge is to estimate the number of fruits, K, in the pres-
need to combine this information into a single result. The
ence of the remaining nuisance parameters.
aim is to count the correct number of fruits K in a plot, while
In principle, we could estimate all parameters by max-
preventing double counting of the same fruit which may
imising the likelihood of the model specified by (1), condi-
appear in multiple images as well as correctly counting fruits
tional on K, and estimate K using likelihood ratio tests.
that are possibly missed in certain views (e.g. because of
However, this would involve an enormous combinatorial
occlusion). Note that a fruit is approximately shifted by a
search to assign fruit labels k(ij), so direct optimisation is
fixed amount (g) in the horizontal direction in consecutive
computationally infeasible. Reversible jump Markov chain
images, depending on its distance from the camera. This
Monte Carlo is another possible approach to tackle this
property will be used to find the same fruit in multiple
problem, but is problematic because of the large dataset of
images.
40,000 observations, so we have instead developed a simpler,
Consider a contiguous sequence of images from a single
much faster, more ad hoc method. We first estimate s2y by
experimental plot at one of the four camera heights. Let
fitting a mixture distribution to differences in y between pairs
(xij,yij) denote the column & row coordinates of the jth fruit
of observations. Similarly we estimate s2x by considering trip-
located in the ith image, where i ¼ 0,.,I and j ¼ 1,.,Ji. For
lets of observations. Finally, for each plot we apply a 95%
example, Fig. 6 shows illustrative data for a short sequence of
significance threshold rule to identify non-overlapping sets of
3 images. It is likely that some of these data are repeat ob-
observations of a single fruit, starting with the largest possible
servations of the same fruit in different images, because
some row coordinates (y) are very similar and column co-
ordinates (x) shift by similar amounts between images. In
order to estimate the number of fruits we need to determine
which are repeat observations.
Suppose that there are K fruits observed at least once in an
experimental plot, indexed by k ¼ 1,.,K. To simplify exposi-
tion, we will only consider a single camera height, though in

Fig. 7 e Precision-recall curves showing detection


performance of our fruit recognition method for 10
experimental plots. Red and blue curves represent initial
points and the BoW model applied on initial points
respectively. The parameter Tp for initial points (described
in Section 3.1) was in the range of [0.3, 0.9], and the range
of parameter W for the BoW model is shown in Table 2.
Fig. 6 e Illustrative data for a sequence of 3 images from The green contours outline F1 scores from 0.2 to 0.9. (For
one experimental plot at a single camera height, with row interpretation of the references to colour in this figure
coordinates (y) plotted against column coordinates (3) for legend, the reader is referred to the web version of this
image 0 (*), 1 (D) and 2 (3). article.)
210 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

set size (I) and progressively reducing until we are left with range from 0 to 1000 as in Table 2. The BoW model elimi-
singletons. The number of sets is our estimate of K. See the nated 2/3rds of the false positives in the initial estimates
Appendix for details. and the minimum precision is 0.61. False positives can
almost be eliminated, but the number of true positives re-
duces and the miss detection rate can become quite high.
5. Results For example, the highest precision was 0.97, but the recall
was only 0.17 and the F1 score was lower than 0.3. For the 10
To quantify the performance of our fruit recognition method experimental plots, the highest F1 score was 0.65 at
(i.e. first finding the points of interest and then classifying W  100, and the F1 score was also above 0.6 for thresholds
the bag-of-words), we used the precision-recall curve and {0,200,300}.
the ground truth consisted of manually labelled fruit posi- It should be noted that the fruit count of a plot could not be
tions for a single row of 10 experimental plots (408 valida- estimated using single images only. Plants were visible in a
tion images in total, see Section 2). If the overlap between a successive sequence of images, and there were multiple
window classified as fruit (detection) and a similar sized plants in an image. We therefore applied the multiple-view
window around the ground truth position is greater than fruit counting algorithm to the 10 experimental plots where
50%, then the detection is considered a true positive; locations of fruits had been visually identified from images
otherwise the detection is a false positive. We treat each (i.e. the ground truth for evaluating fruit recognition method).
detection as unique for a fruit: if there are multiple de- Using the methods described in the Appendix, we obtained
tections satisfying the overlap criterion, the one with s 2x ¼ b
b s 2y ¼ 52 . We note that these are smaller than those for
maximum overlap is the true positive and the others are automatic fruit detection, showing the superiority of the
considered false positives. A false negative represents a human eye. Figure 8 shows the results, with 94.6% correlation
ground truth position which has no corresponding detec- between K and K b VIS . We see that fruit numbers are over-
tion. The precision is defined as, estimated for all but one plot.
This is most likely due to fruits from border plants
Precision ¼ TruePositive=ðTruePositive þ FalsePositiveÞ appearing in images although they are excluded from manual
The recall is: counting. Also, as we had four vertical levels of images, some
fruits may have appeared at both the top of one image and the
Recall ¼ TruePositive=ðTruePositive þ FalseNegativeÞ bottom of the one above. However, these sources of over-
We also combine both precision and recall into a single estimation will be convolved with those from under-
score F1, estimation, mainly from occlusion due to fruits being hidden
behind leaves.
F1 ¼ 2  Precision  Recall=ðPrecision þ RecallÞ We applied the multiple-view algorithm to all experi-
Figure 7 presents the precision-recall performance for our mental plots in the validation trial with at least 12 images at
fruit recognition methods. The result for the initial points of each of the 4 camera heights, totalling 435, for a range of
interest was obtained by varying the threshold Tp weight thresholds W. This is more than the 264 plots stated
(0.3,0.4,0.5,.0.9). In case of overlapping detection windows,
we used the one with the highest posterior probabilities Tp.
There were many false positives using the initial colour
classifier, resulting in low precision (0.45). However, pre-
cision is less relevant, as this only provides initial esti-
mates. We do want a high recall however for initial points,
to make sure that we do not miss fruits. For the BoW model,
Tp was set to 0.9, and the weight threshold W was in the

Table 2 e Choice of weight threshold W to maximise


correlation between manually counted number of fruits
b
per experimental plot (K) and estimated value (K).
W No. data % Correlation
0 38,600 63.1
100 28,600 67.8
200 21,400 72.4
300 16,300 74.2
400 12,900 74.0
500 10,300 73.0
600 8300 71.6
700 6600 70.0 Fig. 8 e Plot of manually counted number of fruits per
800 5200 67.4 experimental plot (K) against estimated value using
900 4000 64.6 b VIS ) for 10
visually identified fruits in images (K
1000 3000 62.3
experimental plots, together with 1:1 line.
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 211

in Section 2 because we view each plot twice, once from the


Table 3 e Results of 100
aisle on either side, and have treated these views separately.
simulations of full dataset, using
Recall that, although our exposition has only considered a algorithm with range of % limits
single camera height, in practice we sum the K’s at the 4 for S2 £c22nL3 ð%Þ.
heights to obtain total fruit count. Table 2 shows the corre- b bias
Limit % K
lation between observed and estimated numbers of fruits per
75 0.83
experimental plot for a range of thresholds, from which we
90 0.24
see that a threshold of 300 is best, maximising the correlation 95 0.00
at 74.2%. 99 0.19
Figure 9 shows the agreement between K and K b for each of 99.5 0.21
the 435 experimental plots in the validation trial, from which 99.9 0.27
it is striking that not only is the correlation maximised but
also K b is a good estimate of K without needing linear
over 100 simulations of the full dataset of 435 experimental
adjustment for intercept and scale. The standard error of
plots, using both the selected threshold for S2 of c22n3 ð95%Þ
prediction is 11.3 fruits. In Fig. 9 four plots from each of three
and alternative probability levels. We see that 95% is unbiased
genotypes with large numbers of fruits have been displayed b
whereas other values lead to biases in K.
using different symbols, from which we see that prediction
errors are not independent of genotype, as all plots for two
genotypes lie above the 1:1 line whereas for the other they all
lie below. Some bias is to be expected, as fruits are more 6. Discussion
likely to be occluded in genotypes with denser foliage, and
some shapes and sizes of fruits may be harder to detect by The aim of our work was to reliably estimate the number of
our automatic algorithm. Of the overall variability in K, 55% fruits in a crop. The main challenge is that fruits can have a
(i.e. 74.22, the squared correlation) is explained by K, b and a variety of colours from green to red amidst green leaves of
further 38% by the 146 genotypes, but this is only 0.26% per varying colour. A more comprehensive solution might be
genotype. When manual and automatic fruit counts are achieved by using a full 3D approach, as we had earlier
averaged over replicates of each genotype, correlation rises attempted (Song, Glasbey, van der Heijden, Polder, &
to 79.4%. Remaining sources of over- and under-estimation Dieleman, 2011). However, construction of such a 3D map is
are as we discussed for Fig. 8. However, unlike in Fig. 8 not trivial and introduces errors in itself. Even if a 3D map can
where over-estimation predominated, in Fig. 9 on average be constructed, it does not fully solve the problem of auto-
the two approximately balance out, with as many points matic fruit recognition. Therefore this approach was not
above as below the 1:1 line, though this is only the case when adopted here, and focus was on a 2D approach, using multiple
a specific threshold value of W  300 is used in the BoW views to limit the problem of occlusion.
model. In earlier stages of this research several methods were
We conducted a simulation trial to further validate the adapted, including a template-correlation approach, where
multiple-view algorithm. Table 3 shows the bias in K b averaged the specular reflection was used to locate the fruit, but re-
sults proved to be unsuccessful. However, because of the
complexities of our images, with green and red fruits in a
green canopy, we had to use sophisticated, computationally-
intensive segmentation methods. Reis et al. (2012) showed a
solution based on relatively simple RGB-thresholds to
distinguish between white and red grapes versus leaves. In
our case, such a threshold approach was not sufficient, as the
green leaves and green fruits were very much alike. Teixido
et al. (2012) describe a systematic approach of colour
invariance to detect peaches using distances to linear colour
models in the RGB colour cube. This enables them to
distinguish peaches and leaves in more and less illuminated
areas. However, if considerable overlap is present between
the different organs in the colour cube (in our case, between
the green leaves and green fruits), the linear colour model
approach fails and more contextual information is needed to
solve the problem. We tried to use correlation based tem-
plates to include this contextual information. Standard ap-
proaches were however still insufficient and detection rates
Fig. 9 e Plot of manually counted number of fruits per were low. Therefore we switched to combination of MSCR
experimental plot (K) against estimated value (K)b for 435 and BoW which combines invariance, contextual informa-
experimental plots, together with 1:1 line. For three tion and advanced high dimensional clustering to obtain a
genotypes with large numbers of fruits the four plots have reasonable rate of detection. Still the human vision is in this
been displayed using 6, 3 and D symbols. case superior.
212 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

In this paper, we have shown how to use the ‘bag of


Table 4 e Challenges addressed in this paper and a
words’ (BoW) approach for recognising fruits with two
summary of our methods related to each challenge.
distinctive colours and complex shapes in an image.
Challenges Relevant methods
Furthermore, a high-throughput imaging setup such as used
1 Multiple plants in one image Multiple views, Section 4 by Polder et al. (2009) usually captures a continuous set of
2 Plant spans across Multiple views, Section 4
images or uses a video camera, which also causes bias in
several images
counting when the same fruit is visible in more than one
3 Complex fruit shape Bag-of-words model, Section 3.2
4 High intraclass variation Bag-of-words model, Section 3.2 image. To reduce the bias, we have described a multiple-view
5 Low interclass variation Colour classifier, Section 3.1 approach that can aggregate estimates from a number of
6 Occlusions Multiple views, Section 4 images or video frames. The multiple-view algorithm also
minimises the error caused by the occlusion problem, which
improves the detection rate of the fruits. The BoW approach
For finding initial points in our fruit recognition method, has already been successfully adopted in practical applica-
approaches other than using colour information could be tions in other fields (Nilsback & Zisserman, 2006; Sivic &
taken, including corners (Leibe et al., 2004) and robust features Zisserman, 2008), and we would like to draw the attention
(Leibe et al., 2005). In addition, other classifiers could be used of the biological systems community to the potential use of
instead of Naive Bayes to find the points. Which feature set this method.
and classifier works best is generally application specific, but
many methods will yield similar results and the method as
such is not critical as it is only used to reduce computer time. 7. Conclusions
We also proposed an algorithm for fruit counting using
multiple views, and a correlation of 94.6% was achieved when Our conclusions are:
fruits in images were visually identified, as shown in Fig. 8.
This high correlation between estimated number K b VIS and true  Image analysis is one way to automate plant phenotyping,
number of fruits K demonstrates that image analysis has the such as to count fruit numbers on plants.
potential to replace manual measurement. The algorithm  Our images are of 3-m high peppers with fruits of complex
could be extended to include features of identified fruits, such shapes and varying colours collected every 5 cm along
as shape, size and colour in addition to (x,y) locations to greenhouse aisles.
determine which are repeat views.  We have developed a two-stage method to locate and
The pepper fruit considered in this paper is a difficult count fruits: 1) find fruits in single images using a bag-of-
object to recognise, and there were the six challenges we words model, 2) aggregate estimates from multiple im-
faced, as shown in Table 4, that have not been tackled by Ji ages using a statistical approach to cluster repeated,
et al. (2012), Linker et al. (2012), Tanigaki et al. (2008) or De- incomplete observations.
An et al. (2011). For example, pepper fruit (Fig. 1) have a  We achieved a correlation of 74.2% between automatic and
range of curved shapes unlike circle-shaped apples in Ji manual counts of fruit numbers in 435 plots, whereas stage
et al. (2012) and Linker et al. (2012). Moreover, pepper fruit 2 alone, with stage 1 replaced by manual identification of
have two distinctive colours and our images were captured fruits in images, achieved a correlation of 94.6%.
under varying lighting conditions (intraclass variation), and
the low contrast in colour between the green fruit and other
plant parts (interclass variation) led to low precision (see
Fig. 3). Our fruit recognition method therefore accounted
for blob (i.e. MSCR) and texture features in addition to
Acknowledgements
colour.
This work is part of the Smart tools for Prediction and
Fruit counting for a large number of plants (e.g. we had
Improvement of Crop Yield (SPICY) project supported by the
over 28,000 images recording over 1000 experimental plants)
European Community and funded by the KBBE FP7 pro-
and its challenges (particularly the first two challenges in
gramme. (Grant agreement number KBBE-2008-211347) We
Table 4) have not been addressed before. We achieved a cor-
b was a good estimate also acknowledge Scottish Government funding.
relation of 74.2% as shown in Fig. 9, and K
of K without any linear adjustment. On average, there were
21.2 fruits in a plot, with a standard deviation of 16.3, while Appendix. Algorithm to estimate fruit numbers
our prediction achieved a standard error of 11.3 fruits. from multiple views
Although our fruit recognition and counting methods were
designed for digital phenotyping, they can in principle be used To estimate the parameters in the model specified by (1), we
for other robotics applications such as automatic harvesting. first consider all pairs of observations from the same experi-
Using a standard PC (3G Hz processor with an 8 GB memory), mental plot, indexed by (i1,j1) and (i2,j2), such that i1 < i2, and
the running time for our fruit recognition method (MATLAB) compute
was under 10 s for a 480  1280 validation image without any xi2 ;j2  xi1 ;j1
reduction in resolution, and the multiple-view algorithm re- D ¼ yi2 ;j2  yi1 ;j1 and b¼
g :
i2  i1
quires few seconds for counting.
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 213

Figure A.1 e Derived data used to estimate s b2y and sb2x : (a) differences between pairs of row coordinates (D) plotted against
estimated shift (bg) for restricted range of values; (b) square-root of residual sums of squares of model fit to triplets of column
coordinates (Sx) plotted against estimated shift (b g) for restricted range of value; (c) histogram of values of D and maximum
likelihood fit (red line) of mixture of normal and uniform distributions; (d) histogram of values of Sx and maximum
likelihood fit (red line) of mixture of half-normal and uniform distributions.

Figure A.1(a) shows D plotted against gb for the full dataset, distribution being more spiked than a normal, but we are not
restricted to jDj  DM h100 and 30  g b  150, which is a con- overly concerned with this discrepancy as statistical inference
servative range of values that g can take for fruits, given the is usually robust to normality assumptions and use of other
distance from the plants to the cameras. (For clarity, only a distributions would greatly complicate estimation to follow.
random 10% of data are plotted.) We see a cluster of values We next consider all observation triplets from the same
around (D,g) z (0,60), which are likely to be repeat observa- plot. However, for subsequent useage, we will express this in
tions of the same fruit. Figure A.1(c) shows the histogram of the greater generality of n observations {(i1,j1),(i2,j2),.,(in,jn)}
D, which looks well approximated by a mixture of a normal such that i1<i2<.<in. Given such a set, we can estimate (a,b,g)
and a uniform distribution, and this agrees with what we by least squares, analytically using standard formulae, and we
expect if distances between neighbouring fruits can be can also compute residual sums of squares S2x and S2y . For row
assumed to be approximately uniformly distributed: coordinates (y):
(     1 X
n n 
X 2
N 0; 2s2y if k i1 ; j1 ¼ k i2 ; j2 b
b¼ yil ;jl S2y ¼ yil ;jl  b
b ;
ðDjjDj  DM Þw n
UðDM ; DM Þ otherwise: l¼1 l¼1

and for column coordinates (x):


We estimate the normal distribution variance (2s2y ) pro-
portion (r) by numerically maximising the log-likelihood: X
n
 2
X
n
 2
ðb b Þ ¼ arg min
a; g xil ;jl  a  il g S2x ¼ xil ;jl  a b :
b  il g
ða;gÞ
8 9 l¼1 l¼1
> " #>
X <ð1  rÞ r Dl = We note that, if the set are all repeat observations of the
ln þ qffiffiffiffiffiffiffiffiffiffiffi exp  2 ; same fruit, then S2y ws2y c2n1 and S2x ws2x c2n2 , from which it fol-
l
>
: 2DM 4ps2y 4sy >
;
lows that:
where summation over pairs l is restricted to the ranges in  S2 S2y 
Figure A.1(a). Figure A.1(c) shows the fitted distribution, with S2 ¼ x
2
þ 2
wc22n3
2 b
sx b
sy
s y ¼ 16:502 . We note that there is some evidence for the
2b
214 b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5

In particular, for a triplet (n¼3), S2x ws2x c21 , so Sx wNþ ð0; s2x Þ, The algorithm is fast because in step 2 most sets can be
the positive half of a normal distribution. excluded with fewer than n points, because S2 increases
Figure A.1(b) shows a plot of Sx against gb for triplets from all monotonically as points are added to a set. So a tree search
experimental plots in the dataset, restricted to Sx  SM h 50 can be used. Figure A.2 shows the results of the algorithm
and 30  gb  150. We also restrict to S2y  b s 2y c22 ð95%Þ to ensure applied to the data in Fig. 6. As there are only 3 images in this
that values of y are consistent with repeat observations of a illustrative example, we start with n ¼ 3. The group marked ‘A’
single fruit, at a 95% level of significance. (For clarity, again are the first identified, with S2 ¼ 2.0, followed by ‘B’ with
only a random 10% of data are plotted.) Similar to D, the data S2 ¼ 5.3. No other triple of observations remaining has S2 
are consistent with c23 ð95%Þ ¼ 7:8, so we then search for sets of size n ¼ 2. We find 5
      pairs, labelled ‘C’.‘G’ with increasing values of S2 
Nþ 0; s2x if k i1 ; j1 ¼ k i2 ; j2 ¼ k i3 ; j3 c21 ð95%Þ ¼ 3:8. Figure A.2 also shows the fitted values. Three
Sx Sx  SM w
Uð0; SM Þ otherwise:
unassigned points remain, singletons labelled ‘H’,‘I’,‘J’, and we
Figure A.1(d) shows the histogram of Sx from the full infer that the total number of observed fruit to be
dataset, and the maximum likelihood fit, with b s 2x ¼ 6:672 . b ¼ 2 þ 5 þ 3 ¼ 10.
K
Again, there is some evidence for the distribution being more
spiked than a normal, which again does not overly concern us.
We also note that b s 2x < b
s 2y , indicating that column locations of references
fruit are more easily determined than row locations.
Now we have estimates of the 2 variance parameters, we
can consider data from each experimental plot separately to Alimi, N., Bink, M., Dieleman, J., Nicola, M., Wubs, M.,
estimate K. Although it is possible to estimate the number of Heuvelink, E., et al. (2013). Genetic and QTL analyses of yield
pairs and triplets of observations from the same fruit, it is not and a set of physiological traits in pepper. Euphytica, 190,
181e201.
possible to extend this to direct estimation of K. Instead, by the
Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier, E., & Van
following algorithm we can identify sets of observations Gool, L. (2009). Robust tracking-by-detection using a detector
which are inferred to have been of a single fruit, at a 95% level confidence particle filter. In IEEE 12th international conference on
of significance. For each plot: computer vision (pp. 1515e1522).
Brosnan, T., & Sun, D.-W. (2004). Improving quality inspection of
1. Initialise set size n ) (I þ 1) and estimated number of fruits food products by computer vision e a review. Journal of Food
b
K)0; Engineering, 61(1), 3e16.
Bulanon, D., Kataoka, T., Ota, Y., & Hiroma, T. (2002). A
2. Find the set of size n, {(i1,j1),(i2,j2),.,(in,jn)}, which minimises
segmentation algorithm for the automatic recognition of fuji
S2, subject to i1 < i2 < . < in (but no contiguity constraint) apples at harvest. Biosystems Engineering, 83(4), 405e412.
and 50  g b  130 (Note, we use this realistic range of values Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for
for g, rather than the conservative range in Figure A.1); human detection. In IEEE computer society conference on computer
3. If S2  c22n3 ð95%Þ then accept this set, remove the n data points vision and pattern recognition (pp. 886e893).
from further consideration, K)ð b b þ 1Þ, and return to step 2;
K De-An, Z., Jidong, L., Wei, J., Ying, Z., & Yu, C. (2011). Design and
control of an apple harvesting robot. Biosystems Engineering,
4. n ) (n  1), and return to step 2 provided n  2;
110(2), 112e122.
b
5. K)ð b þ number of remaining singletonsÞ.
K
Ferrari, V., Fevrier, L., Jurie, F., & Schmid, C. (2008). Groups of
adjacent contour segments for object detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 30(1), 36e51.
Forssén, P.-E. (2007). Maximally stable colour regions for
recognition and matching. In IEEE conference on computer vision
and pattern recognition (pp. 1e8).
Furbank, R. T., & Tester, M. (2011). Phenomics e technologies to
relieve the phenotyping bottleneck. Trends in Plant Science,
16(12), 635e644.
Gevers, T., & Smeulders, W. M. (1999). Color-based object
recognition. Pattern Recognition, 32, 453e464.
Jimenez, A., Jain, A., Ceres, R., & Pons, J. (1999). Automatic fruit
recognition: a survey and new results using range/attenuation
images. Pattern Recognition, 32(10), 1719e1736.
Ji, W., Zhao, D., Cheng, F., Xu, B., Zhang, Y., & Wang, J. (2012).
Automatic recognition vision system guided for apple
harvesting robot. Computers & Electrical Engineering, 38(5),
1186e1195.
Kitamura, S., & Oka, K. (2005). Recognition and cutting system of
sweet pepper for picking robot in greenhouse horticulture. In IEEE
international conference on mechatronics and automation (Vol. 4);
(pp. 1807e1812).
Lee, W., Alchanatis, V., Yang, C., Hirafuji, M., Moshou, D., & Li, C.
(2010). Sensing technologies for precision specialty crop
Figure A.2 e Data in Fig. 6, showing sets of points identified production. Computers and Electronics in Agriculture, 74(1), 2e33.
by algorithm, with ‘A’.‘J’ denoting identification order Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object
(see text), and fitted values (red dots). categorization and segmentation with an implicit shape
b i o s y s t e m s e n g i n e e r i n g 1 1 8 ( 2 0 1 4 ) 2 0 3 e2 1 5 215

model. In ECCV workshop on statistical learning in computer vision images with application to automatic plant phenotyping. In
(pp. 17e32). 17th Scandinavian conference on image analysis, SCIA 2011 (pp.
Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection in 467e478).
crowded scenes. In IEEE computer society conference on computer Stajnko, D., Lakota, M., & Hocevar, M. (2004). Estimation of
vision and pattern recognition (pp. 878e885). number and diameter of apple fruits in an orchard during the
Linker, R., Cohen, O., & Naor, A. (2012). Determination of the growing season by thermal imaging. Computers and Electronics
number of green apples in rgb images recorded in orchards. in Agriculture, 42(1), 31e42.
Computers and Electronics in Agriculture, 81, 45e57. Tanigaki, K., Fujiura, T., Akase, A., & Imagawa, J. (2008). Cherry-
Matas, J., Chum, O., Martin, U., & Pajdla, T. (2002). Robust wide harvesting robot. Computers and Electronics in Agriculture, 63(1),
baseline stereo from maximally stable extremal regions. In 65e72.
Proceedings of the British machine vision conference (pp. 384e393). Teixido, M., Font, D., Palleja, T., Tresanchez, M., Nogues, M., &
Nilsback, M.-E., & Zisserman, A. (2006). A visual vocabulary for Palacin, J. (2012). Definition of linear color models in the
flower classification. In Proceedings of the IEEE conference on RGB vector color space to detect red peaches in orchard
computer vision and pattern recognition (Vol. 2); (pp. 1447e1454). images taken under natural illumination. Sensors, 12,
Polder, G., van der Heijden, G. W. A. M., Glasbey, C. A., Song, Y., & 7701e7718.
Dieleman, J. A. (2009). Spy-See - advanced vision system for van der Heijden, G., Song, Y., Horgan, G., Polder, G., Dieleman, A.,
phenotyping in greenhouses. In Proceedings of the MINET Bink, M., et al. (2012). SPICY: towards automated phenotyping
Conference: Measurement, sensation and cognition (pp. 115e117). of large pepper plants in the greenhouse. Functional Plant
National Physical Laboratory. Biology, 39(11), 870e877.
Reis, M. J. C. S., Morais, R., Peres, E., Pereira, C., Contente, O., Yang, L., Dickinson, J., Wu, Q., & Lang, S. (2007). A fruit
Soares, S., et al. (2012). Automatic detection of bunches of recognition method for automatic harvesting. In 14th
grapes in natural environment from color images. Journal of International conference on mechatronics and machine vision in
Applied Logic, 10(4), 285e290. practice (pp. 152e157).
Sivic, J., & Zisserman, A. (2008). Efficient visual search for objects Zhao, J., Tow, J., & Katupitiya, J. (2005). On-tree fruit recognition
in videos. Proceedings of the IEEE, 96(4), 548e566. using texture properties and color data. In International
Song, Y., Glasbey, C. A., van der Heijden, G. W. A. M., Polder, G., & conference on intelligent robots and systems (pp. 263e268).
Dieleman, J. A. (2011). Combining stereo and Time-of-Flight

You might also like