0% found this document useful (0 votes)
77 views9 pages

Computers in Industry: Edmund J. Sadgrove, Greg Falzon, David Miron, David W. Lamb

project review paper

Uploaded by

Dhamu Dharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views9 pages

Computers in Industry: Edmund J. Sadgrove, Greg Falzon, David Miron, David W. Lamb

project review paper

Uploaded by

Dhamu Dharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Computers in Industry 98 (2018) 183–191

Contents lists available at ScienceDirect

Computers in Industry
journal homepage: www.elsevier.com/locate/compind

Real-time object detection in agricultural/remote environments


using the multiple-expert colour feature extreme learning machine
(MEC-ELM)
Edmund J. Sadgrove* , Greg Falzon, David Miron, David W. Lamb
Precision Agriculture Research Group, University of New England, Armidale, NSW 2351, Australia

A R T I C L E I N F O A B S T R A C T

Article history:
Received 31 August 2017 It is necessary for autonomous robotics in agriculture to provide real time feedback, but due to a diverse
Received in revised form 6 March 2018 array of objects and lack of landscape uniformity this objective is inherently complex. The current study
Accepted 15 March 2018 presents two implementations of the multiple-expert colour feature extreme learning machine (MEC-
Available online xxx ELM). The MEC-ELM is a cascading algorithm that has been implemented along side a summed area table
(SAT) for fast feature extraction and object classification, for a fully functioning object detection
Keywords: algorithm. The MEC-ELM is an implementation of the colour feature extreme learning machine (CF-ELM),
Extreme learning machine which is an extreme learning machine (ELM) with a partially connected hidden layer; taking three colour
Object detection
bands as inputs. The colour implementation used with the SAT enable the MEC-ELM to find and classify
Machine vision
objects quickly, with 84% precision and 91% recall in weed detection in the Y’UV colour space and in 0.5 s
Unmanned aerial vehicle
Agriculture per frame. The colour implementation is however limited to low resolution images and for this reason a
Robotics colour level co-occurrence matrix (CLCM) variant of the MEC-ELM is proposed. This variant uses the SAT
to produce a CLCM and texture analyses, with texture values processed as an input to the MEC-ELM. This
enabled the MEC-ELM to achieve 78–85% precision and 81–93% recall in cattle, weed and quad bike
detection and in times between 1 and 2 s per frame. Both implementations were benchmarked on a
standard i7 mobile processor. Thus the results presented in this paper demonstrated that the MEC-ELM
with SAT grid and CLCM makes an ideal candidate for fast object detection in complex and/or agricultural
landscapes.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction been implemented for weed, horse and wildlife detection [1–3].
The more complex solutions use a wide range of techniques, which
Agriculture systems require autonomous robotics for weed can include large data structures such as deep or exemplar neural
spraying, livestock detection and vehicle safety. The ability to networks and varying levels of texture and shape based analyses.
detect and process objects quickly is a desire of many of these This includes for the ripeness of bananas and for other generic
systems. Agricultural scenarios can be exceedingly complex as object detectors [4–7].
compared to other industrial based robotics systems. Object The approaches discussed often rely on grey-scale images and
detection algorithms in agriculture may compete with any number in many cases, the detection of prominent features or stand out
of structurally similar or diverse objects. This makes for a complex colour attributes. Processing speed is the key advantage in this
environment with the potential for many false positives and false case. Deep learning architectures are much slower and more
negatives. A potential solution to this problem is to adopt a complex and for this reason are often avoided in place of real time
cascading or multiple expert approach. These types of solutions solutions. Notably, with a high-end GPU it is possible to process a
have varying levels of success in both flora and fauna detection and large number of frames per second [8], but in remote mobile
there are numerous implementations, ranging from substratal to computing GPUs might not be attainable. A disadvantage of many
more complex. These approaches include simple colour and approaches is the reliance on prominent key characteristics in the
texture detection, as well as broad template matching and have classification process. This often leads to poor overall classification
accuracy, particularly in more complex scenarios, e.g., in weed
detection there may not be any prominent features; in the case of
* Corresponding author. cattle detection there may be variations in colour between the

https://doi.org/10.1016/j.compind.2018.03.014
0166-3615/© 2018 Elsevier B.V. All rights reserved.
184 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191

same breed and in other cases an object's features may appear defined by the international telecommunications union as ITU-R
different at different angles or rotations. B.601 [32]. Equivalent to the standard ELM it uses randomly
The goal of this research is then to explore methods that can be assigned weights in the hidden layer and by using the pseudo
used to deliver both fast and accurate feature extraction and object inverse it can analytically determine the output weights from a
classification. For this, an implementation of the Multiple-expert fully connected output layer. By dividing the hidden layer into 3
colour feature extreme learning machine (MEC-ELM) [9] is sections, each colour attribute can be processed in a separate
proposed. The MEC-ELM is a cascading implementation of the section of the hidden layer and it is for this reason that the number
Colour Feature Extreme Learn Machine (CF-ELM) [10], which is of neurons in the hidden layer must be a multiple of 3. By storing
itself an implementation of the Extreme Learning Machine (ELM) the output of these 3 sections into 3 sections of the H matrix it
[11] for colour object detection. The ELM in part due to its efficient allows the matrix to be used to determined the output weights in
implementation and fast processing speeds has demonstrated the same way as the standard ELM. For Y’UV the CF-ELM hidden
suitability for computer vision based problems in the agricultural layer can be expressed [10].
scenarios. Including for soybean classifications [12], unmanned
HðW1 2  ; WN~ ; b1    ; bN~ ; Y0 1    ; Y0 N ; U1    ; UN ;3V1    ; VN Þ
aerial vision for palm tree detection [13] and has been bench-
gðW1 Y0 1 þ b1 Þ    gðWN~ Y0 1 þ bn~ Þ
marked using notable feature extraction techniques [7]. The MEC- 6 7
6 .. .. 7
ELM can be used as both a feature extraction and classification 6 .  . 7
6 gðW Y0 þ b Þ    gðW Y0 þ b Þ 7
technique. This can be achieved by adopting the summed area 6 1 N 1 N~ N ~ 7
n
6 7
table (SAT) [14] (or integral image) and thereby reducing a 6 7
6 gðW U þ b Þ    gðW U þ b Þ 7
landscape image (or video frame) to a grid of coloured blocks. The 6 1 1 1 ~
N 1 n~ 7
6 .. .. 7 ð2Þ
purpose of this is to provide a generic approach to HAAR features ¼6 73NN ~
6 .  . 7
6 7
[15] and hence take advantage of the SATs fast, multi-scale feature 6 gðW1 UN þ b1 Þ    gðWN~ UN þ bn~ Þ 7
6 7
extraction architecture. Fast processing may require low resolution 6 7
6 7
image data and this can result in a potential loss of pixel based 6 gðW1 V1 þ b1 Þ    gðWN~ V1 þ bn~ Þ 7
6 7
information. To meet this challenge, the output of the SAT grid is 6 .. .. 7
4 .  . 5
used to generate three colour level co-occurrence matrices (CLCM) gðW1 VN þ b1 Þ    gðWN~ VN þ bn~ Þ
[16], one for each colour band (red, green and blue). The outcome
will be a texture based analysis, with the values provided as input where Y0, U and V are equal to the individual colour pixel
of each CF-ELM. The two implementations of the MEC-ELM were matrices for each image and are stored in the H matrix at the
implemented and processing speeds and overall accuracy com- output of the hidden layer. The hidden layer process is repeated
pared. The algorithm is designed with the objective of fast object in the output layer, with the output b becoming the input
detection in complex and unpredictable terrain, making it an ideal multiplier for the output weights b. The output T of the CF-ELM
candidate for use in the agriculture industry. The objective in this is then the result of b  H.
case is to implement the MEC-ELM as a weed detector for spraying,
T ¼ bH ð3Þ
unobtrusive cattle tracking and as a vehicle interaction and
avoidance tool. This paper will benchmark the MEC-ELM with pre- Here b can be expressed:
recorded video data, for eventual use in a remote laptop interface 2 T3
for interaction with an unmanned aerial vehicle (UAV or drone) and b1
stationary surveillance devices. b¼6 7~
4 ... 5N m ð4Þ
bN~T

2. Theory/calculation
where m is the number of neurons in the output layer, which is
2.1. Multiple-expert colour extreme learning machine (MEC-ELM) equivalent to the number of outputs of the ANN. The matrix of
target outputs T can be expressed as:
The ELM is a single layer, feed forward neural network that is 2 T3
t1
known for its fast and analytical training phase. In this phase the 6 7
T ¼ 4 ... 5Nm ð5Þ
output of the neural networks hidden layer is stored in a matrix
designated H the output weights are then determined analytically. tTN
This can be expressed [17,18]: where for each TN the value is stored based on the input training
HðW1    ; WN~ ; b1    ; bN~ ; x1    ; xN Þ sample and its desired output. This leaves b as the one unknown,
2 3
gðW1 x1 þ b1 Þ    gðWN~ x1 þ bN~ Þ by making b the subject we get:
6 7 ~ ð1Þ
¼4 ... ... 5N  N
 b ¼ H1 T ð6Þ
gðW1 xN þ b1 Þ    gðWN~ xN þ bN~ Þ
1
where H is the Moore-Penrose pseudo inverse of matrix H. The
where H is the hidden layer output matrix, N is the number of output values of this process are then stored in b and used as the
~ is the number of neurons
samples used in the training phase and N weights in the output layer removing the need for a long gradient
in the hidden layer. In the activation function g() [19], W is the descent based training process.
input weight, x is the input sample pixels and b is the bias. The MEC-ELM is then a set of CF-ELMs, where each CF-ELM can
The Colour Feature Extreme Learning Machine (CF-ELM) is be trained on a different set of sample images, different colour
similar in architecture to the ELM, comprising of a single layer, feed system or image analysis techniques. The MEC-ELM becomes a
forward, neural network with a partially connected hidden layer global consensus of all CF-ELMs or experts, the goal is then to find
and a fully connected output layer. The hidden layer is divided into individual CF-ELMS of high classification accuracy but with varying
3 sections and this gives the CF-ELM the ability to be used with consensus [21]. The method used in this paper was inspired by the
different colour models, including red, green, blue (RGB), lumi- Examplar SVM [22] and the Ensemble ELM (EN-ELM) [23], where
nance, chrominance red, chrominance blue (Y’UV) and hue, training samples are divided among different instances of CF-
saturation, value (HSV) [20], Y’UV (also known as YCrCb) is ELMs. The training phase of the MEC-ELM is depicted in Fig. 1,
E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 185

their nearest neighbour [26]. The CLCM has 3 or more possible


variants, including first order, second order and third order. Each
order describes a tabulation of different combinations of pixels. The
first order is essentially a one dimensional histogram of pixel
intensities, it does not look at the relationship between neighbouring
pixels and can be used to determine statistics such as mean and
variance [27]. Second order CLCMs look at the relationship between a
pixel and one of its neighbours, creating a two dimensional array (or
matrix) of combination occurrences. The second order can be used
for texture analyses and will be utilised in this paper. Third or higher
order CLCMs look at 2 or more of a pixel's nearest neighbours. These
implementations are computationally more expensive and more
difficult to interpret, for this reason they were not used in this
research. The initialisation of a second order co-occurrence matrix
Fig. 1. The MEC-ELM with each CF-ELM trained on a different portion of the training
can be expressed [28]:
set. Xn X n 
1; if Iðx; yÞ ¼ i and Iðx þ Dx; y þ DyÞ ¼ j
Pði; jÞ ¼ ð9Þ
0; otherwise
where K is the number of CF-ELMs, each instance is trained on one x¼1 y¼1

K to K (or N) denotes the final image for


kth of the training set and 1N KN
where P(i, j) represents total events and is an individual element in
each instance. A predetermined number of values is sent to the the CLCM, n by n is the dimensions of a rectangular image, x and y
input of each CF-ELM, where RGB values are first processed, this are coordinates within an image and Dx and Dy are the distance
will be discussed further in the methodology in Section 3.1. from a pixel to its neighbour in the right direction (these same
combinations are checked in the left direction as well), in this case
2.2. Summed area table (SAT) Dx, Dy = 1. For this research the intensity levels were reduced from
256 to 16 during initialisation of the CLCM. This was done to reduce
The SAT, which is also known as the integral image (II), was made the occurrence of nil values. After initialisation values are
popular by Viola and Jones object detection framework [24]. The SAT normalised by dividing all values in the CLCM by all possible
allows fast object detection by converting an image into a summed combinations in the CLCM. Total potential combinations can be
representation of all pixel values and from these values a HAAR based calculated from known dimensions 2  n  n, where 2 represents
feature set can be used for broad template matching. The HAAR based the left and right directions. Texture analyses values can then be
feature set can be quite large and the AdaBoost algorithm [25] is often calculated from the values in the CLCM matrix. Five statistics were
used to find features most commonly associated with a set of training chosen; these values were selected as they were the least
images. The HAAR features are rectangular based features with computationally intensive and provided good results in prelimi-
alternating dark and light areas designed to match the dark and light nary testing. These included energy, entropy, contrast, homogene-
areas (or light intensity) associated with a target object. These blocks ity and CLCM mean, which can be expressed [28,26]:
match the block based summation values found in the SAT, which
means that these dark and light areas can be matched to areas within X
D1

the image very quickly. Only four sets of coordinates are required to Energy ¼ Pði; jÞ2 ; ð10Þ
i;j¼0
calculate the light intensity of one sub block. To create the SAT all
pixel values in the table are stored as summations of the pixel value at
the current location and the preceding summation values. In this
X
D1
fashion, the SAT can be generated from just one pass over the image. Entropy ¼ lnðPði; jÞÞ  Pði; jÞ; ð11Þ
This can be expressed [14]: i;j¼0

IIði; jÞ ¼ ði; jÞ þ ði  Di; jÞ þ ði; j  DjÞ  ði  Di; j  DjÞ ð7Þ


where II(i, j) is a single coordinate of dimensions i and j of the SAT X
D1

and the decrements (Di, Dj) refer to the preceding pixel Contrast ¼ Pði; jÞ  ði  jÞ2 ; ð12Þ
summation value; here Di, Dj = 1. The sum light intensity value
i;j¼0

of any rectangular area can then be calculated using the four


coordinates of each corner of the rectangle. This can be expressed:
X
D1
Pði; jÞ
Sði; j; i  w; j  hÞ ¼ ði; jÞ þ ði  w; j  hÞ  ði  w; jÞ  ði; j  hÞ ð8Þ Homogeniety ¼ 2
; ð13Þ
i;j¼0 1 þ ði  jÞ
where S refers to the sum of the rectangle's pixel values, with
bottom right coordinates i and j and dimensions w (width) and h
(height). The resulting magnitudes are then normalised by divide X
D1
the magnitude by the number of pixels within the rectangle. This CLCM Mean ¼ i  Pði; jÞ ð14Þ
will produce an average light intensity value between 0 and 255. In i;j¼0

this paper red, green and blue SATs were created instead of one SAT where D is the number of values within the CLCM. In this research
for grey-scale (or light intensity). This allowed the CF-ELM to work three CLCMs were created to match the three colour band
with each individual colour band as required. requirements of the CF-ELM. These included a red level, green
level and blue level co-occurrence matrix. To avoid confusion these
2.3. Colour-level co-occurrence matrix (CLCM) will be referred to as a colour-level co-occurrence matrix (CLCM),
but it is noteworthy that a grey-level (GLCM) is more common in
The CLCM [16] (or grey level GLCM) is an image analysis tool based literature.
on light intensity and looks at the relationship between pixels and
186 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191

3. Methodology two frame per second to match the expected processing time of
the algorithm, although this did not take into account the CLCM
All instances of the MEC-ELM were programmed and tested and tracking algorithms. The video used for testing in the case of
using the C programming language, C was chosen due to portability the cattle aerial footage was only 18 s long, for this reason six
and faster processing speeds. It is also possible to use variants of C frames per second were extracted instead. To elaborate, the
in many embedded devices [29]. All benchmarking was conducted extraction of two frames per second was deemed adequate for
using a laptop computer with a Linux based system, 16 gigabytes of object detection in remote environments, but the emphasis in
ram, a solid state drive and a 4th generation i7 mobile processor. All this paper was on evaluating the effectiveness of the algorithm
time benchmarking was conducted using the clock _ g ettime and for this reason more frames were required.
function with the CLOCK _ M ONOTONIC option from the time. h  The SAT Grid: during the training, testing and tuning processes, all
library. All images were stored as JPEG and decompressed using the images were first converted to a summed area table. The SAT was
jpegIO. h library (also known as libjepg), before pre-processing, then used to extract an average colour grid of the image or target
images were stored in memory as RGB values between 0 and 255. area. There was no limit on the number of coloured blocks within
JPEG was chosen for storage and file transfer reasons, it is also an the grid, but in this research between 25 and 100 blocks were
output available in a number of remote camera interfaces. Images used and this depended on how successful classification was at
were stored using 4:4:4 sub-sampling which is perceptually each block number. As processing speed was a priority, lower
lossless and were saved using the default setting 4:2:0 (lossy) in block numbers were preferred. For the Y’UV variant of the MEC-
libjpeg for user feedback/external processing (smaller file). In ELM these values were sent directly to each CF-ELM for
agriculture this means that the lesser quality images are better processing. This is on display in Fig. 2, where 3SAT is the three
prepared for the potential wireless transfer constraints of remote individual summed area tables for the colour bands Y’, U and V.
environments. There is also an example of an image converted to a SAT grid with
196 blocks in Fig. 3.
3.1. The algorithm  Scaling: Each frame was read as a selection of rectangles. The
rectangle size was produced by dividing the width and height of
The algorithm is a multiple stage process based primarily on the image frame by a number determined to be effective in
Sadgrove et al. [9,10] for the initialisation of the CF-ELM for remote pretesting. At each iteration the search rectangle was moved
computer based classification. These stages included an (i) through the image frame (typically by 10–20 pixels at a time). To
initialisation stage: where the neural network is initialised into search with a larger or smaller rectangle the number for dividing
memory with initial random weights, (ii) weight biasing stage: was increased or decreased and this caused a scale change.
where the weights were biased to the training data and is based on  The CLCM: The block RGB values extracted from the SAT were
the CIW-ELM [30], (iii) training stage: where the output weights processed into a CLCM for use with the CLCM variant of the MEC-
were analytically determined with assistance from the C lapack ELM. During this process the red, green and blue level co-
library [31], (iv) a tuning stage: where optimal threshold values occurrence matrices were produced and from each an energy,
were found based on a tuning set of 100 images, this stage was also entropy, contrast, homogeneity and CLCM mean value were
used to assist in an off line training process, where optimal produced. The five values were then sent to the five inputs of
individual weight sets for each CF-ELM were saved in text based each CF-ELM for processing. A diagram is on display in Fig. 4.
data files so that they can be imported instead of retraining, (V) Note the extra stage 3CLCM, where R, G and B SATs are used to
testing stage: where the individual CF-ELMS were tested on the make 3 CLCMs for texture values energy, entropy, contrast,
image frames. The weight biasing, training, tuning and testing homogeneity and CLCM mean.
stages differed for each of the two implementations and from the
standard CF-ELM algorithm. These differences will be discussed. Of
the two implementations of the MEC-ELM tested in this paper the
primary difference is that one utilises the CLCM and the other does
not. This means that the CLCM MEC-ELM has CF-ELMs with just 5
inputs to match the 5 texture values. The Y’UV MEC-ELM has inputs
matching the number of grid blocks from the integral image. In
both cases the grid blocks will be referred to as the SAT grid. In this
implementation the RGB image is converted to Y’UV before being
processed as a sat grid.

 Frame Extraction: prior to testing, each video described in


Section 3.2 was extracted into individual frames. The frames
were extracted from MPEG video format to JPEG using the FFmpeg
package available to bash in Linux. The frames were extracted at Fig. 3. Left image is a SAT grid of the image on the right with 196 blocks.

Fig. 2. Diagram of the standard Y’UV variant of the MEC-ELM.


E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 187

Fig. 4. Diagram of the CLCM variant of the MEC-ELM.

 Object Detection: to improve the precision and recall rates of the


algorithm, areas of interest were saved every time a tested Table 1
Number of images in each dataset, with resolution of video frames.
rectangle within the image frame produced a positive classifica-
tion. If an area produced multiple classifications at multiple Dataset Training Tuning Testing Resolution
scales and in close proximity, a square was drawn in the area Bull Thistle 688 100 120 2000, 1500
based on saved coordinates. Each classification was stored with Cattle Drone 1331 100 110 1920, 1080
an (x, y) coordinate number for the top right hand corner and the Cattle Stationary 500 100 120 640, 480
ATV 608 100 144 1280, 720
bottom left hand corner of the rectangle. A classification was
considered in the same area if coordinates for a tested rectangle
were within a percentile area of previously saved coordinates.
The amount of classifications that triggered a square to be drawn not considered important due to the SAT grid producing a low
differed for each test set. Typically if 1–10 detections of the target resolution version of each image. The resolution of the video
object were found in an area, this was a good indication that frames is based on the resolution of the original pre-recorded
something was there. Each search area was within 10–100% video. The four datasets were chosen based on a potential use in
radius of the first classified rectangle. the agriculture industry, a particular emphasis was placed on
 Tracking Assist: to assist tracking objects across multiple frames, unmanned aerial vehicle (UAV or drone) technology. The
a single neuron extreme learning machine (SNELM) was utilised datasets included, a weed detection scenario involving Cirsium
and the output of the SAT was used as it's input. The SNELM was vulgare (or Bull Thistle), two cattle detection scenarios and an all
initialised based on just one image and the RGB values were terrain vehicle (ATV or quad bike) scenario. The Bull Thistle
reduced to a single 16 bit colour value [33]. Preliminary testing scenario was chosen, as thistle competes with other more valuable
with grey-level values did not improve results. The training pasture, which is a potential problem for livestock. Detection of
image was be arbitrarily selected from the tuning set. After an the thistle allows for both weed surveillance and targeted
object is detected the base SNELM is updated to reflect the spraying, saving money, pasture and the environment. The two
detected object. In the proceeding frame the SNELM is then cattle detection scenarios involved cattle recorded by a river bed
primed to detect the same or similar object. As training is based with a stationary surveillance camera and cattle recorded by a
on a single image, the pseudo inverse is not required, this can be drone hovering over green pasture. The cattle detection scenario
expressed: displays the algorithm's potential as a cattle tracking and counting
algorithm. The quad bike scenario involved a rider driving through
T a rural country side. Vehicle detection can be used in both accident
b¼ ð15Þ
ð65536  R þ 256  G þ BÞ  C detection and prevention. The technical aspects of each dataset are
where R, G and B are the arrays of colour values from the grid listed:
extracted from the SAT and C is the number of grid blocks. b is
 Bull Thistle: this dataset was sourced in two different ways. The
one dimensional and serves as an input weights container, with
training set was photographed using a 10 mega-pixel Fujifilm
length equal to C and T contains only one value, as only one image
hand held camera on default settings and at a fixed distance of
is used in training. This value is a target and can be set to 1.
2 m to simulate aerial surveillance. They were captured in JPEG
Processing an object can then be expressed:
format at a resolution of 3648 by 2736 before cropping to 100 by
X
C1 100. The video used to produce the testing frames were captured
Output ¼ bðiÞ  ð65536  R þ 256  G þ BÞ ð16Þ by a DJI Phantom quadcopter in MP4 for 60 s at 4069 by 2178
i¼0
pixels, at a distance of 10 m and on default settings. Both were
the output is then checked against a predefined threshold captured from pasture fields on the University of New England
(typically between 0.1 and 1). SMART Farm (Long 151 35 min 40 s E, Latitude 30 26 min
09 s S), in sunny conditions between midday and late afternoon.
 Cattle Drone: This data was sourced from van Gemert et al [34]
3.2. Datasets allowing comparison of results of the MEC-ELM and the
Exemplar Support Vector Machines (Exemplar-SVM) presented.
Four datasets were used in training, tuning and testing. The The video footage was recorded using an AscTec Pelican
number of images in each dataset is listed in Table 1. The tuning quadcopter, with a mounted GoPro 3: Black Edition action
set contained 50 images of the object from the training set and camera, in overcast conditions and at varying distances (around
50 images of random surrounding terrain from the test frames. 30–50 m). The video was recorded in 1920 by 1080 pixels at 60
All images used in testing and tuning were 100 by 100 pixels in frames per second. The video selected for testing was 18 s long.
resolution. This resolution was chosen based on a trade off The training data was made up of cropped and resized images
between the higher precision of high resolution images and an from the remaining videos in the dataset. To extend the training
improvement in processing time. High resolution images were dataset, images were flipped vertically using a bash command.
188 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191

 Cattle Stationary: the dataset is a collection of surveillance videos The CLCM variant took between 1 and 2 s longer due to the time
taken from a nearby farm using a Scoutguard SG860C camera at needed to process the CLCM and get the texture values. The Y’UV
640 by 480 pixels, at 16 frames per second for 60 s and AVI format variant did not perform very well in the cattle datasets. In both
(before conversion to MP4). The videos included Poll Hereford cases it required more grid blocks and recorded many false
cattle surrounding a river bed and other areas for grazing. The positives. This included the reflection of the cattle in the water in
training set was made up of images cropped from some of these the case of the stationary camera and a large number of false
videos. A video not used to generate the training images was positives in the tree line and on the helipad in the drone dataset,
used as the test video. Each video was captured at different times this dataset was also the slowest of the four because of the multiple
of day and in different weather conditions. The distance to the scales required to get true positives. The CLCM variant recorded a
cow depended on how close they got to the camera. number of false positives on the helipad as well, particularly in the
 ATV: The video of the quad bike rider was captured using a DJI last ten frames. Discarding the last ten frames puts the precision at
Phantom 3 Adv in areas of Lough Coolin and Mount Gable in 84%. Sample output frames from each of the 4 datasets are in Fig. 5,
Ireland in overcast conditions and at varying distances (around where B (cattle stationary) and D (bull thistle) are from the Y’UV
20–50 m). Precise specifications of the video were not available. variant and A (quad bike) and C (cattle drone) are from the CLCM
The video was downloaded from Youtube for fair use in MP4 variant. Note the detection of the cow reflection in B. This problem
format and at a resolution of 1280  720 pixels. The video was did not occur in the CLCM variant.
submitted to Youtube by user and channel name “JGK Drone” In Fig. 6 there is a receiver operator characteristic (ROC) curve
[35]. As only one video was available the training and test set that is depicting the average classification rates for different sized
were made up of different sections of the 3 min 37 s video, SAT grids, with area under the curve (AUC). There were four
exactly 144 frames or 1 min and 12 s of video were used for different sized SAT grids tested and these results were conducted
testing sequences and the rest was used as the training set. Areas using the tuning images, with 50 images of the target object and 50
that were overly dark or shadowy were avoided for reasons that images of surrounding landscape. For this test the thistle dataset
will be explored in the discussion, although these were still used with the Y’UV variant was chosen. This was chosen as a
in the training dataset. Frames used in testing were cropped to demonstration dataset, as the difference between true positive
surround the quad bike. Images were flipped vertically using and false positive rates appeared most obvious. The test was
bash to increase the number of training images. Some training conducted once with a SAT grid of sizes 9, 25, 100 and 2500. The
images overlapped slightly with the test set. four MEC-ELMS were trained and test on the tuning images and the
average TP and FP rates from the four were saved. The average rates
were then used to make the ROC curve. As can be seen the larger
4. Results the SAT grid the better the results, with the best results coming
from the SAT grid with 2500 blocks. Notably the results are not as
The results from each dataset are split with the Y’UV input good as the tuning accuracies found in the result Tables 2 and 4.
results in Table 2 and the CLCM input results in Table 3. The tables The testing in this case was based on an on-line training method
include the name of each dataset, the number of blocks used in the were all weights were randomised and tested immediately after.
SAT grid, the average accuracy of individual CF-ELMs achieved in The MEC-ELMs used to get the precision and recall rates were
the tuning phase, the average time taken in testing per frame, the trained using an off-line method, where the CF-ELMS were
precision and recall rates for all frames. Accuracy, precision and initialised a number of times and the weights for the CF-ELM
recall can be expressed: producing the best results was saved in a text based file [9]. This
meant the results in Fig. 6 were therefore based on the average of
TP þ TN
Accuracy ¼ ; ð17Þ randomly trained classifiers, rather than the best of 100 classifiers.
TP þ TN þ FP þ FN
In further testing processing with 9 blocks took an average of 0.35 s
per frame, 25 blocks took 0.44 s per frame, 100 blocks took 1.1 s per
frame and 2500 blocks took around 67 s per from.
TP
Precision ¼ ; ð18Þ
TP þ FP
5. Discussion

TP Multiple expert or cascading approaches to object detection


Recall ¼ ; ð19Þ are often limited to rudimentary approaches and this is mostly
TP þ FN
due to time constraints, particularly in the pursuit of real time
where TP is true positive, TN is true negative, FP is false positive results. This paper has presented a multiple-expert colour feature
and FN false negatives. extreme learning (MEC-ELM) for real time feature extraction and
The results in Tables 2 and 3 indicate better overall performance object classification. The algorithm allowed real time results by
in the cattle and ATV datasets for the MEC-ELM with CLCM texture adopting a summed area table and hence, converted the images
inputs. The MEC-ELM with Y’UV inputs performed best on the into a low resolution and multiple scale environment. This
thistle dataset with a precision rate of 98% and recall at 84%. The allowed the MEC-ELM to process video frames and classify objects
Y’UV variant produced much faster processing times, with all in less than a second per frame for Y’UV inputs and 1–2 s in the
datasets processing around half a second to a second per frame. case of the CLCM implementation on a standard mobile CPU. The

Table 2
Testing results for each dataset using Y’UV inputs based on all frames of test videos, with accuracy detected during the tuning stage.

Y’UV Dataset SAT grid blocks Accuracy in tuning Time/frame Precision Recall
Bull Thistle 25 97% 0.44(s) 98% 84%
Cattle Drone 100 81% 1.19(s) 23% 53%
Cattle Stationary 100 89.25% 0.53(s) 84% 78%
ATV 25 93% 0.49(s) 91% 78%
E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 189

Table 3
Testing results for each dataset using the CLCM texture values based on all frames of the test videos, with accuracy detected during the tuning stage.

CLCM Dataset SAT grid blocks Accuracy in tuning Time/frame Precision Recall
Bull Thistle 25 91% 1.2(s) 84% 81%
Cattle Drone 25 93.5% 1.9(s) 78% 93%
Cattle Stationary 25 94% 1.3(s) 85% 84%
ATV 25 90% 2.3(s) 84% 91%

considered quite low. There are many advantages to this, by


reducing sections of the image into blocks there is the possibility
to reduce noise in the image and by reducing the size of each
frame, the video feed could be processed at a much lower
resolution. This would decrease the processing times by a
considerable amount. This would also have an impact on hardware
restrictions, that is, processing smaller images would lessen the
memory requirements for on board computing and allow faster
transfer speeds while lessoning storage issues.
Low resolution images have some drawbacks, loss of pixel
information for example can be a serious problem. Efforts should
be made to preserve and/or retrieve some of this lost data and this
is reason the CLCM variant of the MEC-ELM was proposed. The
CLCM was slower than the Y’UV variant, but managed to alleviate
some of the problems found in the cattle datasets. The Y’UV
variant, although very good at detecting objects where colour is of
some importance (the red quad bike and green thistle), struggled
detecting multi-coloured objects, such as black and white cows
and delivered a lot of false positives. The CLCM removed many of
the false positives and was able to detect multiple colour objects
with little difficulty; 1–2 s at multiple scales. This result was an
improvement when compared to similar solutions in literature.
The “Verschoor Aerial Cow Dataset” was previously tested at 32–
Fig. 5. Sample result from detection in each dataset.
144 s per frame and at around 66% precision and 87% recall [34].
The CLCM variant, although it was only tested on one video from
the dataset, produced 78% precision and 94% recall in 1.9 s. These
times make the MEC-ELM comparable to the fast processing Yolo
object detection framework, which has processing times around 6–
12 s per frame [37] on a standard CPU. It is however difficult to
compare at this stage, as Yolo is commonly benchmarked on a high
end GPU and the above citation was programmed in Python and not
C. A comparison between C and Python is available in Fourment and
Gillings [38]. This paper has limited itself to a mobile CPU for the
purpose of field based analysis. It is conceivable however that GPUs
will become more commonplace in the agricultural environment.
Conveniently the SAT Grid is quite adaptable to different resource
constraints and as the availability of hardware resources increases,
the number of blocks used could also increase. This will increase
the accuracy of the algorithm as the constraints are lessened. At
lower resolutions it will still be advantages to a GPU, allowing
processing of much larger areas while keeping processing times
down.
Notably, the results in the dataset were not fully balanced, the
Fig. 6. ROC curve measuring the average classification rates for different sized SAT differences between precision and recall in some of the datasets for
grids. example could have been much closer. This could be achieved with
more precise tuning or a well established tracking algorithm
[39,40]. The tracking algorithm developed was minimal to save
multiple scale environment gave the MEC-ELM the ability to processing time, future research could explore more sophisticated
process at multiple scales per second, with 2–4 different scales tracking algorithms [41]. Another issue that could be resolved is
used in each of the datasets. This means that objects can be the problem of shadows or overcast. The quad bike video for
detected up close, far away or in smaller forms (a calf or thistle in example went through a lot of transitional lighting and the
rosette). Another advantage of using the SAT grid is the ability to classification would suffer during these phases. This can depend on
produce low resolution images. Although this would be a problem the training set of course, but the incorporation of illumination
for classifiers that require high resolution images, a cascading invariant or illumination robust images would likely further
approach such as the MEC-ELM can improve accuracy by forming improve the performance of the algorithm.
a consensus among experts. The SAT grid is essentially a naive The results demonstrate the ability of the MEC-ELM as a low
scale [36] and at 25 blocks per image, the resolution would be resolution real-time object detection algorithm for drone and
190 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191

surveillance video capture in the agriculture industry. Choosing Acknowledgements


between the best algorithm depends on the dataset, with the Y’UV
performing better in weeds and the CLCM performing better in Dr Paul Meek, NSW Department of Primary Industries for the
cattle detection. While they both performed well in ATV detection. provision of cattle surveillance data. Animal Ethics UNE AEC12-
The choice of datasets and capture methods give the algorithms a 042, collected under scientific licence Sci Lic SL 100634.
good indication of how they will perform in a live scenario. The Mr Andrew Rieker of V-TOL Aerospace PTY Limited, for
algorithms were benchmarked on a laptop, while real-time providing footage of weeds using their quadcopter.
wireless transfer has already been establish [42]. Calibrating the Mr Paul Arnott, UNE Kirby Smart Farm, for access to paddocks
algorithm for live use should now be a process of choosing the right for photography of weeds.
frame rate and compression methods. JPEG images were chosen for Mr E. Sadgrove is supported by an Australian Postgraduate
this reason, as it produces smaller image sizes for image transfer. Award.
Global position systems (GPS) can be incorporated into drone and
surveillance equipment to give location feedback. References
The usefulness of the algorithms can be exemplified in each of
the chosen datasets. In weed detection, the algorithm could be [1] T. Berge, S. Goldberg, K. Kaspersen, J. Netland, Towards machine vision based
site-specific weed management in cereals, Comput. Electron. Agric. 81 (2012)
used to locate infestations quickly. A farm hand or ground based 79–86.
robot could then be used to deliver chemical to affected areas. [2] M.S. Uddin, A.Y. Akhi, Horse detection using Haar like features, Int. J. Comput.
Delivering chemicals to precise areas rather than entire areas Theory Eng. 8 (5) (2016) 415–418.
[3] J. Wawerla, S. Marshall, G. Mori, K. Rothley, P. Sabzmeydani, Bearcam:
would save chemical and reduce cost. The cattle detection scenario automated wildlife monitoring at the arctic circle, Mach. Vis. Appl. 20 (2009)
could be used to track and count cattle in an unobtrusive way. ATV 303–317, doi:http://dx.doi.org/10.1007/s00138-008-0128-0.
detection could be used for vehicle safety monitoring, locating [4] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-
time object detection, 2016 IEEE Conference on Computer Vision and Pattern
missing or injured persons and in collision avoidance systems. In Recognition (CVPR) (2015) 779–788.
each case the algorithm could be used in a stand alone device such [5] T. Malisiewicz, A. Gupta, A.A. Efros, Ensemble of Exemplar-SVMs for object
as a drone, unmanned ground vehicle (UGV) or stationary detection and beyond, International Conference on Computer Vision (ICCV),
IEEE, 2011, pp. 89–96.
surveillance camera.
[6] P.M. Pandiyan, C. Hema, P. Krishnan, S. Radzi, Colour recognition algorithm
This paper has demonstrated through quantifiable properties using a neural network model in determining the ripeness of a banana,
the MEC-ELM's potential as a real time object detection algorithm International Conference on Man-Machine Systems (ICoMMS), Batu Ferringhi,
for on board and/or remote computer based technology. The Malaysia, 2009 pp. 2B71–2B–74.
[7] J. Xu, H. Zhou, G.-B. Huang, Extreme learning machine based fast object
datasets used placed particular emphasis on drones and displayed recognition, 15th International Conference on Information Fusion (FUSION),
the algorithm's ability to perform with aerial video data. This is IEEE, Singapore, 2012, pp. 1490–1496.
particularly important given the convenience of drones, with their [8] X. Chen, H. Mulam, An implementation of faster RCNN with study for region
sampling, Tech. rep., Cornell University, 2017.
ability to fly long distances over difficult terrain and report back [9] E.J. Sadgrove, G. Falzon, D. Miron, D. Lamb, Fast object detection in pastoral
results in real time [42]. This point is particularly important in landscapes using a multiple expert colour feature extreme learning machine,
agriculture, but the MEC-ELM was designed to be a generic The International Tri-Conference for Precision Agriculture (PA17), 1st Asian-
Australia Conference on precision Pastures and Livestock Farming (2017).
approach to object detection in complex environments and for this [10] E.J. Sadgrove, G. Falzon, D. Miron, D. Lamb, Fast object detection in pastoral
reason could prove useful in many different scenarios, particularly landscapes using a colour feature extreme learning machine, Comput.
where fast processing is required. Electron. Agric. 139 (2017) 204–212.
[11] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and
applications, Neurocomputing 70 (2006) 489–501.
6. Conclusion [12] R. Moreno, F. Corona, A. Lendasse, M. Graña, L.S. Galvão, Extreme learning
machines for soybean classification in remote sensing hyperspectral images,
Neurocomputing 128 (2014) 207–216, doi:http://dx.doi.org/10.1016/j.
Two implementations of the multiple expert colour feature
neucom.2013.03.057. http://www.sciencedirect.com/science/article/pii/
extreme learning have been benchmarked as real time feature S0925231213010102.
extraction and object classification algorithms. This included a low [13] S. Malek, Y. Bazi, N. Alajlan, Efficient framework for palm tree detection in UAV
resolution Y’UV colour implementation and a CLCM texture based images, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7 (12) (2014) 4692–
4703.
implementation. The Y’UV implementation produced superior [14] G. Facciolo, N. Limare, E. Meinhardt-Llopis, Integral images for block matching,
results in the datasets where colours were uniform and a defining Image Process. On Line 4 (2014) 344–369.
characteristic, including weed detection with a precision of 98% [15] A. Mohamed, A. Issam, B. Mohamed, B. Abdellatif, Real-time detection of
vehicles using the Haar-like features and artificial neuron networks, Procedia
and a recall of 84%, while producing lower accuracy in multiple Comput. Sci. 73 (2015) 24–31.
coloured datasets, such as cattle detection. The CLCM variant was [16] M. Benco, R. Hudec, S. Matuska, M. Zachariasova, One-dimensional color-level
able to produce consistent results through all the datasets and co-occurrence matrices, Elecktro, IEEE, 2012, doi:http://dx.doi.org/10.1109/
ELEKTRO.2012.6225600.
higher results overall in three, including quad bike detection with [17] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning
84% precision and 91% recall, cattle detection from a stationary scheme of feedforward neural networks, Proceedings. 2004 IEEE International
camera at 85% precision and 84% recall and cattle detection from a Joint Conference on Neural Networks, vol. 2 (2004) 985–990.
[18] Ö.F. Ertugrul, Y. Kaya, A detailed analysis on extreme learning machine and
drone, at 78% precision and 93% recall. The Y’UV variant was able to
novel approaches based on ELM, (2015) . http://www.openscienceonline.com/
process video frames in half a second for three of the datasets and journal/ajcse.
just over 1 s for cattle detection from a drone. The CLCM processing [19] G.S.S. da Gomes, T.B. Ludermir, L.M.M.R. Lima, Comparison of new activation
functions in neural network for forecasting financial time series, Neural
times were between 1 and 2 s. These processing times indicate the
Comput. Appl. 20 (3) (2011) 417–439.
algorithms ability to process the images within a time frame [20] M. Loesdau, S. Chabrier, A. Gabillon, Hue and saturation in the RGB colour
suitable for agricultural robotics applications. This is particularly space, Conference: 6th International Conference, Vol. 8509 of Lecture Notes in
notable given its ability as a low resolution classifier. Future Computer Science, Cherbourg, France, 2014, pp. 203–212.
[21] Studio Encoding Parameters of Digital Television for Standard 4:3 and Wide-
research will involve testing the algorithm on an embedded device screen 16:9 Aspect Ratios: Recommendation ITU-R BT.601-6: (Question ITU-R
attached to a drone or land vehicle and testing the algorithm in 1/6), International Telecommunication Union, 2007. URL https://books.google.
different lighting conditions using different illumination equal- com.au/books?id=NpAwPwAACAAJ.
[22] S. Bashbaghi, E. Granger, R. Sabourin, G.-A. Bilodeau, Ensembles of exemplar-
isation techniques. There is also potential for testing the MEC-ELM svms for video face recognition from a single sample per person, 12th IEEE
on a GPU for a comparison to similar real-time object detection International Conference on Advanced Video and Signal Based Surveillance
algorithms. (AVSS) (2015) 1–6.
E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 191

[23] N. Liu, H. Wang, Ensemble based extreme learning machine, IEEE Signal [33] S. Hartig, Basic image analysis and manipulation in imagej Chapter 14, (2013)
Process. Lett. 17 (8) (2010) 754–757. Unit14.15.
[24] P. Viola, M. Jones, Robust real-time object detection, Int. J. Comput. Vis. 57 (2) [34] J.C. van Gemert, C.R. Verschoor, P. Mettes, K. Epema, L.P. Koh, S. Wich, Nature
(2004) 137–154. conservation drones for automatic localization and counting of animals,
[25] R. Wang, Adaboost for feature selection, classification and its relation with European Conference on Computer Vision Workshop, ECCV Workshop on
SVM, a review, Phys. Procedia 25 (2012) 800–807. Computer Vision in Vehicle Technology (2014) 255–270.
[26] A.H. Bishak, Z. Ghandriz, T. Taheri, Face recognition using a co-occurrence [35] JGK Drone, Quad biking drone footage, 2016, November. https://www.
matrix of local average binary pattern (CMLABP), Cyber J. Multidiscip. J. Sci. youtube.com/watch?v=x5I-y7bWD9E.
Technol. J. Sel. Areas Telecommun. (JSAT) (2012) 15–19. [36] D.E. Culler, J.P. Singh, A. Gupta, Workload-driven evaluation, Parallel Computer
[27] G.N. Srinivasan, G. Shoba, Statistical texture analysis, World Academy of Architecture: A Hardware/software Approach, 1st ed., (1998) , pp. 266.
Science, Engineering and Technology, vol. 36 (2008) 1264–1269. [37] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, (2016, December) .
[28] A. Eleyan, H. Demirel, Co-occurrence matrix and its statistical features as a new [38] M. Fourment, M. Gillings, A comparison of common programming languages
approach for face recognition, Turk. J. Elect. Eng. Comput. Sci. 19 (1) (2011) 97– used in bioinformatics, J. BMC Bioinf. 9 (2008) 82.
107. [39] D. Ghosh, N. Kaabouch, A survey on image mosaicing techniques, J. Vis.
[29] J. Leskela, J. Nikula, M. Salmela, OpenCL embedded profile prototype in mobile Commun. Image Represent. 34 (2016) 1–11.
device, Signal Processing Systems, IEEE, 2009, doi:http://dx.doi.org/10.1109/ [40] B.D. Lucas, T. Kanade, An iterative image registration technique with an
SIPS.2009.5336267. application to stereo vision (IJCAI), Proceedings of the 7th International Joint
[30] J. Tapson, P. de Chazal, A. van Schaik, Explicit computation of input weights in Conference on Artificial Intelligence, vol. 81 (1981).
extreme learning machines, International Conference on Extreme Learning [41] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal.
Machines, Vol. 1 of Algorithms and Theories, Springer, Switzerland, 2015, pp. Mach. Intell. 37 (9) (2015) 1834–1848, doi:http://dx.doi.org/10.1109/
41–49. TPAMI.2014.2388226.
[31] Netlib.org, The LAPACKE C interface to LAPACK, 2013, November. http://www. [42] G. Zhou, C. Li, P. Cheng, Unmanned aerial vehicle (UAV) real-time video
netlib.org/lapack/lapacke.html. registration for forest fire monitoring, Vol. 3 of Conference: Geoscience and
[32] Studio encoding parameters of digital television for standard 4:3 and wide- Remote Sensing Symposium, IEEE International, Seoul, South Korea, 2005.
screen 16:9 aspect ratios, Tech. rep., International Telecommunications Union,
Electronic Publication, Geneva, 2015.

You might also like