Computers in Industry: Edmund J. Sadgrove, Greg Falzon, David Miron, David W. Lamb
Computers in Industry: Edmund J. Sadgrove, Greg Falzon, David Miron, David W. Lamb
Computers in Industry
journal homepage: www.elsevier.com/locate/compind
A R T I C L E I N F O A B S T R A C T
Article history:
Received 31 August 2017 It is necessary for autonomous robotics in agriculture to provide real time feedback, but due to a diverse
Received in revised form 6 March 2018 array of objects and lack of landscape uniformity this objective is inherently complex. The current study
Accepted 15 March 2018 presents two implementations of the multiple-expert colour feature extreme learning machine (MEC-
Available online xxx ELM). The MEC-ELM is a cascading algorithm that has been implemented along side a summed area table
(SAT) for fast feature extraction and object classification, for a fully functioning object detection
Keywords: algorithm. The MEC-ELM is an implementation of the colour feature extreme learning machine (CF-ELM),
Extreme learning machine which is an extreme learning machine (ELM) with a partially connected hidden layer; taking three colour
Object detection
bands as inputs. The colour implementation used with the SAT enable the MEC-ELM to find and classify
Machine vision
objects quickly, with 84% precision and 91% recall in weed detection in the Y’UV colour space and in 0.5 s
Unmanned aerial vehicle
Agriculture per frame. The colour implementation is however limited to low resolution images and for this reason a
Robotics colour level co-occurrence matrix (CLCM) variant of the MEC-ELM is proposed. This variant uses the SAT
to produce a CLCM and texture analyses, with texture values processed as an input to the MEC-ELM. This
enabled the MEC-ELM to achieve 78–85% precision and 81–93% recall in cattle, weed and quad bike
detection and in times between 1 and 2 s per frame. Both implementations were benchmarked on a
standard i7 mobile processor. Thus the results presented in this paper demonstrated that the MEC-ELM
with SAT grid and CLCM makes an ideal candidate for fast object detection in complex and/or agricultural
landscapes.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction been implemented for weed, horse and wildlife detection [1–3].
The more complex solutions use a wide range of techniques, which
Agriculture systems require autonomous robotics for weed can include large data structures such as deep or exemplar neural
spraying, livestock detection and vehicle safety. The ability to networks and varying levels of texture and shape based analyses.
detect and process objects quickly is a desire of many of these This includes for the ripeness of bananas and for other generic
systems. Agricultural scenarios can be exceedingly complex as object detectors [4–7].
compared to other industrial based robotics systems. Object The approaches discussed often rely on grey-scale images and
detection algorithms in agriculture may compete with any number in many cases, the detection of prominent features or stand out
of structurally similar or diverse objects. This makes for a complex colour attributes. Processing speed is the key advantage in this
environment with the potential for many false positives and false case. Deep learning architectures are much slower and more
negatives. A potential solution to this problem is to adopt a complex and for this reason are often avoided in place of real time
cascading or multiple expert approach. These types of solutions solutions. Notably, with a high-end GPU it is possible to process a
have varying levels of success in both flora and fauna detection and large number of frames per second [8], but in remote mobile
there are numerous implementations, ranging from substratal to computing GPUs might not be attainable. A disadvantage of many
more complex. These approaches include simple colour and approaches is the reliance on prominent key characteristics in the
texture detection, as well as broad template matching and have classification process. This often leads to poor overall classification
accuracy, particularly in more complex scenarios, e.g., in weed
detection there may not be any prominent features; in the case of
* Corresponding author. cattle detection there may be variations in colour between the
https://doi.org/10.1016/j.compind.2018.03.014
0166-3615/© 2018 Elsevier B.V. All rights reserved.
184 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191
same breed and in other cases an object's features may appear defined by the international telecommunications union as ITU-R
different at different angles or rotations. B.601 [32]. Equivalent to the standard ELM it uses randomly
The goal of this research is then to explore methods that can be assigned weights in the hidden layer and by using the pseudo
used to deliver both fast and accurate feature extraction and object inverse it can analytically determine the output weights from a
classification. For this, an implementation of the Multiple-expert fully connected output layer. By dividing the hidden layer into 3
colour feature extreme learning machine (MEC-ELM) [9] is sections, each colour attribute can be processed in a separate
proposed. The MEC-ELM is a cascading implementation of the section of the hidden layer and it is for this reason that the number
Colour Feature Extreme Learn Machine (CF-ELM) [10], which is of neurons in the hidden layer must be a multiple of 3. By storing
itself an implementation of the Extreme Learning Machine (ELM) the output of these 3 sections into 3 sections of the H matrix it
[11] for colour object detection. The ELM in part due to its efficient allows the matrix to be used to determined the output weights in
implementation and fast processing speeds has demonstrated the same way as the standard ELM. For Y’UV the CF-ELM hidden
suitability for computer vision based problems in the agricultural layer can be expressed [10].
scenarios. Including for soybean classifications [12], unmanned
HðW1 2 ; WN~ ; b1 ; bN~ ; Y0 1 ; Y0 N ; U1 ; UN ;3V1 ; VN Þ
aerial vision for palm tree detection [13] and has been bench-
gðW1 Y0 1 þ b1 Þ gðWN~ Y0 1 þ bn~ Þ
marked using notable feature extraction techniques [7]. The MEC- 6 7
6 .. .. 7
ELM can be used as both a feature extraction and classification 6 . . 7
6 gðW Y0 þ b Þ gðW Y0 þ b Þ 7
technique. This can be achieved by adopting the summed area 6 1 N 1 N~ N ~ 7
n
6 7
table (SAT) [14] (or integral image) and thereby reducing a 6 7
6 gðW U þ b Þ gðW U þ b Þ 7
landscape image (or video frame) to a grid of coloured blocks. The 6 1 1 1 ~
N 1 n~ 7
6 .. .. 7 ð2Þ
purpose of this is to provide a generic approach to HAAR features ¼6 73NN ~
6 . . 7
6 7
[15] and hence take advantage of the SATs fast, multi-scale feature 6 gðW1 UN þ b1 Þ gðWN~ UN þ bn~ Þ 7
6 7
extraction architecture. Fast processing may require low resolution 6 7
6 7
image data and this can result in a potential loss of pixel based 6 gðW1 V1 þ b1 Þ gðWN~ V1 þ bn~ Þ 7
6 7
information. To meet this challenge, the output of the SAT grid is 6 .. .. 7
4 . . 5
used to generate three colour level co-occurrence matrices (CLCM) gðW1 VN þ b1 Þ gðWN~ VN þ bn~ Þ
[16], one for each colour band (red, green and blue). The outcome
will be a texture based analysis, with the values provided as input where Y0, U and V are equal to the individual colour pixel
of each CF-ELM. The two implementations of the MEC-ELM were matrices for each image and are stored in the H matrix at the
implemented and processing speeds and overall accuracy com- output of the hidden layer. The hidden layer process is repeated
pared. The algorithm is designed with the objective of fast object in the output layer, with the output b becoming the input
detection in complex and unpredictable terrain, making it an ideal multiplier for the output weights b. The output T of the CF-ELM
candidate for use in the agriculture industry. The objective in this is then the result of b H.
case is to implement the MEC-ELM as a weed detector for spraying,
T ¼ bH ð3Þ
unobtrusive cattle tracking and as a vehicle interaction and
avoidance tool. This paper will benchmark the MEC-ELM with pre- Here b can be expressed:
recorded video data, for eventual use in a remote laptop interface 2 T3
for interaction with an unmanned aerial vehicle (UAV or drone) and b1
stationary surveillance devices. b¼6 7~
4 ... 5N m ð4Þ
bN~T
2. Theory/calculation
where m is the number of neurons in the output layer, which is
2.1. Multiple-expert colour extreme learning machine (MEC-ELM) equivalent to the number of outputs of the ANN. The matrix of
target outputs T can be expressed as:
The ELM is a single layer, feed forward neural network that is 2 T3
t1
known for its fast and analytical training phase. In this phase the 6 7
T ¼ 4 ... 5Nm ð5Þ
output of the neural networks hidden layer is stored in a matrix
designated H the output weights are then determined analytically. tTN
This can be expressed [17,18]: where for each TN the value is stored based on the input training
HðW1 ; WN~ ; b1 ; bN~ ; x1 ; xN Þ sample and its desired output. This leaves b as the one unknown,
2 3
gðW1 x1 þ b1 Þ gðWN~ x1 þ bN~ Þ by making b the subject we get:
6 7 ~ ð1Þ
¼4 ... ... 5N N
b ¼ H1 T ð6Þ
gðW1 xN þ b1 Þ gðWN~ xN þ bN~ Þ
1
where H is the Moore-Penrose pseudo inverse of matrix H. The
where H is the hidden layer output matrix, N is the number of output values of this process are then stored in b and used as the
~ is the number of neurons
samples used in the training phase and N weights in the output layer removing the need for a long gradient
in the hidden layer. In the activation function g() [19], W is the descent based training process.
input weight, x is the input sample pixels and b is the bias. The MEC-ELM is then a set of CF-ELMs, where each CF-ELM can
The Colour Feature Extreme Learning Machine (CF-ELM) is be trained on a different set of sample images, different colour
similar in architecture to the ELM, comprising of a single layer, feed system or image analysis techniques. The MEC-ELM becomes a
forward, neural network with a partially connected hidden layer global consensus of all CF-ELMs or experts, the goal is then to find
and a fully connected output layer. The hidden layer is divided into individual CF-ELMS of high classification accuracy but with varying
3 sections and this gives the CF-ELM the ability to be used with consensus [21]. The method used in this paper was inspired by the
different colour models, including red, green, blue (RGB), lumi- Examplar SVM [22] and the Ensemble ELM (EN-ELM) [23], where
nance, chrominance red, chrominance blue (Y’UV) and hue, training samples are divided among different instances of CF-
saturation, value (HSV) [20], Y’UV (also known as YCrCb) is ELMs. The training phase of the MEC-ELM is depicted in Fig. 1,
E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 185
the image very quickly. Only four sets of coordinates are required to Energy ¼ Pði; jÞ2 ; ð10Þ
i;j¼0
calculate the light intensity of one sub block. To create the SAT all
pixel values in the table are stored as summations of the pixel value at
the current location and the preceding summation values. In this
X
D1
fashion, the SAT can be generated from just one pass over the image. Entropy ¼ lnðPði; jÞÞ Pði; jÞ; ð11Þ
This can be expressed [14]: i;j¼0
and the decrements (Di, Dj) refer to the preceding pixel Contrast ¼ Pði; jÞ ði jÞ2 ; ð12Þ
summation value; here Di, Dj = 1. The sum light intensity value
i;j¼0
this paper red, green and blue SATs were created instead of one SAT where D is the number of values within the CLCM. In this research
for grey-scale (or light intensity). This allowed the CF-ELM to work three CLCMs were created to match the three colour band
with each individual colour band as required. requirements of the CF-ELM. These included a red level, green
level and blue level co-occurrence matrix. To avoid confusion these
2.3. Colour-level co-occurrence matrix (CLCM) will be referred to as a colour-level co-occurrence matrix (CLCM),
but it is noteworthy that a grey-level (GLCM) is more common in
The CLCM [16] (or grey level GLCM) is an image analysis tool based literature.
on light intensity and looks at the relationship between pixels and
186 E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191
3. Methodology two frame per second to match the expected processing time of
the algorithm, although this did not take into account the CLCM
All instances of the MEC-ELM were programmed and tested and tracking algorithms. The video used for testing in the case of
using the C programming language, C was chosen due to portability the cattle aerial footage was only 18 s long, for this reason six
and faster processing speeds. It is also possible to use variants of C frames per second were extracted instead. To elaborate, the
in many embedded devices [29]. All benchmarking was conducted extraction of two frames per second was deemed adequate for
using a laptop computer with a Linux based system, 16 gigabytes of object detection in remote environments, but the emphasis in
ram, a solid state drive and a 4th generation i7 mobile processor. All this paper was on evaluating the effectiveness of the algorithm
time benchmarking was conducted using the clock _ g ettime and for this reason more frames were required.
function with the CLOCK _ M ONOTONIC option from the time. h The SAT Grid: during the training, testing and tuning processes, all
library. All images were stored as JPEG and decompressed using the images were first converted to a summed area table. The SAT was
jpegIO. h library (also known as libjepg), before pre-processing, then used to extract an average colour grid of the image or target
images were stored in memory as RGB values between 0 and 255. area. There was no limit on the number of coloured blocks within
JPEG was chosen for storage and file transfer reasons, it is also an the grid, but in this research between 25 and 100 blocks were
output available in a number of remote camera interfaces. Images used and this depended on how successful classification was at
were stored using 4:4:4 sub-sampling which is perceptually each block number. As processing speed was a priority, lower
lossless and were saved using the default setting 4:2:0 (lossy) in block numbers were preferred. For the Y’UV variant of the MEC-
libjpeg for user feedback/external processing (smaller file). In ELM these values were sent directly to each CF-ELM for
agriculture this means that the lesser quality images are better processing. This is on display in Fig. 2, where 3SAT is the three
prepared for the potential wireless transfer constraints of remote individual summed area tables for the colour bands Y’, U and V.
environments. There is also an example of an image converted to a SAT grid with
196 blocks in Fig. 3.
3.1. The algorithm Scaling: Each frame was read as a selection of rectangles. The
rectangle size was produced by dividing the width and height of
The algorithm is a multiple stage process based primarily on the image frame by a number determined to be effective in
Sadgrove et al. [9,10] for the initialisation of the CF-ELM for remote pretesting. At each iteration the search rectangle was moved
computer based classification. These stages included an (i) through the image frame (typically by 10–20 pixels at a time). To
initialisation stage: where the neural network is initialised into search with a larger or smaller rectangle the number for dividing
memory with initial random weights, (ii) weight biasing stage: was increased or decreased and this caused a scale change.
where the weights were biased to the training data and is based on The CLCM: The block RGB values extracted from the SAT were
the CIW-ELM [30], (iii) training stage: where the output weights processed into a CLCM for use with the CLCM variant of the MEC-
were analytically determined with assistance from the C lapack ELM. During this process the red, green and blue level co-
library [31], (iv) a tuning stage: where optimal threshold values occurrence matrices were produced and from each an energy,
were found based on a tuning set of 100 images, this stage was also entropy, contrast, homogeneity and CLCM mean value were
used to assist in an off line training process, where optimal produced. The five values were then sent to the five inputs of
individual weight sets for each CF-ELM were saved in text based each CF-ELM for processing. A diagram is on display in Fig. 4.
data files so that they can be imported instead of retraining, (V) Note the extra stage 3CLCM, where R, G and B SATs are used to
testing stage: where the individual CF-ELMS were tested on the make 3 CLCMs for texture values energy, entropy, contrast,
image frames. The weight biasing, training, tuning and testing homogeneity and CLCM mean.
stages differed for each of the two implementations and from the
standard CF-ELM algorithm. These differences will be discussed. Of
the two implementations of the MEC-ELM tested in this paper the
primary difference is that one utilises the CLCM and the other does
not. This means that the CLCM MEC-ELM has CF-ELMs with just 5
inputs to match the 5 texture values. The Y’UV MEC-ELM has inputs
matching the number of grid blocks from the integral image. In
both cases the grid blocks will be referred to as the SAT grid. In this
implementation the RGB image is converted to Y’UV before being
processed as a sat grid.
Cattle Stationary: the dataset is a collection of surveillance videos The CLCM variant took between 1 and 2 s longer due to the time
taken from a nearby farm using a Scoutguard SG860C camera at needed to process the CLCM and get the texture values. The Y’UV
640 by 480 pixels, at 16 frames per second for 60 s and AVI format variant did not perform very well in the cattle datasets. In both
(before conversion to MP4). The videos included Poll Hereford cases it required more grid blocks and recorded many false
cattle surrounding a river bed and other areas for grazing. The positives. This included the reflection of the cattle in the water in
training set was made up of images cropped from some of these the case of the stationary camera and a large number of false
videos. A video not used to generate the training images was positives in the tree line and on the helipad in the drone dataset,
used as the test video. Each video was captured at different times this dataset was also the slowest of the four because of the multiple
of day and in different weather conditions. The distance to the scales required to get true positives. The CLCM variant recorded a
cow depended on how close they got to the camera. number of false positives on the helipad as well, particularly in the
ATV: The video of the quad bike rider was captured using a DJI last ten frames. Discarding the last ten frames puts the precision at
Phantom 3 Adv in areas of Lough Coolin and Mount Gable in 84%. Sample output frames from each of the 4 datasets are in Fig. 5,
Ireland in overcast conditions and at varying distances (around where B (cattle stationary) and D (bull thistle) are from the Y’UV
20–50 m). Precise specifications of the video were not available. variant and A (quad bike) and C (cattle drone) are from the CLCM
The video was downloaded from Youtube for fair use in MP4 variant. Note the detection of the cow reflection in B. This problem
format and at a resolution of 1280 720 pixels. The video was did not occur in the CLCM variant.
submitted to Youtube by user and channel name “JGK Drone” In Fig. 6 there is a receiver operator characteristic (ROC) curve
[35]. As only one video was available the training and test set that is depicting the average classification rates for different sized
were made up of different sections of the 3 min 37 s video, SAT grids, with area under the curve (AUC). There were four
exactly 144 frames or 1 min and 12 s of video were used for different sized SAT grids tested and these results were conducted
testing sequences and the rest was used as the training set. Areas using the tuning images, with 50 images of the target object and 50
that were overly dark or shadowy were avoided for reasons that images of surrounding landscape. For this test the thistle dataset
will be explored in the discussion, although these were still used with the Y’UV variant was chosen. This was chosen as a
in the training dataset. Frames used in testing were cropped to demonstration dataset, as the difference between true positive
surround the quad bike. Images were flipped vertically using and false positive rates appeared most obvious. The test was
bash to increase the number of training images. Some training conducted once with a SAT grid of sizes 9, 25, 100 and 2500. The
images overlapped slightly with the test set. four MEC-ELMS were trained and test on the tuning images and the
average TP and FP rates from the four were saved. The average rates
were then used to make the ROC curve. As can be seen the larger
4. Results the SAT grid the better the results, with the best results coming
from the SAT grid with 2500 blocks. Notably the results are not as
The results from each dataset are split with the Y’UV input good as the tuning accuracies found in the result Tables 2 and 4.
results in Table 2 and the CLCM input results in Table 3. The tables The testing in this case was based on an on-line training method
include the name of each dataset, the number of blocks used in the were all weights were randomised and tested immediately after.
SAT grid, the average accuracy of individual CF-ELMs achieved in The MEC-ELMs used to get the precision and recall rates were
the tuning phase, the average time taken in testing per frame, the trained using an off-line method, where the CF-ELMS were
precision and recall rates for all frames. Accuracy, precision and initialised a number of times and the weights for the CF-ELM
recall can be expressed: producing the best results was saved in a text based file [9]. This
meant the results in Fig. 6 were therefore based on the average of
TP þ TN
Accuracy ¼ ; ð17Þ randomly trained classifiers, rather than the best of 100 classifiers.
TP þ TN þ FP þ FN
In further testing processing with 9 blocks took an average of 0.35 s
per frame, 25 blocks took 0.44 s per frame, 100 blocks took 1.1 s per
frame and 2500 blocks took around 67 s per from.
TP
Precision ¼ ; ð18Þ
TP þ FP
5. Discussion
Table 2
Testing results for each dataset using Y’UV inputs based on all frames of test videos, with accuracy detected during the tuning stage.
Y’UV Dataset SAT grid blocks Accuracy in tuning Time/frame Precision Recall
Bull Thistle 25 97% 0.44(s) 98% 84%
Cattle Drone 100 81% 1.19(s) 23% 53%
Cattle Stationary 100 89.25% 0.53(s) 84% 78%
ATV 25 93% 0.49(s) 91% 78%
E.J. Sadgrove et al. / Computers in Industry 98 (2018) 183–191 189
Table 3
Testing results for each dataset using the CLCM texture values based on all frames of the test videos, with accuracy detected during the tuning stage.
CLCM Dataset SAT grid blocks Accuracy in tuning Time/frame Precision Recall
Bull Thistle 25 91% 1.2(s) 84% 81%
Cattle Drone 25 93.5% 1.9(s) 78% 93%
Cattle Stationary 25 94% 1.3(s) 85% 84%
ATV 25 90% 2.3(s) 84% 91%
[23] N. Liu, H. Wang, Ensemble based extreme learning machine, IEEE Signal [33] S. Hartig, Basic image analysis and manipulation in imagej Chapter 14, (2013)
Process. Lett. 17 (8) (2010) 754–757. Unit14.15.
[24] P. Viola, M. Jones, Robust real-time object detection, Int. J. Comput. Vis. 57 (2) [34] J.C. van Gemert, C.R. Verschoor, P. Mettes, K. Epema, L.P. Koh, S. Wich, Nature
(2004) 137–154. conservation drones for automatic localization and counting of animals,
[25] R. Wang, Adaboost for feature selection, classification and its relation with European Conference on Computer Vision Workshop, ECCV Workshop on
SVM, a review, Phys. Procedia 25 (2012) 800–807. Computer Vision in Vehicle Technology (2014) 255–270.
[26] A.H. Bishak, Z. Ghandriz, T. Taheri, Face recognition using a co-occurrence [35] JGK Drone, Quad biking drone footage, 2016, November. https://www.
matrix of local average binary pattern (CMLABP), Cyber J. Multidiscip. J. Sci. youtube.com/watch?v=x5I-y7bWD9E.
Technol. J. Sel. Areas Telecommun. (JSAT) (2012) 15–19. [36] D.E. Culler, J.P. Singh, A. Gupta, Workload-driven evaluation, Parallel Computer
[27] G.N. Srinivasan, G. Shoba, Statistical texture analysis, World Academy of Architecture: A Hardware/software Approach, 1st ed., (1998) , pp. 266.
Science, Engineering and Technology, vol. 36 (2008) 1264–1269. [37] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, (2016, December) .
[28] A. Eleyan, H. Demirel, Co-occurrence matrix and its statistical features as a new [38] M. Fourment, M. Gillings, A comparison of common programming languages
approach for face recognition, Turk. J. Elect. Eng. Comput. Sci. 19 (1) (2011) 97– used in bioinformatics, J. BMC Bioinf. 9 (2008) 82.
107. [39] D. Ghosh, N. Kaabouch, A survey on image mosaicing techniques, J. Vis.
[29] J. Leskela, J. Nikula, M. Salmela, OpenCL embedded profile prototype in mobile Commun. Image Represent. 34 (2016) 1–11.
device, Signal Processing Systems, IEEE, 2009, doi:http://dx.doi.org/10.1109/ [40] B.D. Lucas, T. Kanade, An iterative image registration technique with an
SIPS.2009.5336267. application to stereo vision (IJCAI), Proceedings of the 7th International Joint
[30] J. Tapson, P. de Chazal, A. van Schaik, Explicit computation of input weights in Conference on Artificial Intelligence, vol. 81 (1981).
extreme learning machines, International Conference on Extreme Learning [41] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal.
Machines, Vol. 1 of Algorithms and Theories, Springer, Switzerland, 2015, pp. Mach. Intell. 37 (9) (2015) 1834–1848, doi:http://dx.doi.org/10.1109/
41–49. TPAMI.2014.2388226.
[31] Netlib.org, The LAPACKE C interface to LAPACK, 2013, November. http://www. [42] G. Zhou, C. Li, P. Cheng, Unmanned aerial vehicle (UAV) real-time video
netlib.org/lapack/lapacke.html. registration for forest fire monitoring, Vol. 3 of Conference: Geoscience and
[32] Studio encoding parameters of digital television for standard 4:3 and wide- Remote Sensing Symposium, IEEE International, Seoul, South Korea, 2005.
screen 16:9 aspect ratios, Tech. rep., International Telecommunications Union,
Electronic Publication, Geneva, 2015.