Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
to
OCTOBER, 2019 GC
Approval
This is to certify that the thesis prepared by Mr. Yidnekachew Kibru Afework entitled
“Developing Bacterial Wilt Detection Model on Enset Crop Using A Deep Learning
Approach” and submitted as a partial fulfillment for the degree of Masters of Science
complies with the regulations of the University and meets the accepted standards with
respect to originality, content, and quality.
ii
Declaration
I hereby declare that this thesis entitled “Developing Bacterial Wilt Detection Model on
Enset Crop Using A Deep Learning Approach” was prepared by me, with the guidance
of my advisor. The work contained herein is my own except where explicitly stated
otherwise in the text, and that this work has not been submitted, in whole or in part, for any
other degree or professional qualification.
Witnessed by:
iii
Dedication
iv
Abstract
Ethiopia is one of the countries in Africa which have a huge potential for the development
of different varieties of crops. There are many cultivated crops which are used as a staple
food in different regions of the country. From those crops Enset is the one and which is
used by around 15 million peoples as a staple food in central, south, and southwestern
regions of Ethiopia. Enset crop is affected by disease caused by bacteria, fungi, and virus.
From these, bacterial wilt of Enset is the most determinant constraint to Enset production.
Identification of the disease needs special attention from experienced experts in the area
and it is not possible for the plant pathologists to reach each and every Enset crop to observe
the disease, because the crop is physically big. Thus, developing a computer vision model
that can be deployed in drones that automatically identify the disease can help to support
the community which cultivates Enset crop. To this end, a deep learning approach for
automatic identification of Enset bacterial wilt disease is proposed. The proposed approach
has three main phases. The first phase is the collection of healthy and diseased Enset images
with the help of agricultural experts from different farms to create a dataset. Then the
design of a convolutional neural network that can classify the given image in to healthy
and diseased is done. Finally, the designed model is trained and tested by using the
collected dataset and compared the designed model with different pre-trained
convolutional neural network models namely VGG16 and InceptionV3.
The dataset contains 4896 healthy and diseased Enset images. From this, 80% of the images
are used for training and the rest for testing the model. During training, data augmentation
technique is used to generate more images to fit the proposed model. The experimental
result demonstrates that the proposed technique is effective for the identification of Enset
bacterial wilt disease. The proposed model can successfully classify the given image with
a mean accuracy of 98.5% even though images are captured under challenging conditions
such as illumination, complex background, different resolution, and orientation of real
scene images.
v
Acknowledgments
First and foremost I would like to express my deepest gratitude to the Almighty God for
the blessing this thesis has successfully been concluded. After that, I would like to thank
my advisor Dr. Sreenivasa Rao for his advice throughout this thesis.
I would also like to express my gratitude to my co-advisor Mr. Taye Girma for his help
and constructive guidance right from the implementation to the completion of the work.
Many thanks and appreciations go to him for the discussions with him always made me
think that things are possible.
I am also very thankful to Mr. Abebe who is plant science expert at Mihurna Alkil wereda
Gurage zone SNNPR and Mr. Sabura Shara who is a Ph.D. scholar at Arbaminch
University for giving me expert advise about bacterial wilt disease of Enset and helped me
to get images of Enset that are affected by the disease.
Finally, I would like to thank my family, who always supported me throughout these years.
I acknowledge the constant hard work and moral support of my loving brother, Mr. Tilahun
Kibru. He always supported and encouraged me to achieve new goals.
vi
Table of Contents
Approval ............................................................................................................................. ii
Declaration ......................................................................................................................... iii
Abstract ............................................................................................................................... v
Acknowledgments.............................................................................................................. vi
Abbreviations and Acronyms ............................................................................................. x
Lists of Tables .................................................................................................................... xi
List of Figures ................................................................................................................... xii
Chapter 1 ....................................................................................................................... 1
Introduction ....................................................................................................................... 1
Background .......................................................................................................... 1
SWOT Analysis.................................................................................................... 2
Motivation ............................................................................................................ 4
Statements of The Problem .................................................................................. 5
Objectives ............................................................................................................. 6
1.5.1 General Objective ......................................................................................... 6
1.5.2 Specific Objectives ....................................................................................... 6
Scope and Limitation of the Study ....................................................................... 7
Significance of the Study ..................................................................................... 7
Organization of the Thesis ................................................................................... 8
Chapter 2 ....................................................................................................................... 9
Literature Review ............................................................................................................. 9
Enset Crop ............................................................................................................ 9
Enset Bacterial Wilt ........................................................................................... 10
Machine Learning .............................................................................................. 11
Artificial Neural Network .................................................................................. 13
2.4.1 Multi-Layer Networks ................................................................................ 14
2.4.2 Backpropagation Algorithm........................................................................ 15
2.4.3 Activation Function .................................................................................... 16
Deep Learning .................................................................................................... 18
2.5.1 Convolutional Neural Network ................................................................... 21
2.5.2 CNN Architectures...................................................................................... 28
2.5.3 Application of CNN in Crop Disease Detection ......................................... 30
Related Works .................................................................................................... 32
vii
Summary ............................................................................................................ 34
Chapter 3 ..................................................................................................................... 36
Research Methodologies ................................................................................................. 36
Research Flow .................................................................................................... 36
Data Preparation ................................................................................................. 37
3.2.1 Data Preprocessing...................................................................................... 37
3.2.2 Data Partitioning ......................................................................................... 38
3.2.3 Data Augmentation ..................................................................................... 39
Software Tools ................................................................................................... 39
Hardware Tools .................................................................................................. 40
Evaluation Technique ......................................................................................... 41
Chapter 4 ..................................................................................................................... 42
Design and Experiment .................................................................................................. 42
Model Selection.................................................................................................. 42
Overview of BWE Detection ............................................................................. 43
Training Components of the Proposed Model ................................................... 44
4.3.1 Proposed Model Description....................................................................... 45
Feature Extraction Using Proposed Model ........................................................ 48
Classification Using Proposed Model ................................................................ 50
Classification Using Pre-Trained Models .......................................................... 51
Experimental setup ............................................................................................. 52
4.7.1 Augmentation Parameters ........................................................................... 53
4.7.2 Hyperparameter Settings ............................................................................. 53
Chapter 5 ..................................................................................................................... 56
Results and Discussions .................................................................................................. 56
Experimental Result ........................................................................................... 56
Pre-trained CNN ................................................................................................. 56
5.2.1 Detection of BWE by using VGG16 Pre-trained Model ............................ 57
5.2.2 Result Analysis of VGG16 ......................................................................... 58
5.2.3 Detection of BWE by using InceptionV3 Pre-trained Model ..................... 60
5.2.4 Result Analysis of InceptionV3 .................................................................. 61
Detection of BWE by using the Proposed CNN Model ..................................... 62
5.3.1 Scenario 1: Changing Training and Testing Dataset Ratio. ........................ 62
5.3.2 Scenario 2: Changing Learning Rate. ......................................................... 63
viii
5.3.3 Scenario 3: Using Different Activation Function. ...................................... 63
5.3.4 Result Analysis for the Proposed BWE Detection Model .......................... 64
Discussion .......................................................................................................... 66
Chapter 6 ..................................................................................................................... 69
Conclusion and Recommendations ............................................................................... 69
Conclusion.......................................................................................................... 69
Recommendations .............................................................................................. 70
References ........................................................................................................................ 71
Appendix A: Experiment of Proposed Model .............................................................. 76
ix
Abbreviations and Acronyms
ANN Artificial Neural Network
BCE Binary Cross-Entropy
BWE Bacterial Wilt of Enset
CNN Convolutional Neural Network
DBN Deep Belief Networks
DBSCAN Density-Based Spatial Clustering of Application with Noise
FC Fully Connected
FCN Fully Convolutional Network
GPU Graphics Processing Units
HOT Histogram of Template
ILSVRC ImageNet Large Scale Visual Recognition Challenge
KNN K-Nearest Neighbor
LSTM Long Short-Term Memory
ML Machine Learning
MSE Mean Squared Error
NLP Natural Language Processing
OCR Optical Character Recognition
ReLU Rectified Linear Unit
RGB Red, Green, and Blue
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SVM Support Vector Machine
x
Lists of Tables
Table 1.1 SWOT analysis of Enset Production .................................................................. 3
Table 2.1. Summary of related works ............................................................................... 34
Table 4.1. Summary of proposed model parameters ........................................................ 48
Table 4.2. Augmentation techniques used ........................................................................ 53
Table 4.3. Summary of hyperparameters used during model training .............................. 55
Table 5.1. Mean accuracy and loss of VGG16 pre-trained model.................................... 59
Table 5.2. Mean accuracy and loss of InceptionV3 pre-trained model ............................ 62
Table 5.3. Result of experiments by using different training and testing dataset ratio ..... 62
Table 5.4. Result of the proposed model by using different learning rate ........................ 63
Table 5.5. Results of the proposed model by using different activation functions ........... 63
Table 5.6. Mean accuracy and loss of the proposed model .............................................. 66
xi
List of Figures
Figure 2.1. Example of Enset crop.................................................................................... 10
Figure 2.2. Example of healthy (left) and infected (right) Enset leave............................. 10
Figure 2.3 Map of regions that cultivate Enset in Ethiopia .............................................. 11
Figure 2.4. Example of single layer perceptron ................................................................ 14
Figure 2.5. Example of multilayer Network ..................................................................... 15
Figure 2.6. Example of CNN Architecture ....................................................................... 22
Figure 2.7. Example of Input volume and filter................................................................ 23
Figure 2.8. Example of the Convolution operation........................................................... 24
Figure 2.9. Example of convolution of a 3D input volume .............................................. 25
Figure 2.10. Example of convolution operation with 2 filters .......................................... 26
Figure 2.11. An Example of one convolution layer with activation function................... 26
Figure 2.12. Example of max pooling .............................................................................. 27
Figure 2.13. Example of fully connected Layer ............................................................... 28
Figure 3.1. Research flow ................................................................................................. 37
Figure 3.2. Resized image ................................................................................................. 38
Figure 4.1. Block diagram of the detection of Bacterial Wilt disease .............................. 43
Figure 4.2. Proposed model .............................................................................................. 45
Figure 4.3. Feature Extraction in the proposed model ...................................................... 49
Figure 4.4. Classification in the proposed model ............................................................. 50
Figure 4.5. Transfer learning ............................................................................................ 52
Figure 5.1. Training and validation accuracy for VGG16 Pre-trained model .................. 59
Figure 5.2. Training and validation loss for the VGG16 pre-trained model..................... 59
Figure 5.3. Example of the Inception module................................................................... 60
Figure 5.4. Training and validation accuracy of InceptionV3 pre-trained model ............ 61
Figure 5.5. Training and validation loss of InceptionV3 pre-trained model .................... 61
Figure 5.6. Training and validation accuracy of the proposed model .............................. 65
Figure 5.7. Training and validation loss of proposed model ............................................ 65
Figure 5.8. Mean accuracy of the three experiments ........................................................ 67
Figure 5.9. Mean Loss of the three experiments ............................................................... 67
xii
ChapterONE
CHAPTER 1
INTRODUCTION
Background
Ethiopian economy is mainly depending on agriculture. Nearly 85% of Ethiopian people
depend on agriculture as their principal means of livelihood [1]. In this context, agriculture
plays a vital role in the Ethiopian economy. In recent decades, agricultural production has
become much more important than it used to be some years back where plants were only
used to feed humans as well as animals. It is also an important source of raw materials for
many agriculture-based industries.
Ethiopia is one of the countries in Africa which have a huge potential for the development
of different varieties of crops. There are many crops used as the main food source in
Ethiopia, from this Enset (እንሰት) is the one. Enset is belonging to a family of Musaceae
and it is herbaceous and monocarpic crop. The physical appearance of Enset resembles that
of banana, but Enset is taller in height and fatter in size, and most importantly the fruits of
Enset are not edible, Hence, Enset is known as ‘false banana’. Enset has a gigantic
underground rhizome or corn which is used to propagate. The corn has emerging suckers
and the sucker develops into a new fruit-bearing Enset crop. In central, south, and
southwestern Ethiopia Enset is considered as the first cultivated food, food security, and
cash crop. A total of 302,143 Hectare of land is cultivated by Enset crop in Ethiopia [2]
and is used as human food, animal forage, fiber, construction materials and medicines for
20% of the country’s population. Most importantly Enset is used as a staple food for more
than 15 million peoples in Ethiopia [3].
Enset production is affected by biotic and abiotic factors, such as diseases and insect pests
which contribute to low yield and low quality of Enset production. From those factors, a
disease which is caused by bacteria, fungi, viruses, and nematodes is the most severe
biological problem. Among these, Bacterial Wilt of Enset (BWE) is the most determinant
constraint to Enset production [4, 5].
1
Detecting diseases plays an important role in the field of agriculture because of most of the
disease in plants is not easily visible when it happens in first time. To identify the disease
that causes a problem on the Enset crop, it is usually necessary to look at the Enset closely;
examine the leaves, and stems of the Enset; sometimes the roots of the Enset; and do some
detective work to determine the possible causes of the disease. The identification of Enset
disease more specifically the bacterial wilt disease can be done through plant pathologists
(experts in the field of agriculture). However, getting experts for the identification of the
disease is expensive for farmers that are far from the place where experts are found this is
the main weakness of the area. In order to minimize this problem and properly identify
bacterial wilt disease, it is possible to develop a computerized model that can detect the
disease by using computer vision and deep learning techniques.
The opportunities are, bacterial wilt disease produces symptoms on the leave of the crop
which are the main indicators of the disease in the field. Depending on those symptoms
which are directly shown in the crop leaf, we can develop a model that identifies the disease
in the crop. Therefore, the development of bacterial wilt disease detection is quite useful.
In this thesis, automatic Enset bacterial wilt detection mechanism is developed by using
computer vision techniques more specifically by using deep learning algorithms.
SWOT Analysis
SWOT analysis is a strategic planning technique used to help a person or organization
identify strengths, weaknesses, opportunities, and threats related to business competition
or project planning. One of the vital steps in the planning process is an analysis of strengths,
weaknesses, opportunities and threats. Prior identification of weaknesses and threats helps
to identify appropriate approaches for internal improvement and justification of factors that
may result in adverse impacts beyond the control of the agriculture sector. Recognition of
strengths and opportunities enables gaining the maximum benefits from internal and
external environments toward achieving the goals and targets set.
Enset production and marketing has the strong opportunities however, there are some
problems and threats faced on this golden crop such as Bacteria Wilt (Xanthomonas
campestris) disease and lack of improved technologies, planting materials, and post-harvest
processing technologies. The lack of improved Enset varieties and absence of external farm
2
inputs may affect production. Farmers mostly rely on organic farmyard manures to supply
nutrients to the Enset plant, which may not sufficient for raising production. Another
problem is Enset production done by subsistence farming system and it observed that not
direct linked with the central market. There were several gaps and weaknesses in the
production, processing, and marketing of kocho and bulla. Farming and post-harvest tools
and implements are still traditional with low use efficiency. Moreover, equipment used in
Enset processing is also very traditional tools and locally made. This indicates that there is
need of a lot of work to improve the processing methods. Gender-dependent work division
on Enset processing has a negative impact on productivity. Kocho storage traditional
methods and bulla drying processing method are leading to losses of products. Marketing
kocho and bulla in the local market are very liable for losses and spoilage due to lack of
storage and market facilities [6, 7].
Table 1.1 SWOT analysis of Enset Production
3
5. Potential export market especially 5. Shortage of land, poor soil fertilities
bulla and unsuitable topography for
6. Potential raw material for textiles agriculture
and paper industries
Motivation
Food, shelter, fiber, medicine, and fuel are provided by plants in our world. Green plants
produce basic food however, modern technologies allow humans to produce more food to
meet the demand of our planet’s population. A lot of factors affect food security such as
climate change and plant disease and most of the plant loss is caused by plant disease. The
most important thing in plant disease management is identifying the disease when first
appears on the farm by using different computing infrastructures such as computer vision,
and deep learning techniques.
The main motivation to choose this thesis was my grandparent's Enset farm was destroyed
by bacterial wilt when I was a grade 9 student. My grandpa and the society of that village
were seriously finding a cure for the disease in the traditional way by the time. Finally, my
grandpa said to me that “study hard my son you will find medicine when you grow up”.
Now I have a chance to detect the disease early before spreading into all of the crops on
the farm and before destroys the entire Enset farm.
4
Statements of The Problem
In the central, south, and southwestern regions of Ethiopia, Enset is the main (the only in
some area) cultivation crop [3]. However, in the cultivation process, there are a number of
challenges that affect the crop such as disease and pests, from these challenges bacterial
wilt disease is the most determinant. Like most of the crop’s disease, the identification of
Enset bacterial wilt disease needs special attention that requires experienced experts in the
area [9]. Researches have shown that the disease is causing a high amount of yield loss in
areas that cultivate Enset in Ethiopia [9]. Up to 80% of enset farms in Ethiopia are currently
infected with enset Xanthomonas wilt. The disease has forced farmers to abandon enset
production, resulting in critical food shortage in the densely populated areas of southern
Ethiopia. This disease directly affects the livelihood of more than 20% of farmers in the
country [9, 10].
Most of the farmers in Ethiopia are uneducated and they do not get correct and complete
information about the diseases of Enset crop, so they need expert advice. Besides, there is
a limitation of resources and expertise on Enset pathology in regions that cultivate Enset
crop. In addition to this, it is not possible for the crop pathologists to reach every farm and
as even the crop pathologists rely on manual eye observation, the manual prediction
method is not so accurate, it is time-consuming, and it takes a lot of effort. The other main
problem is that if the disease happens once, there is no medicine that can cure bacterial wilt
of Enset. The only solution currently is to burn the infected plant and destroy the farm [11].
Despite the importance of identifying plant disease using computer vision and although
this area is studied more than 30 years it has given promising outputs but the advances
achieved are not enough and it seems a little small [12]. In other words, there are a lot of
plants and diseases that are not addressed in the current technologies. Therefore, we need
to extend the works to address more diseases and plants. There is a fact that some
previously worked disease identification and classification researches are conducted by
using some strict methods. For example, the images used for training and testing are taken
under a certain condition like in the laboratory within a proper lighting system, and strict
angle of capture, collecting sample images from publicly available databases instead of
5
capturing real-world images from the field, use of traditional image processing techniques,
and so on [13, 14, 6, 15, 16].
Therefore, there is a need to design an automatic disease detection model that assists the
farmers in early detection of the diseases with greater accuracy. In literature there are many
works are conducted to detect Bacterial Wilt disease in different plants. But the methods
are not used in Enset crop yet. Thus, we need to use deep learning techniques to detect
bacterial wilt disease from Enset crop. The computer vision approach is a noninvasive
technique that provides consistent, reasonably accurate, less time consuming and cost-
effective solutions for farmers to identify bacterial wilt diseases. The following research
questions are formulated on this thesis.
1. How best can computer vision technique be used to detect bacterial wilt diseases
of enset?
2. What method could be used to detect bacterial wilt disease of enset?
3. How datasets are collected in order to accomplish this task?
Objectives
6
Scope and Limitation of the Study
The thesis is mainly concentrated on the design and development of bacterial wilt disease
detection model on Enset crop. The model uses healthy and infected leave images of Enset
crops that are collected from Enset farm as the main input. Images are captured in different
farms which are found in Mihurina Aklil woreda Gurage zone, in South Nation
Nationalities and People Region (SNNPR) Ethiopia. The research work take nine (9)
months starting from problem formulation to the result of experiments. The sample images
are collected using a digital camera. This thesis is conducted only to detect bacterial wilt
disease on Enset crop, it does not include other diseases such as Sigatoka (leaf spot). This
thesis is only restricted to one crop which is Enset, even though Bacterial Wilt disease
affects other plants like banana and cassava. After the detection of the disease on the crop,
there is no way to recommend medicine for the disease, appropriate treatment to the
disease, and estimating the severity of the disease are beyond the scope of this thesis. The
main limitation to conduct this thesis is hardware resources like Graphics Processing Units
(GPUs) which is the most important resource when anyone is working on deep learning
algorithms especially for image processing.
• In the first place, this thesis will enable agriculture experts to appreciate the
importance of computer vision in the field of agriculture.
• The thesis will help to achieve high yield due to the fact that the disease is detected
early without finding agricultural experts.
7
• The thesis will help to reduce the cost of production that bring huge losses to
farmers due to excessive use of pesticides on their crop.
• It will reduce the cost of experts for continues monitoring of crops in large farms.
• The outcome of this thesis will help different authorities to provide proper measures
in situations where there is Bacterial Wilt disease.
• Finally, this thesis will serve as reference material for the researchers who will
conduct their research in computer vision especially researches related to plant
disease identification.
8
CHAPTER
Chapter 2 TWO
LITERATURE REVIEW
This chapter mainly focuses on the background information and review of literature of the
domain of this thesis. It includes a detail explanation about Enset crop, Enset Bacterial Wilt
disease, machine learning, and deep learning algorithms, and related works. Finally, the
chapter is concluded with the summaries of related works and the main gaps which should
be solved in this thesis.
Enset Crop
Enset (Enset ventricosum (Welw.) Cheesman) crop, commonly known as the Ethiopian
banana, Abyssinian banana, false banana, or Ensete [17, 4]. It is domestic only in Ethiopia
but, found in many countries in central and eastern Africa [18]. It is a very big monocarpic
evergreen perennial plant (Figure 2.1) with a height from 4 to 6 meters (sometimes up to
12 meters) by 1 meter in diameter [17, 18]. Enset crop has thick and strong pseudo stem
(false stem) of tightly overlapping leaf bases with large banana-like leaves but, wider and
taller up to 5m in height and a meter wide with midrib. Like banana Enset also have a one-
time flower in the center of the plant and occurs at the end of the plant’s life [18]. The
major food produced by Enset crop is locally called Kocho (ቆጮ) which is obtained by
fermenting a mixture of the scraped pulp from the pseudo stem, the second product is
pulverized corm called Amicho (አሚቾ), and the stalk of the inflorescence [17, 9]. In
addition to Kocho and Amicho Enset produces other foods locally called Bula (ቡላ) [9].
Kocho can be stored for a long time without any problem at a variable temperature. At the
time of flowering its quality of Kocho and Amicho production is higher than that of Enset
without flower. After the flowering, the plant dies.
Enset produces more amount of food than other cereal crops and 40 to 60 Enset crops will
provide enough food for families with 6 members for 1 year [10, 18]. Each crop takes 4 to
6 years to mature or ready to be eaten, a matured Enset crop gives 40 to 50 kg of food at a
time. Domestic Enset crops are propagated vegetatively from suckers but, still there are
few cultivated plants are produced from seeds, one mother plant can produce up to 400
suckers [18, 5, 17].
9
Figure 2.1. Example of Enset crop
Figure 2.2. Example of healthy (left) and infected (right) Enset leave.
Once established in an area, the disease spreads rapidly and results in total yield loss [9].
The main symptoms of the disease are Wilting of leaves, yellowish leave, and vascular
discoloration [9, 10], in addition to this, within a few minutes of cutting the pseudo stem a
cream or yellow-colored ooze exudes when the disease highly affects the crop. The
symptoms of the diseases start from the central part of the leaf and spread to others. The
disease mainly transmitted through infected farming tools, infected planting materials,
animals that fed the infected crop, and insect pests [9, 10]. Even though, there are many
10
different types of disease which are caused by bacteria. The following figure shows
locations where Bacterial Wilt of Enset affects the crop [9].
Machine Learning
Machine learning is an application of AI that makes a machine to learn and improve
automatically without explicitly programmed [20]. Unlike classical computer programs
that perform a task explicitly programmed by the programmer, machine learning program
uses a generic algorithm that can give information about a set of data without having to
write any custom program which is specific to the problem. i.e. instead of writing a new
program for the specific problem we only feed data to the generic algorithm and it
computes that data then the algorithm builds its own logic based on the given data [21].
The process of learning in the machine learning algorithm begins with data [22], such as
examples, direct experience or instructions in order to look for patterns in data and make a
11
better decision in the future based on the example that we provide [23]. The goal is to allow
the computer to learn automatically without the help of human beings and adjust
accordingly. Machine learning basically divided into two main categories based on the way
they learn about the data to make a prediction: supervised and unsupervised learning.
𝑦 = 𝑓(𝑥) (2.1)
In Equation (2.1) above the main objective is to map the function so well that when there
is a new input data x (which is unseen before) that is used to predict the output variable y
for the data. For example, in this thesis for the disease identification problem training
images labeled as diseased and healthy were used. After learning from the images, the
algorithm is able to predict with unseen images during the training.
The second main category of machine learning is unsupervised (descriptive) learning, this
approach has little or zero knowledge of the output and we want to try to find patterns or
groupings within the data. The goal is to find an interesting pattern or to model the
underline structure in the data in order to learn more about the data [23]. This algorithm is
used when there is an input data x (input variable) and there is no corresponding output
data y (output variable). Most of the common cases in unsupervised learning are clustering,
density estimation, and representation learning.
Currently, learning algorithms are widely used in computer vision applications. Hence,
Machine Learning (ML) is the main component of computer vision algorithms [25].
Computer vision has made exciting progress in the past decades, bringing us self-driving
cars, automated scene parsing, medical diagnosis, and more [26]. Behind this revolution,
machine learning is the driving force.
12
Artificial Neural Network
Artificial Neural network (ANN) is one of the most widely used supervised machine
learning models. The primary focus of this thesis is a special type of NN which is known
as Convolutional Neural Network (CNN).
ANN sometimes called Neural networks, computer programs developed to mimic the
human brain [27, 22]. The term “neural network” was originated in 1943 to find a
mathematical representation of biological information processing [27]. Like humans,
ANNs are trained through experience by giving appropriate examples without any special
programming. ANNs are excellent in finding patterns that are very complex for humans to
extract. They gain knowledge by collecting relationships and patterns in the data that is
provided during the training [21, 23]. ANN contains multiple layers, where each layer will
have a number of neurons. A neuron is a smaller building block of the network and it
accepts an input, applies some computation and generates a unique output [13].
Even though neural networks are inspired by human brains, we cannot conclude that they
are completely the same. A human brain contains approximately 100 billion neurons and
each of the neurons has connected to 1,000 to 10,000 other neurons that work in parallel
fashion. When we come to ANN, they are mathematical functions implemented in
computers that are running one process at a time in a serial fashion. There ANNs are not
designed to model the human brain [22].
The first simplified neuron model was introduced by Warren McCulloch and Walter Pitts
and the model is called the M-P model [28, 20]. This model is also known as linear
threshold gate. It has a set of inputs (𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ) and one output 𝑦. Thus, the linear
threshold simply classifies the set of inputs into two different classes, so 𝑦 is binary. In
addition to this, it has a set of weights (𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑤𝑛 ) associated with each input lines
which have values in a range of (0, 1) or (−1, 1). Years later in the late 1950s, an enhanced
version of the M-P model with a concept of the perceptron is proposed by Rosenblatt
(Figure 2.4) [20]. The Rosenblatt model of neurons, perceptron’s, was enhanced by adding
two features from the M-P model. The first feature is in the M-P model the weight values
were fixed, but the later makes it variable. The second one is Rosenblatt model adds extra
input that represents bias.
13
x1
w1
x2 w2
y
w3
x3
Activation Output
wn
Summation and Bias
xn
Inputs Weight
The input neuron 𝑘 receives 𝑛 input parameters 𝑥𝑗 . The neuron also has 𝑛 weight
parameters 𝑤𝑘𝑗 . A bias term 𝑏 with a matching dummy input with a fixed value of 1 is
included with the weight parameters. The input parameters and the weight are linearly
multiplied and summed (to the dot product of input and weight). The result of the
summation of the input parameters and weights are given to the activation function φ as
input by adding the bias term b. Then the activation function produces the output yk of the
neuron [20].
𝑛
𝑦𝑘 = 𝜑(𝑠𝑢𝑚) = 𝜑 (∑ 𝑤𝑘 𝑖 𝑥𝑖 ) + 𝑏 (2.2)
𝑖=1
Where yk is output of the neuron, 𝑤𝑘 𝑖 is the weight of the input, 𝑥𝑖 is the input, 𝜑 is
the activation function, 𝑛 is the number of inputs, and 𝑏 is the bias term.
14
without modification (no computation is performed) process. The hidden layers
(sometimes called layers neither output nor input) perform mathematical computation and
transfer the information from the input layer to the other layer. Most of the computation in
the network performed in the hidden layer. Neurons in the output layer perform
computation and transfer the information to outside the network. The output layer transfers
activations in the hidden layer to actual output, for example, classification and prediction.
Multi-layer networks (or multi-layer perceptions) are also known as feed-forward neural
networks.
15
process is called forward propagation [22]. The backpropagation algorithm (commonly
called backprop) allows the information to flow in reverse direction, the information flows
backward from the output neurons to the input through the hidden layers in order to
compute the gradient [24, 20]. During the training of the neural network, weights are
selected appropriately, therefore the network learns to predict the target output from known
inputs [30]. Even though computing by the analytical expression for the weights of the
neurons is straightforward, it’s computationally expensive. So, we need to find a simple
and effective algorithm which helps us to find the weights. The backpropagation algorithm
provides a simple and effective way for solving the weights iteratively in order to reduce
error (minimizing the difference between the actual output and the desired output) in the
neural network model [22, 30, 20].
Small random values have been initialized for the weights of the network neuron when an
input vector is propagated forward to the neural network. By using a loss function the
predicted output (output of the network) and the desired outputs (output from the training
example) are compared. i.e. the gradient (error value of the network). The error value is
simply the difference between the actual output and the desired output. The error values
are then propagated back from the output layer to the input layer through the hidden layers
and then the error values of the hidden layers are calculated. In this process, the weights of
the hidden layers are updated. This is called learning during the training process of the
neural network. When the weights are iteratively updated the neural network gets better.
The algorithms continue this process by accepting new inputs until the error value is less
than the limit value of the weight we set before [20].
16
neuron of the next layer of the network. If we do not use activation function the output of
the neural network will be simply a linear function. A linear function is not applied in
algorithms that need to learn from complex functional mapping on data [32]. The main
reason that makes us use non-linearity is because we want NN model which learns and
represents any arbitrary function which maps inputs from the output.
In this thesis, the most widely used activation function which is called Rectified Linear
Unit (ReLU) has been used in the hidden layer of the network to make our model more
powerful and to learn complex features from data. It is used to create a lightweight and
effective non-linear network [22, 33]. ReLU becomes popular in the past few years and
now it is state of the art activation function for hidden layers [24, 20]. The mathematical
form of this function is represented as follows.
In the equation (2.3) above if 𝑥 ≥ 0, 𝜑(𝑥) = 𝑥 and if 𝑥 < 0, 𝜑(𝑥) = 0. Hence as we seen
in the mathematical equation, the ReLU activation function is simple and efficient
(especially for the back-propagation algorithm). The main reason that makes ReLU simple
and efficient is that it activates some of the neurons at a time. i.e. if the input is negative
(𝑥 < 0), it converts it to zero and the neuron is not activated. ReLU can’t be applied in the
output layer of the neural network and this is the main drawback of this activation function.
Sigmoid activation function has been used for the output layer of the model. The sigmoid
activation function is the best activation function for binary classification and it exists
between 0 and 1 [20]. It is the best choice for models that have probability output since the
probability of anything exists between 0 and 1. Unlike the SoftMax activation function the
sum of the output of sigmoid functions are not equal to 1.
The SoftMax function accepts arbitrarily 𝑛 inputs and it gives 𝑛 output values within a
range between 0 and 1. This shows the probability of different classes to define each input.
And the sum of the value of the output is always equal to 1. SoftMax is the best choice of
17
activation function for neural network models that are built for multiclass classification
[20].
Deep Learning
Deep learning is a subfield of machine learning that uses a neural network for its
architecture and its learning is based on a data representation algorithm instead of task-
specific algorithms [34, 24]. In the last decade, neural network application is growing faster
than ever mainly because of many powerful computers (inexpensive processing units such
as GPU) and a large amount of data. As discussed in Section 2.4 above ANN has one or
more processing layers. Depending on the problem we want to solve the number of layers
we use in the network defers. If the number of layers is not very large, or simply two or
three we call the network shallow architecture. When an ANN architecture that contains a
very large number of layers, the network is called deep architecture and deep learning refers
to this deep architecture of NN [35, 24].
Multilayer networks were known since the 1980s, but for several reasons, the networks
were not used to train a neural network with multiple hidden layers [22]. The main problem
that prevents the use of multilayer networks in that time was the curse of dimensionality,
i.e. if the number of features of dimension grows, the number of configuration increases.
As the number of configuration increases, the number of data samples for the training
increases exponentially. Therefore, collecting sufficient training datasets was time-
consuming and it was not cost-effective on the usage of storage space [22, 36]. Nowadays
most of the neural networks are often called deep neural networks and they are widely
used. We can train a neural network with many hidden layers because a huge amount of
data, as well as storage space, and computational resources is available.
The traditional machine learning algorithm needs separate hand-tuned feature extraction
before the machine learning phase. Deep learning has only one neural network phase. At
the beginning of the neural network, the layers are learning to recognize the basic features
of the data and that data feedforward to the other layers in the network for additional
computation of the network [22].
18
As the NN is inspired by the human brain, one of the major applications of deep learning
which is computer vision is inspired by the human visual system. Deep learning is giving
great success in computer vision and speech recognition in the last two decades [28, 34].
Deep learning models also applied to so many problem areas, some of them are text
classification, speech recognition (Natural language processing), visual object recognition
(computer vision), object detection and many other domains such as drug discovery and
genomics [24, 37]. The number and type of problems that a neural network can address are
based on different deep learning algorithms that are developed in the last two decades.
Some of the most commonly used deep learning architectures are Recurrent Neural
Network (RNN), Long Short-Term Memory (LSTM), CNN, Deep Belief Networks
(DBN), and Autoencoders.
• RNN is one of the first deep learning architecture which gives a road map to
develop other deep learning algorithms. It is commonly used in speech recognition
and natural language processing [38]. RNN is designed to recognize the sequential
characteristics (remembers previous entries) of the data. When we analyze time
serious data, the network has memory (hidden state) to store previously analyzed
data. To perform the present task RNN needs to look at the present information
(short term dependency) and this is the main drawback. RNN differs from a neural
network is that RNN takes a sequence of data defined over time [38].
• LSTM is a special type of RNN which is explicitly designed to overcome the
problem of long-term dependencies by making the model remember values over
arbitrarily time interval. The main problems of RNN are vanishing gradient and
exploding gradients. The gradient is the change of weight with regard to the change
in error. It is well suited to process and predicts time series given time lags of
unspecified duration. For example, RNN forgets the model if we want to predict a
sequence of one thousand intervals instead of ten, but LSTM remembers such kind
of activities. The main reason that LSTM can remember its input in a long period
of time is that it has a memory that is like memory on a computer which allows the
LSTM to read, write and delete information [39]. It is mostly applied to natural
language text compression, handwritten recognition, speech recognition, gesture
recognition, and image captioning.
19
• CNN is the popular deep learning architecture for different computer vision tasks,
especially for image recognition. It is a multilayer network and inspired by the
animal vision system (visual cortex). CNN is used in this thesis for the detection of
Bacterial Wilt disease on Enset crop and the detail is in Section 2.5.1.
• DBN is a class of deep neural networks with multiple hidden layers where each
layer of the network is connected to each other but the neurons in the layers are not
connected to each other. The training of DBN occurs in two phases. It is composed
of layers of Restricted Boltzmann Machines (RBMs) for the unsupervised
pretraining and feedforward network for the supervised fine-tuning phase. During
the training of the first phase (pretraining) it learning a layer of feature in the input
layer. After the pretraining is completed the fine-tuning phase begins. In the fine-
tuning phase, it accepts the features of the input layer as input and learns features
in the second hidden layer. Then backpropagation or gradient descent is used to
train the full network including the final layer [40]. DBN is applied in image
recognition, information retrieval, natural language understanding, and video
sequence recognition.
• Autoencoders are a specific type of feed-forward neural network which is designed
for unsupervised learning, i.e. when the data is not labeled. The inputs and outputs
of autoencoders are the same. It accepts and compresses the input into a lower-
dimensional code and then reconstructs the output from the compressed code.
Autoencoders have three components namely the encoder, the code, and the
decoder. The encoder accepts the input and produces output, whereas the decoder
produces output by using the code. Anomaly detection is one of the most popular
application is of the autoencoder.
Deep learning techniques are new and rapidly evolving. Nowadays deep learning performs
better than other traditional machine learning approaches because of the availability of a
large amount of data and high-performance computing machine components such as GPU
[20]. Deep learning methods use multilayer (too many hidden layers) processing with better
accuracy performance and unlike traditional machine learning approach there is no explicit
feature extraction, i.e. in deep learning architecture features are extracted automatically
20
from the raw data and we can perform feature extraction and classification (it might be
recognition depending on our problem) at once, therefore we only design a single model.
Researches proved that deep learning can achieve state-of-the-art for many problems that
AI and ML are facing for a long time in the area of computer vision, Natural Language
Processing (NLP) and Robotics [41, 20]. To overcome the complexity of the design, deep
learning methods use backpropagation algorithm, loss functions, and too many parameters
that make the model to learn complex features.
21
The basic idea of CNN was inspired by the receptive field which is a biological term and
it is a feature of animal visual cortex [20, 21]. Receptive fields are a part of sensory neurons
and acts as detectors that are sensitive to a stimulus, for example, edges. The term receptive
field is also applied in the context of ANN, most often related to CNN and the biological
computations are approximated in computers using the convolution operations. In
computer vision, images can be filtered by using convolution operation to produce different
visible effects. CNN has convolutional filters that are used to detect some objects in a given
image such as edges which is the same as the biological receptive field. Since the late 1980s
and in 1990 CNN gives interesting results in handwritten digit classification and face
recognition [42]. The following figure (Figure 2.6) [33] illustrates CNN architecture.
A. Convolution Layer
The main objective of the convolution layer is to extract useful features from the input
image. Every image is represented as a matrix of pixel values in a computer. An image
captured by a standard digital camera has three channels: Red, Green, and Blue (RGB).
This type of image is represented as three 2D-matrices staked over each other (one for each
color) and each having a pixel value of a range of 0 to 255. Convolution layer is formed
22
from a combination of a set of convolutional filters (aka kernels or feature detectors) which
are small matrix values with size like 3 × 3, 9 × 9, and so on [29]. The filters are treated
as neuron parameters and are learnable. Every filter is smaller than the input volume in
spatial size (width and height), extends the depth equal to the input volume (input image).
For example, a typical filter might have size 5 × 5 × 3 (5 width, 5 height, and 3 depth for
the three-color channels). A part1 of the image is connected to the next convolution layer
because if all pixels of the image are connected it will be expensive to compute.
The convolution operation is performed by sliding the filter on the input image from left to
right across width and height and compute the dot product between the filter and the input
image at any position. The output of this operation is called a feature map (aka convolved
feature or activation map). Therefore, the filters are used to extract useful features from the
input image. Whenever the values of the filters are changed, the features that are extracted
or the feature map also changes. In the following illustration (Figure 2.7) we have prepared
a 2D input image of size 5 × 5 and 3 × 3 kernel.
1
If we connect all parts, it is called Fully Connected Network (FCN) as opposition with CNN
23
Figure 2.8. Example of the Convolution operation
The area where the convolution operation is performed is called the receptive field and its
size is 3 × 3 because it is always the same as the size of the filter. We perform as many
convolution operations as we can on the input by using different filters and we get distinct
feature maps. Finally, we stake all the feature maps together and it is the final output of the
convolution layer.
The size of the output neuron (the feature map) is controlled by three hyperparameters:
Depth, Stride, and padding (aka zero paddings). These parameters should be decided before
the convolution operation is performed [29].
• Depth is the number of filters that we use on the convolution operation. The larger
the number of filters the stronger the model we produce, but there is a risk of
overfitting due to increased parameter count. During the convolution operation, if
we use three different filters, we will produce three different feature maps. Finally,
these feature maps stacked as 2D matrices, so, the depth of the feature maps would
be three.
• Stride is the number of pixels that the filter slides on the input volume at a time.
when the stride is 1 the filter matrix slides 1 pixel on the input volume at a time.
When the stride is 2 the filter jumps 2 pixels on the input volume at a time and so
on. If the number of strides is higher the output volume will be smaller.
• Padding is adding zeros in the input volume around borders. It is convenient to pad
the input volume around borders with zeros. It helps to keep more information
around the borders of the input and allows to control the size of the feature map.
24
Commonly filter with a size of 3, stride with 2, and padding with 1 is used hyperparameters
in CNN but, we can change this hyperparameter depending on the input volume we have
[29].
The above example (Figure 2.8) is used for grayscale image, because the matrix only has
a depth of one, in this thesis, these convolutions are performed in 3D because color images
captured by a digital camera has been used which is represented as a 3D matrix with
dimensions of width, height, and depth (the depth represents the three color channels). For
example, if we have input of 6 × 6 × 3 and filter size of 3 × 3 × 3 (the depth of the input
and the filters are always the same), then we perform convolution operation and the only
difference with 2D input and filter is that making the sum of matrix multiplication to 3D
instead of 2D as shown in Figure 2.9 below.
The figure above (Figure 2.9) [43] shows that an input volume of 6 × 6 × 3 and filter
3 × 3 × 3. The number of filters is one. And it slides 1 pixel at a time, i.e. stride is 1. We
can use many different filters in the convolution layer to detect multiple features and the
output of the convolution layer will have the same number of channels with the number of
filters. The following figure (Figure 2.10) [43] is the same as Figure 2.9 above with two
filters. The depth of the feature map is the same as the number of filters as we see in Figure
2.10 below.
25
Figure 2.10. Example of convolution operation with 2 filters
To control the number of free parameters in the convolution layer, there is a systematic
method called parameter sharing. If one feature is useful to compute some spatial position,
it should also be useful in another position. In other words, if we use the same filter
(commonly called weights) in all parts of the input volume, the number of free parameters
decreases. The neurons in the convolutional layer share their parameters and only
connected to some of the parts of the input volume (local connectivity). Parameter sharing
of resulting from convolution contribute to translation invariance of CNN, i.e. when the
input volume has some specific centered structure and we want the CNN to learn different
features in some other spatial location, in this case, we simply share the parameters and
call locally connected layer [29].
Finally, to make a single convolution layer we need to add the activation function (ReLU)
and bias (b) to the output volume. The following figure (Figure 2.11) [43] shows one
convolution layer of CNN with ReLU activation function.
To reduce the number of parameters, to extract dominant features in some spatial location,
to progressively reduce the spatial size of the convolved feature, and to control the problem
of overfitting in the network we need to add pooling layer (also called subsampling or down
sampling) in between some successive convolution layers in CNN [29]. This layer helps to
reduce the computation power that is required to train the network. The pooling operation
is performing by sliding the filter on the convolved feature.
The fully connected layer is the same as the traditional multilayer perceptron that is
discussed in Section 2.4.1 above. In a fully connected layer, every neuron in the previous
layer is connected to every neuron in the next layer. This layer accepts the output of the
27
convolution or pooling layer which is high-level features of the input volume. These high-
level features are in the form of a 3D matrix but, the fully connected layer accepts a 1D
vector of numbers. Therefore, we need to convert the 3D volume of data into a 1D vector
called flattening and that becomes the input to the fully connected layer. The flatten vector
is given to the fully connected layer and it performs mathematical computation like any
ANN and the computation is discussed in Section 2.4 above in Equation (2.2). Activation
functions such as ReLU in the hidden layers are used to apply non linearity in these layers.
By using sigmoid activation function the last layers (output layer) of the fully connected
layer perform classification (probabilities of inputs being in a particular class) based on the
training data. For example, in this thesis, the image classification will have two classes:
Diseased and Healthy.
2
http://www.image-net.org/
28
the winners were researches that use deep learning algorithms especially CNNs as their
algorithm for the classification and recognition of the very large amount of image datasets
with thousands and hundreds of classes. Most of the architectures of CNN are driven to
compete ImageNet challenge. Some of the architectures of CNNs are: LeNet [44], AlexNet
[33], VGGNet [45], ZFNet [42], GoogLeNet [46], and ResNet [47].
Yann LeCun et al. in 1998 developed a CNN model to recognize and classify handwritten
digits in postal service. This model is called LeNet-5 and it is currently used in many banks
and insurance companies to recognize handwritten numbers in a cheque. This architecture
receives 32 × 32 × 1 (grayscale image) with a filter size of 5 × 5 and stride 1. The
architecture wasn’t scalable to large scale images at that time because there was a limitation
of computational power. It has two convolutional layers then an average pooling layer after
these layers there are two fully connected layers with a SoftMax activation function. The
total number of parameters in this model is 60,000 [20].
In 2012 a CNN architecture developed by Alex Krizhevsky et al. called AlexNet won
ImageNet challenge by decreasing the top five error rates from 26% to 15.3%. This
architecture is much similar to LeNet-5 architecture but this one is much deeper, a greater
number of kernels (11 × 11, 5 × 5, 3 × 3) per layer, and with stacked convolutional layers.
AlexNet has more additional layers such as dropout, data augmentation, and ReLU
activations. During the training, it splits into two pipelines and trained simultaneously with
two GPUs. It has 60 million parameters, 650,000 neurons and took 5 to 6 days to train with
2 GTX 580 3GB GPUs [33].
Simonyan and Zisserman from the VGG group at Oxford created a model called VGG and
it has 16 convolution layers. It is an improvement of AlexNet by converting 11 × 11 and
5 × 5 filters with many 3 × 3 filters. VGG achieves top five error rates of 7.3% accuracy
and stakes second place in ImageNet challenge in 2014. It is efficient to use multiple small-
sized stacked kernels than a single large-sized kernel to learn so many different complex
features. VGG has 138 million parameters and this is the main drawback of this architecture
because it needs a greater resource. The VGG architecture is trained for 2 to 3 weeks with
4 GPUs [45].
29
In 2013 the winner of ImageNet challenge is also a CNN architecture called ZFNet by
achieving top five error rates of 14.8%. It uses the same architecture as AlexNet with some
modifications such as filter size changed into 7 × 7 from 11 × 11 and stride 2 × 2 from
4 × 4. To overcome the loss of information caused by using a bigger kernel in an earlier
layer, they use smaller kernels and when the network goes deeper the size of the kernel
increases [42].
GoogLeNet is the winner of the ILSVRC 2014 competition and is developed by Google. It
achieves a top-five error rate of 6.67% which is very close to the human level of
performance. GoogLeNet (sometimes called inception V1). It has deeper paths with
parallel convolutions of different filters. There are 4 million (less than that of AlexNet)
parameters and 22 deep layers in GoogLeNet architecture. It is inspired by LeNet but
GoogLeNet has 1 × 1 convolution (to reduce the number of parameters) in the middle of
the network and there is no fully connected network at the end of the network instead it has
global average pooling. The technique of adding 1 × 1 convolution and global average
pooling is called the inception module [46].
30
A deep CNN approach in [48] is applied to classify a rice disease based on healthy and
unhealthy rice leaves. In this study, a dataset containing a total image of 857 is collected
from rice field by using a digital camera and publicly available images of rice from the
internet. The authors used a manual priori classification with the help of assistance from
agricultural offices to label the image dataset. The authors used AlexNet transfer learning
architecture of the CNN algorithm to classify the input images into three groups namely
healthy, unhealthy, and snail infested. Finally, the network achieves 91.23% accuracy by
using stochastic gradient descent (SGD).
Another deep learning approach to detect plant disease is presented in [49]. In this work,
they proposed a deep learning model for real-time detection of tomato disease and pests.
The authors used a different digital device to collect the images of tomato leaves from
different farms and they have captured 5000 images. The proposed approach identifies and
classifies the disease to nine different classes and finds the location of the disease in the
tomato plant. This makes the study different from many other approaches conducted in this
area. For the object recognition and classification CNNs algorithms such as Faster Region-
Based CNN (Faster R-CNN) [50], Single Shot Multi-box Detector (SSD) [51], and Region-
based Fully Convolutional Networks (R-FCN) [52] are used. By combining each of these
three CNN architectures with feature extractors such as VGG net [45] and Residual
Network (ResNet) [47], their model effectively recognizes and classifies the disease and
pests. Finally, the authors recommended that using data annotation and data augmentation
method helps to increase the accuracy of the result.
Another deep learning technique proposed by the authors in [53] uses a deep CNN for plant
disease detection and classification. In this study images from internet search results have
been collected for training and testing the model. The authors used datasets which has
30,880 total images after augmentation and transformation. The dataset is manually
assessed by agriculture experts, and CaffeNet [54] architecture of CNN is used for training
by modifying to fifteen categories (classes). Finally, the model successfully categorized 13
different classes of disease with a better accuracy of 96.3%.
The author in [35] developed a CNN model to detect and diagnose plant disease by using
simple leave images. The datasets are collected from globally available datasets which are
31
taken in the laboratory condition and in addition to that by taking images from real
cultivation condition in the field. In this study, 25 different plants are selected for 58
distinct classes of disease. Several CNN architecture models are trained such as AlexNet,
GoogLeNet, Overfeat [55], and VGG in the study, from the architectures the VGG has
given successful identification of plant disease combination. This means the system
generates a pair of plants and the corresponding disease with greater accuracy.
Related Works
Currently, in the field of agriculture, the reduction of productivity and loss of yield is
mainly caused by plant disease. To reduce these losses there is a need to develop a state of
the art and automated method for plant disease detection. Besides advancements in
agriculture technologies are already doing a great job including disease detection using
image processing techniques and in the last two decades, the technology is getting faster
and more accurate output. Basically, there are a lot of works have been done for plant
disease detection using image processing and machine learning approach.
However, most of the studies conducted in the identification of plant disease are using the
traditional image processing techniques and they follow a common step, which are image
acquisition, image preprocessing, image feature extraction, and finally classification [14,
6, 15, 56]. In the image acquisition step, the images are collected by using different digital
devices like a digital camera and smartphones from the field (in our case from Enset farm)
or somewhere from the image dataset. The second step is preprocessing, the main goal of
this step is an improvement of image data by removing unwanted features, enhancing the
image, and image segmentation. Segmenting the image is used to identify the boundaries
of the image by using different segmentation methods like thresholding. When we come to
the feature extraction step some useful features for the disease identification in the image
are extracted like color and texture. The fifth and the main step is classification and in this
step disease identification and classification is performed. Different classification
techniques are used in the literature such as Neural Network [57], support vector machine
(SVM), and some of the studies used both SVM and NN [58]. In the following, we discuss
literature in the area of disease detection and classification which are directly related to this
thesis.
32
A machine learning approach presented in [16] to detect and classify banana Bacterial Wilt
and banana black Sigatoka. In this study, 623 diseased and healthy images of banana leaves
which are collected from the field were used. Color features are extracted based on
threshold values of green pixel components of the image, and shape features are extracted
based on thresholding at a different level, extracting connected components, and
calculating morphological features of each connected component. For the classification of
the disease, the authors used seven different classifiers such as Nearest Neighbor [59],
Decision Tree [60], Random Forest [61], Extremely Randomized Trees [62], Naïve Bays
[63], and SVM. After testing the seven different classifiers, Extremely Randomized Trees
give a great classification accuracy of 96% for banana Bacterial Wilt and 91% for banana
black Sigatoka.
An automated tool is presented in [13] to identify and classify banana leave disease. They
try to identify and classify disease caused by fungi and the disease are known as banana
Sigatoka and banana Speckle. In this study, a globally available dataset from the
PlantVillage project is used to collect the infected and healthy leaf images of banana. The
leaves infected by the disease are determined based on the color difference between the
healthy and the infected leaves. The authors perform preprocessing for the entire dataset
by resizing each image to 60 × 60 pixels and by converting the images to grayscale. The
authors perform feature extraction and classification by applying CNN algorithm. The
trained model gives interesting classification with better accuracy.
ANNs for classification and grading of a banana plant is presented in [64]. In this study, a
total of 35 diseased images of banana leaves which are captured in the field was used to
train the NN. The color feature was extracted by converting RGB images to HSV, and
Histogram of Template (HOT) features were extracted. For the classification feed-forward,
neural network is trained. As the authors discussed in the paper, the trained model
successfully classifies five different banana plant disease based on the given images.
A paper presented in [65] for the detection and calculation of the area of infection of banana
black Sigatoka using segmentation and calculation of the area. In this paper, the authors
captured images of a banana plant from the field. The percentage of infection is calculated
33
by using the formula infected area divided by total area then multiplied by 100. The
following table summarizes previously conducted works which are related to this thesis.
Summary
As mentioned in the previous sections, the studies show that computer vision has been
widely used in the field of agriculture especially for crop disease identification and has
obtained interesting results. More specifically machine learning and deep learning
algorithms such as NN and CNN. Enset crop is related to banana crop in several ways and
Bacterial Wilt disease also affects the banana plant. Computer vision techniques are also
34
applied for the detection and classification of different diseases in the banana plant
including banana Bacterial Wilt but there is still a need to develop a more accurate and
efficient model. As we see in the related works (Section 2.5.1) all previously conducted
papers have some problems which we need to overcome in this thesis. For example, most
of the papers used their datasets from internet searches or publicly available databases such
as in PlantVillage. Using publicly available dataset is recommended but the images in most
of the previously conducted researches are captured under controlled environments like in
the laboratory setups; there are a lot of laborious preprocessing stages such as handcrafted
feature extraction, color histogram, texture features, and shape features; most importantly
the methods used by previously conducted research works are not state of the art, i.e. most
of the studies in the literature of crop (especially banana) disease identification follows
traditional image processing techniques [13, 14, 6, 15, 16]. And the other main point of
this thesis is there is no image processing (whether it is traditional machine learning or
deep learning) technique is designed to detect or classify Enset disease so far. Hence, an
accurate and efficient CNN-based model (avoids handcrafted feature extraction) for the
detection of Bacterial Wilt disease in Enset crop by using leaf image of infected and healthy
Enset crop is designed and developed.
35
Chapter 3THREE
CHAPTER
RESEARCH METHODOLOGIES
This chapter focuses on the description of methodologies that are used in order to
accomplish this thesis including methods to implement the model, data collection, data
preparation, software, and hardware configuration of the system used and revaluation
techniques which are used to evaluate the model. In this thesis experimental research
approach is used where a set of variables are kept constant while the other set of variables
are being measured as the subject of experiment. Different experiments are carried out by
using different dataset ratio. In addition to this there are a lot of experiments are conducted
by using different activation function and hyperparameters.
Research Flow
In thesis experimental research method is followed. In order to achieve the objective of this
thesis, the following process flow (Figure 3.1) is followed. As we can see in the following
block diagram, this thesis is conducted with three main phases. The first phase includes
identifying the domain of the problem that means understanding the problem by reviewing
different kinds of literature. Then objectives of the thesis are formulated including the
general and specific objectives. The second phase is about data preparation and design of
the thesis. During data preparation data is collected from farm, then labeled with agriculture
experts, finally splitted in to training, validation, and testing. After data preparation, design
of the model is performed. The third phase is about implementation of the thesis, in this
phase the designed model is implemented with appropriate tools and methods. The
designed model is trained and tested with the appropriate data. During the training of the
model the performance of the model is evaluated. After getting the optimal model during
evaluation, the model is tested with test data. Finally, the model is compared with other
pre-trained models.
36
Figure 3.1. Research flow
Data Preparation
The most important thing when we want to use a neural network or deep learning
algorithms during research is getting the data that are used to train the neural network
model. In this thesis, Enset leaf image data is used as the main input to the model. However,
there is no publicly available database that contains thousands of Enset leaf images that we
can download and use for the training of the model. So, only images captured from different
Enset farms that are healthy and infected are used.
The images of Enset crops are collected from southern regions of Ethiopia with the help of
farmers and agriculture experts. All diseased and healthy images are captured in Enset
farms and some of the images are captured in fields that are particularly made to analyze
Bacterial Wilt disease of Enset by Arbaminch University researchers (Chencha, SNNPR,
Ethiopia). The images are captured by using a digital camera with normal condition i.e.
without considering the selection of light sensors and the relative position of the image to
the camera is also not considered. All the images are checked by domain experts (plant
science expert) of Mihurina Aklil woreda, Gurage Zone, SNNPR Ethiopia.
37
for explicit preprocessing on the dataset because the algorithm can take raw pixels of the
image and learn the features by itself. But the images contained in the prepared dataset
have different sizes. Therefore, size normalization is performed in the dataset in order to
get similar size of all images for the CNN algorithm and to decrease the computational
time of training because the model is trained on a standard PC with limited hardware
resources such as processor, and memory. Finally, all the images contained in the dataset
are resized into 127 × 127 pixels.
127
1944
Resize
127
2592
38
an equal number of images in each category, the dataset is splited randomly into train,
validation, and test according to the ratio stated above. Using an equal number of images
in each class for training and validation helps to avoid the problem of overfitting because
during the training updating of weights would not be biased in one of the categories.
Software Tools
Investigation of available software tools with their libraries are conducted in order to select
the appropriate tool for the implementation of the CNN algorithm for enset image
classification. During the investigation, we have seen that there are tools which are general
for both deep learning and machine learning algorithms and specific only for one of them.
Before selecting the tools, we have considered some criteria’s which are helpful to select
the appropriate software tools with their corresponding libraries. The main criteria are the
choice of programing language that will use to implement the algorithm. The other criteria
are to select tools with enough learning materials such as free video tutorials, existing
experience, and the other one is the tools must be used in machines with limited resources
(like CPU only). Software tools that we have used to implement the CNN algorithm are
python as a programing language with TensorFlow and Keras libraries on anaconda
environment. These tools fulfill all the consideration criteria’s and they are used in python
which is familiar to us.
39
Anaconda3 is used for the implementation of the model and it is a free and open-source
distribution of the Python and R programming languages for data science and machine
learning related applications, that aims to simplify package management and deployment.
It contains different IDE’s which are used to write the coding part such as Jupyter Notebook
and Spyder. We have used Jupyter notebook to implement the coding part. It is easy and
runs in a web browser.
TensorFlow4 is a free and open-source library developed by Google and it is currently the
most famous and fastest deep learning library [21]. It can be used in any desktop which
runs Windows, macOS, Linux; in the cloud as a service and in mobile devices like iOS and
Android. The architecture of TensorFlow works for preprocessing of the data, building the
model, train the model, and estimate the model. All the computations in TensorFlow
involve tensors (n-dimensional array) that represents all kinds of data. TensorFlow also
uses graph framework for graphical representation of the series of computation during the
training. It has two distributions for CPU and GPU.
Keras is a high-level neural network API is written in python which runs on the top of
either TensorFlow, Theano5, or Microsoft Cognitive Toolkit (CNTK). It is very simple to
develop a model, user-friendly, easily extensible with python, and most importantly it
contains pretrained CNN models such as VGG16 and Inception that we use during the
experiment. It allows easy and fast prototyping, and support both CNN and RNN or the
combination of the two [21].
Visio 20196 is used for designing the system architecture. This tool was used to create,
collaborate and share data-linked diagrams easily with ready-made templates and helping
to simplify complex information.
Hardware Tools
Sony Cyber-shot DSC-W230 (12.1 Megapixel) digital camera was used to capture the
sample images from the field. To implement the CNN algorithm with the selected software
3
https://www.anaconda.com/download/
4
https://www.tensorflow.org/install/install_windows
5
Low level Python library runs on the top of NumPy, currently not used for direct implementation of CNN.
6
https://www.microsoft.com/am-et/p/visio-professional-2019
40
tools a very slow machine with CPU Intel(R) Core (TM) i5-5200 CPU @ 2.20GHz
processor, memory 8 GB was used, and no GPU which is the most important hardware in
deep learning for computer vision research.
Evaluation Technique
After training our model we need to know how the model generalizes for never seen before
data. This helps us to say the model is classifying well with new data, or the model is doing
good only for trained data (memorizing the data fed before) but not in new data (data that
hasn’t seen before). Therefore, model evaluation is the process of estimating the
generalization accuracy of the model with unseen data (in our case test data). It is not
recommended to use training data for evaluating a model because the model remembers all
data samples which are fed during training, i.e. it predicts correctly for all the data points
in the training but not for data which hasn’t seen during the training. In this thesis,
classification accuracy metrics are used which is recommended technique for classification
problems and when all the classes of the dataset have the same number of samples [21]. In
this technique, the dataset is divided into training, validation, and testing dataset. During
the training, we can feed the validation split to the model to get performance metrics. The
model returns the accuracy and loss of training data, and the accuracy and loss of validation
data, which are training accuracy, validation accuracy, training loss, and validation loss.
So, we can plot loss and accuracy graph with respect to epochs by using these metrics.
Finally, the testing data (images that have not been used in either the training or validation
sets) is given to the trained model to test the performance of the model, then the model
returns accuracy and loss of the testing data which is never seen during the training.
41
ChapterFOUR
CHAPTER 4
DESIGN AND EXPERIMENT
This chapter focuses on the design of the proposed model and its experimental setups.
Specifically, the design of the proposed model and descriptions, how features are extracted
and classification is performed in the proposed model and other pretrained models by using
the technique called transfer learning are described briefly.
Model Selection
A deep learning algorithm which is CNN is chosen based on different literature that was
conducted in computer vision especially in image classification. CNNs represent an
interesting method for adaptive image processing. The algorithm is used for feature
extraction, classification, training, testing as well as for evaluating the accuracy of the
model. CNNs take raw data, without the need for separate pre-processing or feature
extraction stage. In addition to these, feature extraction and classification stages occur
naturally within a single framework.
As the main advantage of using the CNN algorithm for the detection of BWE, it is more
robust and automated than classical machine learning algorithms [21]. In classical machine
learning algorithm there is a need to develop different algorithms for different problems,
therefore it uses more handcrafted algorithms, but in CNN once we developed an algorithm
for the detection of Bacterial Wilt for Enset crop it can be applied for other related plants
like banana and cassava, so it is easier to generalize and use as it with different but related
problems [20]. Some of the main reasons that CNN will be used in this thesis are:
• A lot of previously conducted researches has shown that CNN is better than other
classification algorithm and it is state of the art for computer vision applications.
• CNNs are designed by emulating human's understanding of vision, so, for image-
related tasks, CNN is better than other deep learning models.
• Most of classical machine learning approaches require explicit extraction of the
features that are used to study from the image before the classification and
prediction.
42
• Most neural network algorithms only accept vectors (1D) and most of the real-
world images are tensors (3 dimensional) so there is a need to flatten the actual
image (input) to the 1D vector which is very difficult and computationally
expensive but, CNN accepts 3-dimensional images.
• CNN can capture temporal and spatial dependencies with the help of relevant
kernels.
Training Phase
Training and
Validation Data
Preprocessing
images Augmentation
Model
Evaluation
Testing Phase
43
The model uses the augmented data for training and the original validation data to give
performance metrics. Inside of the CNN staked layers useful features of each image are
extracted and classification based on the extracted feature is performed and the process is
called model training. During the training of the model we can access the performance of
the model by using the validation dataset which is basically used to measure the
performance of the model. After accessing the performance of the model, the model which
best performed is saved and used as a predictive model. Then the testing phase is performed
by giving unseen images during the training to the predictive model. The model finally
gives class prediction which is the probability of the image belongs to one of the given
class during the training (in our case the classes are diseased and healthy).
44
Figure 4.2. Proposed model
(𝑊1 − 𝐹 + 2𝑃)
𝑂𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 (𝑊2) = +1 (4.1)
𝑆
Where: W1 is the size of the input volume, F is filter size, P is the number of zero
paddings, and S is the stride.
The initial spatial size of the input volume (W1) is 127 × 127 × 3 and it gets changed after
some convolution operations, the initial size of the filter F and Stride S are 5 × 5 × 3 and
2 respectively and these sizes are changed after some convolution and pooling operations,
and there is no zero padding P in our network and the value of P is always zero throughout
the model. In the following table, all of the parameters in each layer are described according
to Equation (4.1) given above.
The spatial parameters have mutual constraints. For example, when the input volume has
size W1 = 10, no zero-padding is used P = 0, and the filter size is F = 3, then it would be
impossible to use stride S = 2, since Equation. (4.1) gives 4.5, which is not an integer,
45
indicating that the neurons don’t “fit” neatly and symmetrically across the input. We have
considered this setting of parameters to be valid during the process of resizing images
contained in our dataset. If this arrangement is not considered, libraries that are used to
implement the CNN model will throw an exception or it will zero pad the rest of the area
or it will crop the image to make it fit.
Input layer: the input layer of our CNN model accepts RGB images of size 127 × 127 × 3
with two different classes (diseased and healthy). This layer only passes the input to the
first convolution layer without any computation. Therefore, there are no learnable features
and the number of parameters in this layer is 0.
Convolutional layer: in the proposed model there are five convolutional layers. The first
convolutional layer of the model filters the 127 × 127 × 3 input image by using 32 kernels
with a size of 5 × 5 × 3 with a stride of 2 pixels. Since (127 − 5)/2 + 1 = 62, and since
this layer has a depth of K = 32, the output volume of this layer is 62 × 62 × 32. The
product of the output volume gives a total number of neurons in the layer (first conv layer)
which is 123,008. Each of 62 ∗ 62 ∗ 32 neurons in this volume is connected to a region of
size 5 × 5 × 3 in the input volume. To control the number of parameters in convolution
layers parameter sharing is used. If we see the input there are 62 ∗ 62 ∗ 32 = 123,008
neurons in the first convolution layer, and each has 5 ∗ 5 ∗ 3 = 75 weights and 1 bias.
Together this adds up to 123,008 ∗ 75 = 9,225,600 parameters in the first layer of the
model. Clearly this number is very high and impossible to implement in our machine. Here
it comes the concept of parameter sharing which is one of the advantages over the
traditional neural network. When we use parameter sharing if one feature is used in some
spatial location let say (𝑥, 𝑦), then it should also be useful to compute at some other position
(𝑥2 , 𝑦2 ). In other words, denoting a single 2-dimensional slice of depth as a depth slice,
which is a volume of size 62 × 62 × 32 has 32 depth slices, each of size 5 × 5, and we are
making the neurons in each depth slice to use the same weight and bias. With this parameter
sharing the first convolution layer in the proposed model has only 32 unique sets of weights
(one for each depth slice) for a total of 32 ∗ 5 ∗ 5 ∗ 3 = 2,400 unique weights or 24,432
parameters by adding 32 bias and all 62 ∗ 62 neurons in each depth slice have the same
46
parameters. The output volume and parameters of each learnable layers in the proposed
model are described in Table 4.1 bellow.
The second convolutional layer takes as input the output (pooled output) of the first
convolutional layer and filters it by using 32 kernels of size 3 × 3 × 32. The third, fourth
and fifth convolutional layers are connected to each other without intervening pooling
layer. The third convolutional layer takes as an input the output of the second pooled
convolutional layer and filters with 64 kernels of size 3 × 3 × 64. The fourth convolutional
layer has 64 kernels of size 5 × 5 × 64 and the fifth convolutional layer also has 64 kernels
of size 3 × 3 × 64. All the convolutional layers of the proposed model use ReLU
nonlinearity as activation functions. ReLU is chosen because it is faster than other non-
linearities such as tanh to train deep CNNs with gradient descent [33].
Pooling layer: There are three max-pooling layers after the first, second and fifth
convolutional layers of the proposed model. The first max-pooling layer reduces the output
of the first convolutional layer with a filter of size 3 × 3 and stride 1. The second max-
pooling layer takes as an input the output of the second convolutional layer and pools by
using 2 × 2 filters of stride 1. The third max-pooling layer has a filter of size 2 × 2 with
stride 2. This layer has no learnable features and it only performs down sampling operation
along the spatial dimension of the input volume, hence the number of parameters in these
layers is 0.
Fully Connected (FC) layer: in the proposed model there are three fully connected layers
including the output layer. The first two fully connected layers have 64 neurons each and
the final layer which is the output layer of the model has only one neuron. The first FC
layer accepts the output of the fifth conv layer after converting the 3D volume of data in to
a vector value (Flattening). This layer computes the class score and the number of neurons
in the layer predefined during the development of the model. It is the same as ordinary NN
and as the name implies, each neuron in this layer is connected to all the numbers in the
previous layer.
47
Output layer: the output layer is the last (the third FC layer) of the model and it has 1
neuron with a sigmoid activation function. Because the model is designed to classify 2
classes called binary classification.
As we can see in the table above the proposed model have 763,681 parameters which are
extremely small when we compare to the other deep learning architectures such as AlexNet
which have 60 million parameters, VGG 138 million parameters, and GoogLeNet 4 million
parameters. It is considered that deep learning models have a massive number of
parameters; therefore, they need a huge computational power to train those models from
scratch and they need a very large amount of data. But the proposed model is trained with
a minimum amount of resources and data and it performs very well.
48
traditional system. Hence, the proposed model gives the output (predefined classes) based
on the color feature of the input image which is learned during the training. When training
CNN the network learns what type of features to extract from the input image. As discussed
in Section 2.5.1 features are extracted by convolution layers of CNN and feature extraction
is the main purpose of this layer. These layers have a series of filters or learnable kernels
(Figure 4.3) which aims to extract local features from the input image.
𝑀𝑖 = ∑ 𝑤𝑖𝑘 ∗ 𝑥𝑘 + 𝑏 4.2
𝑘
Where: 𝑤𝑖𝑘 is a filter of the input, 𝑥𝑘 is the kth channel of the input image, and b is
the bias term.
The features in this case contain different color patterns of the given image. Then each
value of the feature map is passed through activation functions to add nonlinearity in the
49
network. After nonlinearity, the feature map again fed into the pooling layer to reduce the
resolution of the feature map and computational complexity of the network. The process
of extracting useful features in the input image consists of multiple similar steps by
cascading convolution layer, adding nonlinearity, and pooling layers.
50
layer performs dot product of the input data (the features extracted from the convolution
and pooling layers) and the weights to produce a single value.
51
Figure 4.5. Transfer learning
There are two commonly used ways during transfer learning, the first one is to train all the
convolution base of the pre-trained model and only change the fully connected layer, the
other way is to train some part of the convolution base by freezing the weights of the pre-
trained model and change the fully connected layer. Training some of the parts of the
convolution base is called fine-tuning. We have trained two pre-trained models namely
VGG16 and InceptionV3 by using our dataset and compared the results with the proposed
model.
Experimental setup
Three scenarios are considered during the experiment of this thesis. The first two scenarios
are classifying the images by transfer learning approach and the third scenario is proposing
a CNN model based on the VGG16 architecture. During transfer learning the well-known
CNN architectures which are VGG16 [45] and InceptionV3 [46] that win the largest image
classification competition are chosen. The models are trained by millions of images and
thousands of classes. The proposed model is a modified version of the VGG16 model by
dramatically decreasing 138 million parameters in to 763,681 (Table 4.1).
52
4.7.1 Augmentation Parameters
The images are generated by using different augmentation parameters that are described in
Table 4.2 below. Finally, enough number of images are generated because the dataset is
extended by using different augmentation techniques.
53
learning rate during our experiment. In our experiment, we have seen that a learning
rate with a value of too small takes longer to train than a value of larger. But when
we give a smaller value the model is more optimal than a model with a learning rate
of larger. The experiment was done by using learning rate of 0.001, 0.01, and 0.1.
Then, learning rate 0.001 is the optimal one for all of the experiments even if it
takes longer to train.
• Loss function: The choice of the loss function is directly related to activation
functions that are used in the output layer (last fully connected layer) of the model
and the type of problem we are trying to solve (whether regression or
classification). In the proposed model, sigmoid is used as activation function in the
last fully connected layer. The type of problem we are solving is a classification
problem specifically binary classification. We have used Binary Cross-Entropy
(BCE) loss as a loss function for our model. Even though there is another loss
function such as Categorical Cross-Entropy (CCE), Mean Squared Error (MSE),
but binary cross-entropy is the recommended choice of loss function for binary
classification [20, 21]. It performs well for models that output probabilities i.e. it
measures the distance between the actual output and the desired output. The
experiment was done by using both BCE loss and CCE loss.
• Activation function: Experiments are conducted by using two different activation
functions: SoftMax and Sigmoid in the proposed model and Sigmoid performs
better. In the output layer of the model, the Sigmoid activation function is used
because it is the best choice for a binary classification problem [21, 20].
• Number of epochs: is the number of iterations the entire dataset passes forewarned
and backward through the model or the network. In our experiment, the model was
trained by using different epochs starting from 10 to 150. During the training, we
have seen that when we use too small or too large epoch, the model gets a high gap
between the training error and validation error. After many experiments, the model
gets optimal with epoch thirty (30).
• Batch size: is the number of input data we pass into the network at once. It is too
hard to give all the data to the computer in a single epoch so we need to divide the
input into several smaller batches. It is preferred in model training to minimize the
54
computational time of the machine. Batch size of 32 during model training is used
in out experiment.
55
Chapter 5 FIVE
CHAPTER
Experimental Result
To classify the input image, color feature of the image was used as described in detail in
Section 4.4. The main reason that the color feature is chosen to classify the image was that
when we look at the image, we can simply say that the image is healthy or diseased. Three
different classification scenarios are conducted during the experiment to test the
classification performance.
The first two scenarios are based on pre-trained CNN models and the third one is by using
the proposed model. Like most of the deep learning classification algorithms, our
experiments have two main phases. The first one is the training phase and the second one
is the testing phase. In the training phase, data is repeatedly presented to the classifier,
while weights are updated to obtain the desired response. In the testing phase, the trained
algorithm is applied to data that has never seen (test data) by the classifier to test the
performance of the classification algorithm. In the following, we will see the experimental
results in detail.
Pre-trained CNN
Two pre-trained CNN models: VGG and InceptionV3 which are widely used pre-trained
architectures in ImageNet are used and fine-tuned. The VGG model is chosen because of
its simplicity and the Inception model is used because of its complicated features.
Therefore, experiments are conducted in both a relatively simpler model and a complex
one to get the classification accuracy of these models in our dataset. All the experiments
are conducted in the same dataset and the same hyperparameter setting.
56
5.2.1 Detection of BWE by using VGG16 Pre-trained Model
The VGG model is characterized by its simplicity by using only 3 × 3 convolution layers
which are staked on each other in increasing depth of the layer. There are two versions of
the VGG model, the first one is the VGG16 and the second one is the VGG19. The VGG16
has 16 weight layers and the VGG19 has 19 weight layers in the network. The model
accepts 224 × 224 RGB images as an input and gives 1000 classes of ImageNet dataset
(contains 14 million images belonging to 1000 classes) [45]. The input is passed through
stacked convolution layers of the model with a 3 × 3 receptive field and followed by non-
linearity which is ReLU. The model uses a stride of 1 and spatial padding 1 for all of the
3 × 3 convolutions. After every 3 consecutive convolution layers, there is a max-pooling
of window size 2 × 2 with stride 1 to reduce the spatial size of the output of the convolution
layers. There is a total of 16 convolution layers in VGG19 architecture and 13 convolution
layers in VGG16 architecture and 5 max-pooling layers in both. Finally, for the
classification, there are 3 fully connected layers to which follows a stack of the conv layers.
The first two layers have 4096 channel depth and the final layer has 1000 channel depth
which is equal to the number of classes found in the ImageNet dataset with a SoftMax
activation function.
In our experiment, down-sampled RGB image of size 127 × 127 is given as an input to
the model and finetuned the model to give 2 classes of output in our dataset. The original
VGG16 model has a total of 138 million parameters which is very huge. We have trained
the model by 15,894,849 parameters because the spatial dimension of the image in our
model is smaller and we only trained some parts of the model. As we have discussed in
previous sections, we have fine-tuned the VGG16 model by using only the conv base of
the network. We have conducted several experiments in order to find the optimal pre-
trained model by training different conv blocks of the model. The model is trained by using
all the conv base of the network and changing only the fully connected layers, and the result
shows high overfitting. Overfitting happens because the model weights are trained with
millions of images which are different from our dataset and thousands of classes and we
tried to train that model by using only 4896 original images. Hence, we need to update
some of the weights of the network and increase the number of images by using data
57
augmentation technique. Therefore, we have decided to freeze some of the layers (conv
blocks) of the model and conduct a different experiment by using the augmented data. After
several experiments, we noticed that freezing the first 3 conv block is the optimal one
compared to freezing the first 2, and 4 conv blocks by using 96,000 images which are
generated by augmentation techniques. The training of the network is performed by using
the hyperparameters described in Table 4.3 above. The output of the experiment has a mean
training accuracy of 96.7% and mean test accuracy of 92.4%.
The following two plots shows the classification accuracy and loss with respect to epochs
by using classification accuracy metrics such as training and validation accuracy, training
loss and validation loss of VGG16 pre-trained model that we have conducted experiment
by making some changes to the original pre-trained model in order to able the model to
classify well in our dataset. When we see the training accuracy in the first epoch it is around
84% and slightly increases and passes 95% at epoch 5. In between epoch 5 to 10, the
training accuracy of the model gets higher with an accuracy of greater than 95% and after
the 14th epoch, the accuracy gets higher than 97%. As we can see in the graph, the accuracy
gets higher in the first few epochs, this is because of the dataset. The patterns of the images
of the crop in our dataset and very visible to even human eyes and it is easy to differentiate
by the CNN model. In general, as we can see in the following plots the validation accuracy
line is almost in sync with the training accuracy line and at the same time, the validation
loss line is also in sync with the training loss. Even though the validation accuracy and
validation loss lines are not linear, but it shows that the model is not overfitting. In other
words, the validation loss is decreasing not increasing and also the validation accuracy is
increasing not decreasing.
58
Figure 5.1. Training and validation accuracy for VGG16 Pre-trained model
Figure 5.2. Training and validation loss for the VGG16 pre-trained model
The result that is obtained from the experiment of pre-trained VGG16 model is presented
in the following table by using the classification accuracy metrics in the form of percentage
for the train data, validation data, and test data separately.
59
5.2.3 Detection of BWE by using InceptionV3 Pre-trained Model
Inception is an efficient deep CNN architecture for computer vision developed by Google
as GoogLeNet and drives its name from the famous internet meme “We Need to Go
Deeper” [46]. This architecture proposes a deeper (a large number of layers and a large
number of neurons in each layer) network with less computational power. There is one
thing we need to consider when we say deeper network when we increase the number of
layers the network is more likely prone to overfit, when we increase the number of neurons
in each layer, it needs a high computational resource. The inception model has a solution
for this problem by introducing sparsely connected (filters with multiple sizes in the same
layer as shown in Figure 5.3 [46]) network which replaces FC layer, especially inside
convolution layer and this approach lets us maintain the computational cost while
increasing the depth of the network.
This model is trained on the ImageNet dataset by accepting a size of 299 × 299 × 3 images
as input and gives a final output of 1000 class. It has a total of 42 layers and it is
computationally faster than the VGG model even if VGG has only 16 and 19 layers.
In our experiment, inception pre-trained model was trained by giving our dataset of size
127 × 127 color images and trained the entire model without making any fine-tuning
technique in the conv base and only changing the output to 2 classes. The total image given
to the network was 4896 and it gives a promising output without overfitting problem.
60
5.2.4 Result Analysis of InceptionV3
When we see the following plot, the training accuracy in the first epoch is around 70% and
validation accuracy is around 75%. Then both validation and training accuracy
automatically increases when we see the value at epoch 5 and after epoch 10 the values get
higher. The validation accuracy is linearly increasing and no decreasing at the same time
the validation loss is linearly decreasing no increasing and the is not much gap between
training and validation accuracy and loss. Therefore, there is no overfitting problem in the
model when we train by using our dataset.
61
The result that is obtained from the experiment of pre-trained InceptionV3 model is
presented in the following table by using the classification accuracy metrics in the form of
percentage for the train data, validation data, and test data separately.
The result that is obtained from the experiment of the proposed model by using a different
ratio of training and testing split is presented in the following table by using the
classification accuracy metrics in the form of percentage for the train data, validation data,
and test data separately.
Table 5.3. Result of experiments by using different training and testing dataset ratio
62
The proposed model is giving a promising result with different training and testing dataset
ratios as we can see in Table 5.3. From those experiments using 80% for training and 20%
for testing is better performing or optimal among the other three. Using ratio 8:2 means
using 80% of the whole dataset is for training and 20% of the whole dataset is used for
testing. In addition to this validation data is taken from the training data. In 6:4 ratio the
validation data were taken as 40% of the training data (not the whole dataset) and in ratio
7:3 the validation data is taken as 30% of the training data and so on.
The result that is obtained from the experiment of the proposed model is by using different
learning rates is presented in the following table by using the classification accuracy
metrics in the form of percentage for the train data, validation data, and test data separately.
As we can see in the following result giving a higher learning rates has less accuracy than
that of smaller learning rates. Therefore, learning rate of 0.001 is considered as optimal in
the proposed model.
Table 5.4. Result of the proposed model by using different learning rate
The result that is obtained from the experiment of the proposed model is by using different
activation functions is presented in the following table by using the classification accuracy
metrics in the form of percentage for the train data, validation data, and test data separately.
Table 5.5. Results of the proposed model by using different activation functions
63
As we can see in Table 5.5, using sigmoid activation function in the last fully connected
layer or the output layer for binary classification is better than SoftMax which is more
preferred for multiclass classification problems.
Finally, the proposed model successfully classifies the given image with mean training
accuracy if 98.5% and mean test accuracy of 97.86% by using a learning rate of 0.001,
output layer activation function of Sigmoid, and training and testing dataset ratio of 8:2.
Finally, we can see that the validation accuracy is in sync with the training accuracy and
validation loss is in sync with training loss. The validation accuracy and training accuracy
curves are nearly linear, and on the other hand, the validation loss and training loss are
nearly linear. The curves are showing that there is no overfitting in the proposed model,
because the validation accuracy is increasing not decreasing and the validation loss is
decreasing not increasing, and most importantly there is no much gap training and
validation accuracy and also there is no much gap between training and validation loss.
Therefore, we can say that our model’s generalization capability became much better since
the loss of the validation set was only slightly more compared to the training loss.
64
Figure 5.6. Training and validation accuracy of the proposed model
The result that is obtained from the experiment of the proposed model is presented in the
following table by using the classification accuracy metrics in the form of percentage for
the train data, validation data, and test data separately.
65
Table 5.6. Mean accuracy and loss of the proposed model
Metrics Accuracy Loss
Training Validation Test Training Validation Test
Value 98.49% 98.48% 97.86% 4.3% 5.7% 6.23%
Discussion
As presented in the previous sections, the experiments were conducted by using three
different CNN models: two pre-trained models and the proposed model. All of the
experiments are conducted using the same hardware configuration. The number of images
in the dataset which are used to train the models is different according to the depth of the
models or number of parameters. From the pre-trained model, VGG16 is trained with a
total of 96,000 images, Inceptionv3 is trained with 4896 images. The proposed CNN model
was trained with a total of 111, 060 images. All of the models are tested by separate dataset
which is unseen during the training of the model and obtained good result. Classification
accuracy metrics are used to measure the performance of the models and when we
compared the performance of the proposed models with the two pre-trained models and
models that are described in our related works section (Table 2.1), the proposed model has
better classification results.
As we can see in the following plots the mean percentage training accuracy of VGG16,
InceptionV3, and the proposed CNN model is 96.6, 91.7, and 98.49 respectively. These
show the models are giving good result on training dataset. The mean percentage of
validation accuracy for VGG16, InceptionV3, and proposed models is 96.8, 91.5, 98.48
respectively. When we compute the difference between mean training accuracy and mean
validation accuracy for each of the three experiments is very less and almost both mean
training accuracy and mean validation accuracy is the same in the proposed model. These
show that there is no overfitting in the models and we can say that the generalization ability
of the proposed model is high.
When we come to the mean training loss which is used to measure the inconsistency
between the predicted value and actual value for the three experiments: VGG16,
InceptionV3, and proposed model we obtained 7, 14.3, and 4.3 respectively. Mean
validation loss is 7.7, 14, and 5.7 which is nearly the same as the mean training loss when
we compute the difference between mean training loss and mean validation.
66
Accuracy
100 98.49 98.48 97.86
98 96.6 96.8
96
94 92.4
91.7 91.5
92 90.4
90
88
86
Training Accuracy Validation Accuracy Test Accuracy
Loss
30 27.2
25 22.1
20
14.3 14
15
10 7 7.7
5.7 6.23
4.3
5
0
Training Loss Validation Loss Test Loss
67
The test loss of all the experiments is shown in Figure 5.9 above and the value of the
proposed model is lower than the two pre-trained models. Test loss for VGG16 is 22.1, test
loss for InceptionV3 is 27.2, and test loss for proposed model is 6.23 which is good.
Therefore, the proposed model is doing well both in training dataset and testing dataset.
The main reasons that the proposed model gives better result are because of the dataset that
we have used to train the model, i.e. the images in the dataset are easily classified with
human eyes, and the second main reason is our proposed model uses smaller sized filters
in the convolution layer of the network. Using smaller sized convolution helps to identify
very smaller features which are used to distinguish between the input image and the
probability of losing an important feature is very less.
Most deep learning algorithms especially computer vision for image classification
problems are trained by using high performance computing machines with faster GPU, a
huge number of images (in millions), and tens of millions of parameters. But we can train
and get better results with small sized networks with fewer parameters, less hardware
consumption, and fewer data.
More accuracy results will be obtained if the images of the dataset are captured in a stable
environmental condition, which is a stable distance from an object to the camera, proper
light, and proper focus. The other point is that making preprocessing to the images by
removing noise and unwanted features will increase the accuracy of the model.
68
CHAPTER
Chapter 6 SIX
Conclusion
Now a day’s Enset production is suffered from a severe problem, Bacterial Wilt diseases,
which reduces the production and quality of Enset yield. Besides, the shortage of
diagnostics tools in developing countries like Ethiopia has a devastating impact on their
development and quality of life. Therefore, there is an urgent need to detect the disease at
an early stage with affordable and easy to use technological solutions. In order to make
early identification of the diseases we have proposed and implemented a deep learning
approach by using CNN algorithm. We have presented a CNN model to identify and
classify Bacterial Wilt of Enset by using leaf images of the crop as an input. The proposed
CNN model can be used as a tool to identify Bacterial Wilt disease of Enset.
The first contribution of this thesis for the research community and the whole population
is design and develop CNN model to correctly detect and classify the famous Enset disease
which is Bacterial Wilt by using images that are taken in the real scene and under
challenging condition such as complex background, dynamic image resolution, different
illumination and orientation. The second main contribution of this thesis was well
organized and managed dataset of Enset. To accomplish these, we have conducted several
experiments by using pre-trained models and proposed model.
During the experiment we have used images that are directly collected from the farm with
the help of agriculture exerts is used. we have trained the two pre-trained models namely
the VGG16 and InceptionV3 and the proposed model. After several experiments, all of the
models were able to find a good classification result. The VGG16 model gives training
accuracy of 96.6% and testing accuracy of 92.4%, the Inception pre-trained model gives
training accuracy of 91.7% and testing accuracy of 90.4%, the proposed model gives 98.5%
for both training and testing.
The results we have in our experiment has proven that the proposed CNN model can
significantly support for accurate detection of Bacterial Wilt of Enset with little
computational effort and little images which is far less than that of expected for deep
69
learning algorithm because most of deep learning algorithms are trained with millions of
images with high computational resource. To this end we are encouraged by the obtained
results from the experiment and, we are intended to work and to test more Enset disease
with our model.
Recommendations
Since the Ethiopian economy is dependent on agriculture and agricultural products, the
protection of crops from the disease should be the main aim of agriculture sectors. Hence,
image analysis techniques have paramount importance in the identification and
classification of early crop disease. In Ethiopia, there is no research has been conducted so
far for the detection of Bacterial Wilt on Enset crop using image analysis technique. Hence
this thesis may initiate researchers to work more in the area. The image analysis especially
by using deep learning techniques for the identification of Enset Bacterial Wilt can be
further investigated.
For the future, we are interested to train and test our model to detect another Enset disease
like black Sigatoka and leaf speckle. The model is also recommended to test with a large
number of images with complex configuration by increasing the number of layers and the
number of parameters in each layer in order to extract very complex features in the image.
It is also better to train the dataset with other pre-trained deep learning models such as
ResNet which are not conducted in our experiment due to computational resources such as
GPU.
In addition to this, we will make the model estimate the severity of the disease
automatically to help farmers to decide whether to stop the disease or not. To apply this
research in the field of agriculture we recommend developing a mobile app that takes a
picture of Enset image and gives automatic results about the severity of the disease in the
taken image and giving helpful expert advice to the user.
70
References
[1] Federal Democratic Republic of Ethiopia Central Statistical Agency (FDRECSA), "KEY
FINDINGS OF THE 2014/2015 (2007 E.C.) AGRICULTURAL SAMPLE SURVEYS,"
Addis Ababa, 2015.
[2] CSA (Central Statistical Agency), "Agricultural in figures key findings of 2008/09–2010/11
Agricultural Samples Survey for All Sectors and Seasons," Addis Ababa, 2012.
[3] S. W. Fanta and S. Neela, "A review on nutritional profile of the food from enset: A staple
diet for more than 25 per cent population in Ethiopia," Nutrition & Food Science, vol. 49,
no. 5, pp. 824--843, 2019.
[5] D. Yirgou and J. Bradbury, "Bacterial wilt of Enset (Enset ventricosum) incited by
Xanthomonas campestris sp," Phytopathology, pp. 111-112, 1968.
[6] D. Al Bashish, M. Braik and S. Bani-Ahmad, "Detection and classification of leaf diseases
using K-means-based segmentation and Neural networks-based classification," Information
Technology Journa, vol. 10, no. 2, pp. 267-275, 2011.
[7] A. Tuffa, T. Amentae and G. Gebresenbet, "Value chain analysis of warqe food Products in
Ethiopia," International Journal of Managing Value and Supply Chains, vol. 8, no. 1, pp.
23-42, 2017.
[8] A. Tinku and Ajoy, Image Processing Principles and Applications, Jhon Wiley, 2005.
[12] J. Barbedo and A. Garcia, "Digital image processing techniques for detecting, quantifying
and classifying plant diseases," Springer Plus, vol. 2, no. 660, pp. 1-12, 2013.
[13] J. Amara, B. Bouaziz and A. Algergawy, "A Deep Learning-based Approach for Banana
Leaf Diseases Classification," in BTW (Workshops), 2017, pp. 79-88.
71
[14] H. Al-Hiary, S. Bani-Ahmad, M. Reyalat, M. Braik and Z. ALRahamneh, "ast and accurate
detection and classification of plant diseases," Machine learning, vol. 14, no. 5, pp. 31-38,
2011.
[15] D. Cui, Q. Zhang, M. Li, G. L. Hartman and Y. Zhao, "Image processing methods for
quantitatively detecting soybean rust from multispectral images," Biosystems engineering,
vol. 107, no. 3, pp. 186-193, 2010.
[17] A. Tsegaye and P. Struik, "Enset (Ensete ventricosum (Welw.) Cheesman) kocho yield
under different crop establishment methods as compared to yields of other carbohydrate-
rich food crops," NJAS-Wageningen Journal of Life Sciences, vol. 49, no. 1, pp. 81-94,
2001.
[18] G. Birmeta, H. Nybom and E. Bekele, "Distinction between wild and cultivated enset
(Ensete ventricosum) gene pools in Ethiopia using RAPD markers," Hereditas, vol. 140,
no. 2, pp. 139-148, 2004.
[19] SA. Brandt, A. Spring, C. Hiebsch, ST. McCabe, E. Tabogie, M. Diro, G. Welde-Michael,
G. Yntiso, M. Shigeta, and S. Tesfaye, "The 'Tree Against Hunger'. Enset-based
Agricultural Systems in Ethiopia," American Association for the Advancement of science, p.
56, 1997.
[20] P. Josh and G. Adam, Deep Learning A Practitioner’s Approach, Sebastopol: O’Reilly
Media, 2017 .
[21] C. François, Deep Learning with Python, New York: Manning Publications, 2017.
[22] G. Ian, B. Yoshua and C. Aaron, Deep Learning, MIT Press, 2016.
[23] K. P. Murphy, Machine Learning A Probabilistic Perspective, London: MIT Press, 2012.
[24] Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-
444, 2015.
[25] I. Cohen, A. Garg and T. S. Huang, Machine learning in computer vision, vol. 29, Springer
Science \& Business Media, 2005.
[26] R. Mohan, "Deep deconvolutional networks for scene parsing," arXiv preprint
arXiv:1411.4101, pp. 1-8, 2014.
[27] B. Christopher M, Pattern Recognition and Machine Learning, New York: Springer-Verlag,
2006.
[28] T. Cheng, P. Wen and Y. Li, "Research status of artificial neural network and its application
assumption in aviation," 12th International Conference on Computational Intelligence and
Security, pp. 407-410, 2016.
72
[29] F. Li, J. Justin and Y. Serena, "CS231n: Convolutional Neural Networks for Visual
Recognition," Stanford University, Spring 2018. [Online]. Available:
http://cs231n.stanford.edu/index.html. [Accessed 20 january 2019].
[30] M. A. Nielsen, Neural Network and Deep Learning, Determination press, 2015.
[34] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol.
61, pp. 85-117, 2015.
[35] K. P. Ferentinos, "Deep learning models for plant disease detection and diagnosis,"
Computers and Electronics in Agriculture, vol. 145, pp. 311-318, 2018.
[36] G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural
networks," science, vol. 313, no. 5786, pp. 504-507, 2006.
[37] Y. Bengio, A. Courville and P. Vincent, "Representation learning: A review and new
perspectives," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no.
8, pp. 1798-1828, 2013.
[38] A. Graves, A.-r. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural
networks," 2013 IEEE international conference on acoustics, speech and signal processing,
pp. 6645-6649, 2013.
[39] F. A. Gers, J. Schmidhuber and F. Cummins, "Learning to forget: Continual prediction with
LSTM," 1999.
[40] G. E. Hinton, "Deep belief networks," Scholarpedia, vol. 4, no. 5, p. 5947, 2009.
73
[45] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image
recognition," arXiv preprint arXiv:1409.1556, pp. 1-14, 2014.
[47] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition,"
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-
778, 2016.
[48] R. Ronnel and P. Daechul, "A Multiclass Deep Convolutional Neural Network Classifier
for Detection of Common Rice Plant Anomalies," International Journal of Advanced
Computer Science and Applications (IJACSA) , vol. 9, no. 1, pp. 67-70, 2018.
[49] A. Fuentes, S. Yoon, S. C. Kim and D. S. Park, "A robust deep-learning-based detector for
real-time tomato plant diseases and pests recognition," Sensors, vol. 17, no. 9, pp. 1-21,
2017.
[50] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: towards real-time object detection
with region proposal networks," IEEE Transactions on Pattern Analysis & Machine
Intelligence, no. 6, pp. 1137-1149, 2016.
[51] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg, "Ssd:
Single shot multibox detector," in European conference on computer vision, Springer,
2016, pp. 21-37.
[52] J. Dai, Y. Li, K. He and J. Sun, "Object detection via region-based fully convolutional
networks," 30th Conference on Neural Information Processing Systems (NIPS 2016), pp. 1-
9, 2016.
[56] S. Patil and A. Chandavale, "A survey on methods of plant disease detection," International
Journal of Science and Research (IJSR), vol. 4, no. 2, pp. 1392-1396, 2015.
[58] R. Rajmohan, M. Pajany, R. Rajesh, D. R. Raman and U. Prabu, "Smart Paddy Crop
Disease Identification and Management Using Deep Convolution Neural Network And
74
SVM Classifier," International journal of pure and applied mathematics, vol. 118, no. 15,
pp. 255-264, 2018.
[59] Z. Ma and A. Kaban, "K-Nearest-Neighbours with a novel similarity measure for intrusion
detection," UKCI, vol. 13, pp. 266-271, 2013.
[61] A. Liaw, M. Wiener and others, "Classification and regression by randomfores," R news,
vol. 2, no. 3, pp. 18-22, 2002.
[62] P. Geurts, D. Ernst and L. Wehenkel, "Extremely randomized trees," Machine learning, vol.
63, no. 1, pp. 3-42, 2006.
[63] H. Zhang, "The optimality of naive Bayes," AA, vol. 1, no. 2, 2004.
[64] B. Tigadi and B. Sharma, "Banana Plant Disease Detection and Grading Using Image
Processing," International Journal of Engineering Science, vol. 6, no. 6, pp. 6512 -6516,
2016.
[65] S. P. Bhamare and S. C. Kulkarni, "Detection of Black Sigatoka on Banana Tree using
Image Processing Techniques," Second International Conference on Emerging Trends in
Engineering (SICETE), vol. 1, no. 14, pp. 60-65, 2013.
75
Appendix A: Experiment of Proposed Model
76
77
78
79