0% found this document useful (0 votes)
398 views91 pages

Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach

This document is a thesis submitted by Yidnekachew Kibru Afework to the Addis Ababa Science and Technology University for the degree of Master of Science in Software Engineering. The thesis proposes developing a deep learning model to detect bacterial wilt disease in enset crops using images. The model will be trained on a dataset of over 4,800 images of healthy and diseased enset collected from farms. The deep learning model aims to classify new images as healthy or diseased with high accuracy to help identify disease without requiring expert pathologists to inspect each crop.

Uploaded by

Sofi Ketema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
398 views91 pages

Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach

This document is a thesis submitted by Yidnekachew Kibru Afework to the Addis Ababa Science and Technology University for the degree of Master of Science in Software Engineering. The thesis proposes developing a deep learning model to detect bacterial wilt disease in enset crops using images. The model will be trained on a dataset of over 4,800 images of healthy and diseased enset collected from farms. The deep learning model aims to classify new images as healthy or diseased with high accuracy to help identify disease without requiring expert pathologists to inspect each crop.

Uploaded by

Sofi Ketema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

ADDIS ABABA SCIENCE AND TECHNOLOGY UNIVERSITY

DEVELOPING BACTERIAL WILT DETECTION


MODEL ON ENSET CROP USING A DEEP
LEARNING APPROACH
By

YIDNEKACHEW KIBRU AFEWORK

A Thesis Submitted as a Partial Fulfillment to the Requirements for the


Award of the Degree of Master of Science in Software Engineering

to

DEPARTMENT OF SOFTWARE ENGINEERING

COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING

OCTOBER, 2019 GC
Approval
This is to certify that the thesis prepared by Mr. Yidnekachew Kibru Afework entitled
“Developing Bacterial Wilt Detection Model on Enset Crop Using A Deep Learning
Approach” and submitted as a partial fulfillment for the degree of Masters of Science
complies with the regulations of the University and meets the accepted standards with
respect to originality, content, and quality.

Singed by Examining Board

Main Advisor : Signature, Date

__________________________ _______________ __________

Co-Advisor: Signature, Date

__________________________ _______________ __________

External Examiner: Signature, Date

__________________________ _______________ __________

Internal Examiner: Signature Date

__________________________ _____________ ___________

Chairperson: Signature Date

_________________________ ______________ __________

DGC Chairperson: Signature Date

____________________________ __________ __________

College Dean/Associate Dean for GP: Signature Date

________________________________ ___________ __________

ii
Declaration
I hereby declare that this thesis entitled “Developing Bacterial Wilt Detection Model on
Enset Crop Using A Deep Learning Approach” was prepared by me, with the guidance
of my advisor. The work contained herein is my own except where explicitly stated
otherwise in the text, and that this work has not been submitted, in whole or in part, for any
other degree or professional qualification.

Author: Signature Date:

Yidnekachew Kibru _________ _______

Witnessed by:

Name of student advisor: Signature Date

Vuda Sreenivasa Rao (PhD) _________ __________

Name of student co-advisor: Signature Date

Taye Girma (MSc) _________ __________

iii
Dedication

To my mom: ክብነሽ ወልደ መስቀል

iv
Abstract
Ethiopia is one of the countries in Africa which have a huge potential for the development
of different varieties of crops. There are many cultivated crops which are used as a staple
food in different regions of the country. From those crops Enset is the one and which is
used by around 15 million peoples as a staple food in central, south, and southwestern
regions of Ethiopia. Enset crop is affected by disease caused by bacteria, fungi, and virus.
From these, bacterial wilt of Enset is the most determinant constraint to Enset production.

Identification of the disease needs special attention from experienced experts in the area
and it is not possible for the plant pathologists to reach each and every Enset crop to observe
the disease, because the crop is physically big. Thus, developing a computer vision model
that can be deployed in drones that automatically identify the disease can help to support
the community which cultivates Enset crop. To this end, a deep learning approach for
automatic identification of Enset bacterial wilt disease is proposed. The proposed approach
has three main phases. The first phase is the collection of healthy and diseased Enset images
with the help of agricultural experts from different farms to create a dataset. Then the
design of a convolutional neural network that can classify the given image in to healthy
and diseased is done. Finally, the designed model is trained and tested by using the
collected dataset and compared the designed model with different pre-trained
convolutional neural network models namely VGG16 and InceptionV3.

The dataset contains 4896 healthy and diseased Enset images. From this, 80% of the images
are used for training and the rest for testing the model. During training, data augmentation
technique is used to generate more images to fit the proposed model. The experimental
result demonstrates that the proposed technique is effective for the identification of Enset
bacterial wilt disease. The proposed model can successfully classify the given image with
a mean accuracy of 98.5% even though images are captured under challenging conditions
such as illumination, complex background, different resolution, and orientation of real
scene images.

Keywords: - Convolutional Neural Network, Deep Learning, Image Classification, Disease


detection, Enset Bacterial Wilt.

v
Acknowledgments
First and foremost I would like to express my deepest gratitude to the Almighty God for
the blessing this thesis has successfully been concluded. After that, I would like to thank
my advisor Dr. Sreenivasa Rao for his advice throughout this thesis.

I would also like to express my gratitude to my co-advisor Mr. Taye Girma for his help
and constructive guidance right from the implementation to the completion of the work.
Many thanks and appreciations go to him for the discussions with him always made me
think that things are possible.

I am also very thankful to Mr. Abebe who is plant science expert at Mihurna Alkil wereda
Gurage zone SNNPR and Mr. Sabura Shara who is a Ph.D. scholar at Arbaminch
University for giving me expert advise about bacterial wilt disease of Enset and helped me
to get images of Enset that are affected by the disease.

Finally, I would like to thank my family, who always supported me throughout these years.
I acknowledge the constant hard work and moral support of my loving brother, Mr. Tilahun
Kibru. He always supported and encouraged me to achieve new goals.

vi
Table of Contents

Approval ............................................................................................................................. ii
Declaration ......................................................................................................................... iii
Abstract ............................................................................................................................... v
Acknowledgments.............................................................................................................. vi
Abbreviations and Acronyms ............................................................................................. x
Lists of Tables .................................................................................................................... xi
List of Figures ................................................................................................................... xii
Chapter 1 ....................................................................................................................... 1
Introduction ....................................................................................................................... 1
Background .......................................................................................................... 1
SWOT Analysis.................................................................................................... 2
Motivation ............................................................................................................ 4
Statements of The Problem .................................................................................. 5
Objectives ............................................................................................................. 6
1.5.1 General Objective ......................................................................................... 6
1.5.2 Specific Objectives ....................................................................................... 6
Scope and Limitation of the Study ....................................................................... 7
Significance of the Study ..................................................................................... 7
Organization of the Thesis ................................................................................... 8
Chapter 2 ....................................................................................................................... 9
Literature Review ............................................................................................................. 9
Enset Crop ............................................................................................................ 9
Enset Bacterial Wilt ........................................................................................... 10
Machine Learning .............................................................................................. 11
Artificial Neural Network .................................................................................. 13
2.4.1 Multi-Layer Networks ................................................................................ 14
2.4.2 Backpropagation Algorithm........................................................................ 15
2.4.3 Activation Function .................................................................................... 16
Deep Learning .................................................................................................... 18
2.5.1 Convolutional Neural Network ................................................................... 21
2.5.2 CNN Architectures...................................................................................... 28
2.5.3 Application of CNN in Crop Disease Detection ......................................... 30
Related Works .................................................................................................... 32

vii
Summary ............................................................................................................ 34
Chapter 3 ..................................................................................................................... 36
Research Methodologies ................................................................................................. 36
Research Flow .................................................................................................... 36
Data Preparation ................................................................................................. 37
3.2.1 Data Preprocessing...................................................................................... 37
3.2.2 Data Partitioning ......................................................................................... 38
3.2.3 Data Augmentation ..................................................................................... 39
Software Tools ................................................................................................... 39
Hardware Tools .................................................................................................. 40
Evaluation Technique ......................................................................................... 41
Chapter 4 ..................................................................................................................... 42
Design and Experiment .................................................................................................. 42
Model Selection.................................................................................................. 42
Overview of BWE Detection ............................................................................. 43
Training Components of the Proposed Model ................................................... 44
4.3.1 Proposed Model Description....................................................................... 45
Feature Extraction Using Proposed Model ........................................................ 48
Classification Using Proposed Model ................................................................ 50
Classification Using Pre-Trained Models .......................................................... 51
Experimental setup ............................................................................................. 52
4.7.1 Augmentation Parameters ........................................................................... 53
4.7.2 Hyperparameter Settings ............................................................................. 53
Chapter 5 ..................................................................................................................... 56
Results and Discussions .................................................................................................. 56
Experimental Result ........................................................................................... 56
Pre-trained CNN ................................................................................................. 56
5.2.1 Detection of BWE by using VGG16 Pre-trained Model ............................ 57
5.2.2 Result Analysis of VGG16 ......................................................................... 58
5.2.3 Detection of BWE by using InceptionV3 Pre-trained Model ..................... 60
5.2.4 Result Analysis of InceptionV3 .................................................................. 61
Detection of BWE by using the Proposed CNN Model ..................................... 62
5.3.1 Scenario 1: Changing Training and Testing Dataset Ratio. ........................ 62
5.3.2 Scenario 2: Changing Learning Rate. ......................................................... 63

viii
5.3.3 Scenario 3: Using Different Activation Function. ...................................... 63
5.3.4 Result Analysis for the Proposed BWE Detection Model .......................... 64
Discussion .......................................................................................................... 66
Chapter 6 ..................................................................................................................... 69
Conclusion and Recommendations ............................................................................... 69
Conclusion.......................................................................................................... 69
Recommendations .............................................................................................. 70
References ........................................................................................................................ 71
Appendix A: Experiment of Proposed Model .............................................................. 76

ix
Abbreviations and Acronyms
ANN Artificial Neural Network
BCE Binary Cross-Entropy
BWE Bacterial Wilt of Enset
CNN Convolutional Neural Network
DBN Deep Belief Networks
DBSCAN Density-Based Spatial Clustering of Application with Noise
FC Fully Connected
FCN Fully Convolutional Network
GPU Graphics Processing Units
HOT Histogram of Template
ILSVRC ImageNet Large Scale Visual Recognition Challenge
KNN K-Nearest Neighbor
LSTM Long Short-Term Memory
ML Machine Learning
MSE Mean Squared Error
NLP Natural Language Processing
OCR Optical Character Recognition
ReLU Rectified Linear Unit
RGB Red, Green, and Blue
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SVM Support Vector Machine

x
Lists of Tables
Table 1.1 SWOT analysis of Enset Production .................................................................. 3
Table 2.1. Summary of related works ............................................................................... 34
Table 4.1. Summary of proposed model parameters ........................................................ 48
Table 4.2. Augmentation techniques used ........................................................................ 53
Table 4.3. Summary of hyperparameters used during model training .............................. 55
Table 5.1. Mean accuracy and loss of VGG16 pre-trained model.................................... 59
Table 5.2. Mean accuracy and loss of InceptionV3 pre-trained model ............................ 62
Table 5.3. Result of experiments by using different training and testing dataset ratio ..... 62
Table 5.4. Result of the proposed model by using different learning rate ........................ 63
Table 5.5. Results of the proposed model by using different activation functions ........... 63
Table 5.6. Mean accuracy and loss of the proposed model .............................................. 66

xi
List of Figures
Figure 2.1. Example of Enset crop.................................................................................... 10
Figure 2.2. Example of healthy (left) and infected (right) Enset leave............................. 10
Figure 2.3 Map of regions that cultivate Enset in Ethiopia .............................................. 11
Figure 2.4. Example of single layer perceptron ................................................................ 14
Figure 2.5. Example of multilayer Network ..................................................................... 15
Figure 2.6. Example of CNN Architecture ....................................................................... 22
Figure 2.7. Example of Input volume and filter................................................................ 23
Figure 2.8. Example of the Convolution operation........................................................... 24
Figure 2.9. Example of convolution of a 3D input volume .............................................. 25
Figure 2.10. Example of convolution operation with 2 filters .......................................... 26
Figure 2.11. An Example of one convolution layer with activation function................... 26
Figure 2.12. Example of max pooling .............................................................................. 27
Figure 2.13. Example of fully connected Layer ............................................................... 28
Figure 3.1. Research flow ................................................................................................. 37
Figure 3.2. Resized image ................................................................................................. 38
Figure 4.1. Block diagram of the detection of Bacterial Wilt disease .............................. 43
Figure 4.2. Proposed model .............................................................................................. 45
Figure 4.3. Feature Extraction in the proposed model ...................................................... 49
Figure 4.4. Classification in the proposed model ............................................................. 50
Figure 4.5. Transfer learning ............................................................................................ 52
Figure 5.1. Training and validation accuracy for VGG16 Pre-trained model .................. 59
Figure 5.2. Training and validation loss for the VGG16 pre-trained model..................... 59
Figure 5.3. Example of the Inception module................................................................... 60
Figure 5.4. Training and validation accuracy of InceptionV3 pre-trained model ............ 61
Figure 5.5. Training and validation loss of InceptionV3 pre-trained model .................... 61
Figure 5.6. Training and validation accuracy of the proposed model .............................. 65
Figure 5.7. Training and validation loss of proposed model ............................................ 65
Figure 5.8. Mean accuracy of the three experiments ........................................................ 67
Figure 5.9. Mean Loss of the three experiments ............................................................... 67

xii
ChapterONE
CHAPTER 1
INTRODUCTION

Background
Ethiopian economy is mainly depending on agriculture. Nearly 85% of Ethiopian people
depend on agriculture as their principal means of livelihood [1]. In this context, agriculture
plays a vital role in the Ethiopian economy. In recent decades, agricultural production has
become much more important than it used to be some years back where plants were only
used to feed humans as well as animals. It is also an important source of raw materials for
many agriculture-based industries.

Ethiopia is one of the countries in Africa which have a huge potential for the development
of different varieties of crops. There are many crops used as the main food source in
Ethiopia, from this Enset (እንሰት) is the one. Enset is belonging to a family of Musaceae
and it is herbaceous and monocarpic crop. The physical appearance of Enset resembles that
of banana, but Enset is taller in height and fatter in size, and most importantly the fruits of
Enset are not edible, Hence, Enset is known as ‘false banana’. Enset has a gigantic
underground rhizome or corn which is used to propagate. The corn has emerging suckers
and the sucker develops into a new fruit-bearing Enset crop. In central, south, and
southwestern Ethiopia Enset is considered as the first cultivated food, food security, and
cash crop. A total of 302,143 Hectare of land is cultivated by Enset crop in Ethiopia [2]
and is used as human food, animal forage, fiber, construction materials and medicines for
20% of the country’s population. Most importantly Enset is used as a staple food for more
than 15 million peoples in Ethiopia [3].

Enset production is affected by biotic and abiotic factors, such as diseases and insect pests
which contribute to low yield and low quality of Enset production. From those factors, a
disease which is caused by bacteria, fungi, viruses, and nematodes is the most severe
biological problem. Among these, Bacterial Wilt of Enset (BWE) is the most determinant
constraint to Enset production [4, 5].

1
Detecting diseases plays an important role in the field of agriculture because of most of the
disease in plants is not easily visible when it happens in first time. To identify the disease
that causes a problem on the Enset crop, it is usually necessary to look at the Enset closely;
examine the leaves, and stems of the Enset; sometimes the roots of the Enset; and do some
detective work to determine the possible causes of the disease. The identification of Enset
disease more specifically the bacterial wilt disease can be done through plant pathologists
(experts in the field of agriculture). However, getting experts for the identification of the
disease is expensive for farmers that are far from the place where experts are found this is
the main weakness of the area. In order to minimize this problem and properly identify
bacterial wilt disease, it is possible to develop a computerized model that can detect the
disease by using computer vision and deep learning techniques.

The opportunities are, bacterial wilt disease produces symptoms on the leave of the crop
which are the main indicators of the disease in the field. Depending on those symptoms
which are directly shown in the crop leaf, we can develop a model that identifies the disease
in the crop. Therefore, the development of bacterial wilt disease detection is quite useful.
In this thesis, automatic Enset bacterial wilt detection mechanism is developed by using
computer vision techniques more specifically by using deep learning algorithms.

SWOT Analysis
SWOT analysis is a strategic planning technique used to help a person or organization
identify strengths, weaknesses, opportunities, and threats related to business competition
or project planning. One of the vital steps in the planning process is an analysis of strengths,
weaknesses, opportunities and threats. Prior identification of weaknesses and threats helps
to identify appropriate approaches for internal improvement and justification of factors that
may result in adverse impacts beyond the control of the agriculture sector. Recognition of
strengths and opportunities enables gaining the maximum benefits from internal and
external environments toward achieving the goals and targets set.

Enset production and marketing has the strong opportunities however, there are some
problems and threats faced on this golden crop such as Bacteria Wilt (Xanthomonas
campestris) disease and lack of improved technologies, planting materials, and post-harvest
processing technologies. The lack of improved Enset varieties and absence of external farm

2
inputs may affect production. Farmers mostly rely on organic farmyard manures to supply
nutrients to the Enset plant, which may not sufficient for raising production. Another
problem is Enset production done by subsistence farming system and it observed that not
direct linked with the central market. There were several gaps and weaknesses in the
production, processing, and marketing of kocho and bulla. Farming and post-harvest tools
and implements are still traditional with low use efficiency. Moreover, equipment used in
Enset processing is also very traditional tools and locally made. This indicates that there is
need of a lot of work to improve the processing methods. Gender-dependent work division
on Enset processing has a negative impact on productivity. Kocho storage traditional
methods and bulla drying processing method are leading to losses of products. Marketing
kocho and bulla in the local market are very liable for losses and spoilage due to lack of
storage and market facilities [6, 7].
Table 1.1 SWOT analysis of Enset Production

Strengths (S) Weaknesses (W)


1. Valuable and multipurpose crop 1. Lack of improved cultivars and
2. High quality and delicious taste of fertilizers
kocho and bulla produced 2. It needs high labor
3. Hard worker, experienced and 3. Working culture gives burden on
endurance farmers women
4. Cooperating on working culture both 4. Market facilities problem like
male and female by different warehouse
traditional associations 5. Small production scale and low
5. Product very suitable for processing specialization
and upgrading especially bulla 6. Underdeveloped farming and post-
harvest processing technology
7. Weak support from the government
and
scientists
Opportunities (O) Threats (T)
1. Suitable climate for Enset production 1. New generation is not accepting hard
2. Near to central marketplace working which is needs for production
3. Availability of different kinds of of Enset and migrating to cities
Enset cultivars 2. Enset production required high labor
4. High demanded products in local 3. Road condition is poor and old
market 4. High disease incidences (bacterial
wilt).

3
5. Potential export market especially 5. Shortage of land, poor soil fertilities
bulla and unsuitable topography for
6. Potential raw material for textiles agriculture
and paper industries

Motivation
Food, shelter, fiber, medicine, and fuel are provided by plants in our world. Green plants
produce basic food however, modern technologies allow humans to produce more food to
meet the demand of our planet’s population. A lot of factors affect food security such as
climate change and plant disease and most of the plant loss is caused by plant disease. The
most important thing in plant disease management is identifying the disease when first
appears on the farm by using different computing infrastructures such as computer vision,
and deep learning techniques.

The main motivation to choose this thesis was my grandparent's Enset farm was destroyed
by bacterial wilt when I was a grade 9 student. My grandpa and the society of that village
were seriously finding a cure for the disease in the traditional way by the time. Finally, my
grandpa said to me that “study hard my son you will find medicine when you grow up”.
Now I have a chance to detect the disease early before spreading into all of the crops on
the farm and before destroys the entire Enset farm.

In addition to this, currently, computer vision is giving so many advantages in various


fields. It is applied in areas such as medical diagnosis, industrial automation, aerial
surveillance (biometrics), remote sensing (satellite observation of Earth), automated
sorting and grading of agricultural products, and more [8]. In developing countries
including Ethiopia, the application of computer vision or image processing has not been
used in a significance manner. Using this technology for the purpose of crop disease
detection has many significances. Currently, Enset bacterial wilt disease identification is
based on the traditional way by agricultural experts. Therefore, the implementation of
image technology in the detection of bacterial wilt disease will have paramount importance
to increase the yield of Enset crop in Ethiopia.

4
Statements of The Problem
In the central, south, and southwestern regions of Ethiopia, Enset is the main (the only in
some area) cultivation crop [3]. However, in the cultivation process, there are a number of
challenges that affect the crop such as disease and pests, from these challenges bacterial
wilt disease is the most determinant. Like most of the crop’s disease, the identification of
Enset bacterial wilt disease needs special attention that requires experienced experts in the
area [9]. Researches have shown that the disease is causing a high amount of yield loss in
areas that cultivate Enset in Ethiopia [9]. Up to 80% of enset farms in Ethiopia are currently
infected with enset Xanthomonas wilt. The disease has forced farmers to abandon enset
production, resulting in critical food shortage in the densely populated areas of southern
Ethiopia. This disease directly affects the livelihood of more than 20% of farmers in the
country [9, 10].

Most of the farmers in Ethiopia are uneducated and they do not get correct and complete
information about the diseases of Enset crop, so they need expert advice. Besides, there is
a limitation of resources and expertise on Enset pathology in regions that cultivate Enset
crop. In addition to this, it is not possible for the crop pathologists to reach every farm and
as even the crop pathologists rely on manual eye observation, the manual prediction
method is not so accurate, it is time-consuming, and it takes a lot of effort. The other main
problem is that if the disease happens once, there is no medicine that can cure bacterial wilt
of Enset. The only solution currently is to burn the infected plant and destroy the farm [11].

Despite the importance of identifying plant disease using computer vision and although
this area is studied more than 30 years it has given promising outputs but the advances
achieved are not enough and it seems a little small [12]. In other words, there are a lot of
plants and diseases that are not addressed in the current technologies. Therefore, we need
to extend the works to address more diseases and plants. There is a fact that some
previously worked disease identification and classification researches are conducted by
using some strict methods. For example, the images used for training and testing are taken
under a certain condition like in the laboratory within a proper lighting system, and strict
angle of capture, collecting sample images from publicly available databases instead of

5
capturing real-world images from the field, use of traditional image processing techniques,
and so on [13, 14, 6, 15, 16].

Therefore, there is a need to design an automatic disease detection model that assists the
farmers in early detection of the diseases with greater accuracy. In literature there are many
works are conducted to detect Bacterial Wilt disease in different plants. But the methods
are not used in Enset crop yet. Thus, we need to use deep learning techniques to detect
bacterial wilt disease from Enset crop. The computer vision approach is a noninvasive
technique that provides consistent, reasonably accurate, less time consuming and cost-
effective solutions for farmers to identify bacterial wilt diseases. The following research
questions are formulated on this thesis.

1. How best can computer vision technique be used to detect bacterial wilt diseases
of enset?
2. What method could be used to detect bacterial wilt disease of enset?
3. How datasets are collected in order to accomplish this task?

Objectives

1.5.1 General Objective


The main objective of this thesis was to design and develop an automatic Enset Bacterial
Wilt disease identification model by using a deep learning approach.

1.5.2 Specific Objectives


In light of this general theme, the specific objectives of this thesis are.

• Review literature on previous studies made on crop disease identification by using


computer vision techniques.
• Collect healthy and infected Enset leaf images directly from Enset farm.
• Select an appropriate methodology and tool to analyze the image dataset.
• Design a convolutional neural network model to detect the disease.
• Train the proposed model by using the collected dataset.
• Test the proposed model with an unseen dataset.
• Measure the performance of the proposed model.

6
Scope and Limitation of the Study
The thesis is mainly concentrated on the design and development of bacterial wilt disease
detection model on Enset crop. The model uses healthy and infected leave images of Enset
crops that are collected from Enset farm as the main input. Images are captured in different
farms which are found in Mihurina Aklil woreda Gurage zone, in South Nation
Nationalities and People Region (SNNPR) Ethiopia. The research work take nine (9)
months starting from problem formulation to the result of experiments. The sample images
are collected using a digital camera. This thesis is conducted only to detect bacterial wilt
disease on Enset crop, it does not include other diseases such as Sigatoka (leaf spot). This
thesis is only restricted to one crop which is Enset, even though Bacterial Wilt disease
affects other plants like banana and cassava. After the detection of the disease on the crop,
there is no way to recommend medicine for the disease, appropriate treatment to the
disease, and estimating the severity of the disease are beyond the scope of this thesis. The
main limitation to conduct this thesis is hardware resources like Graphics Processing Units
(GPUs) which is the most important resource when anyone is working on deep learning
algorithms especially for image processing.

Significance of the Study


We all agree with the fact that technology has become to stay and there is no way we can
avoid its use. Therefore, we must use technology to improve our living conditions. Deep
learning for computer vision is one of the current technologies and users are using it highly
for different purposes. Because the physical appearance of Enset is very big detection of
Bacterial wilt disease by using a necked eye needs a lot of time and effort and at the same
time, it is less accurate and applied in a limited area. Whereas if automatic disease
identification techniques and methods are used which can be deployed in drones it will take
less time, less effort, more accurate, and covers a large area.

In addition to these, the significance of this thesis is described as follows.

• In the first place, this thesis will enable agriculture experts to appreciate the
importance of computer vision in the field of agriculture.
• The thesis will help to achieve high yield due to the fact that the disease is detected
early without finding agricultural experts.

7
• The thesis will help to reduce the cost of production that bring huge losses to
farmers due to excessive use of pesticides on their crop.
• It will reduce the cost of experts for continues monitoring of crops in large farms.
• The outcome of this thesis will help different authorities to provide proper measures
in situations where there is Bacterial Wilt disease.
• Finally, this thesis will serve as reference material for the researchers who will
conduct their research in computer vision especially researches related to plant
disease identification.

Organization of the Thesis


The remaining part of this thesis is organized as follows. Chapter two presents the review
of literature on the domain of Enset crop, Enset Bacterial Wilt, machine learning, deep
learning and studies which are conducted in those domains. At the end of chapter two
review of studies that directly related to this thesis are discussed and summarized. In
chapter three the methodologies that are used to conduct this thesis are discussed briefly
including data collection methods, materials, and tools which have been used, selection of
appropriate model, the architectural design of the proposed model, and evaluation
techniques are discussed. Chapter four deals with the experimental detail of Bacterial Wilt
disease detection on Enset crop using infected and healthy leaf images of Enset. In addition
to this, different experimental results with detailed discussions are presented. Finally, the
conclusion and recommendations are presented in chapter five.

8
CHAPTER
Chapter 2 TWO
LITERATURE REVIEW
This chapter mainly focuses on the background information and review of literature of the
domain of this thesis. It includes a detail explanation about Enset crop, Enset Bacterial Wilt
disease, machine learning, and deep learning algorithms, and related works. Finally, the
chapter is concluded with the summaries of related works and the main gaps which should
be solved in this thesis.

Enset Crop
Enset (Enset ventricosum (Welw.) Cheesman) crop, commonly known as the Ethiopian
banana, Abyssinian banana, false banana, or Ensete [17, 4]. It is domestic only in Ethiopia
but, found in many countries in central and eastern Africa [18]. It is a very big monocarpic
evergreen perennial plant (Figure 2.1) with a height from 4 to 6 meters (sometimes up to
12 meters) by 1 meter in diameter [17, 18]. Enset crop has thick and strong pseudo stem
(false stem) of tightly overlapping leaf bases with large banana-like leaves but, wider and
taller up to 5m in height and a meter wide with midrib. Like banana Enset also have a one-
time flower in the center of the plant and occurs at the end of the plant’s life [18]. The
major food produced by Enset crop is locally called Kocho (ቆጮ) which is obtained by
fermenting a mixture of the scraped pulp from the pseudo stem, the second product is
pulverized corm called Amicho (አሚቾ), and the stalk of the inflorescence [17, 9]. In
addition to Kocho and Amicho Enset produces other foods locally called Bula (ቡላ) [9].
Kocho can be stored for a long time without any problem at a variable temperature. At the
time of flowering its quality of Kocho and Amicho production is higher than that of Enset
without flower. After the flowering, the plant dies.

Enset produces more amount of food than other cereal crops and 40 to 60 Enset crops will
provide enough food for families with 6 members for 1 year [10, 18]. Each crop takes 4 to
6 years to mature or ready to be eaten, a matured Enset crop gives 40 to 50 kg of food at a
time. Domestic Enset crops are propagated vegetatively from suckers but, still there are
few cultivated plants are produced from seeds, one mother plant can produce up to 400
suckers [18, 5, 17].

9
Figure 2.1. Example of Enset crop

Enset Bacterial Wilt


Enset Bacterial Wilt or Xanthomonas Wilt is a Bacterial disease caused by Xanthomonas
Campestris Pv. The symptoms were first observed in Ethiopia in the 1930s and Bacterial
Wilt disease was first identified on Enset in Ethiopia in the 1960s [11]. Currently, it is
found in all regions of the country (regions that cultivate Enset crop), it attacks Enset at
any stage of growth and all specious of Enset [19]. Researches have shown that BWE is
causing a high amount of yield loss in areas that cultivate Enset in Ethiopia [9]. Since 2011
the disease has been reported in Uganda, eastern Democratic Republic of Congo, Rwanda,
Tanzania, Kenya and Burundi [11].

Figure 2.2. Example of healthy (left) and infected (right) Enset leave.
Once established in an area, the disease spreads rapidly and results in total yield loss [9].
The main symptoms of the disease are Wilting of leaves, yellowish leave, and vascular
discoloration [9, 10], in addition to this, within a few minutes of cutting the pseudo stem a
cream or yellow-colored ooze exudes when the disease highly affects the crop. The
symptoms of the diseases start from the central part of the leaf and spread to others. The
disease mainly transmitted through infected farming tools, infected planting materials,
animals that fed the infected crop, and insect pests [9, 10]. Even though, there are many

10
different types of disease which are caused by bacteria. The following figure shows
locations where Bacterial Wilt of Enset affects the crop [9].

Figure 2.3 Map of regions that cultivate Enset in Ethiopia


In general, Bacterial Wilt disease of Enset has not given an equal amount of attention
compared to other diseases such as diseases caused by fungi (black and yellow Sigatoka)
and in Ethiopia Bacterial Wilt is causing significant impact on Enset and there is no well-
known management practice which is adopted by farmers [11].

Machine Learning
Machine learning is an application of AI that makes a machine to learn and improve
automatically without explicitly programmed [20]. Unlike classical computer programs
that perform a task explicitly programmed by the programmer, machine learning program
uses a generic algorithm that can give information about a set of data without having to
write any custom program which is specific to the problem. i.e. instead of writing a new
program for the specific problem we only feed data to the generic algorithm and it
computes that data then the algorithm builds its own logic based on the given data [21].
The process of learning in the machine learning algorithm begins with data [22], such as
examples, direct experience or instructions in order to look for patterns in data and make a

11
better decision in the future based on the example that we provide [23]. The goal is to allow
the computer to learn automatically without the help of human beings and adjust
accordingly. Machine learning basically divided into two main categories based on the way
they learn about the data to make a prediction: supervised and unsupervised learning.

supervised (sometimes called predictive) learning, in this approach we have knowledge of


input and output, then we train (supervise) the program in predicting the right outcome via
trial and error [23, 24]. This algorithm can apply what has been learned in the previous to
new data using a labeled example to predict future events. Supervised learning is applied
where we have two variables 𝒙 and 𝒚 and when we use a mapping function from the input
to the output.

𝑦 = 𝑓(𝑥) (2.1)

Where: y is the output variable and x are the input variable.

In Equation (2.1) above the main objective is to map the function so well that when there
is a new input data x (which is unseen before) that is used to predict the output variable y
for the data. For example, in this thesis for the disease identification problem training
images labeled as diseased and healthy were used. After learning from the images, the
algorithm is able to predict with unseen images during the training.

The second main category of machine learning is unsupervised (descriptive) learning, this
approach has little or zero knowledge of the output and we want to try to find patterns or
groupings within the data. The goal is to find an interesting pattern or to model the
underline structure in the data in order to learn more about the data [23]. This algorithm is
used when there is an input data x (input variable) and there is no corresponding output
data y (output variable). Most of the common cases in unsupervised learning are clustering,
density estimation, and representation learning.

Currently, learning algorithms are widely used in computer vision applications. Hence,
Machine Learning (ML) is the main component of computer vision algorithms [25].
Computer vision has made exciting progress in the past decades, bringing us self-driving
cars, automated scene parsing, medical diagnosis, and more [26]. Behind this revolution,
machine learning is the driving force.

12
Artificial Neural Network
Artificial Neural network (ANN) is one of the most widely used supervised machine
learning models. The primary focus of this thesis is a special type of NN which is known
as Convolutional Neural Network (CNN).

ANN sometimes called Neural networks, computer programs developed to mimic the
human brain [27, 22]. The term “neural network” was originated in 1943 to find a
mathematical representation of biological information processing [27]. Like humans,
ANNs are trained through experience by giving appropriate examples without any special
programming. ANNs are excellent in finding patterns that are very complex for humans to
extract. They gain knowledge by collecting relationships and patterns in the data that is
provided during the training [21, 23]. ANN contains multiple layers, where each layer will
have a number of neurons. A neuron is a smaller building block of the network and it
accepts an input, applies some computation and generates a unique output [13].

Even though neural networks are inspired by human brains, we cannot conclude that they
are completely the same. A human brain contains approximately 100 billion neurons and
each of the neurons has connected to 1,000 to 10,000 other neurons that work in parallel
fashion. When we come to ANN, they are mathematical functions implemented in
computers that are running one process at a time in a serial fashion. There ANNs are not
designed to model the human brain [22].

The first simplified neuron model was introduced by Warren McCulloch and Walter Pitts
and the model is called the M-P model [28, 20]. This model is also known as linear
threshold gate. It has a set of inputs (𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ) and one output 𝑦. Thus, the linear
threshold simply classifies the set of inputs into two different classes, so 𝑦 is binary. In
addition to this, it has a set of weights (𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑤𝑛 ) associated with each input lines
which have values in a range of (0, 1) or (−1, 1). Years later in the late 1950s, an enhanced
version of the M-P model with a concept of the perceptron is proposed by Rosenblatt
(Figure 2.4) [20]. The Rosenblatt model of neurons, perceptron’s, was enhanced by adding
two features from the M-P model. The first feature is in the M-P model the weight values
were fixed, but the later makes it variable. The second one is Rosenblatt model adds extra
input that represents bias.

13
x1
w1

x2 w2

y
w3
x3

Activation Output
wn
Summation and Bias

xn

Inputs Weight

Figure 2.4. Example of single layer perceptron

The input neuron 𝑘 receives 𝑛 input parameters 𝑥𝑗 . The neuron also has 𝑛 weight
parameters 𝑤𝑘𝑗 . A bias term 𝑏 with a matching dummy input with a fixed value of 1 is
included with the weight parameters. The input parameters and the weight are linearly
multiplied and summed (to the dot product of input and weight). The result of the
summation of the input parameters and weights are given to the activation function φ as
input by adding the bias term b. Then the activation function produces the output yk of the
neuron [20].
𝑛

𝑦𝑘 = 𝜑(𝑠𝑢𝑚) = 𝜑 (∑ 𝑤𝑘 𝑖 𝑥𝑖 ) + 𝑏 (2.2)
𝑖=1

Where yk is output of the neuron, 𝑤𝑘 𝑖 is the weight of the input, 𝑥𝑖 is the input, 𝜑 is
the activation function, 𝑛 is the number of inputs, and 𝑏 is the bias term.

2.4.1 Multi-Layer Networks


ANNs are a combination of multiple artificial neurons grouped in layers [13, 21]. Most of
the ANNs except single-layer networks (a network without hidden layer) have three types
of layers, the input layer, one or more hidden layers, and the output layer. Multi-layer
networks have one or more hidden layers. Each of the layers in the network consists of one
or more neurons. The neurons in the input layer accept information from outside the
network and transfer to the hidden layers of the network. The input layer passes the data

14
without modification (no computation is performed) process. The hidden layers
(sometimes called layers neither output nor input) perform mathematical computation and
transfer the information from the input layer to the other layer. Most of the computation in
the network performed in the hidden layer. Neurons in the output layer perform
computation and transfer the information to outside the network. The output layer transfers
activations in the hidden layer to actual output, for example, classification and prediction.
Multi-layer networks (or multi-layer perceptions) are also known as feed-forward neural
networks.

Figure 2.5. Example of multilayer Network


As shown in Figure 2.5 [29] above, each output of a layer of the neuron is received as an
input in each layer of the next layer of the neuron, this kind of neural network is called
fully connected feed-forward neural network. In this type of neural network, neurons in the
input layer receive the original input data while other neurons in the other layer receive
outputs of previous neurons. In a feed-forward neural network, information flows from the
input layer to the output layer through the hidden layer without going back. Each neuron
in the network has an equal number of weights to the number of neurons in the previous
layer [27].

2.4.2 Backpropagation Algorithm


In a feed-forward neural network, the output 𝑦 is produced by accepting the input 𝑥, and
the information flows forward from input to output. The initial information is provided by
the input 𝑥 then the input is propagated into the hidden layers to produce an output 𝑦. This

15
process is called forward propagation [22]. The backpropagation algorithm (commonly
called backprop) allows the information to flow in reverse direction, the information flows
backward from the output neurons to the input through the hidden layers in order to
compute the gradient [24, 20]. During the training of the neural network, weights are
selected appropriately, therefore the network learns to predict the target output from known
inputs [30]. Even though computing by the analytical expression for the weights of the
neurons is straightforward, it’s computationally expensive. So, we need to find a simple
and effective algorithm which helps us to find the weights. The backpropagation algorithm
provides a simple and effective way for solving the weights iteratively in order to reduce
error (minimizing the difference between the actual output and the desired output) in the
neural network model [22, 30, 20].

Small random values have been initialized for the weights of the network neuron when an
input vector is propagated forward to the neural network. By using a loss function the
predicted output (output of the network) and the desired outputs (output from the training
example) are compared. i.e. the gradient (error value of the network). The error value is
simply the difference between the actual output and the desired output. The error values
are then propagated back from the output layer to the input layer through the hidden layers
and then the error values of the hidden layers are calculated. In this process, the weights of
the hidden layers are updated. This is called learning during the training process of the
neural network. When the weights are iteratively updated the neural network gets better.
The algorithms continue this process by accepting new inputs until the error value is less
than the limit value of the weight we set before [20].

2.4.3 Activation Function


The final output of each neuron in the neural network is determined by activation function
φ. Activation functions are functions that decide whether a neuron should activate (fire) or
not by calculating a weighted sum and adding bias with it [31]. Activation functions
introduced non-linear properties to the NN to overcome the drawback of early neural
networks (perceptron’s). The drawback of early NN was the problem of computing non-
linear and complex problems. The main purpose of the activation function is to convert the
input in a neuron of NN to output. The output of that neuron is used as an input in another

16
neuron of the next layer of the network. If we do not use activation function the output of
the neural network will be simply a linear function. A linear function is not applied in
algorithms that need to learn from complex functional mapping on data [32]. The main
reason that makes us use non-linearity is because we want NN model which learns and
represents any arbitrary function which maps inputs from the output.

In this thesis, the most widely used activation function which is called Rectified Linear
Unit (ReLU) has been used in the hidden layer of the network to make our model more
powerful and to learn complex features from data. It is used to create a lightweight and
effective non-linear network [22, 33]. ReLU becomes popular in the past few years and
now it is state of the art activation function for hidden layers [24, 20]. The mathematical
form of this function is represented as follows.

𝜑(𝑥) = 𝑚𝑎𝑥(0, 𝑥) (2.3)

Where: 𝜑 is the activation function and 𝑥 is the input.

In the equation (2.3) above if 𝑥 ≥ 0, 𝜑(𝑥) = 𝑥 and if 𝑥 < 0, 𝜑(𝑥) = 0. Hence as we seen
in the mathematical equation, the ReLU activation function is simple and efficient
(especially for the back-propagation algorithm). The main reason that makes ReLU simple
and efficient is that it activates some of the neurons at a time. i.e. if the input is negative
(𝑥 < 0), it converts it to zero and the neuron is not activated. ReLU can’t be applied in the
output layer of the neural network and this is the main drawback of this activation function.

Sigmoid activation function has been used for the output layer of the model. The sigmoid
activation function is the best activation function for binary classification and it exists
between 0 and 1 [20]. It is the best choice for models that have probability output since the
probability of anything exists between 0 and 1. Unlike the SoftMax activation function the
sum of the output of sigmoid functions are not equal to 1.

The SoftMax function accepts arbitrarily 𝑛 inputs and it gives 𝑛 output values within a
range between 0 and 1. This shows the probability of different classes to define each input.
And the sum of the value of the output is always equal to 1. SoftMax is the best choice of

17
activation function for neural network models that are built for multiclass classification
[20].

Deep Learning
Deep learning is a subfield of machine learning that uses a neural network for its
architecture and its learning is based on a data representation algorithm instead of task-
specific algorithms [34, 24]. In the last decade, neural network application is growing faster
than ever mainly because of many powerful computers (inexpensive processing units such
as GPU) and a large amount of data. As discussed in Section 2.4 above ANN has one or
more processing layers. Depending on the problem we want to solve the number of layers
we use in the network defers. If the number of layers is not very large, or simply two or
three we call the network shallow architecture. When an ANN architecture that contains a
very large number of layers, the network is called deep architecture and deep learning refers
to this deep architecture of NN [35, 24].

Multilayer networks were known since the 1980s, but for several reasons, the networks
were not used to train a neural network with multiple hidden layers [22]. The main problem
that prevents the use of multilayer networks in that time was the curse of dimensionality,
i.e. if the number of features of dimension grows, the number of configuration increases.
As the number of configuration increases, the number of data samples for the training
increases exponentially. Therefore, collecting sufficient training datasets was time-
consuming and it was not cost-effective on the usage of storage space [22, 36]. Nowadays
most of the neural networks are often called deep neural networks and they are widely
used. We can train a neural network with many hidden layers because a huge amount of
data, as well as storage space, and computational resources is available.

The traditional machine learning algorithm needs separate hand-tuned feature extraction
before the machine learning phase. Deep learning has only one neural network phase. At
the beginning of the neural network, the layers are learning to recognize the basic features
of the data and that data feedforward to the other layers in the network for additional
computation of the network [22].

18
As the NN is inspired by the human brain, one of the major applications of deep learning
which is computer vision is inspired by the human visual system. Deep learning is giving
great success in computer vision and speech recognition in the last two decades [28, 34].
Deep learning models also applied to so many problem areas, some of them are text
classification, speech recognition (Natural language processing), visual object recognition
(computer vision), object detection and many other domains such as drug discovery and
genomics [24, 37]. The number and type of problems that a neural network can address are
based on different deep learning algorithms that are developed in the last two decades.
Some of the most commonly used deep learning architectures are Recurrent Neural
Network (RNN), Long Short-Term Memory (LSTM), CNN, Deep Belief Networks
(DBN), and Autoencoders.

• RNN is one of the first deep learning architecture which gives a road map to
develop other deep learning algorithms. It is commonly used in speech recognition
and natural language processing [38]. RNN is designed to recognize the sequential
characteristics (remembers previous entries) of the data. When we analyze time
serious data, the network has memory (hidden state) to store previously analyzed
data. To perform the present task RNN needs to look at the present information
(short term dependency) and this is the main drawback. RNN differs from a neural
network is that RNN takes a sequence of data defined over time [38].
• LSTM is a special type of RNN which is explicitly designed to overcome the
problem of long-term dependencies by making the model remember values over
arbitrarily time interval. The main problems of RNN are vanishing gradient and
exploding gradients. The gradient is the change of weight with regard to the change
in error. It is well suited to process and predicts time series given time lags of
unspecified duration. For example, RNN forgets the model if we want to predict a
sequence of one thousand intervals instead of ten, but LSTM remembers such kind
of activities. The main reason that LSTM can remember its input in a long period
of time is that it has a memory that is like memory on a computer which allows the
LSTM to read, write and delete information [39]. It is mostly applied to natural
language text compression, handwritten recognition, speech recognition, gesture
recognition, and image captioning.

19
• CNN is the popular deep learning architecture for different computer vision tasks,
especially for image recognition. It is a multilayer network and inspired by the
animal vision system (visual cortex). CNN is used in this thesis for the detection of
Bacterial Wilt disease on Enset crop and the detail is in Section 2.5.1.
• DBN is a class of deep neural networks with multiple hidden layers where each
layer of the network is connected to each other but the neurons in the layers are not
connected to each other. The training of DBN occurs in two phases. It is composed
of layers of Restricted Boltzmann Machines (RBMs) for the unsupervised
pretraining and feedforward network for the supervised fine-tuning phase. During
the training of the first phase (pretraining) it learning a layer of feature in the input
layer. After the pretraining is completed the fine-tuning phase begins. In the fine-
tuning phase, it accepts the features of the input layer as input and learns features
in the second hidden layer. Then backpropagation or gradient descent is used to
train the full network including the final layer [40]. DBN is applied in image
recognition, information retrieval, natural language understanding, and video
sequence recognition.
• Autoencoders are a specific type of feed-forward neural network which is designed
for unsupervised learning, i.e. when the data is not labeled. The inputs and outputs
of autoencoders are the same. It accepts and compresses the input into a lower-
dimensional code and then reconstructs the output from the compressed code.
Autoencoders have three components namely the encoder, the code, and the
decoder. The encoder accepts the input and produces output, whereas the decoder
produces output by using the code. Anomaly detection is one of the most popular
application is of the autoencoder.

Deep learning techniques are new and rapidly evolving. Nowadays deep learning performs
better than other traditional machine learning approaches because of the availability of a
large amount of data and high-performance computing machine components such as GPU
[20]. Deep learning methods use multilayer (too many hidden layers) processing with better
accuracy performance and unlike traditional machine learning approach there is no explicit
feature extraction, i.e. in deep learning architecture features are extracted automatically

20
from the raw data and we can perform feature extraction and classification (it might be
recognition depending on our problem) at once, therefore we only design a single model.

Researches proved that deep learning can achieve state-of-the-art for many problems that
AI and ML are facing for a long time in the area of computer vision, Natural Language
Processing (NLP) and Robotics [41, 20]. To overcome the complexity of the design, deep
learning methods use backpropagation algorithm, loss functions, and too many parameters
that make the model to learn complex features.

2.5.1 Convolutional Neural Network


Convolutional neural network (CNN), sometimes called convnet is a multilayer perceptron
or a deep learning model which is the same as feed-forward NN commonly applied to
analyze visual imaginary. It is very similar to regular neural networks but mainly focuses
on computer vision tasks. CNNs come from the traditional neural network which is
commonly applied in problems that need repeating patterns such as image recognition [20].
In a regular neural network, there is a problem of dimensionality for image processing or
computer vision applications because images most of the time contains an enormous
amount of information. For example, if we take a grayscale image with a size of 1280 by
720 contains 921,600 pixels. If a fully connected network accepts the pixel intensity of this
image as an input, the weights required by the neuron is 921,600. In another example,
2,073,600 weights will be required by a 1920 × 1080 image. If the image is colored
(polychrome) the amount of color is multiplied by three (3). Therefore, when the image
size is increasing, the number of free parameters in the network extremely increases. So,
whenever the model is getting larger the performance of the network decreases and it
causes overfitting [27]. Overfitting is a problem of machine learning algorithms happens
when the size of the network is larger and there is no data to fit the model. The problem
decreases the generalization ability of the machine learning model [21]. CNN overcomes
this problem by providing layers with neurons arranged in three dimensions (3D): width,
height, and depth. Every layer of CNN accepts the 3D volume of input data (in this case
image) and generates another 3D volume of data as an output through a differentiable
function.

21
The basic idea of CNN was inspired by the receptive field which is a biological term and
it is a feature of animal visual cortex [20, 21]. Receptive fields are a part of sensory neurons
and acts as detectors that are sensitive to a stimulus, for example, edges. The term receptive
field is also applied in the context of ANN, most often related to CNN and the biological
computations are approximated in computers using the convolution operations. In
computer vision, images can be filtered by using convolution operation to produce different
visible effects. CNN has convolutional filters that are used to detect some objects in a given
image such as edges which is the same as the biological receptive field. Since the late 1980s
and in 1990 CNN gives interesting results in handwritten digit classification and face
recognition [42]. The following figure (Figure 2.6) [33] illustrates CNN architecture.

Figure 2.6. Example of CNN Architecture


In this thesis, CNN is used to detect BWE by giving leave images of infected and healthy
Enset crop as an input. For the process of classification, CNN is used which is composed
of various sequential layers and every layer of the algorithm transforms one volume of
activation to another using different functions [29]. The basic and commonly used layers
of CNN are the convolution layer, the pooling layers, and the fully connected Layer [22].

A. Convolution Layer

The main objective of the convolution layer is to extract useful features from the input
image. Every image is represented as a matrix of pixel values in a computer. An image
captured by a standard digital camera has three channels: Red, Green, and Blue (RGB).
This type of image is represented as three 2D-matrices staked over each other (one for each
color) and each having a pixel value of a range of 0 to 255. Convolution layer is formed

22
from a combination of a set of convolutional filters (aka kernels or feature detectors) which
are small matrix values with size like 3 × 3, 9 × 9, and so on [29]. The filters are treated
as neuron parameters and are learnable. Every filter is smaller than the input volume in
spatial size (width and height), extends the depth equal to the input volume (input image).
For example, a typical filter might have size 5 × 5 × 3 (5 width, 5 height, and 3 depth for
the three-color channels). A part1 of the image is connected to the next convolution layer
because if all pixels of the image are connected it will be expensive to compute.

The convolution operation is performed by sliding the filter on the input image from left to
right across width and height and compute the dot product between the filter and the input
image at any position. The output of this operation is called a feature map (aka convolved
feature or activation map). Therefore, the filters are used to extract useful features from the
input image. Whenever the values of the filters are changed, the features that are extracted
or the feature map also changes. In the following illustration (Figure 2.7) we have prepared
a 2D input image of size 5 × 5 and 3 × 3 kernel.

Figure 2.7. Example of Input volume and filter


The input and the filter are given, the next step is to perform convolution operation by
sliding or convolving the filter over the input. At every location, dot product (by
performing element-wise matrix multiplication and summing the result) is computed and
store the result in a new matrix called feature map (Figure 2.8). As we can see in the
following illustration the output of the first convolution operation is 4 and the second is 3,
these results are added to the feature map. The whole process is performed by sliding the
filter to the right and adding the result into the feature map.

1
If we connect all parts, it is called Fully Connected Network (FCN) as opposition with CNN

23
Figure 2.8. Example of the Convolution operation
The area where the convolution operation is performed is called the receptive field and its
size is 3 × 3 because it is always the same as the size of the filter. We perform as many
convolution operations as we can on the input by using different filters and we get distinct
feature maps. Finally, we stake all the feature maps together and it is the final output of the
convolution layer.

The size of the output neuron (the feature map) is controlled by three hyperparameters:
Depth, Stride, and padding (aka zero paddings). These parameters should be decided before
the convolution operation is performed [29].

• Depth is the number of filters that we use on the convolution operation. The larger
the number of filters the stronger the model we produce, but there is a risk of
overfitting due to increased parameter count. During the convolution operation, if
we use three different filters, we will produce three different feature maps. Finally,
these feature maps stacked as 2D matrices, so, the depth of the feature maps would
be three.
• Stride is the number of pixels that the filter slides on the input volume at a time.
when the stride is 1 the filter matrix slides 1 pixel on the input volume at a time.
When the stride is 2 the filter jumps 2 pixels on the input volume at a time and so
on. If the number of strides is higher the output volume will be smaller.
• Padding is adding zeros in the input volume around borders. It is convenient to pad
the input volume around borders with zeros. It helps to keep more information
around the borders of the input and allows to control the size of the feature map.

24
Commonly filter with a size of 3, stride with 2, and padding with 1 is used hyperparameters
in CNN but, we can change this hyperparameter depending on the input volume we have
[29].

The above example (Figure 2.8) is used for grayscale image, because the matrix only has
a depth of one, in this thesis, these convolutions are performed in 3D because color images
captured by a digital camera has been used which is represented as a 3D matrix with
dimensions of width, height, and depth (the depth represents the three color channels). For
example, if we have input of 6 × 6 × 3 and filter size of 3 × 3 × 3 (the depth of the input
and the filters are always the same), then we perform convolution operation and the only
difference with 2D input and filter is that making the sum of matrix multiplication to 3D
instead of 2D as shown in Figure 2.9 below.

Figure 2.9. Example of convolution of a 3D input volume

The figure above (Figure 2.9) [43] shows that an input volume of 6 × 6 × 3 and filter
3 × 3 × 3. The number of filters is one. And it slides 1 pixel at a time, i.e. stride is 1. We
can use many different filters in the convolution layer to detect multiple features and the
output of the convolution layer will have the same number of channels with the number of
filters. The following figure (Figure 2.10) [43] is the same as Figure 2.9 above with two
filters. The depth of the feature map is the same as the number of filters as we see in Figure
2.10 below.

25
Figure 2.10. Example of convolution operation with 2 filters
To control the number of free parameters in the convolution layer, there is a systematic
method called parameter sharing. If one feature is useful to compute some spatial position,
it should also be useful in another position. In other words, if we use the same filter
(commonly called weights) in all parts of the input volume, the number of free parameters
decreases. The neurons in the convolutional layer share their parameters and only
connected to some of the parts of the input volume (local connectivity). Parameter sharing
of resulting from convolution contribute to translation invariance of CNN, i.e. when the
input volume has some specific centered structure and we want the CNN to learn different
features in some other spatial location, in this case, we simply share the parameters and
call locally connected layer [29].

Finally, to make a single convolution layer we need to add the activation function (ReLU)
and bias (b) to the output volume. The following figure (Figure 2.11) [43] shows one
convolution layer of CNN with ReLU activation function.

Figure 2.11. An Example of one convolution layer with activation function


26
B. Pooling Layer

To reduce the number of parameters, to extract dominant features in some spatial location,
to progressively reduce the spatial size of the convolved feature, and to control the problem
of overfitting in the network we need to add pooling layer (also called subsampling or down
sampling) in between some successive convolution layers in CNN [29]. This layer helps to
reduce the computation power that is required to train the network. The pooling operation
is performing by sliding the filter on the convolved feature.

Figure 2.12. Example of max pooling


There are three types of polling: Max pooling, Average polling, and less commonly used
type which is Sum pooling. The max polling (Figure 2.12) [29] is the most commonly used
polling operation and its output is the maximum value from the portion of the image
covered by the filter. The average pooling returns the average of all the values from the
image covered by the filter and finally, the sum pooling returns the sum of all the values
from the portion of the image covered by the filter. The max-pooling performs de-noising
along with dimensionality reduction but average polling only used for dimensionality
reduction. Therefore, max polling is better than average pooling. The pooling operation is
applied in all of the depth slices of the image after the convolution operation and commonly
used filter is 2 × 2 and stride 2 but we can change accordingly. For example, if we take the
commonly used 2 × 2 filter (as shown in Figure 2.12), for the max-pooling it returns the
maximum value from the four values [29].

C. Fully Connected Layer

The fully connected layer is the same as the traditional multilayer perceptron that is
discussed in Section 2.4.1 above. In a fully connected layer, every neuron in the previous
layer is connected to every neuron in the next layer. This layer accepts the output of the

27
convolution or pooling layer which is high-level features of the input volume. These high-
level features are in the form of a 3D matrix but, the fully connected layer accepts a 1D
vector of numbers. Therefore, we need to convert the 3D volume of data into a 1D vector
called flattening and that becomes the input to the fully connected layer. The flatten vector
is given to the fully connected layer and it performs mathematical computation like any
ANN and the computation is discussed in Section 2.4 above in Equation (2.2). Activation
functions such as ReLU in the hidden layers are used to apply non linearity in these layers.
By using sigmoid activation function the last layers (output layer) of the fully connected
layer perform classification (probabilities of inputs being in a particular class) based on the
training data. For example, in this thesis, the image classification will have two classes:
Diseased and Healthy.

Figure 2.13. Example of fully connected Layer


In addition to classification fully connected layer is a better way of learning non-linear
features of the output returned from convolution and pooling layers. Sometimes
convolutional networks may not have fully connected layers and are called Fully
Convolutional Network (FCN).

2.5.2 CNN Architectures


There is a project called ImageNet2 that contains a massive dataset of images designed to
help image and computer vision (visual image recognition) researches and it has an annual
software competition which is known as ImageNet Large Scale Visual Recognition
Challenge (ILSVRC). Recently in the world’s most significant computer vision challenge,

2
http://www.image-net.org/

28
the winners were researches that use deep learning algorithms especially CNNs as their
algorithm for the classification and recognition of the very large amount of image datasets
with thousands and hundreds of classes. Most of the architectures of CNN are driven to
compete ImageNet challenge. Some of the architectures of CNNs are: LeNet [44], AlexNet
[33], VGGNet [45], ZFNet [42], GoogLeNet [46], and ResNet [47].

Yann LeCun et al. in 1998 developed a CNN model to recognize and classify handwritten
digits in postal service. This model is called LeNet-5 and it is currently used in many banks
and insurance companies to recognize handwritten numbers in a cheque. This architecture
receives 32 × 32 × 1 (grayscale image) with a filter size of 5 × 5 and stride 1. The
architecture wasn’t scalable to large scale images at that time because there was a limitation
of computational power. It has two convolutional layers then an average pooling layer after
these layers there are two fully connected layers with a SoftMax activation function. The
total number of parameters in this model is 60,000 [20].

In 2012 a CNN architecture developed by Alex Krizhevsky et al. called AlexNet won
ImageNet challenge by decreasing the top five error rates from 26% to 15.3%. This
architecture is much similar to LeNet-5 architecture but this one is much deeper, a greater
number of kernels (11 × 11, 5 × 5, 3 × 3) per layer, and with stacked convolutional layers.
AlexNet has more additional layers such as dropout, data augmentation, and ReLU
activations. During the training, it splits into two pipelines and trained simultaneously with
two GPUs. It has 60 million parameters, 650,000 neurons and took 5 to 6 days to train with
2 GTX 580 3GB GPUs [33].

Simonyan and Zisserman from the VGG group at Oxford created a model called VGG and
it has 16 convolution layers. It is an improvement of AlexNet by converting 11 × 11 and
5 × 5 filters with many 3 × 3 filters. VGG achieves top five error rates of 7.3% accuracy
and stakes second place in ImageNet challenge in 2014. It is efficient to use multiple small-
sized stacked kernels than a single large-sized kernel to learn so many different complex
features. VGG has 138 million parameters and this is the main drawback of this architecture
because it needs a greater resource. The VGG architecture is trained for 2 to 3 weeks with
4 GPUs [45].

29
In 2013 the winner of ImageNet challenge is also a CNN architecture called ZFNet by
achieving top five error rates of 14.8%. It uses the same architecture as AlexNet with some
modifications such as filter size changed into 7 × 7 from 11 × 11 and stride 2 × 2 from
4 × 4. To overcome the loss of information caused by using a bigger kernel in an earlier
layer, they use smaller kernels and when the network goes deeper the size of the kernel
increases [42].

GoogLeNet is the winner of the ILSVRC 2014 competition and is developed by Google. It
achieves a top-five error rate of 6.67% which is very close to the human level of
performance. GoogLeNet (sometimes called inception V1). It has deeper paths with
parallel convolutions of different filters. There are 4 million (less than that of AlexNet)
parameters and 22 deep layers in GoogLeNet architecture. It is inspired by LeNet but
GoogLeNet has 1 × 1 convolution (to reduce the number of parameters) in the middle of
the network and there is no fully connected network at the end of the network instead it has
global average pooling. The technique of adding 1 × 1 convolution and global average
pooling is called the inception module [46].

Residual Neural Network (ResNet) by Kaiming He et al is introduced in 2015 and it was


the winner of ILSVRC. ResNet has a new feature called skip connection or jumping over
some layers to avoid the problem of vanishing gradient, by using a similar activation to the
previous layer until the layer gets its weight. The accuracy of networks re reduced due to
vanishing gradients, i.e. as layers are going deeper and deeper, the gradient decreases and
the performance of the network also decreases. To overcome this challenge Kaiming He et
al introduced a concept called residual connection, it is nothing but connecting the output
of the previous layer to a new layer instead of the next layer. By applying this method
ResNet won the largest computer vision competition and the network reaches state-of-the-
art in all standard computer vision benchmarks [47]. ResNet achieves a top five error rate
of 3.57% which is accurate than humans in this dataset.

2.5.3 Application of CNN in Crop Disease Detection


Deep CNN is applied in different computer vision applications to solve various problems.
In the following, we see literature which uses CNN architectures for the detection and
classifications of crop disease by using images of crops.

30
A deep CNN approach in [48] is applied to classify a rice disease based on healthy and
unhealthy rice leaves. In this study, a dataset containing a total image of 857 is collected
from rice field by using a digital camera and publicly available images of rice from the
internet. The authors used a manual priori classification with the help of assistance from
agricultural offices to label the image dataset. The authors used AlexNet transfer learning
architecture of the CNN algorithm to classify the input images into three groups namely
healthy, unhealthy, and snail infested. Finally, the network achieves 91.23% accuracy by
using stochastic gradient descent (SGD).

Another deep learning approach to detect plant disease is presented in [49]. In this work,
they proposed a deep learning model for real-time detection of tomato disease and pests.
The authors used a different digital device to collect the images of tomato leaves from
different farms and they have captured 5000 images. The proposed approach identifies and
classifies the disease to nine different classes and finds the location of the disease in the
tomato plant. This makes the study different from many other approaches conducted in this
area. For the object recognition and classification CNNs algorithms such as Faster Region-
Based CNN (Faster R-CNN) [50], Single Shot Multi-box Detector (SSD) [51], and Region-
based Fully Convolutional Networks (R-FCN) [52] are used. By combining each of these
three CNN architectures with feature extractors such as VGG net [45] and Residual
Network (ResNet) [47], their model effectively recognizes and classifies the disease and
pests. Finally, the authors recommended that using data annotation and data augmentation
method helps to increase the accuracy of the result.

Another deep learning technique proposed by the authors in [53] uses a deep CNN for plant
disease detection and classification. In this study images from internet search results have
been collected for training and testing the model. The authors used datasets which has
30,880 total images after augmentation and transformation. The dataset is manually
assessed by agriculture experts, and CaffeNet [54] architecture of CNN is used for training
by modifying to fifteen categories (classes). Finally, the model successfully categorized 13
different classes of disease with a better accuracy of 96.3%.

The author in [35] developed a CNN model to detect and diagnose plant disease by using
simple leave images. The datasets are collected from globally available datasets which are

31
taken in the laboratory condition and in addition to that by taking images from real
cultivation condition in the field. In this study, 25 different plants are selected for 58
distinct classes of disease. Several CNN architecture models are trained such as AlexNet,
GoogLeNet, Overfeat [55], and VGG in the study, from the architectures the VGG has
given successful identification of plant disease combination. This means the system
generates a pair of plants and the corresponding disease with greater accuracy.

Related Works
Currently, in the field of agriculture, the reduction of productivity and loss of yield is
mainly caused by plant disease. To reduce these losses there is a need to develop a state of
the art and automated method for plant disease detection. Besides advancements in
agriculture technologies are already doing a great job including disease detection using
image processing techniques and in the last two decades, the technology is getting faster
and more accurate output. Basically, there are a lot of works have been done for plant
disease detection using image processing and machine learning approach.

However, most of the studies conducted in the identification of plant disease are using the
traditional image processing techniques and they follow a common step, which are image
acquisition, image preprocessing, image feature extraction, and finally classification [14,
6, 15, 56]. In the image acquisition step, the images are collected by using different digital
devices like a digital camera and smartphones from the field (in our case from Enset farm)
or somewhere from the image dataset. The second step is preprocessing, the main goal of
this step is an improvement of image data by removing unwanted features, enhancing the
image, and image segmentation. Segmenting the image is used to identify the boundaries
of the image by using different segmentation methods like thresholding. When we come to
the feature extraction step some useful features for the disease identification in the image
are extracted like color and texture. The fifth and the main step is classification and in this
step disease identification and classification is performed. Different classification
techniques are used in the literature such as Neural Network [57], support vector machine
(SVM), and some of the studies used both SVM and NN [58]. In the following, we discuss
literature in the area of disease detection and classification which are directly related to this
thesis.

32
A machine learning approach presented in [16] to detect and classify banana Bacterial Wilt
and banana black Sigatoka. In this study, 623 diseased and healthy images of banana leaves
which are collected from the field were used. Color features are extracted based on
threshold values of green pixel components of the image, and shape features are extracted
based on thresholding at a different level, extracting connected components, and
calculating morphological features of each connected component. For the classification of
the disease, the authors used seven different classifiers such as Nearest Neighbor [59],
Decision Tree [60], Random Forest [61], Extremely Randomized Trees [62], Naïve Bays
[63], and SVM. After testing the seven different classifiers, Extremely Randomized Trees
give a great classification accuracy of 96% for banana Bacterial Wilt and 91% for banana
black Sigatoka.

An automated tool is presented in [13] to identify and classify banana leave disease. They
try to identify and classify disease caused by fungi and the disease are known as banana
Sigatoka and banana Speckle. In this study, a globally available dataset from the
PlantVillage project is used to collect the infected and healthy leaf images of banana. The
leaves infected by the disease are determined based on the color difference between the
healthy and the infected leaves. The authors perform preprocessing for the entire dataset
by resizing each image to 60 × 60 pixels and by converting the images to grayscale. The
authors perform feature extraction and classification by applying CNN algorithm. The
trained model gives interesting classification with better accuracy.

ANNs for classification and grading of a banana plant is presented in [64]. In this study, a
total of 35 diseased images of banana leaves which are captured in the field was used to
train the NN. The color feature was extracted by converting RGB images to HSV, and
Histogram of Template (HOT) features were extracted. For the classification feed-forward,
neural network is trained. As the authors discussed in the paper, the trained model
successfully classifies five different banana plant disease based on the given images.

A paper presented in [65] for the detection and calculation of the area of infection of banana
black Sigatoka using segmentation and calculation of the area. In this paper, the authors
captured images of a banana plant from the field. The percentage of infection is calculated

33
by using the formula infected area divided by total area then multiplied by 100. The
following table summarizes previously conducted works which are related to this thesis.

Table 2.1. Summary of related works


Author Title Methodology Accuracy Remark
used
Godliver Automated For feature 96% • The traditional and simple
Owomugisha Vision-Based extraction machine learning approach
et al (2014) Diagnosis of thresholding. is used.
[16] Banana For • Different hand-crafted
Bacterial Wilt classification: feature extraction methods
Disease and Extremely applied.
Black Sigatoka Randomized • Not suitable for a very large
Disease tree dataset.
Jihen Amara A Deep CNN 97.3% • The datasets used in this
et al (2017) Learning-based architecture study are from a publicly
[13] Approach for (LeNet) was available dataset of
Banana Leaf used to train PlantVillage.
Diseases the model. • Segmentation is used before
Classification the training
• The number of images in the
dataset is not enough to train
the CNN algorithm.
Basavaraj Banana Plant Histogram of 90% • Only 35 images are used to
Tigadi and Disease Template train the NN.
Bhavana Detection and (HOT) for • They used handcrafted
Sharma Grading Using feature feature extraction.
(2016) [64] Image extraction, • Used simple and traditional
Processing NN for machine learning
classification. techniques.
• The method will not work in
a very large dataset.
Sandip P. Detection of Calculating Not • Detection only by
Bhamare, Black Sigatoka the specified calculating the area of the
and on Banana Tree percentage of in infected area in each
Samadhan C. using Image the infected percentage image.
Kulkarni Processing area. • No image processing
[65] Techniques technique is used.

Summary
As mentioned in the previous sections, the studies show that computer vision has been
widely used in the field of agriculture especially for crop disease identification and has
obtained interesting results. More specifically machine learning and deep learning
algorithms such as NN and CNN. Enset crop is related to banana crop in several ways and
Bacterial Wilt disease also affects the banana plant. Computer vision techniques are also

34
applied for the detection and classification of different diseases in the banana plant
including banana Bacterial Wilt but there is still a need to develop a more accurate and
efficient model. As we see in the related works (Section 2.5.1) all previously conducted
papers have some problems which we need to overcome in this thesis. For example, most
of the papers used their datasets from internet searches or publicly available databases such
as in PlantVillage. Using publicly available dataset is recommended but the images in most
of the previously conducted researches are captured under controlled environments like in
the laboratory setups; there are a lot of laborious preprocessing stages such as handcrafted
feature extraction, color histogram, texture features, and shape features; most importantly
the methods used by previously conducted research works are not state of the art, i.e. most
of the studies in the literature of crop (especially banana) disease identification follows
traditional image processing techniques [13, 14, 6, 15, 16]. And the other main point of
this thesis is there is no image processing (whether it is traditional machine learning or
deep learning) technique is designed to detect or classify Enset disease so far. Hence, an
accurate and efficient CNN-based model (avoids handcrafted feature extraction) for the
detection of Bacterial Wilt disease in Enset crop by using leaf image of infected and healthy
Enset crop is designed and developed.

35
Chapter 3THREE
CHAPTER

RESEARCH METHODOLOGIES
This chapter focuses on the description of methodologies that are used in order to
accomplish this thesis including methods to implement the model, data collection, data
preparation, software, and hardware configuration of the system used and revaluation
techniques which are used to evaluate the model. In this thesis experimental research
approach is used where a set of variables are kept constant while the other set of variables
are being measured as the subject of experiment. Different experiments are carried out by
using different dataset ratio. In addition to this there are a lot of experiments are conducted
by using different activation function and hyperparameters.

Research Flow
In thesis experimental research method is followed. In order to achieve the objective of this
thesis, the following process flow (Figure 3.1) is followed. As we can see in the following
block diagram, this thesis is conducted with three main phases. The first phase includes
identifying the domain of the problem that means understanding the problem by reviewing
different kinds of literature. Then objectives of the thesis are formulated including the
general and specific objectives. The second phase is about data preparation and design of
the thesis. During data preparation data is collected from farm, then labeled with agriculture
experts, finally splitted in to training, validation, and testing. After data preparation, design
of the model is performed. The third phase is about implementation of the thesis, in this
phase the designed model is implemented with appropriate tools and methods. The
designed model is trained and tested with the appropriate data. During the training of the
model the performance of the model is evaluated. After getting the optimal model during
evaluation, the model is tested with test data. Finally, the model is compared with other
pre-trained models.

36
Figure 3.1. Research flow

Data Preparation
The most important thing when we want to use a neural network or deep learning
algorithms during research is getting the data that are used to train the neural network
model. In this thesis, Enset leaf image data is used as the main input to the model. However,
there is no publicly available database that contains thousands of Enset leaf images that we
can download and use for the training of the model. So, only images captured from different
Enset farms that are healthy and infected are used.

The images of Enset crops are collected from southern regions of Ethiopia with the help of
farmers and agriculture experts. All diseased and healthy images are captured in Enset
farms and some of the images are captured in fields that are particularly made to analyze
Bacterial Wilt disease of Enset by Arbaminch University researchers (Chencha, SNNPR,
Ethiopia). The images are captured by using a digital camera with normal condition i.e.
without considering the selection of light sensors and the relative position of the image to
the camera is also not considered. All the images are checked by domain experts (plant
science expert) of Mihurina Aklil woreda, Gurage Zone, SNNPR Ethiopia.

3.2.1 Data Preprocessing


Preprocessing used to be the major important task when we want to work with image
processing. It is a transformation of the raw data before feed into the neural network or
deep learning algorithm. However, in CNN as discussed in Section 2.5.1 there is no need

37
for explicit preprocessing on the dataset because the algorithm can take raw pixels of the
image and learn the features by itself. But the images contained in the prepared dataset
have different sizes. Therefore, size normalization is performed in the dataset in order to
get similar size of all images for the CNN algorithm and to decrease the computational
time of training because the model is trained on a standard PC with limited hardware
resources such as processor, and memory. Finally, all the images contained in the dataset
are resized into 127 × 127 pixels.

127
1944
Resize

127

2592

Figure 3.2. Resized image

3.2.2 Data Partitioning


In the first place, the dataset is divided into two parts which are training and test. The
training split is used to train the model and the test split is used to test the model which is
unseen during the training of the model. The validation split is used to assess the
performance of the model which is built during the training and it is also used to fine-tune
model parameters in order to select the best performing model. The number of original
images collected from Enset farm was 4896. Literature recommend using the ratio of the
training split from 60% up to 90% of the total dataset and the rest for testing [29, 35]. In
this thesis, experiments are conducted by using four different ratios which is 6:4, 7:3, 8:2,
and 9:1. Finally, the ratio of the size of training image and the testing image which gives
better results was 8:2 which means 80% of the dataset is for the training and 20% of the
dataset is for testing. From the training split, 20% of the images are taken for validation.
Therefore, the training dataset contains 3138 images, validation dataset contains 780 the
validation dataset contains 978 images. Since the two classes (healthy and diseased) have

38
an equal number of images in each category, the dataset is splited randomly into train,
validation, and test according to the ratio stated above. Using an equal number of images
in each class for training and validation helps to avoid the problem of overfitting because
during the training updating of weights would not be biased in one of the categories.

3.2.3 Data Augmentation


In most cases of image classification (specifically for deep learning application) research
datasets are very large. Data augmentation is a process of increasing the number of training
data points in a dataset by generating more data from the existing training sample [21]. It
is important to increase the number of data points even if there is a large data set. It helps
the network to learn more complex features from the data and prevents the problem of
overfitting [21]. In this thesis, various data augmentation technique has been performed on
the original images to get other images for our data set. We can perform data augmentation
before we feed the data into our model (aka offline augmentation) or we can augment the
data during the training. In this thesis, data augmentation is performed during the training
of the network by using Keras libraries. Each image that was sent into the network during
the training is generated by the original image.

Software Tools
Investigation of available software tools with their libraries are conducted in order to select
the appropriate tool for the implementation of the CNN algorithm for enset image
classification. During the investigation, we have seen that there are tools which are general
for both deep learning and machine learning algorithms and specific only for one of them.
Before selecting the tools, we have considered some criteria’s which are helpful to select
the appropriate software tools with their corresponding libraries. The main criteria are the
choice of programing language that will use to implement the algorithm. The other criteria
are to select tools with enough learning materials such as free video tutorials, existing
experience, and the other one is the tools must be used in machines with limited resources
(like CPU only). Software tools that we have used to implement the CNN algorithm are
python as a programing language with TensorFlow and Keras libraries on anaconda
environment. These tools fulfill all the consideration criteria’s and they are used in python
which is familiar to us.

39
Anaconda3 is used for the implementation of the model and it is a free and open-source
distribution of the Python and R programming languages for data science and machine
learning related applications, that aims to simplify package management and deployment.
It contains different IDE’s which are used to write the coding part such as Jupyter Notebook
and Spyder. We have used Jupyter notebook to implement the coding part. It is easy and
runs in a web browser.

TensorFlow4 is a free and open-source library developed by Google and it is currently the
most famous and fastest deep learning library [21]. It can be used in any desktop which
runs Windows, macOS, Linux; in the cloud as a service and in mobile devices like iOS and
Android. The architecture of TensorFlow works for preprocessing of the data, building the
model, train the model, and estimate the model. All the computations in TensorFlow
involve tensors (n-dimensional array) that represents all kinds of data. TensorFlow also
uses graph framework for graphical representation of the series of computation during the
training. It has two distributions for CPU and GPU.

Keras is a high-level neural network API is written in python which runs on the top of
either TensorFlow, Theano5, or Microsoft Cognitive Toolkit (CNTK). It is very simple to
develop a model, user-friendly, easily extensible with python, and most importantly it
contains pretrained CNN models such as VGG16 and Inception that we use during the
experiment. It allows easy and fast prototyping, and support both CNN and RNN or the
combination of the two [21].

Visio 20196 is used for designing the system architecture. This tool was used to create,
collaborate and share data-linked diagrams easily with ready-made templates and helping
to simplify complex information.

Hardware Tools
Sony Cyber-shot DSC-W230 (12.1 Megapixel) digital camera was used to capture the
sample images from the field. To implement the CNN algorithm with the selected software

3
https://www.anaconda.com/download/
4
https://www.tensorflow.org/install/install_windows
5
Low level Python library runs on the top of NumPy, currently not used for direct implementation of CNN.
6
https://www.microsoft.com/am-et/p/visio-professional-2019

40
tools a very slow machine with CPU Intel(R) Core (TM) i5-5200 CPU @ 2.20GHz
processor, memory 8 GB was used, and no GPU which is the most important hardware in
deep learning for computer vision research.

Evaluation Technique
After training our model we need to know how the model generalizes for never seen before
data. This helps us to say the model is classifying well with new data, or the model is doing
good only for trained data (memorizing the data fed before) but not in new data (data that
hasn’t seen before). Therefore, model evaluation is the process of estimating the
generalization accuracy of the model with unseen data (in our case test data). It is not
recommended to use training data for evaluating a model because the model remembers all
data samples which are fed during training, i.e. it predicts correctly for all the data points
in the training but not for data which hasn’t seen during the training. In this thesis,
classification accuracy metrics are used which is recommended technique for classification
problems and when all the classes of the dataset have the same number of samples [21]. In
this technique, the dataset is divided into training, validation, and testing dataset. During
the training, we can feed the validation split to the model to get performance metrics. The
model returns the accuracy and loss of training data, and the accuracy and loss of validation
data, which are training accuracy, validation accuracy, training loss, and validation loss.
So, we can plot loss and accuracy graph with respect to epochs by using these metrics.
Finally, the testing data (images that have not been used in either the training or validation
sets) is given to the trained model to test the performance of the model, then the model
returns accuracy and loss of the testing data which is never seen during the training.

41
ChapterFOUR
CHAPTER 4
DESIGN AND EXPERIMENT
This chapter focuses on the design of the proposed model and its experimental setups.
Specifically, the design of the proposed model and descriptions, how features are extracted
and classification is performed in the proposed model and other pretrained models by using
the technique called transfer learning are described briefly.

Model Selection
A deep learning algorithm which is CNN is chosen based on different literature that was
conducted in computer vision especially in image classification. CNNs represent an
interesting method for adaptive image processing. The algorithm is used for feature
extraction, classification, training, testing as well as for evaluating the accuracy of the
model. CNNs take raw data, without the need for separate pre-processing or feature
extraction stage. In addition to these, feature extraction and classification stages occur
naturally within a single framework.

As the main advantage of using the CNN algorithm for the detection of BWE, it is more
robust and automated than classical machine learning algorithms [21]. In classical machine
learning algorithm there is a need to develop different algorithms for different problems,
therefore it uses more handcrafted algorithms, but in CNN once we developed an algorithm
for the detection of Bacterial Wilt for Enset crop it can be applied for other related plants
like banana and cassava, so it is easier to generalize and use as it with different but related
problems [20]. Some of the main reasons that CNN will be used in this thesis are:

• A lot of previously conducted researches has shown that CNN is better than other
classification algorithm and it is state of the art for computer vision applications.
• CNNs are designed by emulating human's understanding of vision, so, for image-
related tasks, CNN is better than other deep learning models.
• Most of classical machine learning approaches require explicit extraction of the
features that are used to study from the image before the classification and
prediction.

42
• Most neural network algorithms only accept vectors (1D) and most of the real-
world images are tensors (3 dimensional) so there is a need to flatten the actual
image (input) to the 1D vector which is very difficult and computationally
expensive but, CNN accepts 3-dimensional images.
• CNN can capture temporal and spatial dependencies with the help of relevant
kernels.

Overview of BWE Detection


Before starting the detection, process there are two mains phases which are done: Training
and Testing. In the training phase of the network, the CNN algorithm accepts size
normalized images contained in the training and validation dataset. In this phase, more
training data are generated to fit the CNN model by using data augmentation techniques
specified in Table 4.2. The following figure illustrates the overall architecture of the
process of BWE detection using CNN algorithm.

Training Phase

Training and
Validation Data
Preprocessing
images Augmentation

Stacked Layers of CNN


Pretrained CNN
Feature
Classification
Extraction
Proposed Modified CNN
Architecture

Model
Evaluation

Testing Phase

Testing Image Preprocessing Predictive model Result

Figure 4.1. Block diagram of the detection of Bacterial Wilt disease

43
The model uses the augmented data for training and the original validation data to give
performance metrics. Inside of the CNN staked layers useful features of each image are
extracted and classification based on the extracted feature is performed and the process is
called model training. During the training of the model we can access the performance of
the model by using the validation dataset which is basically used to measure the
performance of the model. After accessing the performance of the model, the model which
best performed is saved and used as a predictive model. Then the testing phase is performed
by giving unseen images during the training to the predictive model. The model finally
gives class prediction which is the probability of the image belongs to one of the given
class during the training (in our case the classes are diseased and healthy).

Training Components of the Proposed Model


The selection of CNN architecture is a very difficult part because most of the architectures
are deployed in large scale applications such as ILSVRC which contains millions of
parameters with thousands of classes and needs high computational power. In this thesis,
the architecture is deployed in limited hardware resources and which is designed for only
two classes. In order to find an appropriate model, a CNN model is designed which will
work pretty good in a small number of images with very low computational resources like
CPU and GPU. The proposed model has 5 convolution layers, 3 fully connected layers,
ReLU in the hidden layers is included as activation function to add nonlinearity during the
training of the network, and dropout is included after the first two fully connected layers
to prevent the problem of overfitting. Due to the reduction of trained classes, hardware
resources that we have, and a number of images, we scaled down the number of neurons,
parameters, and filters of pre-trained CNN models.

44
Figure 4.2. Proposed model

4.3.1 Proposed Model Description


By using input image size (W1), receptive field size (F), stride (S), and amount of zero
paddings (P) we can compute the spatial size of the output volume in each layer [29]. The
following equation gives the exact output volume size of all the layers in the proposed
model.

(𝑊1 − 𝐹 + 2𝑃)
𝑂𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 (𝑊2) = +1 (4.1)
𝑆

Where: W1 is the size of the input volume, F is filter size, P is the number of zero
paddings, and S is the stride.

The initial spatial size of the input volume (W1) is 127 × 127 × 3 and it gets changed after
some convolution operations, the initial size of the filter F and Stride S are 5 × 5 × 3 and
2 respectively and these sizes are changed after some convolution and pooling operations,
and there is no zero padding P in our network and the value of P is always zero throughout
the model. In the following table, all of the parameters in each layer are described according
to Equation (4.1) given above.

The spatial parameters have mutual constraints. For example, when the input volume has
size W1 = 10, no zero-padding is used P = 0, and the filter size is F = 3, then it would be
impossible to use stride S = 2, since Equation. (4.1) gives 4.5, which is not an integer,

45
indicating that the neurons don’t “fit” neatly and symmetrically across the input. We have
considered this setting of parameters to be valid during the process of resizing images
contained in our dataset. If this arrangement is not considered, libraries that are used to
implement the CNN model will throw an exception or it will zero pad the rest of the area
or it will crop the image to make it fit.

Input layer: the input layer of our CNN model accepts RGB images of size 127 × 127 × 3
with two different classes (diseased and healthy). This layer only passes the input to the
first convolution layer without any computation. Therefore, there are no learnable features
and the number of parameters in this layer is 0.

Convolutional layer: in the proposed model there are five convolutional layers. The first
convolutional layer of the model filters the 127 × 127 × 3 input image by using 32 kernels
with a size of 5 × 5 × 3 with a stride of 2 pixels. Since (127 − 5)/2 + 1 = 62, and since
this layer has a depth of K = 32, the output volume of this layer is 62 × 62 × 32. The
product of the output volume gives a total number of neurons in the layer (first conv layer)
which is 123,008. Each of 62 ∗ 62 ∗ 32 neurons in this volume is connected to a region of
size 5 × 5 × 3 in the input volume. To control the number of parameters in convolution
layers parameter sharing is used. If we see the input there are 62 ∗ 62 ∗ 32 = 123,008
neurons in the first convolution layer, and each has 5 ∗ 5 ∗ 3 = 75 weights and 1 bias.
Together this adds up to 123,008 ∗ 75 = 9,225,600 parameters in the first layer of the
model. Clearly this number is very high and impossible to implement in our machine. Here
it comes the concept of parameter sharing which is one of the advantages over the
traditional neural network. When we use parameter sharing if one feature is used in some
spatial location let say (𝑥, 𝑦), then it should also be useful to compute at some other position
(𝑥2 , 𝑦2 ). In other words, denoting a single 2-dimensional slice of depth as a depth slice,
which is a volume of size 62 × 62 × 32 has 32 depth slices, each of size 5 × 5, and we are
making the neurons in each depth slice to use the same weight and bias. With this parameter
sharing the first convolution layer in the proposed model has only 32 unique sets of weights
(one for each depth slice) for a total of 32 ∗ 5 ∗ 5 ∗ 3 = 2,400 unique weights or 24,432
parameters by adding 32 bias and all 62 ∗ 62 neurons in each depth slice have the same

46
parameters. The output volume and parameters of each learnable layers in the proposed
model are described in Table 4.1 bellow.

The second convolutional layer takes as input the output (pooled output) of the first
convolutional layer and filters it by using 32 kernels of size 3 × 3 × 32. The third, fourth
and fifth convolutional layers are connected to each other without intervening pooling
layer. The third convolutional layer takes as an input the output of the second pooled
convolutional layer and filters with 64 kernels of size 3 × 3 × 64. The fourth convolutional
layer has 64 kernels of size 5 × 5 × 64 and the fifth convolutional layer also has 64 kernels
of size 3 × 3 × 64. All the convolutional layers of the proposed model use ReLU
nonlinearity as activation functions. ReLU is chosen because it is faster than other non-
linearities such as tanh to train deep CNNs with gradient descent [33].

Pooling layer: There are three max-pooling layers after the first, second and fifth
convolutional layers of the proposed model. The first max-pooling layer reduces the output
of the first convolutional layer with a filter of size 3 × 3 and stride 1. The second max-
pooling layer takes as an input the output of the second convolutional layer and pools by
using 2 × 2 filters of stride 1. The third max-pooling layer has a filter of size 2 × 2 with
stride 2. This layer has no learnable features and it only performs down sampling operation
along the spatial dimension of the input volume, hence the number of parameters in these
layers is 0.

Fully Connected (FC) layer: in the proposed model there are three fully connected layers
including the output layer. The first two fully connected layers have 64 neurons each and
the final layer which is the output layer of the model has only one neuron. The first FC
layer accepts the output of the fifth conv layer after converting the 3D volume of data in to
a vector value (Flattening). This layer computes the class score and the number of neurons
in the layer predefined during the development of the model. It is the same as ordinary NN
and as the name implies, each neuron in this layer is connected to all the numbers in the
previous layer.

47
Output layer: the output layer is the last (the third FC layer) of the model and it has 1
neuron with a sigmoid activation function. Because the model is designed to classify 2
classes called binary classification.

Table 4.1. Summary of proposed model parameters


Layer Filter Depth Stride No. of Param. Output size
Input Image - - - 0 127 × 127 × 3
1 conv2D + ReLU 5 × 5 32 2 2,432 62 × 62 × 32
maxPool2D 3×3 - 1 0 60 × 60 × 32
2 conv2D + ReLU 3 × 3 32 1 9,248 58 × 58 × 32
maxPool2D 2×2 - 1 0 57 × 57 × 32
3 conv2D + ReLU 3 × 3 64 1 18,496 55 × 55 × 64
4 conv2D + ReLU 5 × 5 64 2 102,464 26 × 26 × 64
5 conv2D + ReLU 3 × 3 64 1 36,928 24 × 24 × 64
maxPool2D 2×2 - 2 0 12 × 12 × 64
flatten - - - 0 9216
6 FC + ReLu - 64 - 589,888 64
Dropout - - - 0 64
7 FC + ReLu - 64 - 4,160 64
Dropout - - - 0 64
Output FC + Sigmoid - 1 - 65 2
Total Number of Parameters 763,681

As we can see in the table above the proposed model have 763,681 parameters which are
extremely small when we compare to the other deep learning architectures such as AlexNet
which have 60 million parameters, VGG 138 million parameters, and GoogLeNet 4 million
parameters. It is considered that deep learning models have a massive number of
parameters; therefore, they need a huge computational power to train those models from
scratch and they need a very large amount of data. But the proposed model is trained with
a minimum amount of resources and data and it performs very well.

Feature Extraction Using Proposed Model


Feature extraction is the main stage of most image classification problems because before
the classification stage is started, important features that are used to classify the images are
extracted by the CNN algorithm. For the detection of the BWE one feature parameter is
used which is a color feature, this feature is considered because their visual color difference
identifies whether the crop is infected by the disease or not by a human vision in the

48
traditional system. Hence, the proposed model gives the output (predefined classes) based
on the color feature of the input image which is learned during the training. When training
CNN the network learns what type of features to extract from the input image. As discussed
in Section 2.5.1 features are extracted by convolution layers of CNN and feature extraction
is the main purpose of this layer. These layers have a series of filters or learnable kernels
(Figure 4.3) which aims to extract local features from the input image.

Figure 4.3. Feature Extraction in the proposed model


The filters in the convolution layer slide from left to right across the entry of the input
image to detect features. During feature extraction, the convolution layer accepts pixel
values of the input image and these values are multiplied and summed with the values of
filter (set of weights) then it gives the output called feature map. Feature map is the
extracted feature of the input image it contains patterns that are used to distinguish the
given images. The feature map (𝑀𝑖 ) is computed as:

𝑀𝑖 = ∑ 𝑤𝑖𝑘 ∗ 𝑥𝑘 + 𝑏 4.2
𝑘

Where: 𝑤𝑖𝑘 is a filter of the input, 𝑥𝑘 is the kth channel of the input image, and b is
the bias term.

The features in this case contain different color patterns of the given image. Then each
value of the feature map is passed through activation functions to add nonlinearity in the

49
network. After nonlinearity, the feature map again fed into the pooling layer to reduce the
resolution of the feature map and computational complexity of the network. The process
of extracting useful features in the input image consists of multiple similar steps by
cascading convolution layer, adding nonlinearity, and pooling layers.

Classification Using Proposed Model


In the proposed model classification is performed in fully connected layers. As we have
seen in the model above (Figure 4.2) we have a total of three fully connected layers
including the output layer. The main function of these layers is to classify the input image
based on the features extracted by the convolution layers. The first fully connected layer
accepts the output of the convolution and pooling layers. But the outputs are joined together
and flattened in to a single vector value before fed into the fully connected layer. Each
value of the vector is representing a probability that a certain feature (color in our dataset)
belongs to a class. The following figure illustrates how the input values flow into the FC
layer of the network.

Figure 4.4. Classification in the proposed model


In our case, if the image is diseased, yellow colors represents high probability for the class
diseased. The values are multiplied by weights and pass through an activation function
(ReLU) and then the result passes forward to the output layer in which every neuron
represents a classification label (diseased or healthy). In other words, the fully connected

50
layer performs dot product of the input data (the features extracted from the convolution
and pooling layers) and the weights to produce a single value.

Classification Using Pre-Trained Models


There are CNN models that are trained by a very large number of images, typically on a
large-scale image classification task with thousands of classes [21]. Because the models
are trained by millions of images and thousands of classes, their ability to generalize an
object is higher. Therefore, the features learned by the models are used for many other real-
world problems even though the problems are different than those of the original task [21].
These models are called pre-trained models and the process of training the models and
using them is called transfer learning. Anyone can use these pre-trained models and train
his/her data by using these models instead of training a large CNN model from scratch.
Training these architectures from scratch is computationally expensive and sometimes
impossible. In this thesis, pre-trained models are imported and trained by our dataset with
those models. There are two main parts in most of pre-trained CNN models, the first part
is convolution base which includes convolution and pooling layers for extracting useful
features on the input image and the second part is classifier which is used to classify the
input image based on the features extracted by the convolution base. The convolution base
part contains the features learned during the training; in other words, it is called knowledge
base. In the convolution base, convolution layers near to the input layer learns features
general to all the given images. Whereas convolution layers near to the classifier only holds
features specific to the given image. Therefore, we can take convolution blocks near to the
top (input) layer of the model to train other problem and it generalizes well [21].

51
Figure 4.5. Transfer learning
There are two commonly used ways during transfer learning, the first one is to train all the
convolution base of the pre-trained model and only change the fully connected layer, the
other way is to train some part of the convolution base by freezing the weights of the pre-
trained model and change the fully connected layer. Training some of the parts of the
convolution base is called fine-tuning. We have trained two pre-trained models namely
VGG16 and InceptionV3 by using our dataset and compared the results with the proposed
model.

Experimental setup
Three scenarios are considered during the experiment of this thesis. The first two scenarios
are classifying the images by transfer learning approach and the third scenario is proposing
a CNN model based on the VGG16 architecture. During transfer learning the well-known
CNN architectures which are VGG16 [45] and InceptionV3 [46] that win the largest image
classification competition are chosen. The models are trained by millions of images and
thousands of classes. The proposed model is a modified version of the VGG16 model by
dramatically decreasing 138 million parameters in to 763,681 (Table 4.1).

52
4.7.1 Augmentation Parameters
The images are generated by using different augmentation parameters that are described in
Table 4.2 below. Finally, enough number of images are generated because the dataset is
extended by using different augmentation techniques.

Table 4.2. Augmentation techniques used

Augmentation parameter Augmentation Factor


Horizontal Flip 1(True)
Shear Range 0.3
Width Shift Range 0.2
Height Shift Range 0.2
Zoom Range 0.2

4.7.2 Hyperparameter Settings


Hyperparameters are configurations that are external to the deep learning algorithm whose
value is set before the training process begins. There is no standard rule to choose the best
hyperparameters for a given problem [20]. Therefore, a lot of experiments are conducted
to choose the hyperparameters. In the following, hyperparameters that are chosen for the
model are described.

• Optimization algorithms: the proposed model is trained using gradient descent


optimization algorithm to minimize the error rate and backpropagation of error
algorithm is used to update the weights. Gradient descent is by far the most popular
and the most widely used optimization algorithm in deep learning researches [20,
66]. At the same time, every state-of-the-art deep learning library contains
implementations of gradient descent optimization algorithms such as Keras (used in
this thesis). It updates the weight of the model and tunes parameters, therefore,
minimize the loss function. To optimize the gradient descent, the Adaptive Moment
Estimation (Adam) optimizer is used [67]. Adam computes adaptive learning rate to
each parameter and it uses squared gradients to scale learning rate, it also uses the
moving average of the gradient.
• Learning rate: Because backpropagation was used to train the proposed model,
learning rate is used during weight update. It controls the amount of weight to
update during backpropagation [20]. The challenging part was to choose the proper

53
learning rate during our experiment. In our experiment, we have seen that a learning
rate with a value of too small takes longer to train than a value of larger. But when
we give a smaller value the model is more optimal than a model with a learning rate
of larger. The experiment was done by using learning rate of 0.001, 0.01, and 0.1.
Then, learning rate 0.001 is the optimal one for all of the experiments even if it
takes longer to train.
• Loss function: The choice of the loss function is directly related to activation
functions that are used in the output layer (last fully connected layer) of the model
and the type of problem we are trying to solve (whether regression or
classification). In the proposed model, sigmoid is used as activation function in the
last fully connected layer. The type of problem we are solving is a classification
problem specifically binary classification. We have used Binary Cross-Entropy
(BCE) loss as a loss function for our model. Even though there is another loss
function such as Categorical Cross-Entropy (CCE), Mean Squared Error (MSE),
but binary cross-entropy is the recommended choice of loss function for binary
classification [20, 21]. It performs well for models that output probabilities i.e. it
measures the distance between the actual output and the desired output. The
experiment was done by using both BCE loss and CCE loss.
• Activation function: Experiments are conducted by using two different activation
functions: SoftMax and Sigmoid in the proposed model and Sigmoid performs
better. In the output layer of the model, the Sigmoid activation function is used
because it is the best choice for a binary classification problem [21, 20].
• Number of epochs: is the number of iterations the entire dataset passes forewarned
and backward through the model or the network. In our experiment, the model was
trained by using different epochs starting from 10 to 150. During the training, we
have seen that when we use too small or too large epoch, the model gets a high gap
between the training error and validation error. After many experiments, the model
gets optimal with epoch thirty (30).
• Batch size: is the number of input data we pass into the network at once. It is too
hard to give all the data to the computer in a single epoch so we need to divide the
input into several smaller batches. It is preferred in model training to minimize the

54
computational time of the machine. Batch size of 32 during model training is used
in out experiment.

Table 4.3. Summary of hyperparameters used during model training


Parameter Epoch Batch Activation Loss Optimization Learning
size Function Function algorithm rate
Value 30 32 Sigmoid BCE Adam 0.001

55
Chapter 5 FIVE
CHAPTER

RESULTS AND DISCUSSIONS


This chapter describes the implementation of the classification process of Bacterial Wilt
disease on Enset crop by using the CNN algorithm, which was specified in detail in the
previous chapter. In this chapter, all the experimentation details such as results of each
experiment, and discussion of these results are presented briefly. The results of the
experiments are shown in different graphs and tables.

Experimental Result
To classify the input image, color feature of the image was used as described in detail in
Section 4.4. The main reason that the color feature is chosen to classify the image was that
when we look at the image, we can simply say that the image is healthy or diseased. Three
different classification scenarios are conducted during the experiment to test the
classification performance.

The first two scenarios are based on pre-trained CNN models and the third one is by using
the proposed model. Like most of the deep learning classification algorithms, our
experiments have two main phases. The first one is the training phase and the second one
is the testing phase. In the training phase, data is repeatedly presented to the classifier,
while weights are updated to obtain the desired response. In the testing phase, the trained
algorithm is applied to data that has never seen (test data) by the classifier to test the
performance of the classification algorithm. In the following, we will see the experimental
results in detail.

Pre-trained CNN
Two pre-trained CNN models: VGG and InceptionV3 which are widely used pre-trained
architectures in ImageNet are used and fine-tuned. The VGG model is chosen because of
its simplicity and the Inception model is used because of its complicated features.
Therefore, experiments are conducted in both a relatively simpler model and a complex
one to get the classification accuracy of these models in our dataset. All the experiments
are conducted in the same dataset and the same hyperparameter setting.

56
5.2.1 Detection of BWE by using VGG16 Pre-trained Model

The VGG model is characterized by its simplicity by using only 3 × 3 convolution layers
which are staked on each other in increasing depth of the layer. There are two versions of
the VGG model, the first one is the VGG16 and the second one is the VGG19. The VGG16
has 16 weight layers and the VGG19 has 19 weight layers in the network. The model
accepts 224 × 224 RGB images as an input and gives 1000 classes of ImageNet dataset
(contains 14 million images belonging to 1000 classes) [45]. The input is passed through
stacked convolution layers of the model with a 3 × 3 receptive field and followed by non-
linearity which is ReLU. The model uses a stride of 1 and spatial padding 1 for all of the
3 × 3 convolutions. After every 3 consecutive convolution layers, there is a max-pooling
of window size 2 × 2 with stride 1 to reduce the spatial size of the output of the convolution
layers. There is a total of 16 convolution layers in VGG19 architecture and 13 convolution
layers in VGG16 architecture and 5 max-pooling layers in both. Finally, for the
classification, there are 3 fully connected layers to which follows a stack of the conv layers.
The first two layers have 4096 channel depth and the final layer has 1000 channel depth
which is equal to the number of classes found in the ImageNet dataset with a SoftMax
activation function.

In our experiment, down-sampled RGB image of size 127 × 127 is given as an input to
the model and finetuned the model to give 2 classes of output in our dataset. The original
VGG16 model has a total of 138 million parameters which is very huge. We have trained
the model by 15,894,849 parameters because the spatial dimension of the image in our
model is smaller and we only trained some parts of the model. As we have discussed in
previous sections, we have fine-tuned the VGG16 model by using only the conv base of
the network. We have conducted several experiments in order to find the optimal pre-
trained model by training different conv blocks of the model. The model is trained by using
all the conv base of the network and changing only the fully connected layers, and the result
shows high overfitting. Overfitting happens because the model weights are trained with
millions of images which are different from our dataset and thousands of classes and we
tried to train that model by using only 4896 original images. Hence, we need to update
some of the weights of the network and increase the number of images by using data

57
augmentation technique. Therefore, we have decided to freeze some of the layers (conv
blocks) of the model and conduct a different experiment by using the augmented data. After
several experiments, we noticed that freezing the first 3 conv block is the optimal one
compared to freezing the first 2, and 4 conv blocks by using 96,000 images which are
generated by augmentation techniques. The training of the network is performed by using
the hyperparameters described in Table 4.3 above. The output of the experiment has a mean
training accuracy of 96.7% and mean test accuracy of 92.4%.

5.2.2 Result Analysis of VGG16

The following two plots shows the classification accuracy and loss with respect to epochs
by using classification accuracy metrics such as training and validation accuracy, training
loss and validation loss of VGG16 pre-trained model that we have conducted experiment
by making some changes to the original pre-trained model in order to able the model to
classify well in our dataset. When we see the training accuracy in the first epoch it is around
84% and slightly increases and passes 95% at epoch 5. In between epoch 5 to 10, the
training accuracy of the model gets higher with an accuracy of greater than 95% and after
the 14th epoch, the accuracy gets higher than 97%. As we can see in the graph, the accuracy
gets higher in the first few epochs, this is because of the dataset. The patterns of the images
of the crop in our dataset and very visible to even human eyes and it is easy to differentiate
by the CNN model. In general, as we can see in the following plots the validation accuracy
line is almost in sync with the training accuracy line and at the same time, the validation
loss line is also in sync with the training loss. Even though the validation accuracy and
validation loss lines are not linear, but it shows that the model is not overfitting. In other
words, the validation loss is decreasing not increasing and also the validation accuracy is
increasing not decreasing.

58
Figure 5.1. Training and validation accuracy for VGG16 Pre-trained model

Figure 5.2. Training and validation loss for the VGG16 pre-trained model
The result that is obtained from the experiment of pre-trained VGG16 model is presented
in the following table by using the classification accuracy metrics in the form of percentage
for the train data, validation data, and test data separately.

Table 5.1. Mean accuracy and loss of VGG16 pre-trained model

Metrics Mean Accuracy Mean Loss


Training Validation Test Training Validation Test
Value 96.6% 96.8% 92.4% 7.0% 7.70% 22.1%

59
5.2.3 Detection of BWE by using InceptionV3 Pre-trained Model
Inception is an efficient deep CNN architecture for computer vision developed by Google
as GoogLeNet and drives its name from the famous internet meme “We Need to Go
Deeper” [46]. This architecture proposes a deeper (a large number of layers and a large
number of neurons in each layer) network with less computational power. There is one
thing we need to consider when we say deeper network when we increase the number of
layers the network is more likely prone to overfit, when we increase the number of neurons
in each layer, it needs a high computational resource. The inception model has a solution
for this problem by introducing sparsely connected (filters with multiple sizes in the same
layer as shown in Figure 5.3 [46]) network which replaces FC layer, especially inside
convolution layer and this approach lets us maintain the computational cost while
increasing the depth of the network.

Figure 5.3. Example of the Inception module

This model is trained on the ImageNet dataset by accepting a size of 299 × 299 × 3 images
as input and gives a final output of 1000 class. It has a total of 42 layers and it is
computationally faster than the VGG model even if VGG has only 16 and 19 layers.

In our experiment, inception pre-trained model was trained by giving our dataset of size
127 × 127 color images and trained the entire model without making any fine-tuning
technique in the conv base and only changing the output to 2 classes. The total image given
to the network was 4896 and it gives a promising output without overfitting problem.

60
5.2.4 Result Analysis of InceptionV3
When we see the following plot, the training accuracy in the first epoch is around 70% and
validation accuracy is around 75%. Then both validation and training accuracy
automatically increases when we see the value at epoch 5 and after epoch 10 the values get
higher. The validation accuracy is linearly increasing and no decreasing at the same time
the validation loss is linearly decreasing no increasing and the is not much gap between
training and validation accuracy and loss. Therefore, there is no overfitting problem in the
model when we train by using our dataset.

Figure 5.4. Training and validation accuracy of InceptionV3 pre-trained model

Figure 5.5. Training and validation loss of InceptionV3 pre-trained model

61
The result that is obtained from the experiment of pre-trained InceptionV3 model is
presented in the following table by using the classification accuracy metrics in the form of
percentage for the train data, validation data, and test data separately.

Table 5.2. Mean accuracy and loss of InceptionV3 pre-trained model

Metrics Mean Accuracy Loss


Training Validation Test Training Validation Test
Value 91.7% 91.5% 90.4% 14.3% 14.0% 27.2%

Detection of BWE by using the Proposed CNN Model


In this thesis, a CNN model was designed by modifying the VGG16 architecture which
can run in a minimum hardware resource and which can give promising result. As
discussed in Section Error! Reference source not found. the model has a total of 8 layers, f
ive convolutions, and three dense layers. It accepts a size of 127 × 127 color images like
the other pre-trained models which we have conducted in this thesis and gives an output of
2 classes. The proposed model is trained with a total of 111, 060 images after data
augmentation during the training is applied in the dataset. A lot of experiments are
conducted by the proposed model by changing the ratio of training and testing dataset, by
using different learning rates, and finally by different activation functions.

5.3.1 Scenario 1: Changing Training and Testing Dataset Ratio.

The result that is obtained from the experiment of the proposed model by using a different
ratio of training and testing split is presented in the following table by using the
classification accuracy metrics in the form of percentage for the train data, validation data,
and test data separately.

Table 5.3. Result of experiments by using different training and testing dataset ratio

Training Test Accuracy Loss


Split Split Training Validation Test Training Validation Test
60% 40% 92.6% 93.6% 93.5% 20.4% 18.2% 22%
70% 30% 93.8% 95.4% 97% 17.4% 14% 7.8%
80% 20% 98.49% 98.48% 97.86% 4.3% 5.7% 6.23%
90% 10% 95.3% 96.5% 96.6% 14.1% 10.1% 7.2%

62
The proposed model is giving a promising result with different training and testing dataset
ratios as we can see in Table 5.3. From those experiments using 80% for training and 20%
for testing is better performing or optimal among the other three. Using ratio 8:2 means
using 80% of the whole dataset is for training and 20% of the whole dataset is used for
testing. In addition to this validation data is taken from the training data. In 6:4 ratio the
validation data were taken as 40% of the training data (not the whole dataset) and in ratio
7:3 the validation data is taken as 30% of the training data and so on.

5.3.2 Scenario 2: Changing Learning Rate.

The result that is obtained from the experiment of the proposed model is by using different
learning rates is presented in the following table by using the classification accuracy
metrics in the form of percentage for the train data, validation data, and test data separately.
As we can see in the following result giving a higher learning rates has less accuracy than
that of smaller learning rates. Therefore, learning rate of 0.001 is considered as optimal in
the proposed model.

Table 5.4. Result of the proposed model by using different learning rate

Learning Rate Accuracy Loss


Training Validation Test Training Validation Test
0.1 91.2% 91.4% 74.7% 21.7% 20.1% 60.4%
0.01 94.2% 95.2% 97.4% 16.7% 13.5% 8.2%
0.001 98.49% 98.48% 97.86% 4.3% 5.7% 6.23%

5.3.3 Scenario 3: Using Different Activation Function.

The result that is obtained from the experiment of the proposed model is by using different
activation functions is presented in the following table by using the classification accuracy
metrics in the form of percentage for the train data, validation data, and test data separately.

Table 5.5. Results of the proposed model by using different activation functions

Activation Accuracy Loss


function Training Validation Test Training Validation Test
SoftMax 94.6% 95.2% 97.1% 15.5% 13.2% 8.6%
Sigmoid 98.49% 98.48% 97.86% 4.3% 5.7% 6.23%

63
As we can see in Table 5.5, using sigmoid activation function in the last fully connected
layer or the output layer for binary classification is better than SoftMax which is more
preferred for multiclass classification problems.

Finally, the proposed model successfully classifies the given image with mean training
accuracy if 98.5% and mean test accuracy of 97.86% by using a learning rate of 0.001,
output layer activation function of Sigmoid, and training and testing dataset ratio of 8:2.

5.3.4 Result Analysis for the Proposed BWE Detection Model


As we can see in the following plot (Figure 5.6), at the beginning of the training the value
of training accuracy line gets 90% and the value of the validation accuracy line is around
91% then both values are getting higher up to epoch 5. After epoch 5 both lines pass 98%
and increases very slowly. When we see the training loss and validation loss curves in
Figure 5.7 plot both decreases linearly from 0.2 up to 0.025 starting from the first epoch to
epoch 30. After epoch 7 both validation and training loss lines pass 0.05 which decreases
highly starting from the beginning of the training and didn’t pass 0.025 which is the
smallest value in the training.

Finally, we can see that the validation accuracy is in sync with the training accuracy and
validation loss is in sync with training loss. The validation accuracy and training accuracy
curves are nearly linear, and on the other hand, the validation loss and training loss are
nearly linear. The curves are showing that there is no overfitting in the proposed model,
because the validation accuracy is increasing not decreasing and the validation loss is
decreasing not increasing, and most importantly there is no much gap training and
validation accuracy and also there is no much gap between training and validation loss.
Therefore, we can say that our model’s generalization capability became much better since
the loss of the validation set was only slightly more compared to the training loss.

64
Figure 5.6. Training and validation accuracy of the proposed model

Figure 5.7. Training and validation loss of proposed model

The result that is obtained from the experiment of the proposed model is presented in the
following table by using the classification accuracy metrics in the form of percentage for
the train data, validation data, and test data separately.

65
Table 5.6. Mean accuracy and loss of the proposed model
Metrics Accuracy Loss
Training Validation Test Training Validation Test
Value 98.49% 98.48% 97.86% 4.3% 5.7% 6.23%

Discussion
As presented in the previous sections, the experiments were conducted by using three
different CNN models: two pre-trained models and the proposed model. All of the
experiments are conducted using the same hardware configuration. The number of images
in the dataset which are used to train the models is different according to the depth of the
models or number of parameters. From the pre-trained model, VGG16 is trained with a
total of 96,000 images, Inceptionv3 is trained with 4896 images. The proposed CNN model
was trained with a total of 111, 060 images. All of the models are tested by separate dataset
which is unseen during the training of the model and obtained good result. Classification
accuracy metrics are used to measure the performance of the models and when we
compared the performance of the proposed models with the two pre-trained models and
models that are described in our related works section (Table 2.1), the proposed model has
better classification results.

As we can see in the following plots the mean percentage training accuracy of VGG16,
InceptionV3, and the proposed CNN model is 96.6, 91.7, and 98.49 respectively. These
show the models are giving good result on training dataset. The mean percentage of
validation accuracy for VGG16, InceptionV3, and proposed models is 96.8, 91.5, 98.48
respectively. When we compute the difference between mean training accuracy and mean
validation accuracy for each of the three experiments is very less and almost both mean
training accuracy and mean validation accuracy is the same in the proposed model. These
show that there is no overfitting in the models and we can say that the generalization ability
of the proposed model is high.

When we come to the mean training loss which is used to measure the inconsistency
between the predicted value and actual value for the three experiments: VGG16,
InceptionV3, and proposed model we obtained 7, 14.3, and 4.3 respectively. Mean
validation loss is 7.7, 14, and 5.7 which is nearly the same as the mean training loss when
we compute the difference between mean training loss and mean validation.
66
Accuracy
100 98.49 98.48 97.86
98 96.6 96.8
96
94 92.4
91.7 91.5
92 90.4
90
88
86
Training Accuracy Validation Accuracy Test Accuracy

Pre-trained VGG16 Pre-trained Inception V3 Proposed modified Model

Figure 5.8. Mean accuracy of the three experiments


By testing the models with unseen data, we have obtained a promising result. The test
accuracy of VGG16 is 92.4%, InceptionV3 is 90.4%, and test accuracy of the proposed
model is 97.86%. The test result shows that the proposed model can classify the given
image as diseased or healthy successfully with better accuracy than the VGG16 and
InceptionV3 pre-trained models.

Loss
30 27.2
25 22.1
20
14.3 14
15

10 7 7.7
5.7 6.23
4.3
5

0
Training Loss Validation Loss Test Loss

Pre-trained VGG16 Pre-trained Inception V3 Proposed modified Model

Figure 5.9. Mean Loss of the three experiments

67
The test loss of all the experiments is shown in Figure 5.9 above and the value of the
proposed model is lower than the two pre-trained models. Test loss for VGG16 is 22.1, test
loss for InceptionV3 is 27.2, and test loss for proposed model is 6.23 which is good.
Therefore, the proposed model is doing well both in training dataset and testing dataset.

The main reasons that the proposed model gives better result are because of the dataset that
we have used to train the model, i.e. the images in the dataset are easily classified with
human eyes, and the second main reason is our proposed model uses smaller sized filters
in the convolution layer of the network. Using smaller sized convolution helps to identify
very smaller features which are used to distinguish between the input image and the
probability of losing an important feature is very less.

Most deep learning algorithms especially computer vision for image classification
problems are trained by using high performance computing machines with faster GPU, a
huge number of images (in millions), and tens of millions of parameters. But we can train
and get better results with small sized networks with fewer parameters, less hardware
consumption, and fewer data.

More accuracy results will be obtained if the images of the dataset are captured in a stable
environmental condition, which is a stable distance from an object to the camera, proper
light, and proper focus. The other point is that making preprocessing to the images by
removing noise and unwanted features will increase the accuracy of the model.

68
CHAPTER
Chapter 6 SIX

CONCLUSION AND RECOMMENDATIONS

Conclusion
Now a day’s Enset production is suffered from a severe problem, Bacterial Wilt diseases,
which reduces the production and quality of Enset yield. Besides, the shortage of
diagnostics tools in developing countries like Ethiopia has a devastating impact on their
development and quality of life. Therefore, there is an urgent need to detect the disease at
an early stage with affordable and easy to use technological solutions. In order to make
early identification of the diseases we have proposed and implemented a deep learning
approach by using CNN algorithm. We have presented a CNN model to identify and
classify Bacterial Wilt of Enset by using leaf images of the crop as an input. The proposed
CNN model can be used as a tool to identify Bacterial Wilt disease of Enset.

The first contribution of this thesis for the research community and the whole population
is design and develop CNN model to correctly detect and classify the famous Enset disease
which is Bacterial Wilt by using images that are taken in the real scene and under
challenging condition such as complex background, dynamic image resolution, different
illumination and orientation. The second main contribution of this thesis was well
organized and managed dataset of Enset. To accomplish these, we have conducted several
experiments by using pre-trained models and proposed model.

During the experiment we have used images that are directly collected from the farm with
the help of agriculture exerts is used. we have trained the two pre-trained models namely
the VGG16 and InceptionV3 and the proposed model. After several experiments, all of the
models were able to find a good classification result. The VGG16 model gives training
accuracy of 96.6% and testing accuracy of 92.4%, the Inception pre-trained model gives
training accuracy of 91.7% and testing accuracy of 90.4%, the proposed model gives 98.5%
for both training and testing.

The results we have in our experiment has proven that the proposed CNN model can
significantly support for accurate detection of Bacterial Wilt of Enset with little
computational effort and little images which is far less than that of expected for deep

69
learning algorithm because most of deep learning algorithms are trained with millions of
images with high computational resource. To this end we are encouraged by the obtained
results from the experiment and, we are intended to work and to test more Enset disease
with our model.

Recommendations
Since the Ethiopian economy is dependent on agriculture and agricultural products, the
protection of crops from the disease should be the main aim of agriculture sectors. Hence,
image analysis techniques have paramount importance in the identification and
classification of early crop disease. In Ethiopia, there is no research has been conducted so
far for the detection of Bacterial Wilt on Enset crop using image analysis technique. Hence
this thesis may initiate researchers to work more in the area. The image analysis especially
by using deep learning techniques for the identification of Enset Bacterial Wilt can be
further investigated.

For the future, we are interested to train and test our model to detect another Enset disease
like black Sigatoka and leaf speckle. The model is also recommended to test with a large
number of images with complex configuration by increasing the number of layers and the
number of parameters in each layer in order to extract very complex features in the image.
It is also better to train the dataset with other pre-trained deep learning models such as
ResNet which are not conducted in our experiment due to computational resources such as
GPU.

In addition to this, we will make the model estimate the severity of the disease
automatically to help farmers to decide whether to stop the disease or not. To apply this
research in the field of agriculture we recommend developing a mobile app that takes a
picture of Enset image and gives automatic results about the severity of the disease in the
taken image and giving helpful expert advice to the user.

70
References

[1] Federal Democratic Republic of Ethiopia Central Statistical Agency (FDRECSA), "KEY
FINDINGS OF THE 2014/2015 (2007 E.C.) AGRICULTURAL SAMPLE SURVEYS,"
Addis Ababa, 2015.

[2] CSA (Central Statistical Agency), "Agricultural in figures key findings of 2008/09–2010/11
Agricultural Samples Survey for All Sectors and Seasons," Addis Ababa, 2012.

[3] S. W. Fanta and S. Neela, "A review on nutritional profile of the food from enset: A staple
diet for more than 25 per cent population in Ethiopia," Nutrition & Food Science, vol. 49,
no. 5, pp. 824--843, 2019.

[4] G. Welde-Michael, K. Bobosha, G. Blomme, A. Temesgen, T. Mengesha, and S.


Mekonnen, "Evaluation of Enset Clones against Enset Bacterial Wilt," Afr. Crop Sci. J.,
vol. 16, no. 1, pp. 89-95, 2008.

[5] D. Yirgou and J. Bradbury, "Bacterial wilt of Enset (Enset ventricosum) incited by
Xanthomonas campestris sp," Phytopathology, pp. 111-112, 1968.

[6] D. Al Bashish, M. Braik and S. Bani-Ahmad, "Detection and classification of leaf diseases
using K-means-based segmentation and Neural networks-based classification," Information
Technology Journa, vol. 10, no. 2, pp. 267-275, 2011.

[7] A. Tuffa, T. Amentae and G. Gebresenbet, "Value chain analysis of warqe food Products in
Ethiopia," International Journal of Managing Value and Supply Chains, vol. 8, no. 1, pp.
23-42, 2017.

[8] A. Tinku and Ajoy, Image Processing Principles and Applications, Jhon Wiley, 2005.

[9] M. Wolde, A. Ayalew and A. Chala, "Assessment of bacterial wilt (Xanthomonas


campestris pv. musacearum) of enset in Southern Ethiopia," African Journal of Agricultural
Research, vol. 11, no. 19, pp. 1724-1733, 2016.

[10] Z. Yemataw, A. Mekonen, A. Chala4, K. Tesfaye, K. Mekonen, D. J. Studholme, K.


Sharma, "Farmers’ knowledge and perception of enset Xanthomonas wilt in southern
Ethiopia," Agriculture & Food Security, vol. 6, no. 62, 2017.

[11] B. Guy, D. Miguel, S. J. Kim, P. V. Luis, M. Agustin, O. Walter, P. Stephane and P.


Philippe, "Bacterial Diseases of Bananas and Enset: Current State of Knowledge and
Integrated Approaches Toward Sustainable Management," Frontiers in Plant science, vol.
8, no. 1290, pp. 1-25, 2017.

[12] J. Barbedo and A. Garcia, "Digital image processing techniques for detecting, quantifying
and classifying plant diseases," Springer Plus, vol. 2, no. 660, pp. 1-12, 2013.

[13] J. Amara, B. Bouaziz and A. Algergawy, "A Deep Learning-based Approach for Banana
Leaf Diseases Classification," in BTW (Workshops), 2017, pp. 79-88.

71
[14] H. Al-Hiary, S. Bani-Ahmad, M. Reyalat, M. Braik and Z. ALRahamneh, "ast and accurate
detection and classification of plant diseases," Machine learning, vol. 14, no. 5, pp. 31-38,
2011.

[15] D. Cui, Q. Zhang, M. Li, G. L. Hartman and Y. Zhao, "Image processing methods for
quantitatively detecting soybean rust from multispectral images," Biosystems engineering,
vol. 107, no. 3, pp. 186-193, 2010.

[16] G. Owomugisha, J. A. Quinn, E. Mwebaze and J. Lwasa, "Automated Vision-Based


Diagnosis of Banana Bacterial Wilt Disease and Black Sigatoka Disease," International
Conference on the Use of Mobile ICT in Africa, pp. 1-5, 2014.

[17] A. Tsegaye and P. Struik, "Enset (Ensete ventricosum (Welw.) Cheesman) kocho yield
under different crop establishment methods as compared to yields of other carbohydrate-
rich food crops," NJAS-Wageningen Journal of Life Sciences, vol. 49, no. 1, pp. 81-94,
2001.

[18] G. Birmeta, H. Nybom and E. Bekele, "Distinction between wild and cultivated enset
(Ensete ventricosum) gene pools in Ethiopia using RAPD markers," Hereditas, vol. 140,
no. 2, pp. 139-148, 2004.

[19] SA. Brandt, A. Spring, C. Hiebsch, ST. McCabe, E. Tabogie, M. Diro, G. Welde-Michael,
G. Yntiso, M. Shigeta, and S. Tesfaye, "The 'Tree Against Hunger'. Enset-based
Agricultural Systems in Ethiopia," American Association for the Advancement of science, p.
56, 1997.

[20] P. Josh and G. Adam, Deep Learning A Practitioner’s Approach, Sebastopol: O’Reilly
Media, 2017 .

[21] C. François, Deep Learning with Python, New York: Manning Publications, 2017.

[22] G. Ian, B. Yoshua and C. Aaron, Deep Learning, MIT Press, 2016.

[23] K. P. Murphy, Machine Learning A Probabilistic Perspective, London: MIT Press, 2012.

[24] Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-
444, 2015.

[25] I. Cohen, A. Garg and T. S. Huang, Machine learning in computer vision, vol. 29, Springer
Science \& Business Media, 2005.

[26] R. Mohan, "Deep deconvolutional networks for scene parsing," arXiv preprint
arXiv:1411.4101, pp. 1-8, 2014.

[27] B. Christopher M, Pattern Recognition and Machine Learning, New York: Springer-Verlag,
2006.

[28] T. Cheng, P. Wen and Y. Li, "Research status of artificial neural network and its application
assumption in aviation," 12th International Conference on Computational Intelligence and
Security, pp. 407-410, 2016.

72
[29] F. Li, J. Justin and Y. Serena, "CS231n: Convolutional Neural Networks for Visual
Recognition," Stanford University, Spring 2018. [Online]. Available:
http://cs231n.stanford.edu/index.html. [Accessed 20 january 2019].

[30] M. A. Nielsen, Neural Network and Deep Learning, Determination press, 2015.

[31] S. G. Disha, "analytics vidhya," 23 October 2017. [Online]. Available:


https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-
functions-when-to-use-them/. [Accessed 14 December 2018].

[32] S. W. Anish, "Towards Data Science," 29 May 2017. [Online]. Available:


https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-
a9a5310cc8f. [Accessed 14 December 2018].

[33] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep


convolutional neural networks," Advances in neural information processing systems, pp.
1097-1105, 2012.

[34] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol.
61, pp. 85-117, 2015.

[35] K. P. Ferentinos, "Deep learning models for plant disease detection and diagnosis,"
Computers and Electronics in Agriculture, vol. 145, pp. 311-318, 2018.

[36] G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural
networks," science, vol. 313, no. 5786, pp. 504-507, 2006.

[37] Y. Bengio, A. Courville and P. Vincent, "Representation learning: A review and new
perspectives," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no.
8, pp. 1798-1828, 2013.

[38] A. Graves, A.-r. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural
networks," 2013 IEEE international conference on acoustics, speech and signal processing,
pp. 6645-6649, 2013.

[39] F. A. Gers, J. Schmidhuber and F. Cummins, "Learning to forget: Continual prediction with
LSTM," 1999.

[40] G. E. Hinton, "Deep belief networks," Scholarpedia, vol. 4, no. 5, p. 5947, 2009.

[41] A. Kamilaris and F. X. Prenafeta-Boldu, "Deep learning in agriculture: A survey,"


Computers and electronics in agriculture, vol. 147, pp. 70-90, 2018.

[42] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks,"


European conference on computer vision, pp. 818-833, 2014.

[43] A. Ng, "Convolutional Neural Networks," corsera, [Online]. Available:


https://www.coursera.org/learn/convolutional-neural-networks/lecture/hELHk/pooling-
layers. [Accessed 10 march 2019].

[44] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to


document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

73
[45] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image
recognition," arXiv preprint arXiv:1409.1556, pp. 1-14, 2014.

[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke


and A. Rabinovich, "Going deeper with convolutions," Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1-9, 2015.

[47] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition,"
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-
778, 2016.

[48] R. Ronnel and P. Daechul, "A Multiclass Deep Convolutional Neural Network Classifier
for Detection of Common Rice Plant Anomalies," International Journal of Advanced
Computer Science and Applications (IJACSA) , vol. 9, no. 1, pp. 67-70, 2018.

[49] A. Fuentes, S. Yoon, S. C. Kim and D. S. Park, "A robust deep-learning-based detector for
real-time tomato plant diseases and pests recognition," Sensors, vol. 17, no. 9, pp. 1-21,
2017.

[50] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: towards real-time object detection
with region proposal networks," IEEE Transactions on Pattern Analysis & Machine
Intelligence, no. 6, pp. 1137-1149, 2016.

[51] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg, "Ssd:
Single shot multibox detector," in European conference on computer vision, Springer,
2016, pp. 21-37.

[52] J. Dai, Y. Li, K. He and J. Sun, "Object detection via region-based fully convolutional
networks," 30th Conference on Neural Information Processing Systems (NIPS 2016), pp. 1-
9, 2016.

[53] S. Sladojevic, M. Arsenovic, A. Anderla, D. Culibrk and D. Stefanovic, "Deep neural


networks based recognition of plant diseases by leaf image classification," Computational
intelligence and neuroscience, vol. 2016, pp. 1-11, 2016.

[54] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T.


Darrell, "Caffe: Convolutional architecture for fast feature embedding," Proceedings of the
22nd ACM international conference on Multimedia, pp. 675-678, 2014.

[55] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, "Overfeat:


Integrated recognition, localization and detection using convolutional networks," arXiv
preprint arXiv:1312.6229, pp. 1-16, 2013.

[56] S. Patil and A. Chandavale, "A survey on methods of plant disease detection," International
Journal of Science and Research (IJSR), vol. 4, no. 2, pp. 1392-1396, 2015.

[57] S. S. Sannakki, V. S. Rajpurohit, V. Nargund and P. Kulkarni, "Diagnosis and classification


of grape leaf diseases using neural networks," Fourth International Conference on
Computing, Communications and Networking Technologies (ICCCNT), pp. 1-5, 2013.

[58] R. Rajmohan, M. Pajany, R. Rajesh, D. R. Raman and U. Prabu, "Smart Paddy Crop
Disease Identification and Management Using Deep Convolution Neural Network And

74
SVM Classifier," International journal of pure and applied mathematics, vol. 118, no. 15,
pp. 255-264, 2018.

[59] Z. Ma and A. Kaban, "K-Nearest-Neighbours with a novel similarity measure for intrusion
detection," UKCI, vol. 13, pp. 266-271, 2013.

[60] R.-H. Li and G. G. Belford, "Instability of decision tree classification algorithms,"


Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, pp. 570-575, 2002.

[61] A. Liaw, M. Wiener and others, "Classification and regression by randomfores," R news,
vol. 2, no. 3, pp. 18-22, 2002.

[62] P. Geurts, D. Ernst and L. Wehenkel, "Extremely randomized trees," Machine learning, vol.
63, no. 1, pp. 3-42, 2006.

[63] H. Zhang, "The optimality of naive Bayes," AA, vol. 1, no. 2, 2004.

[64] B. Tigadi and B. Sharma, "Banana Plant Disease Detection and Grading Using Image
Processing," International Journal of Engineering Science, vol. 6, no. 6, pp. 6512 -6516,
2016.

[65] S. P. Bhamare and S. C. Kulkarni, "Detection of Black Sigatoka on Banana Tree using
Image Processing Techniques," Second International Conference on Emerging Trends in
Engineering (SICETE), vol. 1, no. 14, pp. 60-65, 2013.

[66] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H.


Shahrzad, A. Navruzyan and N. Duffy, "Evolving deep neural networks," in Artificial
Intelligence in the Age of Neural Networks and Brain Computing, Elsevier, 2019, pp. 293-
312.

[67] B. McMahan and M. Streeter, "Delay-tolerant algorithms for asynchronous distributed


online learning," Advances in Neural Information Processing Systems, pp. 2915-2923,
2014.

[68] M. Mariette, S. Marc, I. Bergh, S. Boudy, V. Bernard, B. Guy, G. Svetlana, N. Emmanuel


and L. Cees, "Xanthomonas Wilt of Banana (BXW) in Central Africa: Opportunities,
challenges, and pathways for citizen science and ICT-based control and prevention
strategies," NJAS - Wageningen Journal of Life Sciences, pp. 1-12, 2018.

75
Appendix A: Experiment of Proposed Model

76
77
78
79

You might also like